# Stop Guessing: A Smarter Way to Infer Time Frequencies in Climate Data

*A practical demonstration of robust frequency inference for climate time series.*

In [1]:
import numpy as np
import pandas as pd
import xarray as xr
import cftime

# Import the infer_frequency function from pymor
from pymor.core.infer_freq import infer_frequency, FrequencyResult

### Why `xarray.infer_freq` often returns `None`

**Example 1**: Non-standard calendar (360_day)

In [2]:
cftime_360day = [
    cftime.Datetime360Day(2000, 1, 16),
    cftime.Datetime360Day(2000, 2, 16),
    cftime.Datetime360Day(2000, 3, 16),
    cftime.Datetime360Day(2000, 4, 16),
]

# pandas.infer_freq does not understand cftime and so xarray.infer_freq does not
try:
    xr_result = xr.infer_freq(cftime_360day)
    print(f"xarray.infer_freq result: {xr_result}")
except Exception as e:
    print(f"xarray.infer_freq failed: {e}")

# Now, with out robust implementation
pymor_result = infer_frequency(cftime_360day)
print(f"pymor.infer_frequency result: {pymor_result}")

xarray.infer_freq failed: <class 'cftime._cftime.Datetime360Day'> is not convertible to datetime, at position 0
pymor.infer_frequency result: M


---

**Example 2**: Monthly data with missing month

In [3]:
times_with_gap = pd.to_datetime(["2000-01-31", "2000-02-29", "2000-04-30"])  # March is missing

# Test xarray's infer_freq
xr_result = xr.infer_freq(times_with_gap)
print(f"xarray.infer_freq result: {xr_result}")

# Test our implementation
pymor_result = infer_frequency(times_with_gap)
print(f"pymor.infer_frequency result: {pymor_result}")

xarray.infer_freq result: None
pymor.infer_frequency result: M


---
**Example 3**: Unanchored montly data

Anchored montly data has date-stamp on either month-start ("MS") or month-end ("ME").

Unanchored monthly data has date-stamp with an offset in date.

In [4]:
datasets = {
    "month_start": pd.date_range("2000-01-01", periods=4, freq="MS"),
    "month_end": pd.date_range("2000-01-01", periods=4, freq="ME"),
    "unachored_offset": pd.date_range("2000-01-01", periods=4, freq="MS") + pd.Timedelta(days=5),
}

for name, data in datasets.items():
    print(f"Name: {name}")
    print("Data:")
    print(list(data))
    xr_result = xr.infer_freq(data)
    print("Inference:")
    print(f"  → xarray.infer_freq : {xr_result}")
    pymor_result = infer_frequency(data)
    print(f"  → pymor.infer_frequency : {pymor_result}")
    print("-"*50)


Name: month_start
Data:
[Timestamp('2000-01-01 00:00:00'), Timestamp('2000-02-01 00:00:00'), Timestamp('2000-03-01 00:00:00'), Timestamp('2000-04-01 00:00:00')]
Inference:
  → xarray.infer_freq : MS
  → pymor.infer_frequency : MS
--------------------------------------------------
Name: month_end
Data:
[Timestamp('2000-01-31 00:00:00'), Timestamp('2000-02-29 00:00:00'), Timestamp('2000-03-31 00:00:00'), Timestamp('2000-04-30 00:00:00')]
Inference:
  → xarray.infer_freq : ME
  → pymor.infer_frequency : ME
--------------------------------------------------
Name: unachored_offset
Data:
[Timestamp('2000-01-06 00:00:00'), Timestamp('2000-02-06 00:00:00'), Timestamp('2000-03-06 00:00:00'), Timestamp('2000-04-06 00:00:00')]
Inference:
  → xarray.infer_freq : None
  → pymor.infer_frequency : M
--------------------------------------------------


---

### Works with Any Calendar

In [5]:
# Test different calendar types
calendars_to_test = {
    "360_day": cftime.Datetime360Day,
    "noleap": cftime.DatetimeNoLeap,
    "standard": cftime.DatetimeGregorian
}

for calendar_name, datetime_class in calendars_to_test.items():
    print(f"\n=== {calendar_name.upper()} Calendar ===")
    
    # Create monthly data
    monthly_times = [
        datetime_class(2000, 1, 15),
        datetime_class(2000, 2, 15),
        datetime_class(2000, 3, 15),
        datetime_class(2000, 4, 15),
        datetime_class(2000, 5, 15),
    ]
    
    result = infer_frequency(monthly_times, calendar=calendar_name)
    print(f"Monthly data frequency: {result}")
    
    # Create daily data
    daily_times = [
        datetime_class(2000, 1, 1),
        datetime_class(2000, 1, 2),
        datetime_class(2000, 1, 3),
        datetime_class(2000, 1, 4),
        datetime_class(2000, 1, 5),
    ]
    
    result = infer_frequency(daily_times, calendar=calendar_name)
    print(f"Daily data frequency: {result}")


=== 360_DAY Calendar ===
Monthly data frequency: M
Daily data frequency: D

=== NOLEAP Calendar ===
Monthly data frequency: M
Daily data frequency: D

=== STANDARD Calendar ===
Monthly data frequency: M
Daily data frequency: D


---

### Rich Diagnostics

when parameter `return_metadata=True` is set, `infer_frequency` returns `FrequencyResult` object instead of a simple string. This object contains additional information. 

In [6]:
times = [
    "2000-01-01",
    "2000-02-01",
    "2000-02-28",  #  <- 1 day offset 
    "2000-04-01",
]

result = infer_frequency(times, return_metadata=True, strict=True)
print("Inference:")
print(f"  → Frequency: {result.frequency}")
print(f"  → Delta (days): {result.delta_days}")
print(f"  → Step: {result.step}")
print(f"  → Is exact: {result.is_exact}")
print(f"  → Status: {result.status}")

Inference:
  → Frequency: M
  → Delta (days): 27.0
  → Step: 1
  → Is exact: False
  → Status: irregular


In [7]:
# Test different scenarios
print("\n📊 Testing Different Scenarios:")
scenarios = {
    "Perfect monthly": pd.date_range("2000-01-01", periods=12, freq="MS"),
    "Monthly with gap": pd.to_datetime(["2000-01-01", "2000-02-01", "2000-04-01", "2000-05-01"]),
    "Daily data": pd.date_range("2000-01-01", periods=7, freq="D"),
    "Irregular daily": pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"]),
    "Too short": pd.to_datetime(["2000-01-01"]),
}

for scenario_name, times in scenarios.items():
    print(f"\n{scenario_name.upper()}:")
    print(f"data: {times}")
    result = infer_frequency(times, return_metadata=True, strict=True)
    print("Inference:")
    if result.frequency:
        print(f"    → Frequency: {result.frequency}")
        if result.delta_days is not None:
            print(f"    → Delta: {result.delta_days:.2f} days")
        else:
            print(f"    → Delta: None")
        print(f"    → Status: {result.status}")
        print(f"    → Exact: {'Yes' if result.is_exact else 'No'}")
    else:
        print(f"  → Could not infer frequency")
        print(f"  → Status: {result.status}")
    print('-'*50)


📊 Testing Different Scenarios:

PERFECT MONTHLY:
data: DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01',
               '2000-05-01', '2000-06-01', '2000-07-01', '2000-08-01',
               '2000-09-01', '2000-10-01', '2000-11-01', '2000-12-01'],
              dtype='datetime64[ns]', freq='MS')
Inference:
    → Frequency: MS
    → Delta: 31.00 days
    → Status: valid
    → Exact: Yes
--------------------------------------------------

MONTHLY WITH GAP:
data: DatetimeIndex(['2000-01-01', '2000-02-01', '2000-04-01', '2000-05-01'], dtype='datetime64[ns]', freq=None)
Inference:
    → Frequency: M
    → Delta: 30.00 days
    → Status: irregular
    → Exact: No
--------------------------------------------------

DAILY DATA:
data: DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07'],
              dtype='datetime64[ns]', freq='D')
Inference:
    → Frequency: D
    → Delta: 1.00 days
    → Status

---

### Handle data overlaps and duplicates

`infer_frequency` can detect the underlying frequency of the data inspite of data overlaps or duplicate entries

In [8]:
# Original monthly data
data = pd.to_datetime(["2000-01-01", "2000-02-01", "2000-03-01"])

print("Original data:")
print(data)
result = infer_frequency(data, return_metadata=True)
print("Inference:")
print(f"  → Frequency: {result.frequency}")
print(f"  → Status: {result.status}")
print(f"  → Is exact: {result.is_exact}")

Original data:
DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01'], dtype='datetime64[ns]', freq=None)
Inference:
  → Frequency: MS
  → Status: valid
  → Is exact: True


In [9]:
# Simulate concatenating the same file twice (common mistake!)
duplicated_data = np.tile(data, 2)  # [Jan, Feb, Mar, Jan, Feb, Mar]
print("\nDuplicated data:")
print(f"{duplicated_data}")

result = infer_frequency(duplicated_data, return_metadata=True)
print("Inference:")
print(f"  → Frequency: {result.frequency}")
print(f"  → Status: {result.status}")
print(f"  → Is exact: {result.is_exact}")


Duplicated data:
['2000-01-01T00:00:00.000000000' '2000-02-01T00:00:00.000000000'
 '2000-03-01T00:00:00.000000000' '2000-01-01T00:00:00.000000000'
 '2000-02-01T00:00:00.000000000' '2000-03-01T00:00:00.000000000']
Inference:
  → Frequency: M
  → Status: irregular
  → Is exact: False


---

###  Understanding FrequencyResult

In [10]:
test_cases = {
    "Valid regular series": pd.date_range("2000-01-01", periods=12, freq="MS"),
    "Missing steps": pd.to_datetime(["2000-01-01", "2000-02-01", "2000-04-01", "2000-05-01"]),
    "Irregular spacing": pd.to_datetime(["2000-01-01", "2000-01-31", "2000-02-28", "2000-04-01"]),
    "Too short": pd.to_datetime(["2000-01-01"]),
    "With duplicates": pd.to_datetime(["2000-01-01", "2000-01-01", "2000-02-01", "2000-03-01"]),
}

print("=== COMPREHENSIVE STATUS EXAMPLES ===")
for case_name, times in test_cases.items():
    print(f"\n{case_name}:")
    result = infer_frequency(times, return_metadata=True, strict=True)
    
    print(f"  Times: {len(times)} points")
    if len(times) > 1 and len(times) <= 4:
        print(f"  Range: {times[0].strftime('%Y-%m-%d')} to {times[-1].strftime('%Y-%m-%d')}")
    
    print(f"  → Frequency: {result.frequency or 'None'}")
    print(f"  → Status: {result.status}")
    print(f"  → Is exact: {result.is_exact}")
    if result.delta_days:
        print(f"  → Median Δ: {result.delta_days:.1f} days")
    
    # Interpretation
    if result.status == "valid" and result.is_exact:
        print("  ✅ Safe for resampling/analysis")
    elif result.status == "missing_steps":
        print("  ⚠️  Has gaps - consider filling before analysis")
    elif result.status == "irregular":
        print("  ⚠️  Underlying frequency exists, but beware of inconsistencies")
    elif result.status == "too_short":
        print("  ❌ Not enough points to determine frequency")
    else:
        print(f"  ❓ Status: {result.status}")


=== COMPREHENSIVE STATUS EXAMPLES ===

Valid regular series:
  Times: 12 points
  → Frequency: MS
  → Status: valid
  → Is exact: True
  → Median Δ: 31.0 days
  ✅ Safe for resampling/analysis

Missing steps:
  Times: 4 points
  Range: 2000-01-01 to 2000-05-01
  → Frequency: M
  → Status: irregular
  → Is exact: False
  → Median Δ: 30.0 days
  ⚠️  Underlying frequency exists, but beware of inconsistencies

Irregular spacing:
  Times: 4 points
  Range: 2000-01-01 to 2000-04-01
  → Frequency: M
  → Status: irregular
  → Is exact: False
  → Median Δ: 28.0 days
  ⚠️  Underlying frequency exists, but beware of inconsistencies

Too short:
  Times: 1 points
  → Frequency: None
  → Status: too_short
  → Is exact: False
  ❌ Not enough points to determine frequency

With duplicates:
  Times: 4 points
  Range: 2000-01-01 to 2000-03-01
  → Frequency: M
  → Status: missing_steps
  → Is exact: False
  → Median Δ: 29.0 days
  ⚠️  Has gaps - consider filling before analysis


---

## Preventing Subtle Errors



In [11]:
# Create some sample climate data
np.random.seed(42)

# Daily temperature data
daily_times = pd.date_range("2000-01-01", "2000-12-31", freq="D")
daily_temps = 15 + 10 * np.sin(2 * np.pi * np.arange(len(daily_times)) / 365.25) + np.random.normal(0, 2, len(daily_times))

# Monthly temperature data (subset)
monthly_times = pd.date_range("2000-01-01", "2000-12-01", freq="MS")
monthly_temps = 15 + 10 * np.sin(2 * np.pi * np.arange(len(monthly_times)) / 12) + np.random.normal(0, 1, len(monthly_times))

print("=== RESAMPLING SAFETY CHECK ===")

# Check daily data
daily_result = infer_frequency(daily_times, return_metadata=True, strict=True)
print(f"\nDaily data:")
print(f"  Frequency: {daily_result.frequency}")
print(f"  Status: {daily_result.status}")
print(f"  detla: {daily_result.delta_days}")
print(f"  Safe for monthly resampling: {'✅' if daily_result.delta_days and daily_result.delta_days < 30 else '❌'}")

# Check monthly data
monthly_result = infer_frequency(monthly_times, return_metadata=True, strict=True)
print(f"\nMonthly data:")
print(f"  Frequency: {monthly_result.frequency}")
print(f"  Status: {monthly_result.status}")
print(f"  Safe for annual resampling: {'✅' if monthly_result.delta_days and monthly_result.delta_days < 365 else '❌'}")
print(f"  Safe for daily resampling: {'❌ (would be upsampling)' if monthly_result.delta_days and monthly_result.delta_days > 1 else '✅'}")

# Demonstrate the risk of upsampling
print(f"\n⚠️  UPSAMPLING RISK:")
if monthly_result.delta_days is not None:
    print(f"   Trying to resample monthly data ({monthly_result.delta_days:.1f} day intervals)")
    print(f"   to daily frequency (1 day intervals) would create artificial data points!")
else:
    print(f"   Monthly data has unknown interval - check before resampling to daily!")

=== RESAMPLING SAFETY CHECK ===

Daily data:
  Frequency: D
  Status: valid
  detla: 1.0
  Safe for monthly resampling: ✅

Monthly data:
  Frequency: MS
  Status: valid
  Safe for annual resampling: ✅
  Safe for daily resampling: ❌ (would be upsampling)

⚠️  UPSAMPLING RISK:
   Trying to resample monthly data (31.0 day intervals)
   to daily frequency (1 day intervals) would create artificial data points!


---

### File Concatenation Analysis

In [12]:
# Simulate concatenating multiple NetCDF files with potential issues

# File 1: Jan-Jun 2000
file1_times = pd.date_range("2000-01-01", "2000-06-30", freq="D")

# File 2: Jul-Dec 2000 (but with a gap - missing July 15)
file2_start = pd.date_range("2000-07-01", "2000-07-14", freq="D")
file2_end = pd.date_range("2000-07-16", "2000-12-31", freq="D")
file2_times = file2_start.union(file2_end)

# File 3: Overlap with file 2 (Dec 2000 repeated)
file3_times = pd.date_range("2000-12-01", "2001-06-30", freq="D")


print("=== FILE CONCATENATION ANALYSIS ===")

print("\n=== CHECK INDIVIDUAL FILES ===")

for i, times in enumerate([file1_times, file2_times, file3_times], 1):
    result = infer_frequency(times, return_metadata=True, strict=True)
    print(f"File {i}: {len(times)} time points")
    print(f"  Range: {times[0].strftime('%Y-%m-%d')} to {times[-1].strftime('%Y-%m-%d')}")
    print(f"  Frequency: {result.frequency}")
    print(f"  Status: {result.status}")
    print(f"  Is exact: {result.is_exact}")
    print()

# Concatenate all files
combined_times = file1_times.union(file2_times).union(file3_times)
combined_result = infer_frequency(combined_times, return_metadata=True, strict=True)

print("\n=== COMBINED DATASET ===")
print(f"Total time points: {len(combined_times)}")
print(f"Range: {combined_times[0].strftime('%Y-%m-%d')} to {combined_times[-1].strftime('%Y-%m-%d')}")
print(f"Frequency: {combined_result.frequency}")
print(f"Status: {combined_result.status}")
print(f"Is exact: {combined_result.is_exact}")


if combined_result.status != "valid" or not combined_result.is_exact:
    print(f"\n⚠️  ISSUES DETECTED:")
    if "missing" in combined_result.status:
        print(f"   - Missing time steps detected")
    if "irregular" in combined_result.status:
        print(f"   - Irregular spacing or duplicates detected")
    print(f"   - Recommend investigating data before analysis")
else:
    print(f"\n✅ Combined dataset looks good for analysis!")

=== FILE CONCATENATION ANALYSIS ===

=== CHECK INDIVIDUAL FILES ===
File 1: 182 time points
  Range: 2000-01-01 to 2000-06-30
  Frequency: D
  Status: valid
  Is exact: True

File 2: 183 time points
  Range: 2000-07-01 to 2000-12-31
  Frequency: D
  Status: missing_steps
  Is exact: False

File 3: 212 time points
  Range: 2000-12-01 to 2001-06-30
  Frequency: D
  Status: valid
  Is exact: True


=== COMBINED DATASET ===
Total time points: 546
Range: 2000-01-01 to 2001-06-30
Frequency: D
Status: missing_steps
Is exact: False

⚠️  ISSUES DETECTED:
   - Missing time steps detected
   - Recommend investigating data before analysis


---

### Summary and Takeaways

🎯 KEY TAKEAWAYS:

1. **Resilient to irregularities**: Our infer_frequency handles gaps, duplicates, 
   and non-standard calendars that break standard tools.

2. **Transparent diagnostics**: Instead of silent failures, you get detailed 
   information about what was found and why.

3. **Tailored for climate data**: Built specifically for the messy realities 
   of climate model output and observational data.

4. **Prevents subtle errors**: Programmatically detect issues before they 
   propagate into your analysis pipeline.

5. **Easy integration**: Works with xarray, pandas, and cftime objects 
   out of the box.

The pymor.core.infer_freq module turns guesswork into a reliable, automated 
process—so you can spend less time debugging and more time doing science.

Stop guessing. Start inferring—smarter.

---

## Project Repository

- GitHub: [esm-tools/pymor](https://github.com/esm-tools/pymor)
- PyPI: [py-cmor](https://pypi.org/project/py-cmor/)

---

## Authors

This work was developed by the High Performance Computing and Data Processing
group at the Alfred Wegener Institute for Polar and Marine Research (AWI),
Bremerhaven, Germany.

- Pavan Kumar Siligam (AWI) - [ORCID: 0009-0003-8054-7021](https://orcid.org/0009-0003-8054-7021)
- Paul Gierz (AWI) - [ORCID: 0000-0002-4512-087X](https://orcid.org/0000-0002-4512-087X)
- Miguel Andrés-Martínez (AWI) - [ORCID: 0000-0002-1525-5546](https://orcid.org/0000-0002-1525-5546)

---