# Meteo Silver Pipeline

This notebook represents the *Silver stage* of the Meteo data pipeline.  
Here we process the **raw Bergen 2019 weather data** fetched from the Open-Meteo API (Bronze stage),  
and perform statistical and machine learning–based **outlier and anomaly detection**.

---

### Objectives
- Load and prepare the Bergen 2019 dataset.
- Detect temperature outliers using the **Discrete Cosine Transform (DCT)** and **Statistical Process Control (SPC)**.
- Detect precipitation anomalies using the **Local Outlier Factor (LOF)**.
- Generate summary tables and visualizations for both analyses.

**Inputs:**
- `data/bronze/meteo_bergen_2019.csv`

**Outputs:**
- `data/silver/outliers_temperature_bergen_2019.csv`
- `data/silver/anomalies_precipitation_bergen_2019.csv`


In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.neighbors import LocalOutlierFactor
from scipy.fftpack import dct, idct
from pathlib import Path

# Paths
DATA_BRONZE = Path("../../data/bronze")
DATA_SILVER = Path("../../data/silver")
DATA_SILVER.mkdir(parents=True, exist_ok=True)

print(f"✅ Silver data folder ready: {DATA_SILVER.resolve()}")

✅ Silver data folder ready: /Users/fabianheflo/UNI_courses/IND320/IND320/data/silver


In [5]:
df = pd.read_csv(DATA_BRONZE / "meteo_bergen_2019.csv")
df["time"] = pd.to_datetime(df["time"])
df = df.set_index("time")
df.head()

Unnamed: 0_level_0,temperature_2m,precipitation,wind_speed_10m,latitude,longitude,year
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:00:00,5.7,0.7,37.0,60.3913,5.3221,2019
2019-01-01 01:00:00,5.8,0.2,41.0,60.3913,5.3221,2019
2019-01-01 02:00:00,6.1,0.7,42.0,60.3913,5.3221,2019
2019-01-01 03:00:00,6.3,0.5,40.9,60.3913,5.3221,2019
2019-01-01 04:00:00,5.8,1.1,41.2,60.3913,5.3221,2019


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2019-01-01 00:00:00 to 2019-12-31 23:00:00
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   temperature_2m  8760 non-null   float64
 1   precipitation   8760 non-null   float64
 2   wind_speed_10m  8760 non-null   float64
 3   latitude        8760 non-null   float64
 4   longitude       8760 non-null   float64
 5   year            8760 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 479.1 KB


In [7]:
df.describe()

Unnamed: 0,temperature_2m,precipitation,wind_speed_10m,latitude,longitude,year
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,7.831689,0.246518,10.688402,60.3913,5.3221,2019.0
std,5.999381,0.583165,6.05575,7.105833e-15,8.882291e-16,0.0
min,-12.9,0.0,0.0,60.3913,5.3221,2019.0
25%,3.1,0.0,6.1,60.3913,5.3221,2019.0
50%,7.2,0.0,9.5,60.3913,5.3221,2019.0
75%,12.1,0.2,14.3,60.3913,5.3221,2019.0
max,31.7,9.5,46.7,60.3913,5.3221,2019.0


## Temperature Outlier Detection (DCT + SPC)
High-pass filtering using the **Discrete Cosine Transform (DCT)**  
to isolate short-term temperature fluctuations,  
and **Statistical Process Control (SPC)** to flag outliers.


In [8]:
def detect_temperature_outliers(df, freq_cutoff=30, n_std=3):
    """
    Identify temperature outliers using DCT and SPC.
    freq_cutoff: cutoff frequency for high-pass filter.
    n_std: number of standard deviations (MAD-based) for SPC boundaries.
    """
    y = df["temperature_2m"].values
    y_dct = dct(y, norm="ortho")

    # Zero out low frequencies (keep high frequencies)
    y_dct[:freq_cutoff] = 0
    y_filtered = idct(y_dct, norm="ortho")

    # Robust statistics
    median = np.median(y_filtered)
    mad = np.median(np.abs(y_filtered - median))
    upper = median + n_std * mad
    lower = median - n_std * mad

    # Identify outliers
    is_outlier = (y_filtered > upper) | (y_filtered < lower)

    df_out = df.copy()
    df_out["outlier_temp"] = is_outlier

    # Plot
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y=df["temperature_2m"], mode="lines", name="Temperature"))
    fig.add_trace(go.Scatter(x=df.index, y=np.where(is_outlier, df["temperature_2m"], np.nan),
                             mode="markers", name="Outliers", marker=dict(color="red", size=6)))
    fig.add_hline(y=upper, line=dict(dash="dash", color="orange"))
    fig.add_hline(y=lower, line=dict(dash="dash", color="orange"))
    fig.update_layout(title="Temperature Outliers (DCT + SPC)",
                      xaxis_title="Time", yaxis_title="Temperature (°C)")

    summary = df_out[df_out["outlier_temp"]][["temperature_2m"]]
    return fig, summary

In [17]:
fig_temp, outliers_temp = detect_temperature_outliers(df, freq_cutoff=30, n_std=3)
fig_temp.show()
print(f"Number of temperature outliers: {len(outliers_temp)}")
display(outliers_temp.head())

Number of temperature outliers: 465


Unnamed: 0_level_0,temperature_2m
time,Unnamed: 1_level_1
2019-01-28 04:00:00,-9.0
2019-01-28 05:00:00,-10.6
2019-01-28 06:00:00,-11.3
2019-01-28 07:00:00,-11.8
2019-01-28 08:00:00,-11.6


In [20]:
from scipy.fftpack import dct, idct
import plotly.graph_objects as go

def plot_dct_effect(df, cutoffs=[30]):
    y = df["temperature_2m"].values
    fig = go.Figure()
    for c in cutoffs:
        y_dct = dct(y, norm="ortho")
        y_dct[:c] = 0
        y_filtered = idct(y_dct, norm="ortho")
        fig.add_trace(go.Scatter(x=df.index, y=y_filtered, name=f"Filtered (cutoff={c})"))
    fig.add_trace(go.Scatter(x=df.index, y=y, name="Original", line=dict(color="black", width=1)))
    fig.update_layout(title="Effect of freq_cutoff on DCT Filtering", xaxis_title="Time", yaxis_title="Temperature (°C)")
    return fig

fig = plot_dct_effect(df)
fig.show()


## Precipitation Anomaly Detection (Local Outlier Factor)
Use the **Local Outlier Factor (LOF)** method to identify unusual precipitation values.  
Default contamination (proportion of anomalies) = 1%.

In [11]:
def detect_precipitation_anomalies(df, contamination=0.01):
    """
    Detect precipitation anomalies using Local Outlier Factor.
    contamination: expected proportion of outliers.
    """
    X = df[["precipitation"]].values
    lof = LocalOutlierFactor(contamination=contamination)
    y_pred = lof.fit_predict(X)

    df_out = df.copy()
    df_out["anomaly_precip"] = (y_pred == -1)

    # Plot
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y=df["precipitation"], mode="lines", name="Precipitation"))
    fig.add_trace(go.Scatter(x=df.index, y=np.where(df_out["anomaly_precip"], df["precipitation"], np.nan),
                             mode="markers", name="Anomalies", marker=dict(color="red", size=6)))
    fig.update_layout(title="Precipitation Anomalies (LOF)",
                      xaxis_title="Time", yaxis_title="Precipitation (mm)")

    summary = df_out[df_out["anomaly_precip"]][["precipitation"]]
    return fig, summary

In [12]:
fig_precip, anomalies_precip = detect_precipitation_anomalies(df, contamination=0.01)
fig_precip.show()
print(f"Number of precipitation anomalies: {len(anomalies_precip)}")
anomalies_precip.head()


Duplicate values are leading to incorrect results. Increase the number of neighbors for more accurate results.



Number of precipitation anomalies: 80


Unnamed: 0_level_0,precipitation
time,Unnamed: 1_level_1
2019-01-04 14:00:00,2.4
2019-01-04 16:00:00,2.3
2019-01-20 21:00:00,3.8
2019-02-12 18:00:00,3.3
2019-02-21 03:00:00,2.4


### Save for later

In [21]:
outliers_temp.to_csv(DATA_SILVER / "outliers_temperature_bergen_2019.csv", index=True)
anomalies_precip.to_csv(DATA_SILVER / "anomalies_precipitation_bergen_2019.csv", index=True)