# Weather Imputation algorithm
## Goals
### get a weather forecast possible to merge with the air quality measures
- We need a time-series every hour in years 2019-2023 with weather conditions
- Some hours have multiple reports (FM-12 and FM-15), some of the reports miss the data that you can find in the other report. we have to find a way to merge them. To do that we have to chose one type of report as a base, then fill the missing info from the additional reports.
- In general FM-12 are each hour and FM-15 are every 30 minutes.
- If none of the reports contains a field we can a) treat it as missing, b) average the value from the previous available observation and next available observation (assume linear change of weather)
- e.g. TMP at 3pm was 10 degrees, 4pm is missing, next available TMP is at 8pm - 5 degrees, we calculate 4pm to be 9 degrees
- approach b) gets pretty complicated with the inclusion of half hour periods and stuff like that

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/processed/weather_pre_algorithm.csv")

## Example hour with multiple reports

In [3]:
df["DATE"] = pd.to_datetime(df["DATE"], errors="coerce")

report_counts = df.groupby("DATE")["REPORT_TYPE"].nunique().reset_index(name="num_report_types")

# Group by timestamp and keep only those with multiple reports
dupes = df.groupby("DATE").filter(lambda g: g["REPORT_TYPE"].nunique() > 1)

# Pick one example timestamp
example_time = dupes["DATE"].iloc[0]
subset = df[df["DATE"] == example_time]

# Compare reports row by row: keep only differing columns
diff_cols = subset.loc[:, (subset.nunique(dropna=False) > 1)]
print(f"🔎 Example time: {example_time}")
print(diff_cols)

🔎 Example time: 2019-01-01 00:00:00
  REPORT_TYPE  wind_speed_raw  temperature_C  SLP_hpa  DEW_C  MA1_main  \
0       FM-12            30.0            0.1   1030.2   -0.9       NaN   
1       FM-15            26.0            0.0      NaN   -1.0   10280.0   

   MA1_sec  GA1_amt  GA1_height  GA1_type  GA2_amt  GA2_height  MD1_m1  \
0   9996.0      7.0       800.0       6.0      NaN         NaN     8.0   
1      NaN      2.0       640.0       NaN      7.0       884.0     NaN   

   MD1_m2  MW1_val  
0    11.0     10.0  
1     NaN      NaN  


In [4]:
report_counts = df["REPORT_TYPE"].value_counts()
print(report_counts)

REPORT_TYPE
FM-15    86875
FM-12    43452
FM-16        1
Name: count, dtype: int64
