Step 1: Basic Time Features
Extract from the time index:

hour (0–23)

day (1–31)

month (1–12)

dayofweek (0=Mon to 6=Sun)

is_weekend (0/1)

dayofyear (1–365)

(optional) season (summer, monsoon, winter, etc.)


In [1]:
import os
import pandas as pd
import glob

input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

for file_path in file_paths:
    print(f"\nProcessing time features for: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)

        df['time'] = pd.to_datetime(df['time'], errors='coerce')
        df = df.dropna(subset=['time'])

        df['hour'] = df['time'].dt.hour
        df['day'] = df['time'].dt.day
        df['month'] = df['time'].dt.month
        df['dayofweek'] = df['time'].dt.dayofweek
        df['dayofyear'] = df['time'].dt.dayofyear

        df.to_csv(file_path, index=False)
        
        with open(file_path, 'r') as f:
            body = f.read()

        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        print(f"Time features added and saved to: {file_path}")

    except Exception as e:
        print(f"Failed to process {file_path}: {e}")



Processing time features for: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Time features added and saved to: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv

Processing time features for: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Time features added and saved to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv

Processing time features for: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Time features added and saved to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv

Processing time features for: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Time features added and saved to: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv

Processing time features for: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Time features added and saved to: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv

Pr

Lag Features -  past behavior of temperature and rain

| Lag Column                              | Meaning                     |
| --------------------------------------- | --------------------------- |
| `temp_lag_1h`                           | 1 hour ago temperature      |
| `temp_lag_3h`, `6h`, `12h`, `24h`       | Multi-hour trends           |
| `rain_lag_1h`, `3h`, `6h`, `12h`, `24h` | Rain history — super useful |


In [2]:
import os
import pandas as pd
import glob

input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

lag_hours = [1, 3, 6, 12, 24]

for file_path in file_paths:
    print(f"\nAdding lag features to: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)
        df['time'] = pd.to_datetime(df['time'], errors='coerce')
        df = df.dropna(subset=['time'])
        df = df.sort_values('time').reset_index(drop=True)

        for lag in lag_hours:
            df[f'temp_lag_{lag}h'] = df['temperature_2m (°C)'].shift(lag)
            df[f'rain_lag_{lag}h'] = df['rain (mm)'].shift(lag)

        df = df.dropna().reset_index(drop=True)

        df.to_csv(file_path, index=False)
        
        with open(file_path, 'r') as f:
            body = f.read()
        
        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        print(f"Lag features added: {file_path}")

    except Exception as e:
        print(f"Error in {file_path}: {e}")


Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Lag features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv

Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Lag features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv

Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Lag features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv

Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Lag features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv

Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Lag features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv

Adding lag features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Dindigul_cleaned.csv
Lag features

Rolling features - Capture short-term trends and volatility in temperature and rainfall.

| Rolling Column                         | Meaning                          |
| -------------------------------------- | -------------------------------- |
| `temp_roll_mean_6h`                    | Avg temp over past 6 hours       |
| `temp_roll_mean_12h`, `24h`            | Longer averages                  |
| `rain_roll_sum_6h`, `12h`, `24h`       | Total rainfall in last X hours   |
| (optional later: rolling std/variance) | Spread/volatility — skip for now |


In [3]:
input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

roll_windows = [6, 12, 24]

for file_path in file_paths:
    print(f"\nAdding rolling features to: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)
        df['time'] = pd.to_datetime(df['time'], errors='coerce')
        df = df.dropna(subset=['time'])
        df = df.sort_values('time').reset_index(drop=True)

        for window in roll_windows:
            df[f'temp_roll_mean_{window}h'] = df['temperature_2m (°C)'].rolling(window=window).mean()
            df[f'rain_roll_sum_{window}h'] = df['rain (mm)'].rolling(window=window).sum()

        df = df.dropna().reset_index(drop=True)

        df.to_csv(file_path, index=False)
        with open(file_path, 'r') as f:
            body = f.read()
        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        print(f"Rolling features added: {file_path}")

    except Exception as e:
        print(f"Error in {file_path}: {e}")



Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Rolling features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv

Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Rolling features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv

Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Rolling features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv

Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Rolling features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv

Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Rolling features added: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv

Adding rolling features to: Dataset_cleaned/Tamil Nadu/Metrolo

Step 4: Delta Features

temp_diff_1h = temp - temp_lag_1h

rain_diff_1h = rain - rain_lag_1h

Combined with the creation of the Global Dataset Creation

In [5]:
input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

global_df_list = []

for file_path in file_paths:
    print(f"\nAdding delta features: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)
        df['time'] = pd.to_datetime(df['time'], errors='coerce')
        df = df.dropna(subset=['time'])
        df = df.sort_values('time').reset_index(drop=True)

        df['temp_diff_1h'] = df['temperature_2m (°C)'] - df['temperature_2m (°C)'].shift(1)
        df['rain_diff_1h'] = df['rain (mm)'] - df['rain (mm)'].shift(1)

        df = df.dropna().reset_index(drop=True)

        district = os.path.basename(file_path).replace('_cleaned.csv', '')
        df['district'] = district 

        df.to_csv(file_path, index=False)
        with open(file_path, 'r') as f:
            body = f.read()
        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        global_df_list.append(df)

        print(f"Delta features and district column added for: {district}")

    except Exception as e:
        print(f"Failed: {file_path} — {e}")

global_df = pd.concat(global_df_list, ignore_index=True)
global_df.to_csv('weather_global_dataset.csv', index=False)
print("\nGlobal dataset saved as: weather_global_dataset.csv")



Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Delta features and district column added for: Ariyalur

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Delta features and district column added for: Chengalpattu

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Delta features and district column added for: Chennai

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Delta features and district column added for: Coimbatore

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Delta features and district column added for: Cuddalore

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Dindigul_cleaned.csv
Delta features and district column added for: Dindigul

Adding delta features: Dataset_cleaned/Tamil Nadu/Metrological Data\Gummidipoondi_cleaned.csv
Delta features and dist

Create Prediction Targets (t + 120)

| New Column    | Meaning                 |
| ------------- | ----------------------- |
| `target_temp` | temperature at `t+120h` |
| `target_rain` | rainfall at `t+120h`    |

In [6]:
input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

for file_path in file_paths:
    print(f"\nCreating prediction targets in: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)
        df['time'] = pd.to_datetime(df['time'], errors='coerce')
        df = df.dropna(subset=['time'])
        df = df.sort_values('time').reset_index(drop=True)

        df['target_temp'] = df['temperature_2m (°C)'].shift(-120)
        df['target_rain'] = df['rain (mm)'].shift(-120)

        df = df.dropna().reset_index(drop=True)

        df.to_csv(file_path, index=False)
        with open(file_path, 'r') as f:
            body = f.read()
        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        print(f"Targets added to: {file_path}")

    except Exception as e:
        print(f"Error in {file_path}: {e}")

print("\n🌍 Updating global dataset...")
global_df = pd.read_csv('weather_global_dataset.csv')
global_df['time'] = pd.to_datetime(global_df['time'], errors='coerce')
global_df = global_df.dropna(subset=['time'])
global_df = global_df.sort_values(['district', 'time']).reset_index(drop=True)

def shift_targets(group):
    group = group.sort_values('time')
    group['target_temp'] = group['temperature_2m (°C)'].shift(-120)
    group['target_rain'] = group['rain (mm)'].shift(-120)
    return group

global_df = global_df.groupby('district').apply(shift_targets).dropna().reset_index(drop=True)
global_df.to_csv('weather_global_dataset.csv', index=False)
print("Global targets added and file updated.")



Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Targets added to: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv

Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Targets added to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv

Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Targets added to: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv

Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Targets added to: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv

Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Targets added to: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv

Creating prediction targets in: Dataset_cleaned/Tamil Nadu/Metrological 

  global_df = global_df.groupby('district').apply(shift_targets).dropna().reset_index(drop=True)


Global targets added and file updated.


Final Cleanup

For each file (and the global file):

Drop rows with any NaNs

Optionally print the final shape and column list

In [7]:
input_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

for file_path in file_paths:
    print(f"\nFinal cleaning: {file_path}")
    
    try:
        with open(file_path, 'r') as f:
            header_lines = [next(f) for _ in range(3)]

        df = pd.read_csv(file_path, skiprows=3)
        df = df.dropna().reset_index(drop=True)

        df.to_csv(file_path, index=False)
        with open(file_path, 'r') as f:
            body = f.read()
        with open(file_path, 'w') as f:
            f.writelines(header_lines)
            f.write(body)

        print(f"Clean and saved: {file_path} — Shape: {df.shape}")

    except Exception as e:
        print(f"Error in {file_path}: {e}")

print("\nFinal cleanup: global dataset")
global_df = pd.read_csv('weather_global_dataset.csv')
global_df = global_df.dropna().reset_index(drop=True)
global_df.to_csv('weather_global_dataset.csv', index=False)
print(f"Global dataset ready — Shape: {global_df.shape}")



Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv
Clean and saved: Dataset_cleaned/Tamil Nadu/Metrological Data\Ariyalur_cleaned.csv — Shape: (8616, 35)

Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv
Clean and saved: Dataset_cleaned/Tamil Nadu/Metrological Data\Chengalpattu_cleaned.csv — Shape: (8616, 35)

Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv
Clean and saved: Dataset_cleaned/Tamil Nadu/Metrological Data\Chennai_cleaned.csv — Shape: (8616, 35)

Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv
Clean and saved: Dataset_cleaned/Tamil Nadu/Metrological Data\Coimbatore_cleaned.csv — Shape: (8616, 35)

Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv
Clean and saved: Dataset_cleaned/Tamil Nadu/Metrological Data\Cuddalore_cleaned.csv — Shape: (8616, 35)

Final cleaning: Dataset_cleaned/Tamil Nadu/Metrological D