 STEP 1: File Listing + Basic Info for Each File

In [1]:
import os
import pandas as pd
import glob

# Set your folder path
input_dir = 'Dataset/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

# Loop through each file and show basic shape and null info
for file_path in file_paths:
    print(f"\n🔍 Inspecting file: {file_path}")
    
    # Read with skiprows
    try:
        df = pd.read_csv(file_path, skiprows=3)

        print("📋 Shape:", df.shape)
        print("📉 Null values per column:\n", df.isnull().sum())
        print("ℹ️ Info:")
        print(df.info())
    
    except Exception as e:
        print(f"Error reading {file_path}: {e}")



🔍 Inspecting file: Dataset/Tamil Nadu/Metrological Data\Ariyalur.csv
📋 Shape: (8784, 9)
📉 Null values per column:
 time                        0
temperature_2m (°C)         0
relative_humidity_2m (%)    0
rain (mm)                   0
surface_pressure (hPa)      0
wind_speed_10m (km/h)       0
wind_speed_100m (km/h)      0
wind_direction_100m (°)     0
wind_direction_10m (°)      0
dtype: int64
ℹ️ Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   time                      8784 non-null   object 
 1   temperature_2m (°C)       8784 non-null   float64
 2   relative_humidity_2m (%)  8784 non-null   int64  
 3   rain (mm)                 8784 non-null   float64
 4   surface_pressure (hPa)    8784 non-null   float64
 5   wind_speed_10m (km/h)     8784 non-null   float64
 6   wind_speed_100m (km/h)    8784 non-null  

 STEP 2: .describe() + Time Parsing Check (One-by-One)


In [2]:
input_dir = 'Dataset/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

for file_path in file_paths:
    print(f"\n📊 Describing file: {file_path}")
    
    df = pd.read_csv(file_path, skiprows=3)

    df['time'] = pd.to_datetime(df['time'], errors='coerce')
    
    num_invalid_times = df['time'].isnull().sum()
    if num_invalid_times > 0:
        print(f"⚠️  {num_invalid_times} rows have invalid timestamps.")
    else:
        print("✅ All timestamps are valid.")

    df = df.set_index('time').sort_index()

    print(df.describe())



📊 Describing file: Dataset/Tamil Nadu/Metrological Data\Ariyalur.csv
✅ All timestamps are valid.
       temperature_2m (°C)  relative_humidity_2m (%)    rain (mm)  \
count          8784.000000               8784.000000  8784.000000   
mean             28.575774                 71.065801     0.197905   
std               4.194366                 19.234508     0.911090   
min              19.200000                 18.000000     0.000000   
25%              25.400000                 58.000000     0.000000   
50%              27.800000                 73.000000     0.000000   
75%              31.200000                 88.000000     0.000000   
max              42.500000                100.000000    15.000000   

       surface_pressure (hPa)  wind_speed_10m (km/h)  wind_speed_100m (km/h)  \
count             8784.000000            8784.000000             8784.000000   
mean               999.882309              11.491610               18.154019   
std                  3.618857           

STEP 3: Z-Score Outlier Detection (Per File)

In [3]:
from scipy.stats import zscore


input_dir = 'Dataset/Tamil Nadu/Metrological Data/'
file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

exclude_cols = ['time', 'wind_direction_100m (°)', 'wind_direction_10m (°)']

for file_path in file_paths:
    print(f"\nZ-Score Outlier Check for: {file_path}")
    
    df = pd.read_csv(file_path, skiprows=3)
    df['time'] = pd.to_datetime(df['time'], errors='coerce')
    df = df.set_index('time').sort_index()

    num_cols = df.select_dtypes(include='number').columns.difference(exclude_cols)
    z_scores = df[num_cols].apply(zscore)

    outlier_counts = (abs(z_scores) > 3).sum()
    print(outlier_counts)



Z-Score Outlier Check for: Dataset/Tamil Nadu/Metrological Data\Ariyalur.csv
rain (mm)                   173
relative_humidity_2m (%)      0
surface_pressure (hPa)        0
temperature_2m (°C)          28
wind_speed_100m (km/h)       17
wind_speed_10m (km/h)        25
dtype: int64

Z-Score Outlier Check for: Dataset/Tamil Nadu/Metrological Data\Chengalpattu.csv
rain (mm)                   102
relative_humidity_2m (%)      1
surface_pressure (hPa)        0
temperature_2m (°C)          39
wind_speed_100m (km/h)       55
wind_speed_10m (km/h)        52
dtype: int64

Z-Score Outlier Check for: Dataset/Tamil Nadu/Metrological Data\Chennai.csv
rain (mm)                   120
relative_humidity_2m (%)      7
surface_pressure (hPa)        0
temperature_2m (°C)          33
wind_speed_100m (km/h)       31
wind_speed_10m (km/h)        44
dtype: int64

Z-Score Outlier Check for: Dataset/Tamil Nadu/Metrological Data\Coimbatore.csv
rain (mm)                   151
relative_humidity_2m (%)      0
surf

STEP 4: Clip Z-score Outliers (But Keep Rain)

In [4]:
input_dir = 'Dataset/Tamil Nadu/Metrological Data/'
output_dir = 'Dataset_cleaned/Tamil Nadu/Metrological Data/'
os.makedirs(output_dir, exist_ok=True)

clip_cols = [
    'temperature_2m (°C)',
    'relative_humidity_2m (%)',
    'surface_pressure (hPa)',
    'wind_speed_10m (km/h)',
    'wind_speed_100m (km/h)'
]

file_paths = glob.glob(os.path.join(input_dir, '*.csv'))

for file_path in file_paths:
    print(f"\n🧼 Cleaning: {file_path}")
    
    file_name = os.path.basename(file_path)
    district = os.path.splitext(file_name)[0]
    

    with open(file_path, 'r') as f:
        header_lines = [next(f) for _ in range(3)]
    
    df = pd.read_csv(file_path, skiprows=3)
    df['time'] = pd.to_datetime(df['time'], errors='coerce')
    df = df.dropna(subset=['time'])
    df = df.set_index('time').sort_index()
    
    for col in clip_cols:
        if col in df.columns:
            z = zscore(df[col])
            df[col] = df[col].where(abs(z) < 3, df[col].median())

    output_path = os.path.join(output_dir, f"{district}_cleaned.csv")
    df.to_csv(output_path)

    with open(output_path, 'r') as f:
        cleaned_data = f.read()

    with open(output_path, 'w') as f:
        f.writelines(header_lines)
        f.write(cleaned_data)

    print(f"✅ Saved cleaned file to: {output_path}")



🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Ariyalur.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Ariyalur_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Chengalpattu.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Chengalpattu_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Chennai.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Chennai_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Coimbatore.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Coimbatore_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Cuddalore.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Cuddalore_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Dindigul.csv
✅ Saved cleaned file to: Dataset_cleaned/Tamil Nadu/Metrological Data/Dindigul_cleaned.csv

🧼 Cleaning: Dataset/Tamil Nadu/Metrological Data\Gummidipoondi.