Key Columns (Weather and Environmental Data): 

- temp, feels_like, temp_min, temp_max: Temperature-related data in °C.
- pressure, humidity: Atmospheric pressure (hPa) and humidity (%).
- wind_speed, wind_deg, wind_gust: Wind characteristics.
- rain_1h, rain_3h, snow_1h, snow_3h: Precipitation data (mm).
- clouds_all: Cloud coverage in percentage.
- weather_main, weather_description, weather_icon: Weather conditions.
- dt, timezone: → use dt_iso instead
- dt_iso: ISO formatted datetime with timezone.
Observation: 
- city_name, lat, lon, sea_level, grnd_level can be dropped since same or blank
- temp, feels_like, temp_max, temp_min, and pressure values  : -998, -1000 ... drop/ replace with NaN 

In [2]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
ds = load_dataset("LaurentiuStancioiu/Cluj-Napoca-Weather-OpenWeather-data")

# Confirm it's a Hugging Face Dataset object
print(type(ds['train']))  # Should show: <class 'datasets.arrow_dataset.Dataset'>

# Convert to a pandas DataFrame and display first few rows
df = ds['train'].to_pandas()
print(df.head())

# Drop unnecessary columns
columns_to_drop = [
    'dt', 'timezone', 'city_name', 'lat', 'lon',
    'sea_level', 'grnd_level'
]
df.drop(columns=columns_to_drop, inplace=True)

# ✅ FIXED: Add comma in print statement
print("All columns:", df.columns.tolist())

# Define realistic value ranges for filtering
valid_ranges = {
    'temp': (-60, 60),
    'feels_like': (-70, 60),
    'temp_min': (-60, 60),
    'temp_max': (-60, 60),
    'pressure': (870, 1085),
}

# Filter rows based on valid ranges
mask = np.ones(len(df), dtype=bool)
for col, (low, high) in valid_ranges.items():
    mask &= df[col].between(low, high)

df_cleaned = df[mask]




  from .autonotebook import tqdm as notebook_tqdm


<class 'datasets.arrow_dataset.Dataset'>
           dt                         dt_iso  timezone  \
0  1199145600  2008-01-01 00:00:00 +0000 UTC      7200   
1  1199149200  2008-01-01 01:00:00 +0000 UTC      7200   
2  1199152800  2008-01-01 02:00:00 +0000 UTC      7200   
3  1199156400  2008-01-01 03:00:00 +0000 UTC      7200   
4  1199160000  2008-01-01 04:00:00 +0000 UTC      7200   

                                    city_name        lat        lon  temp  \
0  Universitatea Babeș-Bolyai din Cluj-Napoca  46.767141  23.592139 -9.46   
1  Universitatea Babeș-Bolyai din Cluj-Napoca  46.767141  23.592139 -9.39   
2  Universitatea Babeș-Bolyai din Cluj-Napoca  46.767141  23.592139 -9.39   
3  Universitatea Babeș-Bolyai din Cluj-Napoca  46.767141  23.592139 -9.55   
4  Universitatea Babeș-Bolyai din Cluj-Napoca  46.767141  23.592139 -9.55   

   visibility  dew_point  feels_like  ...  wind_gust  rain_1h  rain_3h  \
0      4000.0     -10.40       -9.46  ...        NaN      NaN      NaN   

In [None]:
# Parse datetime and extract time features
df['dt_iso'] = pd.to_datetime(df['dt_iso'].str.replace(' +0000 UTC', '', regex=False))
df['hour'] = df['dt_iso'].dt.hour
df['day'] = df['dt_iso'].dt.day
df['month'] = df['dt_iso'].dt.month
df['year'] = df['dt_iso'].dt.year
df['weekday'] = df['dt_iso'].dt.weekday

oo
print( df.head())

               dt_iso  temp  visibility  dew_point  feels_like  temp_min  \
0 2008-01-01 00:00:00 -9.46      4000.0     -10.40       -9.46    -11.37   
1 2008-01-01 01:00:00 -9.39      4000.0     -10.33       -9.39    -11.50   
2 2008-01-01 02:00:00 -9.39      4000.0     -10.33       -9.39    -11.68   
3 2008-01-01 03:00:00 -9.55      4000.0     -10.49       -9.55    -11.74   
4 2008-01-01 04:00:00 -9.55      4000.0     -10.37       -9.55    -11.48   

   temp_max  pressure  humidity  wind_speed  ...  clouds_all  weather_id  \
0     -7.64      1024        92         1.0  ...         100         600   
1     -7.39      1024        92         1.0  ...         100         600   
2     -7.35      1023        92         1.0  ...         100         600   
3     -7.60      1023        92         1.0  ...         100         804   
4     -7.85      1023        93         1.0  ...         100         701   

   weather_main  weather_description  weather_icon  hour  day  month  year  \
0       

Blank or NaN; - 
