# Weather Data Cleaning
This notebook will be used for cleaning the weather data.

In [24]:
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt 
import seaborn as sns

pd.set_option('display.max_columns', None)

print('Pandas version:', pd.__version__) # 2.2.3
print('Numpy version:', np.__version__) # 1.26.4
print('Seaborn version:', sns.__version__) # 0.13.2

Pandas version: 2.2.3
Numpy version: 1.26.4
Seaborn version: 0.13.2


## Loading the dataset

The data that is used is from the KNMI 
> https://www.knmi.nl/nederland-nu/klimatologie/uurgegevens

In [25]:
df = pd.read_csv('../data/Weather/WeatherData.txt',skiprows=31, sep=',')
df

Unnamed: 0,# STN,YYYYMMDD,HH,DD,FH,FF,FX,T,T10N,TD,SQ,Q,DR,RH,P,VV,N,U,WW,IX,M,R,S,O,Y
0,260,20170101,1,200,40,40,60,12,,12,0,0,0,-1,10234,5,9,99,32,7,1,1,0,0,0
1,260,20170101,2,200,40,40,70,12,,12,0,0,0,0,10226,3,9,99,34,7,1,0,0,0,0
2,260,20170101,3,210,40,40,70,13,,12,0,0,0,0,10218,9,9,99,32,7,1,0,0,0,0
3,260,20170101,4,210,40,40,70,14,,13,0,0,0,0,10210,11,8,98,20,7,1,0,0,0,0
4,260,20170101,5,210,40,30,70,15,,14,0,0,0,0,10202,18,8,98,10,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,260,20241231,20,210,70,80,140,62,,37,0,0,0,0,10182,76,8,84,,5,0,0,0,0,0
70124,260,20241231,21,210,80,70,160,68,,41,0,0,0,0,10182,81,8,82,,5,0,0,0,0,0
70125,260,20241231,22,210,80,80,140,68,,42,0,0,0,0,10173,78,8,83,,5,0,0,0,0,0
70126,260,20241231,23,210,80,90,130,74,,48,0,0,0,0,10169,70,8,83,,5,0,0,0,0,0


## Data Cleaning

To start the data cleaning of we have to give the columns names that we can understand. The column names came from the Text file the data was in.  

In [26]:
# New column names
new_column_names = [
    "Station",
    "Date",
    "Hour",
    "WindDirection",
    "WindSpeedAvg60min",
    "WindSpeedAvg10min",
    "WindGust",
    "Temperature",
    "MinTemperature6hour",
    "DewPoint",
    "Sunshineperhour",
    "GlobalRadiation",
    "PrecipitationDuration",
    "HourlyPrecipitationAmount",
    "Pressure",
    "HorizontalVisibility",
    "CloudCover",
    "RelativeAtmosphericHumidity",
    "WeatherCode",
    "IndicatorWeatherCode",
    "Fog",
    "Rain",
    "Snow",
    "Thunder",
    "IceFormation",
]

# Rename columns
df.columns = new_column_names
df

Unnamed: 0,Station,Date,Hour,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,MinTemperature6hour,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,WeatherCode,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation
0,260,20170101,1,200,40,40,60,12,,12,0,0,0,-1,10234,5,9,99,32,7,1,1,0,0,0
1,260,20170101,2,200,40,40,70,12,,12,0,0,0,0,10226,3,9,99,34,7,1,0,0,0,0
2,260,20170101,3,210,40,40,70,13,,12,0,0,0,0,10218,9,9,99,32,7,1,0,0,0,0
3,260,20170101,4,210,40,40,70,14,,13,0,0,0,0,10210,11,8,98,20,7,1,0,0,0,0
4,260,20170101,5,210,40,30,70,15,,14,0,0,0,0,10202,18,8,98,10,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,260,20241231,20,210,70,80,140,62,,37,0,0,0,0,10182,76,8,84,,5,0,0,0,0,0
70124,260,20241231,21,210,80,70,160,68,,41,0,0,0,0,10182,81,8,82,,5,0,0,0,0,0
70125,260,20241231,22,210,80,80,140,68,,42,0,0,0,0,10173,78,8,83,,5,0,0,0,0,0
70126,260,20241231,23,210,80,90,130,74,,48,0,0,0,0,10169,70,8,83,,5,0,0,0,0,0


Now that it is visble what data there is, it is time to make sure everything is in the correct format. To start of, it is best if the date and time are in the same column as a datetime variable. 

In [27]:
# Combine Date and Hour into a single datetime column
df['Datetime'] = pd.to_datetime(df['Date'].astype(str)) + pd.to_timedelta(df['Hour'] - 1, unit='h')

# Move the Datetime column to the front
df = df[['Datetime'] + [col for col in df.columns if col != 'Datetime']]
df

Unnamed: 0,Datetime,Station,Date,Hour,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,MinTemperature6hour,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,WeatherCode,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation
0,2017-01-01 00:00:00,260,20170101,1,200,40,40,60,12,,12,0,0,0,-1,10234,5,9,99,32,7,1,1,0,0,0
1,2017-01-01 01:00:00,260,20170101,2,200,40,40,70,12,,12,0,0,0,0,10226,3,9,99,34,7,1,0,0,0,0
2,2017-01-01 02:00:00,260,20170101,3,210,40,40,70,13,,12,0,0,0,0,10218,9,9,99,32,7,1,0,0,0,0
3,2017-01-01 03:00:00,260,20170101,4,210,40,40,70,14,,13,0,0,0,0,10210,11,8,98,20,7,1,0,0,0,0
4,2017-01-01 04:00:00,260,20170101,5,210,40,30,70,15,,14,0,0,0,0,10202,18,8,98,10,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,20241231,20,210,70,80,140,62,,37,0,0,0,0,10182,76,8,84,,5,0,0,0,0,0
70124,2024-12-31 20:00:00,260,20241231,21,210,80,70,160,68,,41,0,0,0,0,10182,81,8,82,,5,0,0,0,0,0
70125,2024-12-31 21:00:00,260,20241231,22,210,80,80,140,68,,42,0,0,0,0,10173,78,8,83,,5,0,0,0,0,0
70126,2024-12-31 22:00:00,260,20241231,23,210,80,90,130,74,,48,0,0,0,0,10169,70,8,83,,5,0,0,0,0,0


Now that the datetime column is created we can remove the separate date and hour column. 

In [28]:
# Drop Date and Hour columns
df.drop(columns=['Date', 'Hour'],inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Date', 'Hour'],inplace=True)


Unnamed: 0,Datetime,Station,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,MinTemperature6hour,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,WeatherCode,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation
0,2017-01-01 00:00:00,260,200,40,40,60,12,,12,0,0,0,-1,10234,5,9,99,32,7,1,1,0,0,0
1,2017-01-01 01:00:00,260,200,40,40,70,12,,12,0,0,0,0,10226,3,9,99,34,7,1,0,0,0,0
2,2017-01-01 02:00:00,260,210,40,40,70,13,,12,0,0,0,0,10218,9,9,99,32,7,1,0,0,0,0
3,2017-01-01 03:00:00,260,210,40,40,70,14,,13,0,0,0,0,10210,11,8,98,20,7,1,0,0,0,0
4,2017-01-01 04:00:00,260,210,40,30,70,15,,14,0,0,0,0,10202,18,8,98,10,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,210,70,80,140,62,,37,0,0,0,0,10182,76,8,84,,5,0,0,0,0,0
70124,2024-12-31 20:00:00,260,210,80,70,160,68,,41,0,0,0,0,10182,81,8,82,,5,0,0,0,0,0
70125,2024-12-31 21:00:00,260,210,80,80,140,68,,42,0,0,0,0,10173,78,8,83,,5,0,0,0,0,0
70126,2024-12-31 22:00:00,260,210,80,90,130,74,,48,0,0,0,0,10169,70,8,83,,5,0,0,0,0,0


The temperature needs to be multiplied with 0.1 for it to show the actual temperature. 

In [29]:
# Convert Temperature to float and scale it by 0.1
df['Temperature'] = df['Temperature'].astype(float) * 0.1

# Replace spaces with NaN
df.replace(to_replace='     ', value=np.nan, inplace=True)
# Convert MinTemperature6hour to float and scale it by 0.1
df['MinTemperature6hour'] = df['MinTemperature6hour'].astype(float) * 0.1

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Temperature'] = df['Temperature'].astype(float) * 0.1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.replace(to_replace='     ', value=np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MinTemperature6hour'] = df['MinTemperature6hour'].astype(float) * 0.1


Unnamed: 0,Datetime,Station,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,MinTemperature6hour,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,WeatherCode,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation
0,2017-01-01 00:00:00,260,200,40,40,60,1.2,,12,0,0,0,-1,10234,5,9,99,32,7,1,1,0,0,0
1,2017-01-01 01:00:00,260,200,40,40,70,1.2,,12,0,0,0,0,10226,3,9,99,34,7,1,0,0,0,0
2,2017-01-01 02:00:00,260,210,40,40,70,1.3,,12,0,0,0,0,10218,9,9,99,32,7,1,0,0,0,0
3,2017-01-01 03:00:00,260,210,40,40,70,1.4,,13,0,0,0,0,10210,11,8,98,20,7,1,0,0,0,0
4,2017-01-01 04:00:00,260,210,40,30,70,1.5,,14,0,0,0,0,10202,18,8,98,10,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,210,70,80,140,6.2,,37,0,0,0,0,10182,76,8,84,,5,0,0,0,0,0
70124,2024-12-31 20:00:00,260,210,80,70,160,6.8,,41,0,0,0,0,10182,81,8,82,,5,0,0,0,0,0
70125,2024-12-31 21:00:00,260,210,80,80,140,6.8,,42,0,0,0,0,10173,78,8,83,,5,0,0,0,0,0
70126,2024-12-31 22:00:00,260,210,80,90,130,7.4,,48,0,0,0,0,10169,70,8,83,,5,0,0,0,0,0


Then there were some null values to deal with. First there needs to be a clear view of how many and where all the null values are. 

In [30]:
# Calculate the number of missing values for each column
missing_values = df.isnull().sum()

# Calculate the percentage of missing values for each column
missing_percentage = (missing_values / len(df)) * 100

# Create a DataFrame to display the results
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})

# Display the missing values analysis
missing_data

Unnamed: 0,Missing Values,Percentage
Datetime,0,0.0
Station,0,0.0
WindDirection,0,0.0
WindSpeedAvg60min,0,0.0
WindSpeedAvg10min,0,0.0
WindGust,0,0.0
Temperature,0,0.0
MinTemperature6hour,58440,83.333333
DewPoint,0,0.0
Sunshineperhour,0,0.0


You can see that there are 2 columns where more than 50% of the values are missing. Since we aren't using these columns and there are a lot of values missing it doesn't make sense to fill them, so the columns will be dropped. 

In [31]:
# Remove columns with NaN values
df = df.dropna(axis=1)

df

Unnamed: 0,Datetime,Station,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation
0,2017-01-01 00:00:00,260,200,40,40,60,1.2,12,0,0,0,-1,10234,5,9,99,7,1,1,0,0,0
1,2017-01-01 01:00:00,260,200,40,40,70,1.2,12,0,0,0,0,10226,3,9,99,7,1,0,0,0,0
2,2017-01-01 02:00:00,260,210,40,40,70,1.3,12,0,0,0,0,10218,9,9,99,7,1,0,0,0,0
3,2017-01-01 03:00:00,260,210,40,40,70,1.4,13,0,0,0,0,10210,11,8,98,7,1,0,0,0,0
4,2017-01-01 04:00:00,260,210,40,30,70,1.5,14,0,0,0,0,10202,18,8,98,7,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,210,70,80,140,6.2,37,0,0,0,0,10182,76,8,84,5,0,0,0,0,0
70124,2024-12-31 20:00:00,260,210,80,70,160,6.8,41,0,0,0,0,10182,81,8,82,5,0,0,0,0,0
70125,2024-12-31 21:00:00,260,210,80,80,140,6.8,42,0,0,0,0,10173,78,8,83,5,0,0,0,0,0
70126,2024-12-31 22:00:00,260,210,80,90,130,7.4,48,0,0,0,0,10169,70,8,83,5,0,0,0,0,0


Now just a quick check if all the NaN values are gone. 

In [32]:
# Calculate the number of missing values for each colums 
df.isnull().sum()

Datetime                       0
Station                        0
WindDirection                  0
WindSpeedAvg60min              0
WindSpeedAvg10min              0
WindGust                       0
Temperature                    0
DewPoint                       0
Sunshineperhour                0
GlobalRadiation                0
PrecipitationDuration          0
HourlyPrecipitationAmount      0
Pressure                       0
HorizontalVisibility           0
CloudCover                     0
RelativeAtmosphericHumidity    0
IndicatorWeatherCode           0
Fog                            0
Rain                           0
Snow                           0
Thunder                        0
IceFormation                   0
dtype: int64

There are a couple of additionals that would be nice to have for eda. Those are the average monthly temperatures, which I will be calculating in the next cell. 

In [33]:
# Calculate the daily average temperature
df['Date'] = df['Datetime'].dt.date
daily_avg_temp = df.groupby('Date')['Temperature'].mean().reset_index()

# Rename columns for clarity
daily_avg_temp.columns = ['Date', 'AvgDailyTemperature']

# Display the daily average temperature
daily_avg_temp
# Merge the daily average temperature back into the original dataframe
df = df.merge(daily_avg_temp, on='Date', how='left')

# Drop the 'Date' column as it is no longer needed
df.drop(columns=['Date'], inplace=True)

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date'] = df['Datetime'].dt.date


Unnamed: 0,Datetime,Station,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation,AvgDailyTemperature
0,2017-01-01 00:00:00,260,200,40,40,60,1.2,12,0,0,0,-1,10234,5,9,99,7,1,1,0,0,0,0.520833
1,2017-01-01 01:00:00,260,200,40,40,70,1.2,12,0,0,0,0,10226,3,9,99,7,1,0,0,0,0,0.520833
2,2017-01-01 02:00:00,260,210,40,40,70,1.3,12,0,0,0,0,10218,9,9,99,7,1,0,0,0,0,0.520833
3,2017-01-01 03:00:00,260,210,40,40,70,1.4,13,0,0,0,0,10210,11,8,98,7,1,0,0,0,0,0.520833
4,2017-01-01 04:00:00,260,210,40,30,70,1.5,14,0,0,0,0,10202,18,8,98,7,0,0,0,0,0,0.520833
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,210,70,80,140,6.2,37,0,0,0,0,10182,76,8,84,5,0,0,0,0,0,4.625000
70124,2024-12-31 20:00:00,260,210,80,70,160,6.8,41,0,0,0,0,10182,81,8,82,5,0,0,0,0,0,4.625000
70125,2024-12-31 21:00:00,260,210,80,80,140,6.8,42,0,0,0,0,10173,78,8,83,5,0,0,0,0,0,4.625000
70126,2024-12-31 22:00:00,260,210,80,90,130,7.4,48,0,0,0,0,10169,70,8,83,5,0,0,0,0,0,4.625000


And also the daily average temperature. 

In [34]:
# Calculate the monthly average temperature
df.loc[:, 'Month'] = df['Datetime'].dt.to_period('M')
monthly_avg_temp = df.groupby('Month')['Temperature'].mean().reset_index()

# Merge the monthly average temperature back into the original dataframe
df = df.merge(monthly_avg_temp, on='Month', suffixes=('', 'AvgMonthlyTemperature'))

# Drop the 'Month' column as it is no longer needed
df.drop(columns=['Month'], inplace=True)

df

Unnamed: 0,Datetime,Station,WindDirection,WindSpeedAvg60min,WindSpeedAvg10min,WindGust,Temperature,DewPoint,Sunshineperhour,GlobalRadiation,PrecipitationDuration,HourlyPrecipitationAmount,Pressure,HorizontalVisibility,CloudCover,RelativeAtmosphericHumidity,IndicatorWeatherCode,Fog,Rain,Snow,Thunder,IceFormation,AvgDailyTemperature,TemperatureAvgMonthlyTemperature
0,2017-01-01 00:00:00,260,200,40,40,60,1.2,12,0,0,0,-1,10234,5,9,99,7,1,1,0,0,0,0.520833,1.565591
1,2017-01-01 01:00:00,260,200,40,40,70,1.2,12,0,0,0,0,10226,3,9,99,7,1,0,0,0,0,0.520833,1.565591
2,2017-01-01 02:00:00,260,210,40,40,70,1.3,12,0,0,0,0,10218,9,9,99,7,1,0,0,0,0,0.520833,1.565591
3,2017-01-01 03:00:00,260,210,40,40,70,1.4,13,0,0,0,0,10210,11,8,98,7,1,0,0,0,0,0.520833,1.565591
4,2017-01-01 04:00:00,260,210,40,30,70,1.5,14,0,0,0,0,10202,18,8,98,7,0,0,0,0,0,0.520833,1.565591
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,2024-12-31 19:00:00,260,210,70,80,140,6.2,37,0,0,0,0,10182,76,8,84,5,0,0,0,0,0,4.625000,6.067339
70124,2024-12-31 20:00:00,260,210,80,70,160,6.8,41,0,0,0,0,10182,81,8,82,5,0,0,0,0,0,4.625000,6.067339
70125,2024-12-31 21:00:00,260,210,80,80,140,6.8,42,0,0,0,0,10173,78,8,83,5,0,0,0,0,0,4.625000,6.067339
70126,2024-12-31 22:00:00,260,210,80,90,130,7.4,48,0,0,0,0,10169,70,8,83,5,0,0,0,0,0,4.625000,6.067339


Now exporting the cleaned dataframe to a csv file so it is ready to use in the next steps. 

In [35]:
df.to_csv('../Data/Weather/WeatherDataCleaned.csv', index=False)