## ☁️ Daily Visibility Prediction

Given *data about weather in Szeged, Hungary from 2006-2016*, let's try to predict the **visibility** on a given day at a given time.

We will use Linear Regression, Decision Tree Regression, and K-Nearest Neighbors Regression to make our predictions.

Data source: https://www.kaggle.com/datasets/budincsevity/szeged-weather

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

In [2]:
data = pd.read_csv('weatherHistory.csv')
data

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.
...,...,...,...,...,...,...,...,...,...,...,...,...
96448,2016-09-09 19:00:00.000 +0200,Partly Cloudy,rain,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36,Partly cloudy starting in the morning.
96449,2016-09-09 20:00:00.000 +0200,Partly Cloudy,rain,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16,Partly cloudy starting in the morning.
96450,2016-09-09 21:00:00.000 +0200,Partly Cloudy,rain,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66,Partly cloudy starting in the morning.
96451,2016-09-09 22:00:00.000 +0200,Partly Cloudy,rain,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95,Partly cloudy starting in the morning.


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
# Drop Summary and Daily Summary columns
df = df.drop(['Summary', 'Daily Summary'], axis=1)
df

Unnamed: 0,Formatted Date,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars)
0,2006-04-01 00:00:00.000 +0200,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13
1,2006-04-01 01:00:00.000 +0200,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63
2,2006-04-01 02:00:00.000 +0200,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94
3,2006-04-01 03:00:00.000 +0200,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41
4,2006-04-01 04:00:00.000 +0200,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51
...,...,...,...,...,...,...,...,...,...,...
96448,2016-09-09 19:00:00.000 +0200,rain,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36
96449,2016-09-09 20:00:00.000 +0200,rain,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16
96450,2016-09-09 21:00:00.000 +0200,rain,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66
96451,2016-09-09 22:00:00.000 +0200,rain,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95


In [6]:
df.isna().sum()

Formatted Date                0
Precip Type                 517
Temperature (C)               0
Apparent Temperature (C)      0
Humidity                      0
Wind Speed (km/h)             0
Wind Bearing (degrees)        0
Visibility (km)               0
Loud Cover                    0
Pressure (millibars)          0
dtype: int64

In [7]:
df['Precip Type'].unique()

array(['rain', 'snow', nan], dtype=object)

In [9]:
df['Precip Type'].mode()[0]

'rain'

In [10]:
# Fill missing values in Precip Type column
df['Precip Type'] = df['Precip Type'].fillna(df['Precip Type'].mode()[0])

In [11]:
df.isna().sum().sum()

0

In [12]:
df

Unnamed: 0,Formatted Date,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars)
0,2006-04-01 00:00:00.000 +0200,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13
1,2006-04-01 01:00:00.000 +0200,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63
2,2006-04-01 02:00:00.000 +0200,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94
3,2006-04-01 03:00:00.000 +0200,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41
4,2006-04-01 04:00:00.000 +0200,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51
...,...,...,...,...,...,...,...,...,...,...
96448,2016-09-09 19:00:00.000 +0200,rain,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36
96449,2016-09-09 20:00:00.000 +0200,rain,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16
96450,2016-09-09 21:00:00.000 +0200,rain,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66
96451,2016-09-09 22:00:00.000 +0200,rain,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95


In [15]:
# Extract date/time features from the Formatted Date column
df['Formatted Date'] = pd.to_datetime(df['Formatted Date'], format='%Y-%m-%d %H:%M:%S.%f %z')

  df['Formatted Date'] = pd.to_datetime(df['Formatted Date'], format='%Y-%m-%d %H:%M:%S.%f %z')


In [16]:
df['Year'] = df['Formatted Date'].apply(lambda x: x.year)
df['Month'] = df['Formatted Date'].apply(lambda x: x.month)
df['Day'] = df['Formatted Date'].apply(lambda x: x.day)
df['Hour'] = df['Formatted Date'].apply(lambda x: x.hour)
df = df.drop('Formatted Date', axis = 1)
df

Unnamed: 0,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Year,Month,Day,Hour
0,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,2006,4,1,0
1,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,2006,4,1,1
2,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,2006,4,1,2
3,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,2006,4,1,3
4,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,2006,4,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96448,rain,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36,2016,9,9,19
96449,rain,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16,2016,9,9,20
96450,rain,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66,2016,9,9,21
96451,rain,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95,2016,9,9,22


In [17]:
df['Precip Type'].value_counts()

Precip Type
rain    85741
snow    10712
Name: count, dtype: int64

In [18]:
# Let us binary encode Precip Type column
df['Precip Type'] = df['Precip Type'].apply(lambda x: 1 if x == 'snow' else 0)

In [19]:
df

Unnamed: 0,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Year,Month,Day,Hour
0,0,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,2006,4,1,0
1,0,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,2006,4,1,1
2,0,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,2006,4,1,2
3,0,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,2006,4,1,3
4,0,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,2006,4,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96448,0,26.016667,26.016667,0.43,10.9963,31.0,16.1000,0.0,1014.36,2016,9,9,19
96449,0,24.583333,24.583333,0.48,10.0947,20.0,15.5526,0.0,1015.16,2016,9,9,20
96450,0,22.038889,22.038889,0.56,8.9838,30.0,16.1000,0.0,1015.66,2016,9,9,21
96451,0,21.522222,21.522222,0.60,10.5294,20.0,16.1000,0.0,1015.95,2016,9,9,22


In [20]:
{column: len(df[column].unique()) for column in df.columns}

{'Precip Type': 2,
 'Temperature (C)': 7574,
 'Apparent Temperature (C)': 8984,
 'Humidity': 90,
 'Wind Speed (km/h)': 2484,
 'Wind Bearing (degrees)': 360,
 'Visibility (km)': 949,
 'Loud Cover': 1,
 'Pressure (millibars)': 4979,
 'Year': 11,
 'Month': 12,
 'Day': 31,
 'Hour': 24}

In [21]:
# Drop Loud Cover column
df = df.drop('Loud Cover', axis=1)

In [23]:
# Split df into X and y
y = df['Visibility (km)'].copy()
X = df.drop('Visibility (km)', axis=1).copy()

In [26]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=123)

In [27]:
X_train.shape, X_test.shape

((67517, 11), (28936, 11))

In [28]:
# Scale X with a standard scaler (mean of 1 and variance of 0)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [29]:
X_train

Unnamed: 0,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Pressure (millibars),Year,Month,Day,Hour
13596,-0.355013,1.627626,1.501724,-2.024300,1.461847,1.055230,0.006060,-1.269615,-0.151306,1.168975,0.073065
79326,-0.355013,0.441592,0.495554,-0.590020,-0.428029,0.393836,0.124929,1.264744,-0.730792,1.055362,-0.359778
72378,2.816801,-1.472340,-1.832447,0.075895,2.182800,-0.360712,0.170144,0.947949,-1.310278,-1.671337,0.938750
13895,-0.355013,-0.328167,-0.300348,0.332017,-0.425696,0.095743,0.033564,-1.269615,-1.020535,-1.671337,1.660155
37609,2.816801,-1.261296,-1.354283,0.793035,-0.005723,0.365890,0.060556,-0.319230,-1.310278,0.600913,-1.514026
...,...,...,...,...,...,...,...,...,...,...,...
63206,-0.355013,-0.529909,-0.594723,0.075895,0.206596,-0.193034,0.233154,0.631154,1.587150,0.714525,0.361627
61404,-0.355013,-0.613047,-0.693367,-0.487572,0.290591,0.943445,0.059364,0.631154,-0.730792,-1.671337,0.073065
17730,-0.355013,0.033457,0.131090,-0.641245,2.565441,0.952761,0.068560,-0.952820,-0.730792,0.032850,0.938750
28030,2.816801,-1.882801,-1.882288,0.536914,-0.598351,0.589460,0.047273,-0.636025,1.587150,0.260075,1.515874


In [30]:
X_train.mean()

Precip Type                -3.283459e-17
Temperature (C)             1.452299e-16
Apparent Temperature (C)    6.630062e-17
Humidity                   -6.640586e-17
Wind Speed (km/h)          -1.682773e-16
Wind Bearing (degrees)      5.661863e-17
Pressure (millibars)        1.005770e-15
Year                        2.217808e-14
Month                       8.955846e-17
Day                         7.361474e-17
Hour                        1.241821e-16
dtype: float64

In [31]:
X_train.var()

Precip Type                 1.000015
Temperature (C)             1.000015
Apparent Temperature (C)    1.000015
Humidity                    1.000015
Wind Speed (km/h)           1.000015
Wind Bearing (degrees)      1.000015
Pressure (millibars)        1.000015
Year                        1.000015
Month                       1.000015
Day                         1.000015
Hour                        1.000015
dtype: float64

In [32]:
y_train

13596     9.9820
79326    16.1000
72378    15.9229
13895    14.7154
37609    14.9569
          ...   
63206     6.9069
61404    11.2700
17730    11.2056
28030     6.3434
15725     5.8926
Name: Visibility (km), Length: 67517, dtype: float64

### Training

In [33]:
models = {
    "  Linear Regression": LinearRegression(),
    "      Decision Tree": DecisionTreeRegressor(),
    "K-Nearest Neighbors": KNeighborsRegressor()
}

In [34]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

  Linear Regression trained.
      Decision Tree trained.
K-Nearest Neighbors trained.


### Results

In [35]:
for name, model in models.items():
    print(name + " R^2: {:.5f}".format(model.score(X_test, y_test)))

  Linear Regression R^2: 0.24017
      Decision Tree R^2: 0.50592
K-Nearest Neighbors R^2: 0.68630
