# Author Information
* @Eric Alexander Zair
* @Snow Logistic Regression Classifier

## Goal

### The goal of this notebook is to create a model that can run logistic regression on some weather data (contained in a csv file).

### We will use this model to determine if a day is going to snow or not snow.

In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [71]:
# Load in our data.
weather_df = pd.read_csv('weatherHistory.csv')

In [72]:
# Let's see what attributes of data we are dealing with.
print(weather_df.columns)

Index(['Formatted Date', 'Summary', 'Precip Type', 'Temperature (C)',
       'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
       'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover',
       'Pressure (millibars)', 'Daily Summary'],
      dtype='object')


In [73]:
# Let's take a better look at the data...
print(weather_df.head())


                  Formatted Date        Summary Precip Type  Temperature (C)  \
0  2006-04-01 00:00:00.000 +0200  Partly Cloudy        rain         9.472222   
1  2006-04-01 01:00:00.000 +0200  Partly Cloudy        rain         9.355556   
2  2006-04-01 02:00:00.000 +0200  Mostly Cloudy        rain         9.377778   
3  2006-04-01 03:00:00.000 +0200  Partly Cloudy        rain         8.288889   
4  2006-04-01 04:00:00.000 +0200  Mostly Cloudy        rain         8.755556   

   Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
0                  7.388889      0.89            14.1197   
1                  7.227778      0.86            14.2646   
2                  9.377778      0.89             3.9284   
3                  5.944444      0.83            14.1036   
4                  6.977778      0.83            11.0446   

   Wind Bearing (degrees)  Visibility (km)  Loud Cover  Pressure (millibars)  \
0                   251.0          15.8263         0.0               1015.13  

In [74]:
# Let's remove any nan data just in case we run into an issue.
weather_df = weather_df.dropna(axis='rows')

In [75]:
# We have a column in our data called "Precip Type". This tells what the weather was like that day e.g. rain, snow. This will be our target value later.
print(weather_df['Precip Type'])


0        rain
1        rain
2        rain
3        rain
4        rain
         ... 
96448    rain
96449    rain
96450    rain
96451    rain
96452    rain
Name: Precip Type, Length: 95936, dtype: object


In [76]:
# First let's take a look at the data and see how often it snows v.s. how often it does not snow.
total_days = weather_df['Precip Type'].count()
count_of_snowy_days = len(weather_df[weather_df['Precip Type'] == 'snow'])
count_of_all_other_days = weather_df['Precip Type'].count() - count_of_snowy_days

print(f"Number of snowy days: {count_of_snowy_days}")
print(f"Number of all other days: {count_of_all_other_days}")

print(f"Percentage of snowy days: {count_of_snowy_days / total_days * 100}")
print(f"Percentage of all other days: {count_of_all_other_days / total_days * 100}")

Number of snowy days: 10712
Number of all other days: 85224
Percentage of snowy days: 11.16577718478986
Percentage of all other days: 88.83422281521014


In [77]:
# We are going to use "Precip Type" to be our target value, but first we need change the layout of the data.
# Currently "Precip Type" is using Categorical data, but we need to convert the column into integer based data.
# We will use 1 to represent snowing, 0 for anything else.
y = weather_df['Precip Type'].replace(to_replace=['snow', 'rain'], value=[1, 0])

# Just to confirm that replacing these values worked correctly...
print(y[y == 1])
print(y[y == 0])

1562     1
1563     1
1564     1
1565     1
1566     1
        ..
93265    1
93266    1
93267    1
93311    1
93506    1
Name: Precip Type, Length: 10712, dtype: int64
0        0
1        0
2        0
3        0
4        0
        ..
96448    0
96449    0
96450    0
96451    0
96452    0
Name: Precip Type, Length: 85224, dtype: int64


In [78]:
# Let's figure out what features for our X matrix we should be using.
print(weather_df.head(10))

                  Formatted Date        Summary Precip Type  Temperature (C)  \
0  2006-04-01 00:00:00.000 +0200  Partly Cloudy        rain         9.472222   
1  2006-04-01 01:00:00.000 +0200  Partly Cloudy        rain         9.355556   
2  2006-04-01 02:00:00.000 +0200  Mostly Cloudy        rain         9.377778   
3  2006-04-01 03:00:00.000 +0200  Partly Cloudy        rain         8.288889   
4  2006-04-01 04:00:00.000 +0200  Mostly Cloudy        rain         8.755556   
5  2006-04-01 05:00:00.000 +0200  Partly Cloudy        rain         9.222222   
6  2006-04-01 06:00:00.000 +0200  Partly Cloudy        rain         7.733333   
7  2006-04-01 07:00:00.000 +0200  Partly Cloudy        rain         8.772222   
8  2006-04-01 08:00:00.000 +0200  Partly Cloudy        rain        10.822222   
9  2006-04-01 09:00:00.000 +0200  Partly Cloudy        rain        13.772222   

   Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
0                  7.388889      0.89            14.1197   

In [79]:
X = weather_df

# Looking at the data:
#   We can discrod Daily Summary, as it is basically a repeat of Summary.
#   We can drop Formatted Date, as that will not help us predict weather.
#   We need to drop Precip Type, it is no longer being used, as that is our target value.
#   For now we will drop Summary . We will later add it back and test with it.
X = X.drop(columns=['Daily Summary', 'Formatted Date', 'Summary', 'Precip Type'])

In [80]:
# Our feature matrix...
print(X.head(5))

   Temperature (C)  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
0         9.472222                  7.388889      0.89            14.1197   
1         9.355556                  7.227778      0.86            14.2646   
2         9.377778                  9.377778      0.89             3.9284   
3         8.288889                  5.944444      0.83            14.1036   
4         8.755556                  6.977778      0.83            11.0446   

   Wind Bearing (degrees)  Visibility (km)  Loud Cover  Pressure (millibars)  
0                   251.0          15.8263         0.0               1015.13  
1                   259.0          15.8263         0.0               1015.63  
2                   204.0          14.9569         0.0               1015.94  
3                   269.0          15.8263         0.0               1016.41  
4                   259.0          15.8263         0.0               1016.51  


In [81]:
# How much data records of data do we actually have?
print(f"Number of data records in our population data set: {y.count()}")

Number of data records in our population data set: 95936


In [82]:
# Before we build our model, let's first split the data into training and testing.
# This method should be rather effective, since we have a large amount of data to play with. 
from sklearn.model_selection import train_test_split

# Our testsize will be 20 percent of our population set of data.
# We will shuffle the data as well so that it is randomized with a state of 42.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# This should be a reasonable amount of data.
print(f"Number of test examples: {len(y_test)}")
print(f"Number of training examples: {len(y_train)}")


Number of test examples: 19188
Number of training examples: 76748


In [91]:
# Alright, let's run our basic Logistic Regression model and see what we get.
from sklearn.linear_model import LogisticRegression

# With training data...
basic_model = LogisticRegression(random_state=42)
basic_model.fit(X_train, y_train)
training_score = basic_model.score(X_train, y_train)
testing_score = basic_model.score(X_test, y_test)

print(f"Training Score: {training_score}")
print(f"Testing Score: {testing_score}")

Training Score: 0.9983452337520196
Testing Score: 0.9985407546383156


In [101]:
# Looking at the data we do NOT appear to be Overfitting or even underfitting at all.
# Interestingly enough though, our testing score is slightly higher than our training score.
print(f"Testing is {testing_score - training_score} percent better.")

Testing is 0.00019552088629604114 percent better.


In [None]:
# For the sake of curiosity, I am going to run 5 fold Cross Validation on the population dataset just to see what we end up getting.

In [None]:
# I would also like to run the same X matrix but with the added "Summary" column in it. However, in order to do this, we will have to run one hot encoding and then add those rows to our matrix.
#...more on this later...