# Where we left off

After we determined the k nearest neighbors method wasnt that effective, I decided to reform my approach to explore a different method.

This time I wanted to incorporate more of the variables provided so I decided to change the variable I wanted to predict. Now, I will try to predict the season based off of various values that someone could easily get from the daily weather report.

So, lets begin the process.

In [63]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import MultinomialNB, CategoricalNB

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [95]:
df = pd.read_csv('arizonaWeather.csv', sep=',', index_col=False, header=None, names=["Date", "Precipitation", "PanEvaporation", "MeanTemp", "MeanWindSpeed", "SolarRadiation", "FAOShortGrassEto", "DaylightStationPressure", "DaylightHumidity", "SkyCoverage", "DaylightTemp", "DaylightBroadbandAerosol", "WindSpeed", "WindDirection"])

print(df.shape)
df.head()


(10957, 14)


Unnamed: 0,Date,Precipitation,PanEvaporation,MeanTemp,MeanWindSpeed,SolarRadiation,FAOShortGrassEto,DaylightStationPressure,DaylightHumidity,SkyCoverage,DaylightTemp,DaylightBroadbandAerosol,WindSpeed,WindDirection
0,10161,0.0,0.22,8.1,118.5,294.8,1.5,97.8,45,0,12.1,0.117,0.4,0
1,10261,0.0,0.25,8.0,121.8,295.4,1.7,98.0,35,0,12.4,0.092,0.0,0
2,10361,0.0,0.32,9.6,171.0,297.9,2.2,98.3,30,0,14.0,0.092,1.8,50
3,10461,0.0,0.27,10.1,110.1,256.1,1.6,98.2,32,2,14.3,0.128,0.7,0
4,10561,0.0,0.35,12.2,167.3,271.0,2.3,97.7,32,5,15.9,0.055,3.3,80


## Data Munging

As we can see there are quite a few data values that are not so commonly known or available to the average civilian so to make the model more applicable, we will remove the variables for now and reevaluate if we should consider them later.

The variables we will remove are Pan Evaporation, Solar Radiation, FAO Short Grass Eto, Daylight Station Pressure, and Daylight Broadband Aerosol.

Every other variable should be relatively easy to obtain from a simple web search of the weather for the day.

In [96]:
df = df.drop(['PanEvaporation','SolarRadiation','FAOShortGrassEto','DaylightStationPressure','DaylightBroadbandAerosol'], axis = 1)
print(df.shape)
df.head()

(10957, 9)


Unnamed: 0,Date,Precipitation,MeanTemp,MeanWindSpeed,DaylightHumidity,SkyCoverage,DaylightTemp,WindSpeed,WindDirection
0,10161,0.0,8.1,118.5,45,0,12.1,0.4,0
1,10261,0.0,8.0,121.8,35,0,12.4,0.0,0
2,10361,0.0,9.6,171.0,30,0,14.0,1.8,50
3,10461,0.0,10.1,110.1,32,2,14.3,0.7,0
4,10561,0.0,12.2,167.3,32,5,15.9,3.3,80


Additionally, since we are predicting the season for a given day in Phoenix, we need to make sure the computer knows what season each observation was taken in.

Since the data set only provided the date of the observation, we will have to mutate the value to achieve our goal

In [98]:
# Convert the date to a string for easier parsing
df.dtypes
dateType = {'Date': str}
df = df.astype(dateType)

# Create an index to allow the machine to better understand the progression of time
# This will allow it to account for potential rises in temperature year over year
DayIndex = list(range(1,len(df)+1))
df['DayIndex'] = DayIndex

# Isolate the month value so that we can determine season
Month = list(range(1,len(df)+1))
df['Month'] = Month
        
for i in range(0, len(df)):
    dateStr = df.at[i, 'Date']
    if (len(dateStr) == 5):
        df.at[i, 'Month'] = int(dateStr[0])
    else:
        df.at[i, 'Month'] = int(dateStr[0]+dateStr[1])

# I elected to base season off of the meteorlogical standard (i.e. seasons start on the first of their equinox months)
Season = str(list(range(1,len(df)+1)))
df['Season'] = Season

for i in range(0, len(df)):
    #dateDay = df.at[i, 'Day']
    dateMonth = df.at[i, 'Month']
    if (dateMonth == 12 or dateMonth < 3):
        df.at[i, 'Season'] = "Winter"
    elif (dateMonth >=3 and dateMonth < 6):
        df.at[i, 'Season'] = "Spring"
    elif (dateMonth >=6 and dateMonth < 9):
        df.at[i, 'Season'] = "Summer"
    else:
        df.at[i, 'Season'] = "Autumn"

# Sanity check
df.head()

Unnamed: 0,Date,Precipitation,MeanTemp,MeanWindSpeed,DaylightHumidity,SkyCoverage,DaylightTemp,WindSpeed,WindDirection,DayIndex,Month,Season
0,10161,0.0,8.1,118.5,45,0,12.1,0.4,0,1,1,Winter
1,10261,0.0,8.0,121.8,35,0,12.4,0.0,0,2,1,Winter
2,10361,0.0,9.6,171.0,30,0,14.0,1.8,50,3,1,Winter
3,10461,0.0,10.1,110.1,32,2,14.3,0.7,0,4,1,Winter
4,10561,0.0,12.2,167.3,32,5,15.9,3.3,80,5,1,Winter


<br>
Lets also do a quick evaluation of the seasonal composition to make sure they are relatively even.

<br>

In [99]:
df['Season'].value_counts(normalize=True)

Spring    0.251894
Summer    0.251894
Autumn    0.249156
Winter    0.247057
Name: Season, dtype: float64

## Making Test and Train data

Now we can go ahead an make the train and test sets. Like before, I will elect to have a 70-30 split of the data.

In [71]:
np.random.seed(657)
data_randomized = df.sample(frac=1)

trainsize = round(len(data_randomized) * 0.70)

# Split into training and test sets
training_set = data_randomized[:trainsize].reset_index(drop=True)
test_set = data_randomized[trainsize:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(7670, 11)
(3287, 11)


<br>
Just to be safe, lets make sure the two sets share a similar breakdown of seasonal variety

In [100]:
training_set['Season'].value_counts(normalize=True)


Spring    0.255280
Autumn    0.250587
Summer    0.248501
Winter    0.245632
Name: Season, dtype: float64

In [101]:
test_set['Season'].value_counts(normalize=True)

Summer    0.259811
Winter    0.250380
Autumn    0.245817
Spring    0.243991
Name: Season, dtype: float64

<br>
Doing some final adjustments to the sets, we will need to make sure that the season and month categories we made are removed, as to not let the computer have it easy with the classifier.

In [102]:
trainX = training_set.iloc[:,1:-2]
trainy = training_set['Season']

colnames = trainX.columns

testX = test_set.iloc[:,1:-2]
testy = test_set['Season']

trainX.head()

Unnamed: 0,Precipitation,MeanTemp,DaylightHumidity,SkyCoverage,DaylightTemp,WindSpeed,WindDirection,DayIndex
0,0.0,13.7,52,2,16.2,2.6,280,7652
1,0.0,34.7,31,0,36.7,3.5,240,8259
2,0.0,25.5,33,4,27.7,6.4,253,1721
3,0.0,20.0,27,0,24.4,2.1,93,2130
4,1.12,21.7,56,10,23.6,2.8,255,9397


<br>
Now we encode our training values so that the machine can understand the data and how to classify it. We will also throw in a quick test to make sure the encoding process is working correctly.

In [118]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
trainLabels = le.fit_transform(trainy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder() 
trainX = enc.fit_transform(trainX)
trainX = pd.DataFrame(trainX, columns=colnames) 


trainLabels[:5]

array([3, 2, 0, 0, 0])

In [119]:
le.inverse_transform(trainLabels[:5])

array(['Winter', 'Summer', 'Autumn', 'Autumn', 'Autumn'], dtype=object)


## Modeling and Results

It is time to create our model based off of the training data.

In [120]:
model = CategoricalNB() 
model.fit(trainX,trainLabels) 

CategoricalNB()

In [121]:
yhattrain = model.predict(trainX)
pd.crosstab(yhattrain, trainy)

Season,Autumn,Spring,Summer,Winter
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1203,100,50,57
1,147,1407,104,96
2,270,137,1752,0
3,302,314,0,1731


In [122]:
accuracy_score(yhattrain, trainLabels)

0.794393741851369

While it does not seem to be the most perfect model, it did do a decent job at correctly classifying the season about 80% of the time. 

We can see from the confusion matrix that the diagonal is the amount of correct classifications and the other numbers are the amount of times it guessed a different season.

For example, we can see that of all of the times the observations were from the Summer, the classifier did not guess Winter a single time, and vice versa for Winter data. At very least, that is a good start for a prediction generator.

It seems like the machine has the most issues with determining Autumn and Spring data, which on some degree makes sense. Autumn and Spring exhibit very similar temperature ranges as they both are the middle grounds for the extreme seasons (Spring and Summer). 

In [123]:
le = LabelEncoder() 
testLabels = le.fit_transform(testy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
testX = enc.fit_transform(testX)
testX = pd.DataFrame(testX, columns=colnames) 

In [124]:
yhattest = model.predict(testX)
confM = confusion_matrix(yhattest, testLabels)

confM

array([[294, 132, 133,  21],
       [215, 417, 140,  43],
       [ 86,  21, 579,   1],
       [213, 232,   2, 758]])

In [125]:
acc = accuracy_score(yhattest, testLabels)

acc

0.6230605415272285

<br>

Naturally the test data also seemed to struggle on the model, being a bit more sloppy than the train data, but this is to be expected.

So overall our model has about a 62% chance to accurately guess the season you are in when you give it Phoenix weather data.

A bit on the low side. Unfortunate.

However, lets go ahead and see if we gave the model every variable collected if that would increase our accuracy scores, using the exact same process.

## Retrying Attempt with More Variables

In [126]:
df = pd.read_csv('arizonaWeather.csv', sep=',', index_col=False, header=None, names=["Date", "Precipitation", "PanEvaporation", "MeanTemp", "MeanWindSpeed", "SolarRadiation", "FAOShortGrassEto", "DaylightStationPressure", "DaylightHumidity", "SkyCoverage", "DaylightTemp", "DaylightBroadbandAerosol", "WindSpeed", "WindDirection"])

# Convert the date to a string for easier parsing
df.dtypes
dateType = {'Date': str}
df = df.astype(dateType)

# Create an index to allow the machine to better understand the progression of time
# This will allow it to account for potential rises in temperature year over year
DayIndex = list(range(1,len(df)+1))
df['DayIndex'] = DayIndex

# Isolate the month value so that we can determine season
Month = list(range(1,len(df)+1))
df['Month'] = Month
        
for i in range(0, len(df)):
    dateStr = df.at[i, 'Date']
    if (len(dateStr) == 5):
        df.at[i, 'Month'] = int(dateStr[0])
    else:
        df.at[i, 'Month'] = int(dateStr[0]+dateStr[1])

# I elected to base season off of the meteorlogical standard (i.e. seasons start on the first of their equinox months)
Season = str(list(range(1,len(df)+1)))
df['Season'] = Season

for i in range(0, len(df)):
    #dateDay = df.at[i, 'Day']
    dateMonth = df.at[i, 'Month']
    if (dateMonth == 12 or dateMonth < 3):
        df.at[i, 'Season'] = "Winter"
    elif (dateMonth >=3 and dateMonth < 6):
        df.at[i, 'Season'] = "Spring"
    elif (dateMonth >=6 and dateMonth < 9):
        df.at[i, 'Season'] = "Summer"
    else:
        df.at[i, 'Season'] = "Autumn"

# Sanity check
df.head()

Unnamed: 0,Date,Precipitation,PanEvaporation,MeanTemp,MeanWindSpeed,SolarRadiation,FAOShortGrassEto,DaylightStationPressure,DaylightHumidity,SkyCoverage,DaylightTemp,DaylightBroadbandAerosol,WindSpeed,WindDirection,DayIndex,Month,Season
0,10161,0.0,0.22,8.1,118.5,294.8,1.5,97.8,45,0,12.1,0.117,0.4,0,1,1,Winter
1,10261,0.0,0.25,8.0,121.8,295.4,1.7,98.0,35,0,12.4,0.092,0.0,0,2,1,Winter
2,10361,0.0,0.32,9.6,171.0,297.9,2.2,98.3,30,0,14.0,0.092,1.8,50,3,1,Winter
3,10461,0.0,0.27,10.1,110.1,256.1,1.6,98.2,32,2,14.3,0.128,0.7,0,4,1,Winter
4,10561,0.0,0.35,12.2,167.3,271.0,2.3,97.7,32,5,15.9,0.055,3.3,80,5,1,Winter


In [127]:
np.random.seed(657)
data_randomized = df.sample(frac=1)

trainsize = round(len(data_randomized) * 0.70)

# Split into training and test sets
training_set = data_randomized[:trainsize].reset_index(drop=True)
test_set = data_randomized[trainsize:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

trainX = training_set.iloc[:,1:-2]
trainy = training_set['Season']

colnames = trainX.columns

testX = test_set.iloc[:,1:-2]
testy = test_set['Season']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
trainLabels = le.fit_transform(trainy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder() 
trainX = enc.fit_transform(trainX)
trainX = pd.DataFrame(trainX, columns=colnames) 

model = CategoricalNB() 
model.fit(trainX,trainLabels) 

yhattrain = model.predict(trainX)
pd.crosstab(yhattrain, trainy)

(7670, 17)
(3287, 17)


Season,Autumn,Spring,Summer,Winter
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1355,69,44,58
1,63,1515,34,31
2,201,176,1828,0
3,303,198,0,1795


In [128]:
accuracy_score(yhattrain, trainLabels)

0.8465449804432855

In [129]:
le = LabelEncoder() 
testLabels = le.fit_transform(testy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
testX = enc.fit_transform(testX)
testX = pd.DataFrame(testX, columns=colnames) 

yhattest = model.predict(testX)
confM = confusion_matrix(yhattest, testLabels)

confM

array([[374, 211, 129,  35],
       [133, 379, 158,  12],
       [ 56,  37, 565,   0],
       [245, 175,   2, 776]])

In [130]:
acc = accuracy_score(yhattest, testLabels)

acc

0.6370550654091877

<br>
Overall, it only improved the model by 1% which is unfortunate, however the training model improved by 5%! It seems to also better differentiate between Autumn and Spring but still struggles to seperate them from the extremes. But once again, it seems to know the difference between Winter and Summer.

This method yielded a much better result than our k nearest neighbors attempt which is exciting because it encourages further research into this method. 

Just to make sure we arent overfitting the data, lets try one more attempt but with only variables that you could get from simple values that can be quickly obtained from a weather report.

## Retrying Attempt with Less Variables

In [133]:
df = pd.read_csv('arizonaWeather.csv', sep=',', index_col=False, header=None, names=["Date", "Precipitation", "PanEvaporation", "MeanTemp", "MeanWindSpeed", "SolarRadiation", "FAOShortGrassEto", "DaylightStationPressure", "DaylightHumidity", "SkyCoverage", "DaylightTemp", "DaylightBroadbandAerosol", "WindSpeed", "WindDirection"])

df = df.drop(['PanEvaporation','MeanWindSpeed','SolarRadiation','SkyCoverage','FAOShortGrassEto','DaylightStationPressure','DaylightBroadbandAerosol','WindSpeed','WindDirection'], axis = 1)

# Convert the date to a string for easier parsing
df.dtypes
dateType = {'Date': str}
df = df.astype(dateType)

# Create an index to allow the machine to better understand the progression of time
# This will allow it to account for potential rises in temperature year over year
DayIndex = list(range(1,len(df)+1))
df['DayIndex'] = DayIndex

# Isolate the month value so that we can determine season
Month = list(range(1,len(df)+1))
df['Month'] = Month
        
for i in range(0, len(df)):
    dateStr = df.at[i, 'Date']
    if (len(dateStr) == 5):
        df.at[i, 'Month'] = int(dateStr[0])
    else:
        df.at[i, 'Month'] = int(dateStr[0]+dateStr[1])

# I elected to base season off of the meteorlogical standard (i.e. seasons start on the first of their equinox months)
Season = str(list(range(1,len(df)+1)))
df['Season'] = Season

for i in range(0, len(df)):
    #dateDay = df.at[i, 'Day']
    dateMonth = df.at[i, 'Month']
    if (dateMonth == 12 or dateMonth < 3):
        df.at[i, 'Season'] = "Winter"
    elif (dateMonth >=3 and dateMonth < 6):
        df.at[i, 'Season'] = "Spring"
    elif (dateMonth >=6 and dateMonth < 9):
        df.at[i, 'Season'] = "Summer"
    else:
        df.at[i, 'Season'] = "Autumn"

# Sanity check
df.head()

Unnamed: 0,Date,Precipitation,MeanTemp,DaylightHumidity,DaylightTemp,DayIndex,Month,Season
0,10161,0.0,8.1,45,12.1,1,1,Winter
1,10261,0.0,8.0,35,12.4,2,1,Winter
2,10361,0.0,9.6,30,14.0,3,1,Winter
3,10461,0.0,10.1,32,14.3,4,1,Winter
4,10561,0.0,12.2,32,15.9,5,1,Winter


In [134]:
np.random.seed(657)
data_randomized = df.sample(frac=1)

trainsize = round(len(data_randomized) * 0.70)

# Split into training and test sets
training_set = data_randomized[:trainsize].reset_index(drop=True)
test_set = data_randomized[trainsize:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

trainX = training_set.iloc[:,1:-2]
trainy = training_set['Season']

colnames = trainX.columns

testX = test_set.iloc[:,1:-2]
testy = test_set['Season']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
trainLabels = le.fit_transform(trainy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder() 
trainX = enc.fit_transform(trainX)
trainX = pd.DataFrame(trainX, columns=colnames) 

model = CategoricalNB() 
model.fit(trainX,trainLabels) 

yhattrain = model.predict(trainX)
pd.crosstab(yhattrain, trainy)

(7670, 8)
(3287, 8)


Season,Autumn,Spring,Summer,Winter
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1151,67,73,45
1,173,1403,87,107
2,287,133,1746,0
3,311,355,0,1732


In [135]:
accuracy_score(yhattrain, trainLabels)

0.7864406779661017

In [136]:
le = LabelEncoder() 
testLabels = le.fit_transform(testy)

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
testX = enc.fit_transform(testX)
testX = pd.DataFrame(testX, columns=colnames) 

yhattest = model.predict(testX)
confM = confusion_matrix(yhattest, testLabels)

confM

array([[277, 146, 130,  17],
       [223, 385, 122,  44],
       [ 88,  24, 601,   1],
       [220, 247,   1, 761]])

In [137]:
acc = accuracy_score(yhattest, testLabels)

acc

0.6157590508062063

## Review

So as we can see, with less variables the accuracy did go down. This could be due to the fact that we chose the wrong variables to evaluate, but the purpose of this model was also to make it as simple as possible to someone who is not as adept at meteorology. This model still exhibits similar characteristics as the ones before it, but obviously just a tad worse.

I'm sure with different variables or perhaps a broader set of data, this classifier could be even more accurate. But until then, this was a fun little exploration into weather predictions.