# Weather Prediction Model
This notebook will contain a ML model that predicts comfortable temperatures based on historical data. For this model, I'll be using San Jose's weather data from 2010 to June 28th, 2020. The reason for using as much data is that the variety in the data can help avoid overfitting.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
weatherData = pd.read_csv('sanJoseWeather.csv')
weatherData.head()

Unnamed: 0,STATION,NAME,DATE,AWND,FMTM,PGTM,PRCP,SNOW,SNWD,TAVG,...,WT02,WT03,WT04,WT05,WT07,WT08,WT13,WT14,WT16,WT21
0,USW00023293,"SAN JOSE, CA US",2010-01-01,3.13,1209.0,1208.0,0.0,,,,...,,,,1.0,,,,,1.0,
1,USW00023293,"SAN JOSE, CA US",2010-01-02,2.68,1709.0,733.0,0.0,,,,...,,,,,,,,,,
2,USW00023293,"SAN JOSE, CA US",2010-01-03,0.89,1705.0,2025.0,0.0,,,,...,,,,,1.0,1.0,1.0,,,
3,USW00023293,"SAN JOSE, CA US",2010-01-04,1.12,2219.0,2216.0,0.0,,,,...,,,,,1.0,1.0,1.0,,,
4,USW00023293,"SAN JOSE, CA US",2010-01-05,2.01,1456.0,1647.0,0.0,,,,...,,,,,1.0,1.0,1.0,,,


Then what is going to be important is get all the attributes from the dataset and retrieve those that are of importance.

The rest of the attributes can be dropped from the DataFrame.

In [3]:
print('There are ', len(weatherData.columns), ' attributes in our dataset')

There are  27  attributes in our dataset


What is next will be deciding the attributes that are not crucial to the model, such as name of the city, since all the data was pulled from the same weather station (which is also useless for training a model)

In [4]:
weatherData.columns

Index(['STATION', 'NAME', 'DATE', 'AWND', 'FMTM', 'PGTM', 'PRCP', 'SNOW',
       'SNWD', 'TAVG', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT01',
       'WT02', 'WT03', 'WT04', 'WT05', 'WT07', 'WT08', 'WT13', 'WT14', 'WT16',
       'WT21'],
      dtype='object')

In [5]:
# We will get rid of two attributes
weatherData = weatherData.drop(['STATION', 'NAME'], axis=1)
weatherData.head()

Unnamed: 0,DATE,AWND,FMTM,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,...,WT02,WT03,WT04,WT05,WT07,WT08,WT13,WT14,WT16,WT21
0,2010-01-01,3.13,1209.0,1208.0,0.0,,,,63.0,49.0,...,,,,1.0,,,,,1.0,
1,2010-01-02,2.68,1709.0,733.0,0.0,,,,58.0,45.0,...,,,,,,,,,,
2,2010-01-03,0.89,1705.0,2025.0,0.0,,,,60.0,39.0,...,,,,,1.0,1.0,1.0,,,
3,2010-01-04,1.12,2219.0,2216.0,0.0,,,,57.0,42.0,...,,,,,1.0,1.0,1.0,,,
4,2010-01-05,2.01,1456.0,1647.0,0.0,,,,59.0,38.0,...,,,,,1.0,1.0,1.0,,,


Now, apart from taking attributes that are clearly unimportant to training the model, we should also take away the attributes that have all **NaN** as their sole value.

In [6]:
weatherData.columns

Index(['DATE', 'AWND', 'FMTM', 'PGTM', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX',
       'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT01', 'WT02', 'WT03', 'WT04',
       'WT05', 'WT07', 'WT08', 'WT13', 'WT14', 'WT16', 'WT21'],
      dtype='object')

Lets take a look at the snowfall (__**SNOW**__) and snow depth (__**SNWD**__) attributes. According to my knowledge there isn't any snowfall in the SF Bay Area, so let's check the unique attributes for those two columns.

In [7]:
print('Unique values for SNOW: ', weatherData.SNOW.unique())
print('Unique values for SNWD: ', weatherData.SNWD.unique())

Unique values for SNOW:  [nan  0.]
Unique values for SNWD:  [nan  0.]


As expected, since 2010 there has not been any snow in San Jose. We can get rid of those columns now

In [8]:
weatherData = weatherData.drop(['SNOW', 'SNWD'], axis=1)
weatherData.head()

Unnamed: 0,DATE,AWND,FMTM,PGTM,PRCP,TAVG,TMAX,TMIN,WDF2,WDF5,...,WT02,WT03,WT04,WT05,WT07,WT08,WT13,WT14,WT16,WT21
0,2010-01-01,3.13,1209.0,1208.0,0.0,,63.0,49.0,250.0,250.0,...,,,,1.0,,,,,1.0,
1,2010-01-02,2.68,1709.0,733.0,0.0,,58.0,45.0,260.0,340.0,...,,,,,,,,,,
2,2010-01-03,0.89,1705.0,2025.0,0.0,,60.0,39.0,240.0,230.0,...,,,,,1.0,1.0,1.0,,,
3,2010-01-04,1.12,2219.0,2216.0,0.0,,57.0,42.0,200.0,190.0,...,,,,,1.0,1.0,1.0,,,
4,2010-01-05,2.01,1456.0,1647.0,0.0,,59.0,38.0,290.0,340.0,...,,,,,1.0,1.0,1.0,,,


We still have 23 columns to check, and we don't want to take that long, so what I'm going to do is check every column for its unique values, in case that attribute has less than 20 unique values then such attribute is not going to work for us.

In [9]:
def showNanattributes(dataFrame):
    for attr in list(dataFrame.columns):
        column = dataFrame[attr]
        if (len(column.unique()) < 31):
            print('Attribute ', column.name, ' values: ', column.unique())

showNanattributes(weatherData)

Attribute  TAVG  values:  [nan]
Attribute  WT01  values:  [nan  1.]
Attribute  WT02  values:  [nan  1.]
Attribute  WT03  values:  [nan  1.]
Attribute  WT04  values:  [nan  1.]
Attribute  WT05  values:  [ 1. nan]
Attribute  WT07  values:  [nan  1.]
Attribute  WT08  values:  [nan  1.]
Attribute  WT13  values:  [nan  1.]
Attribute  WT14  values:  [nan  1.]
Attribute  WT16  values:  [ 1. nan]
Attribute  WT21  values:  [nan  1.]


We see now that the attributes that contain the least amount of useful data are: TAVG, WT01, WT02, WT03, WT04, WT05, WT07, WT08, WT13, WT14, WT16, and WT21

These attributes stand for:

- TAVG: Average Temperature
- WT01: Fog, ice fog, or freezing fog
- WT02: Heavy fog or heaving freezing fog
- WT03: Thunder
- WT04: Ice pellets, sleet, snow pellets, or small hail
- WT05: Hail
- WT07: Dust, volcanic ash, blowing dust, blowing sand, or blowing obstruction
- WT08: Smoke or haze
- WT13: Mist
- WT14: Drizzle
- WT16: Rain
- WT21: Ground fog

Another note, WDF2 and WDF5 have consistently the same data, these two attributes measure the wind direction for the fastest gust of wind. WDF5 often times has missing data, but since the two of them are consistent, I will get rid of WDF5 as well. 

In [10]:
# We will now drop the least informative attributes for the area
weatherData = weatherData.drop(['TAVG', 'WDF5', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT07', 'WT08', 'WT13', 'WT21'], axis=1)

Now we have found a list of attributes that account no significant data. However, the names are confusing and we don't know what they are.

Now we will rename our attributes to make them clearer

In [15]:
print(list(weatherData.columns))

unnamedAttributes = list(weatherData.columns)
renamedAttributes = ['DATE', 'AVGWINDSP', 'TIMEFASTESTWIND', 'PEAKGUST', 'PRECIPITATION', 'TAVG', 'TMAX', 'TMIN', 'GUSTDIRECTION']

# nameAdjustment = dict(zip(unnamedAttributes, renamedAttributes))

['DATE', 'AWND', 'FMTM', 'PGTM', 'PRCP', 'TAVG', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT14', 'WT16']
