# Making An Air Pollutant Forecasting model - Part II
<br>
We are now ready to build a forecasting tool (in North Carolina) using the decision tree algorithm.

Some descriptions about this tool:
- We will use "features" including temperature, humidity, wind speed, and population
- The prediction target would be either "Good" or "Warning", where "Warning" indicates that the AQI of PM 2.5 is higher than 50.
- In order to see how well the model performs, we will again split the data into a "training set" and a "holdout set." We will use data from 2013-2016 as the training set, and data from 2017 as the holdout set for testing. 

In [1]:
import numpy as np
import pandas as pd
import importlib
from IPython.display import display
from sklearn import tree
from sklearn.model_selection import train_test_split
from bokeh.io import output_notebook
output_notebook()


Next, we load in our data table. The table includes daily summary of weather and pollutant information. The table also includes population of the given year. Note that "missing data" have been removed in this table, this may introduce some bias into our model but we will ignore this for now.

In [2]:
pd.set_option('display.max_rows',None)
df=pd.read_csv('pm25_combined.csv')
df=df[df["State_name"]=="North Carolina"].reset_index(drop=True)
df

Unnamed: 0.1,Unnamed: 0,State_name,County_name,Date,avgAQI_pm25,Temperature,Wind,RH,Population,pm25_class
0,120936,North Carolina,Cumberland,1/4/13,44.0,40.375,114.0875,63.083333,333273,0
1,120942,North Carolina,Cumberland,1/10/13,38.0,58.5,76.383333,68.625,333273,0
2,120948,North Carolina,Cumberland,1/16/13,23.0,52.625,100.783333,94.416667,333273,0
3,120954,North Carolina,Cumberland,1/22/13,27.0,38.291667,125.575,32.083333,333273,0
4,120960,North Carolina,Cumberland,1/28/13,65.0,44.958333,85.864583,70.583333,333273,1
5,120966,North Carolina,Cumberland,2/3/13,25.0,43.916667,3.4875,60.791667,333273,0
6,120972,North Carolina,Cumberland,2/9/13,34.0,44.583333,1.970833,46.083333,333273,0
7,120978,North Carolina,Cumberland,2/15/13,43.0,46.916667,111.977083,66.166667,333273,0
8,120984,North Carolina,Cumberland,2/21/13,50.0,43.666667,85.395833,41.541667,333273,0
9,120990,North Carolina,Cumberland,2/27/13,25.0,49.5,120.152083,72.416667,333273,0


Here, we define the "features" and "target", and then split the data into a training set and hold out test set.

In [3]:
X=df.loc[:,["Temperature","Wind","RH","Population"]]
y=df.loc[:,"pm25_class"]

In [4]:
X_train=X[pd.DatetimeIndex(df["Date"]).year<2017]
y_train=y[pd.DatetimeIndex(df["Date"]).year<2017]
X_test=X[pd.DatetimeIndex(df["Date"]).year==2017]
y_test=y[pd.DatetimeIndex(df["Date"]).year==2017]
print("Total number of data entries = ",len(y))
print("Number of data entries in the training set = ",len(y_train))
print("Number of data entries in the hold out (testing) data set = ",len(y_test))

Total number of data entries =  2481
Number of data entries in the training set =  2131
Number of data entries in the hold out (testing) data set =  350


### Decision Tree Model
If we continue exploring the decision tree until we can perfectly classify every data entry, then we will __overfit__ the training data. This model will not be able to __predict__ future, unseen data. One way to prevent overfitting is to manually set a ```max_depth``` (maximum depth of the decision tree). 

We will build decision tree models using several different values of this ```max_depth``` and see how the model performs using the test data set.

In [5]:
max_depth = 2
pm25_tm = tree.DecisionTreeClassifier(max_depth=max_depth)
pm25_tm = pm25_tm.fit(X_train,y_train)


Use the following code to visulize the tree:

In [None]:
import graphviz 

dot_data = tree.export_graphviz(pm25_tm, out_file=None, 
                    feature_names=['Temperature', 'Wind', 'RH', 'Population'],  
                    class_names=['Good','Warning'],  
                    filled=True, rounded=True,  
                    special_characters=True, max_depth=5)  
graph = graphviz.Source(dot_data)  
graph 

The first metric we can look at is the classification accuracy:

In [6]:
from sklearn.metrics import accuracy_score
y_train_pred=pm25_tm.predict(X_train)
print("accuracy for the training set is ",np.around(accuracy_score(y_train,y_train_pred)*100,2),"%")
y_test_pred=pm25_tm.predict(X_test)
print("accuracy for the test set is ",np.around(accuracy_score(y_test,y_test_pred)*100,2),"%")

accuracy for the training set is  84.84 %
accuracy for the test set is  87.14 %


Next, we will explore data entries our model accuratly predicted as Good or Warning, and entries our model inaccuratly predicted as Good or Warning. Record the numbers like the following matrix:
<img src="confusion_matrix.png" width="300">

In [15]:
df_train=pd.concat([X_train,y_train,pd.Series(data=y_train_pred,name="Prediction")],axis=1)
df_test=pd.concat([X_test.reset_index(),y_test.reset_index(),pd.Series(data=y_test_pred,name="Prediction")],axis=1)

In [None]:
pd.set_option('display.max_rows',None)

df_cm = df_train # df_train (training set) or df_test (test set)
actual = 0 #1 (Warning) or 0 (Good)
prediction = 0 #1 (Warning) or 0 (Good)

df_sub=df_cm[(df_cm["pm25_class"]==actual) & (df_cm["Prediction"]==prediction)]
display(df_sub)
print("number of data entries = ",len(df_sub))



In [9]:
df_train

Unnamed: 0,Temperature,Wind,RH,Population,pm25_class,Prediction
0,40.375,114.0875,63.083333,333273,0,0
1,58.5,76.383333,68.625,333273,0,0
2,52.625,100.783333,94.416667,333273,0,0
3,38.291667,125.575,32.083333,333273,0,0
4,44.958333,85.864583,70.583333,333273,1,0
5,43.916667,3.4875,60.791667,333273,0,0
6,44.583333,1.970833,46.083333,333273,0,0
7,46.916667,111.977083,66.166667,333273,0,0
8,43.666667,85.395833,41.541667,333273,0,0
9,49.5,120.152083,72.416667,333273,0,0


In [16]:
df_test

Unnamed: 0,index,Temperature,Wind,RH,Population,index.1,pm25_class,Prediction
0,2131,59.583333,90.0,80.083333,376320,2131,0,0
1,2132,48.041667,90.083333,83.958333,376320,2132,0,0
2,2133,47.833333,160.208333,69.458333,376320,2133,0,0
3,2134,52.0,99.666667,100.0,376320,2134,0,0
4,2135,52.5,245.958333,61.0,376320,2135,0,0
5,2136,37.458333,188.083333,47.875,376320,2136,0,0
6,2137,48.583333,187.083333,40.208333,376320,2137,0,0
7,2138,38.833333,70.958333,50.708333,376320,2138,0,0
8,2139,51.375,164.416667,54.166667,376320,2139,0,0
9,2140,42.5,130.25,52.958333,376320,2140,0,0
