<a href="https://colab.research.google.com/github/ashleydabb/IS-4487/blob/main/Lab11_DABB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Module 5 Script

## Objectives

a. Understand model accuracy.  Why is it a performance metric for classification and not regression?
    
b. Calculate accuracy for a simple majority class model (this is the same as calculating the proportion of the majority class in a binary variable). Consider: x <- c(1, 1, 1, 0, 0).  What is the majority class? What is the proportion of the majority class in x?
    
c. Fit a tree model of the target with just one predictor variable and calculate the accuracy of this model.
    
d. Interpret a tree model, and calculate information gain.
    
e. Fit a tree model of the target using all the predictors, then:  create a visualization of the tree and identify the top 3 most important predictors in this model.
    
f. How do these models compare to majority class prediction?
    
g. How will you use a classification model as part of a solution to the AdviseInvest case?

We will use the MegaTelCo data for this demonstration.

##Load Libraries

In this class we will be using 
- Pandas
- Scikitlearn
- Matplotlib


In [1]:
import pandas as pd
import matplotlib as mpl
import numpy as np

from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn import tree


## Getting data into Pandas

In this case we will load data from the statsmodels.org library

See the Canvas assignments and lectures for a description of the Megatelco data

**Note:** you will need to enter a code supplied by Google in the next step. 


In [2]:
from google.colab import drive 
drive.mount('/content/gdrive', force_remount=True)

df = pd.read_csv (r'/content/gdrive/MyDrive/Colab Notebooks/daily_aqi_by_county_2021.csv')

Mounted at /content/gdrive


In [3]:
#look at the top rows
df.head(10) 

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Category,Defining Parameter,Defining Site,Number of Sites Reporting
0,Alabama,Baldwin,1,3,2021-01-01,27,Good,PM2.5,01-003-0010,1
1,Alabama,Baldwin,1,3,2021-01-04,47,Good,PM2.5,01-003-0010,1
2,Alabama,Baldwin,1,3,2021-01-07,24,Good,PM2.5,01-003-0010,1
3,Alabama,Baldwin,1,3,2021-01-10,39,Good,PM2.5,01-003-0010,1
4,Alabama,Baldwin,1,3,2021-01-13,46,Good,PM2.5,01-003-0010,1
5,Alabama,Baldwin,1,3,2021-01-16,21,Good,PM2.5,01-003-0010,1
6,Alabama,Baldwin,1,3,2021-01-19,52,Moderate,PM2.5,01-003-0010,1
7,Alabama,Baldwin,1,3,2021-01-22,11,Good,PM2.5,01-003-0010,1
8,Alabama,Baldwin,1,3,2021-01-25,39,Good,PM2.5,01-003-0010,1
9,Alabama,Baldwin,1,3,2021-01-28,22,Good,PM2.5,01-003-0010,1


In [4]:
#look at the datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218196 entries, 0 to 218195
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   State Name                 218196 non-null  object
 1   county Name                218196 non-null  object
 2   State Code                 218196 non-null  int64 
 3   County Code                218196 non-null  int64 
 4   Date                       218196 non-null  object
 5   AQI                        218196 non-null  int64 
 6   Category                   218196 non-null  object
 7   Defining Parameter         218196 non-null  object
 8   Defining Site              218196 non-null  object
 9   Number of Sites Reporting  218196 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 16.6+ MB


In [5]:
#describe the data before cleaning it
df.describe()

Unnamed: 0,State Code,County Code,AQI,Number of Sites Reporting
count,218196.0,218196.0,218196.0,218196.0
mean,30.17839,82.098265,39.58425,1.92422
std,15.755914,89.555906,21.547055,2.251967
min,1.0,1.0,0.0,1.0
25%,18.0,25.0,29.0,1.0
50%,30.0,61.0,38.0,1.0
75%,42.0,111.0,47.0,2.0
max,80.0,810.0,2723.0,34.0


## Clean up the data
Did you notice anything unusual about the "house" amounts? 

How about the handset price and income?

Clean up the data in a  new datafram named "df_clean"


In [6]:
#delete rows with outlier data; put it in a new dataframe
df_clean = df[df['county Name'] == 'Salt Lake']

#delete any rows with missing values in the clean dataframe
df_clean = df_clean.dropna() 

df_clean.describe()

Unnamed: 0,State Code,County Code,AQI,Number of Sites Reporting
count,305.0,305.0,305.0,305.0
mean,49.0,35.0,66.406557,7.767213
std,0.0,0.0,31.38647,0.460521
min,49.0,35.0,6.0,5.0
25%,49.0,35.0,44.0,8.0
50%,49.0,35.0,54.0,8.0
75%,49.0,35.0,84.0,8.0
max,49.0,35.0,177.0,8.0


In [7]:
#Converting parameters to T/F
#PM2.5
df_clean['PM2.5_Parameter'] = np.where(df_clean['Defining Parameter'] == 'PM2.5', 1, 0)
#NO2
df_clean['NO2_Parameter'] = np.where(df_clean['Defining Parameter'] == 'NO2', 1, 0)
#ozone
df_clean['Ozone_Parameter'] = np.where(df_clean['Defining Parameter'] == 'Ozone', 1, 0)
#PM10
df_clean['PM10_Parameter'] = np.where(df_clean['Defining Parameter'] == 'PM10', 1, 0)
#CO
df_clean['CO_Parameter'] = np.where(df_clean['Defining Parameter'] == 'CO', 1, 0)

df_clean.head()

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Category,Defining Parameter,Defining Site,Number of Sites Reporting,PM2.5_Parameter,NO2_Parameter,Ozone_Parameter,PM10_Parameter,CO_Parameter
193138,Utah,Salt Lake,49,35,2021-01-01,70,Moderate,PM2.5,49-035-4002,7,1,0,0,0,0
193139,Utah,Salt Lake,49,35,2021-01-02,48,Good,PM2.5,49-035-3015,7,1,0,0,0,0
193140,Utah,Salt Lake,49,35,2021-01-03,51,Moderate,PM2.5,49-035-3015,7,1,0,0,0,0
193141,Utah,Salt Lake,49,35,2021-01-04,52,Moderate,PM2.5,49-035-4002,7,1,0,0,0,0
193142,Utah,Salt Lake,49,35,2021-01-05,37,Good,NO2,49-035-2005,7,0,1,0,0,0


# Convert attributes to factors

- Leave
- College
- Reported satisfaction
- Reported usage level
- Considering change of plan

## Fit a basic tree model

Use just two variables, 'income' and 'house'. We'll call this the "money tree." 

What is the accuracy of the money tree? 

In [8]:
# split the datafram into independent (x) and dependent (predicted) attributes (y)
x = df_clean[['AQI','County Code', 'PM2.5_Parameter', 'NO2_Parameter', 'Ozone_Parameter', 'PM10_Parameter', 'CO_Parameter']]
y = df_clean['Category']

AQI_model = DecisionTreeClassifier()

# Create Decision Tree Classifer
AQI_model = AQI_model.fit(x,y)


## Preview the tree


In [9]:
AQI_model_text = tree.export_text(AQI_model)
print(AQI_model_text)

|--- feature_0 <= 50.50
|   |--- class: Good
|--- feature_0 >  50.50
|   |--- feature_0 <= 100.50
|   |   |--- class: Moderate
|   |--- feature_0 >  100.50
|   |   |--- feature_0 <= 150.50
|   |   |   |--- class: Unhealthy for Sensitive Groups
|   |   |--- feature_0 >  150.50
|   |   |   |--- class: Unhealthy



## Check Accuracy

What is the accuracy of the money_tree? Use these steps to calculate accuracy.

Is this over fitted?

In [10]:
pred = AQI_model.predict(x)

#print(pred)

print("Accuracy:",metrics.accuracy_score(y, pred))

Accuracy: 1.0


## Prune the tree

Limit the number of levels to 2

In [11]:
AQI_model2 = DecisionTreeClassifier(criterion="entropy", max_depth=4)

# Create Decision Tree Classifer
AQI_model2 = AQI_model2.fit(x,y)


## Preview the new tree

In [12]:
AQI_model2_text = tree.export_text(AQI_model2)
print(AQI_model2_text)

|--- feature_0 <= 50.50
|   |--- class: Good
|--- feature_0 >  50.50
|   |--- feature_0 <= 100.50
|   |   |--- class: Moderate
|   |--- feature_0 >  100.50
|   |   |--- feature_0 <= 150.50
|   |   |   |--- class: Unhealthy for Sensitive Groups
|   |   |--- feature_0 >  150.50
|   |   |   |--- class: Unhealthy



## Re-Check Accuracy

Is this accuracy better than making a random guess?  (check the distribution above)

In [13]:
pred = AQI_model2.predict(x)

print("Accuracy:",metrics.accuracy_score(y, pred))

Accuracy: 1.0


## Visualize the model

In [14]:
from six import StringIO
from IPython.display import Image  
import pydotplus


dot_data = StringIO()
export_graphviz(money_tree2, out_file=dot_data, 
                   feature_names=x.columns,class_names=['Good','Moderate'], 
                   filled=True,rounded=True, precision =2)

graph=pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

NameError: ignored

## Fit a full tree model

Use all of the independent attributes.  We'll call this the "full tree." 

What is the accuracy of the full tree? 

In [15]:
# split the datafram into independent (x) and dependent (predicted) attributes (y)
x = df_clean[['income','house','college','overage','leftover','handset_price','over_15mins_calls_per_month','average_call_duration','reported_satisfaction','reported_usage_level','considering_change_of_plan']]
y = df_clean['leave']

full_tree = DecisionTreeClassifier(criterion="entropy", max_depth=1)

# Create Decision Tree Classifer
full_tree = full_tree.fit(x,y)


KeyError: ignored

##Visualize the full tree

In [None]:
from six import StringIO
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(full_tree, out_file=dot_data, 
                   feature_names=x.columns,class_names=['leave','stay'], 
                   filled=True,rounded=True, precision =2)

graph=pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

## Check Accuracy

In [None]:
pred = full_tree.predict(x)

#print(pred)

print("Accuracy:",metrics.accuracy_score(y, pred))

##Test and Train
Now we will split the dataset into 80% training data and 20% test data

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

##Create a new tree using only training data

In [None]:
train_tree = DecisionTreeClassifier(criterion="entropy", max_depth=5)

# Create Decision Tree Classifer
train_tree = train_tree.fit(x_train,y_train)

##Apply the new tree to our test data

In [None]:
pred = train_tree.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, pred))

## Visualize the trained tree

In [None]:
from six import StringIO
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(train_tree, out_file=dot_data, 
                   feature_names=x.columns,class_names=['leave','stay'], 
                   filled=True,rounded=True, precision =2)

graph=pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

## Did the model improve?
👎  👍