# Classification for Data Exploration -- Decision Trees

Before we start, we need to install a couple of things (in this order), which will make it possible to visualize decision trees. 

### Anaconda Environment check

Make sure your Anaconda environment has : **pydotplus**  and **graphviz**

If the Anaconda doesn't have **pydotplus**, you can use the following command to install it into Anaconda.  Anaconda -> Environments -> base(root) -> Open Terminal:

**conda install -c conda-forge pydotplus**

**conda install -c conda-forge graphviz**


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image  
import pydotplus

In [None]:
import warnings
#suppress all future warning
warnings.filterwarnings('ignore')

Data from here <href>https://www.kaggle.com/uciml/adult-census-income</href>

In [None]:
df = pd.read_csv('adult.csv')

In [None]:
df.head()

Let's try to train a decision tree to predict income

In [None]:
# The hyperparameter max_depth controls the overall complexity of a decision tree.
# Let’s build a shallow tree with max_depth=2 for better interpretation
dt = tree.DecisionTreeClassifier(max_depth=???)

In [None]:
# check dt type
type(dt)

### Two problems with this data

<ol>
<li>We have null values
<li>We need numbers only
</ol>

## Cleaning the data set

In [None]:
df.head()

replace ? with NaN

In [None]:
df.replace(to_replace=???,value=???,inplace=???)

In [None]:
# pove that there is no '?'
sum(df.workclass=='?')

remove education (we already have education.num)

In [None]:
# Drop education column permanently
df.drop(columns=???,inplace=???)

In [None]:
# Drop fnlwgt column permanently
df.drop(columns=???,inplace=???)

In [None]:
df.head()

In [None]:
# check the unique values in native.country columns
df['native.country'].???()

In [None]:
# check the number of unique values in native.country columns
df['native.country'].???()

Check how many unique patterns in each columns.

In [None]:
# Use a for loop tp show the number of unique values in each column
for c in df.???:
    print(c + ' ' + str(df[c].???()) )

make dummy variables for all categorical variables except income

In [None]:
df.shape

In [None]:
# Use get_dummies to create dummy variables for all categorical columns
# For each categorical column, treat nan as a category and generate nan dummy variable
df = pd.get_dummies(df, columns=['workclass','marital.status',
        'occupation','relationship','race','sex','native.country'],
        dummy_na=True)

In [None]:
df.shape

In [None]:
# chech the unique values in our outcome
df.income.???()

In [None]:
df

Make income binary

In [None]:
# make a copy of df
df2=df.???()

In [None]:
# write a lambda function and use startswith function to convert >50K to 1.0 and <=50K to 0.0
df2['income1'] = df2.income.apply(lambda x: ???)

In [None]:
# Use replace function to replace'<=50K' with 0.0 and '>50K' with 1.0
df2['income2'] = df.income.replace({'<=50K':0.0, '>50K':1.0})

In [None]:
df2.head(10)

Need to remove duplicated columns

In [None]:
# drop 'income' and 'income2'
df2.drop(columns=(['income', 'income2']), inplace=???)

In [None]:
# rename 'income1' to 'income'
df2.???(columns={'income1':'income'}, inplace=???)

In [None]:
df2.head()

In [None]:
df=df2

In [None]:
df.shape

## Train the decision tree

Make X and Y. Remember to take out the dependent variable from X, or else the classification problem becomes trivial!

In [None]:
# drop outcome from X
X = df.drop(columns=???)

In [None]:
# income column as Y
Y = df.???

Build the tree

In [None]:
# train the decision tree classifier
dt.???(X,Y)

Visualize the tree

In [None]:
len(df)

In [None]:
import graphviz
# This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
dot_data =tree.export_graphviz(dt, feature_names=dt_feature_names, class_names=dt_target_names, filled=True)  
graph = graphviz.Source(dot_data) 
graph

In [None]:
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
fig = plt.figure(figsize=(25,20))
tree.plot_tree(dt,
           feature_names = dt_feature_names, 
           class_names=dt_target_names,
           filled = True)
plt.show()

### Note 1: 
For **Mac OS** users who experience issue installing graphviz, please try the following steps:  (First check your MacOS version. If it is still 10.x, pls update the MacOS to 11.x)

1. Go to this website: https://brew.sh 
2. Once you launch the website, you will see a line of code under **Install Homebrew**
3. Copy the code <i>**/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"**</i>
to your computer’s terminal (NOT the terminal launched through Anaconda) and run it
4. After installing Homebrew successfully, run the code <i>**brew install graphviz**</i> to your computer’s terminal
5. Restart computer and if installation in step 4 does not work 

If restarting the computer in step 5 does not help installing graphviz, please try the below steps:

6. Go to computer’s terminal and run the code <i>**brew uninstall graphviz**</i>
7. Use the following command to install it into Anaconda: 
Open Anaconda -> Environments -> Base(root) -> Open Terminal -> run the following code: <i>**pip install graphviz**</i>
8. Restart computer and if installation in step 7 does not work


- If restarting the computer in step 8 does not help installing graphviz, please contact TA. 

Note: 
- For MacOS 10.15.6+, the error "GraphViz's executables not found" can be solved by installing brew (steps 1-5)
- MacOS Catalina (10.15.6) users need to take steps 6-8 after trying steps 1-5
- MacOS version 10.13 is too old for direct brew installation and has to update ios before any step is taken.

### Note 2:

For **Microsoft Windows OS**, if you see error **'GraphViz's executables not found.'** You need to first figure out the ACTUAL_DIRECTORY where the graphviz package being installed to in your C drive. Then do one of the following:

1. Modify the Windows system PATH environment, add in the ACTUAL_DIRECTORY (for example: C:\Users\ttan\Anaconda3\Library\bin\graphviz\bin) then restart the Windows. 

Detail example in here : https://geek-university.com/python/add-python-to-the-windows-path/ or https://www.youtube.com/watch?v=q7PzqbKUm_4

Or...

2. Add the following two lines at the beginning of this code section.

<i>**import os**</i>

<i>**os.environ["PATH"] += os.pathsep + ACTUAL_DIRECTORY**</i>

for example : <i>**os.environ["PATH"] += os.pathsep + C:\Users\ttan\Anaconda3\Library\bin\graphviz\bin**</i>

**Note to user name with a space:**

The above workaround may not work for you.  Here are some extra steps you need to do:
* uninstall the graphviz package from Anaconda by <i>**conda remove graphviz**</i> from the cmd line.
* reinstall the pydotplus with <i>**conda install pydotplus**</i>
* manually install the graphviz package with the .msi file through here https://graphviz.gitlab.io/download/ Use the Windows host, stable version. (for example, graphviz-2.38.msi )
* When executing the .msi file, specify a directory that is NOT under your user name (example: c:\graphviz )
* After installation completed, modify/add your Windows PATH env. variable to the newly installed graphviz directory. (see above example Youtube video)
* restart Anaconda, now your graphviz search path issue should be resolved.


## Validating the finding

In [None]:
sns.catplot(x='education.num', y='income', hue='marital.status_Married-civ-spouse', kind='point', data=df, aspect=2)

Education.num = 12.5 is a good threshold to separate low income from high income, but it is more effective for those married civilly (right part of the tree)


In [None]:
df2 = df.copy()

In [None]:
df2['capital.gain'].value_counts().sort_index(ascending=False)

In [None]:
sns.catplot(x='capital.gain',y='income',data=df2,aspect=3,kind='bar', hue='marital.status_Married-civ-spouse')

In [None]:
df2['binned_capital_gain'] = pd.cut(df2['capital.gain'],bins=[0,7073,100000])

In [None]:
sns.catplot(x='binned_capital_gain',y='income',data=df2 aspect=3,kind='bar', hue='marital.status_Married-civ-spouse')

Capital.gain = 7073 is a good threshold to separate low income from high income, but it is more effective for those NOT married civilly (left part of the tree)
