<a href="https://colab.research.google.com/github/code4tomorrow/machine-learning/blob/main/2_intermediate/chapter5/decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Decision Trees**

In this notebook, we will apply Decision Trees to analyze admit rates. Most of us are headed to college after school, and maybe this is a relevant topic to analyze! In particular, we are studying graduate admissions, and trying to predict if a particular student will be admitted based on their criteria.

### **Imports**

Here, we import the Pandas and NumPy libraries, and then the Decision Tree Classifier.

In [None]:
import pandas as pd
import numpy as n
from sklearn.tree import DecisionTreeClassifier

## **Data Import**

In [None]:
from google.colab import files
files.upload()

Saving College_admission.csv to College_admission.csv


{'College_admission.csv': b'admit,gre,gpa,ses,Gender_Male,Race,rank\r\n0,380,3.61,1,0,3,3\r\n1,660,3.67,2,0,2,3\r\n1,800,4,2,0,2,1\r\n1,640,3.19,1,1,2,4\r\n0,520,2.93,3,1,2,4\r\n1,760,3,2,1,1,2\r\n1,560,2.98,2,1,2,1\r\n0,400,3.08,2,0,2,2\r\n1,540,3.39,1,1,1,3\r\n0,700,3.92,1,0,2,2\r\n0,800,4,1,1,1,4\r\n0,440,3.22,3,0,2,1\r\n1,760,4,3,1,2,1\r\n0,700,3.08,2,0,2,2\r\n1,700,4,2,1,1,1\r\n0,480,3.44,3,0,1,3\r\n0,780,3.87,2,0,3,4\r\n0,360,2.56,3,1,3,3\r\n0,800,3.75,1,1,3,2\r\n1,540,3.81,1,0,3,1\r\n0,500,3.17,3,0,2,3\r\n1,660,3.63,1,0,1,2\r\n0,600,2.82,1,0,3,4\r\n0,680,3.19,1,0,1,4\r\n1,760,3.35,2,0,2,2\r\n1,800,3.66,2,1,1,1\r\n1,620,3.61,2,0,1,1\r\n1,520,3.74,2,0,3,4\r\n1,780,3.22,1,0,1,2\r\n0,520,3.29,1,0,1,1\r\n0,540,3.78,1,1,1,4\r\n0,760,3.35,2,1,1,3\r\n0,600,3.4,3,0,1,3\r\n1,800,4,3,0,1,3\r\n0,360,3.14,1,1,2,1\r\n0,400,3.05,3,0,2,2\r\n0,580,3.25,1,0,2,1\r\n0,520,2.9,2,0,2,3\r\n1,500,3.13,2,0,2,2\r\n1,520,2.68,2,0,1,3\r\n0,560,2.42,1,1,3,2\r\n1,580,3.32,1,0,1,2\r\n1,600,3.15,2,1,1,2\r\n0,5

In [None]:
dataDf = pd.read_csv("College_admission.csv") #Upload this file that you have downloaded
dataDf.head() #This command will help us look at what the data set looks like in general

Unnamed: 0,admit,gre,gpa,ses,Gender_Male,Race,rank
0,0,380,3.61,1,0,3,3
1,1,660,3.67,2,0,2,3
2,1,800,4.0,2,0,2,1
3,1,640,3.19,1,1,2,4
4,0,520,2.93,3,1,2,4


### **Cleaning up Data**

Data Engineering is a huge topic in itself, so in this course, we will largely avoid any problematic data. In the following cells, we will remove any rows with empty cells, and the like. Feel free to try this on your own on another data set, if you'd like to.

In [None]:
dataDf = dataDf.dropna() #Removes rows with Null Values.
dataDf.head()

Unnamed: 0,admit,gre,gpa,ses,Gender_Male,Race,rank
0,0,380,3.61,1,0,3,3
1,1,660,3.67,2,0,2,3
2,1,800,4.0,2,0,2,1
3,1,640,3.19,1,1,2,4
4,0,520,2.93,3,1,2,4


### **Train-Test Split**

First, we convert the data to train and test datasets. Refer to notebooks from the previous class on Regression if required.

In [12]:
msk = np.random.rand(len(dataDf))<0.8
train = dataDf[msk]
test = dataDf[~msk]

In [13]:
train_x = train[["gre","gpa","ses","Gender_Male","Race"]] #Since we are to predict the admittance, we put all other features in the train_input list, where x represents input
train_y = train[["admit"]] #We put the admit feature in this NumPy Array, where y represents output
test_x = test[["gre","gpa","ses","Gender_Male","Race"]]
test_y = test[["admit"]]

### **Categorical Variables**

Decision Trees do not handle Categorical Variables, as implemented in Sklearn. Categorical Variables are ones that have a specific list of values it can hold, not any number is a valid value. So in order to apply Decision Trees, we'll need to convert any Categorical Variables to Numerical Variables.

Thankfully, however, this data set is already prepped for Decision Trees, since all variables contain numerical values. If any one were to have textual values, we would need to convert them as shown in the code shown in the comments below. 

In [15]:
# from sklearn import preprocessing
#NumberConverter = preprocessing.LabelEncoder()
#possible_values = [...], this list contains all possible values for a particular feature
#NumberConverter.fit(possible_values)
#numberFeature = NumberConverter.transform(categorical_value), where categorical_value is the variable which is to be numerized (note that this is not a real term, and only used as explanation)

### **Training**

Now, we will train our Decision Tree to predict the admittance of the student.

In [20]:
DecisionTreeInstance = DecisionTreeClassifier(criterion="entropy") #Creating an instance of a Decision Tree Classifier from Sklearn

Then, we can train the Classifier on the training data set.

In [21]:
DecisionTreeInstance.fit(train_x, train_y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

### **Predictions and Accuracy Test**

We use the following syntax to predict the output from the Test Input Data.

In [22]:
predictions = DecisionTreeInstance.predict(test_x)

Here, we apply the Accuracy Score to see how good this model.

In [24]:
from sklearn import metrics
print(metrics.accuracy_score(test_y,predictions))

0.5625


The accuracy score becomes closer to 1, as accuracy increases. As you can see, this is not a very great accuracy score, so we may want to apply another algorithm to understand this data.

In the following cells, repeat what you have seen above, starting from the Train-Test Split, except leave out one feature from the train-test split step. See if this makes your accuracy better or worse. Why do you think it has done so?