# Decision Tree Classification

## CART (Classification And Regression Tree)

CART( Classification And Regression Tree) is a  variation of the decision tree algorithm. It can handle both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train  Decision Trees (also called “growing” trees). CART was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984.

CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the target variable’s values can be predicted based on other matters. It is a decision tree where each fork is split into a predictor variable and each node has a prediction for the target variable at the end.

In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an attribute. The root node is taken as the training set and is split into two by considering the best attribute and threshold value. Further, the subsets are also split using the same logic. This continues till the last pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree.

The CART algorithm works via the following process:

- The best split point of each input is obtained. 
- Based on the best split points of each input in Step 1, the new “best” split point is identified. 
- Split the chosen input according to the “best” split point. 
- Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

### Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities of each class. It computes the degree of probability of a specific variable that is wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides outcomes either “successful” or “failure” and hence conducts binary splitting only.

The degree of the  Gini index varies from 0 to 1,

Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
The Gini index of value 1 signifies that all the elements are randomly distributed across various classes, and
A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows: 

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv("Data/salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [5]:
inputs = df.drop('salary_more_then_100k',axis='columns')

In [6]:
target = df['salary_more_then_100k']

In [7]:
from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

In [8]:
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])

In [9]:
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


In [10]:
inputs_n = inputs.drop(['company','job','degree'],axis='columns')

In [11]:
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


In [12]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [13]:
from sklearn import tree
model = tree.DecisionTreeClassifier()

In [14]:
model.fit(inputs_n, target)

DecisionTreeClassifier()

In [15]:
model.score(inputs_n,target)

1.0

## Queries?

Is salary of Google, Computer Engineer, Bachelors degree > 100 k ?

In [16]:
model.predict([[2,1,0]])

array([0])

Is salary of Google, Computer Engineer, Masters degree > 100 k ?

In [17]:
model.predict([[2,1,1]])

array([1])

# Excercise

**Exercise: Build decision tree model to predict survival based on certain parameters**

CSV file is available in Data folder

##### In this file using following columns build a model to predict if person would survive or not,

1. Pclass
1. Sex
1. Age
1. Fare

##### Calculate score of your model