<div class="alert alert-block alert-info" style="margin-top: 20px"><h1>Decision Tree</h1>
<code>By AKSHAY KASHYAP</code></div>

<hr><div class="alert alert-block alert-danger" style="margin-top: 20px"><h3>Decision Tree</h3></div>
<hr>
<div>
A <code>decision tree</code> is a popular supervised machine learning algorithm that is used for both<br>
classification and regression tasks. It works by recursively <code>partitioning</code> the data into<br>
subsets based on the most <code>significant attribute</code> at each step.<hr>
    <li><code>Tree Structure</code>: A decision tree consists of nodes (representing a feature/attribute),<br>
        branches (representing decision rules), and leaf nodes (representing the outcome or decision).</li><br>
    <li><code>Splitting Criteria</code>: At each node of the tree, the algorithm selects the attribute that<br>
        best splits the data into purest possible subsets. The purity of subsets is typically measured<br>
        using metrics like Gini impurity or entropy for classification, and variance reduction for regression.</li>
</div><hr>

from `sklearn` import `tree`
__________

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder #to label categorical data
from sklearn import tree # for decision tree
from sklearn.model_selection import train_test_split # spliting data for training and testing

In [2]:
df = pd.read_csv("ML_data/sal.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


`NOTE:` these are not real life data.

In [3]:
inputs = df.drop(['salary_more_then_100k'], axis=1) # axis columns
inputs.head()

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors


In [4]:
target = df['salary_more_then_100k']

In [5]:
# creating labelencoder
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()
# or just one for all 3  
# le = LabelEncoder()

In [6]:
# fitting data(columns) in label encoder to label it 

inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])

`Note:`we can do in single line code two<hr>

inputs[[ 'company_n', 'job_n', 'degree_n' ]] = inputs[[ 'company', 'job', 'degree' ]].apply( `LabelEncoder().fit_transform` )
<hr>

In [7]:
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


_____
Label encoder, assign in `alphabetical order` 
_____________________
|company|label provide|job|label provide|degree|label provide|
|-|-|-|-|-|-|
|abc pharma|0|business manager|0|bachelors|0|
|facebook|1|computer programmer|1|masters|1|
|google|2|sales executive|2|
__________

In [8]:
inputs.drop(['company', 'job', 'degree'], axis='columns', inplace = True)

In [9]:
inputs.head()

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0


In [10]:
model = tree.DecisionTreeClassifier()

In [11]:
model.fit(inputs,target)

Is salary of Google, Computer Engineer, Bachelors degree > 100 k ?

In [12]:
model.predict([[2,1,0]]) # here, only number is provided which we labelled earlier



array([0], dtype=int64)

Is salary of Google, Computer Engineer, Masters degree > 100 k ?

In [13]:
model.predict([[2,1,1]]) # here, only number is provided which we labelled earlier



array([1], dtype=int64)

__________
Let's try to predict survived on `titanic dataset` (`sex`, `class`, `age`, `fare`)
________

In [14]:
titan = pd.read_csv('ML_data/titanic.csv')

In [15]:
titan

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


_______________
AS per our question we need `4 columns` + `1 target column` only
______________

In [16]:
inputs = titan[['Fare','Pclass','Sex','Age']]
inputs.head()

Unnamed: 0,Fare,Pclass,Sex,Age
0,7.25,3,male,22.0
1,71.2833,1,female,38.0
2,7.925,3,female,26.0
3,53.1,1,female,35.0
4,8.05,3,male,35.0


In [17]:
target = titan['Survived']

In [18]:
# we had only one non-numerical value
inputs.loc[:, 'Sex'] = inputs['Sex'].map({'male': 1, 'female': 2})

In [19]:
inputs.head()

Unnamed: 0,Fare,Pclass,Sex,Age
0,7.25,3,1,22.0
1,71.2833,1,2,38.0
2,7.925,3,2,26.0
3,53.1,1,2,35.0
4,8.05,3,1,35.0


In [20]:
inputs.isnull().sum()

Fare        0
Pclass      0
Sex         0
Age       177
dtype: int64

In [21]:
inputs.loc[:, 'Age'] = inputs['Age'].fillna(inputs['Age'].mean())

In [22]:
X_train, X_test, y_train, y_test = train_test_split(inputs,target, test_size=0.2, random_state=42)

In [23]:
model_titan = tree.DecisionTreeClassifier()

In [24]:
model_titan.fit(X_train,y_train)

In [25]:
model_titan.score(X_test,y_test)

0.7597765363128491

In [26]:
# fare pclass sex age
model_titan.predict([[9,3,2,23]]) 



array([0], dtype=int64)

In [27]:
# fare pclass sex age
model_titan.predict([[22,2,2,33]]) 



array([1], dtype=int64)