## Decision Tree: iteratively split your data by feature until the predicted value for all of your remaining data is the same (yes/no)

## The order you select your features will determine the performance of your algorithm.

## Entropy: how similar/dissimilar a group is from each other (6 green + 0 red = low entropy, 4 green + 4 red = high entropy/total randomness)

## Low entropy = high information gain
## High entropy = low information gain

In [1]:
import pandas as pd
df = pd.read_csv("salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


## Split our features and output into separate data frames.

In [3]:
inputs = df.drop("salary_more_then_100k", axis="columns")
target = df["salary_more_then_100k"]

In [4]:
inputs

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors
5,google,computer programmer,masters
6,abc pharma,sales executive,masters
7,abc pharma,computer programmer,bachelors
8,abc pharma,business manager,bachelors
9,abc pharma,business manager,masters


In [5]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

## Convert string features into numeric features using LabelEncoder before our model can read them.

In [15]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

inputs["company_n"] = le.fit_transform(inputs["company"])
inputs["job_n"] = le.fit_transform(inputs["job"])
inputs["degree_n"] = le.fit_transform(inputs["degree"])

In [18]:
inputs_n = inputs.drop(["company", "job", "degree"], axis="columns")
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


## Apply decision tree method.

In [25]:
from sklearn import tree
import warnings
warnings.filterwarnings("ignore")

In [21]:
model = tree.DecisionTreeClassifier()
model.fit(inputs_n, target)

DecisionTreeClassifier()

## Accuracy score of 1.0 is expected because we didn't split our testing and training data before checking for accuracy. This is why we should split our data before training our model.

In [22]:
model.score(inputs_n, target)

1.0

In [27]:
model.predict([[2,0,1]])

array([1], dtype=int64)