# . . . . . . . . . . . . . . . . . . . . . . . . . . . .
# Decision Tree Linear Regression
# . . . . . . . . . . . . . . . . . . . . . . . . . . . .




Afin de créer un arbre de décision, nous allons étudier un CSV sur les revenus d'individus aux USA.
Cela contient les informations sur le status marital, l'âge, le type d'emploi etc.
Les datas sont de 1994. Nous voulons predire si le salaire sera inferieur voire égal à 50k ou supérieur à 50k

Voilà la source : [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult)
Le CSV est inclus.

### <span style="color:purple">Importe le fichier income.csv fourni
 </span>

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

In [6]:
# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)

income = pd.read_csv("income.csv")
income.head(15)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


### <span style="color:purple">Combien d'observations y a t-il ?
 </span>

In [7]:
income.shape

(32561, 15)

 <span style="color:green">Attendu : 32561</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Catégories

### <span style="color:purple">Prends le temps d'interpreter les datas. Quels sont les differents facteurs, les	Θ. Sous quelle forme sont elles classifiées.<br> Sex donne "Male" ou "Female". "workclass" contient plusieurs sortes de réponses.   Afin de pouvoir les analyser nous allons les convertirs en valeures numériques en fonction des categories.   Pour cela utilise la class [pandas.Categorical()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Categorical.html)</span>

In [8]:
#income["workclass"] = pd.Categorical(income["workclass"]).codes


### <span style="color:purple">Convertis le reste des catgorical columns dans `income` (education, marital_status, occupation, relationship, race, sex, native_country, and high_income) en categories numerique.</span>


In [11]:
to_convert = ["workclass","education","marital_status","occupation","relationship","race","sex",
              "native_country","high_income"]
for feature in to_convert:
    income[feature] = pd.Categorical(income[feature]).codes

In [12]:
income.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0
5,37,4,284582,12,14,2,4,5,4,0,0,0,40,39,0
6,49,4,160187,6,5,3,8,1,2,0,0,0,16,23,0
7,52,6,209642,11,9,2,4,0,4,1,0,0,45,39,1
8,31,4,45781,12,14,4,10,1,4,0,14084,0,50,39,1
9,42,4,159449,9,13,2,4,0,4,1,5178,0,40,39,1


### <span style="color:purple">A present divise income en deux `DataFrames` en fonction de `workclass` afin de diviser ceux travaillant dans le secteur privé ou non. </span>
*python hint : Booleans to split a dataframe.*

In [15]:
private = income[income["workclass"] == 4]

In [17]:
private.shape

(22696, 15)

In [18]:
public = income[income["workclass"] != 4]
public.shape[0] / income.shape[0], public.shape

(0.3029698105095052, (9865, 15))

### ------------------------------------------------------------------------------------------------------------------------------

###  <span style="color:purple">Calcule une proportion / probabilite</span>
 ##### <span style="color:purple">Quelle est la proportion de `private_income` et de `public_incomes` ?
 </span>

In [32]:
private_incomes_prop = private.shape[0] / income.shape[0]
public_incomes_prop = public.shape[0] / income.shape[0]
print("private_incomes_proportion",private_incomes_prop)
print("public_incomes_proportion",public_incomes_prop) 

private_incomes_proportion 0.6970301894904948
public_incomes_proportion 0.3029698105095052


 <span style="color:green">private_incomes 0.6970301894904948 <br>
public_incomes 0.3029698105095052</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Entropy

$$-\sum _{ i=1 }^{ c }{ P({ x }_{ i })\log _{ b }{ P({ x }_{ i }) }  } $$

In [13]:
test = pd.Categorical(income["workclass"])
test

[7, 6, 4, 4, 4, ..., 4, 4, 4, 4, 5]
Length: 32561
Categories (9, int64): [0, 1, 2, 3, ..., 5, 6, 7, 8]

In [25]:
def calc_entropy(column):
    p = np.array(column.value_counts()) / column.shape
    entropy = -np.dot(p, np.log2(p))
    return entropy
print("high_income", calc_entropy(income["high_income"]))
print("workclass", calc_entropy(income["workclass"]))

high_income 0.796383955202
workclass 1.64797692751


 <span style="color:green">Attendu : <br> high_income 0.796383955202 <br>
workclass 1.64797692751</span>

### ------------------------------------------------------------------------------------------------------------------------------

## Information Gain

###  <span style="color:purple">Calcule le `Information Gain` de `age` en fonction de l'objectif final `high_income`</span>
 
$$Entropy(T)\quad =\sum _{ i=1 }^{ c }{ P({ x }_{ i })\log _{ b }{ P({ x }_{ i }) }  } \\ \\ IG(T,\quad A)\quad =\quad Entropy(T)-\sum _{ v\epsilon A }^{  }{ \frac { |{ T }_{ v }| }{ |T| }  } .Entropy({ T }_{ v })$$


In [22]:
import math

#trouve le médian de age.

# crée deux subset en fonction du médian


# calcule la proportion de chacuns des splits.


In [23]:
# Calcule l'entropie de high_income l'objectif final
income_entropy = calc_entropy(income["high_income"])


In [24]:
#calculez l' `age_information_gain`

age_information_gain

0.047028661304691965

 <span style="color:green">Attendu:0.047028661304691965</span>

 ### <span style="color:purple">Créer une fonction `calc_information_gain`</span>

In [25]:
def calc_information_gain(data, split_name, target_name):

    # calcule l'entropy d'origine
    
    # trouve le médiant de la colonne
    
    
    # crée deux subset en fonction du médian
   
    
    # calcule le subset entropy de chacun des set
    to_subtract = 0
    for subset in [left_split, right_split]:
        
    # Retourne information gain
    return original_entropy - to_subtract

# Vérifie que la valeur coïncide avec `income`, "age", "high_income"
print(calc_information_gain(income, "age", "high_income"))

0.0470286613047


 <span style="color:green">Attendu: 0.0470286613047</span>

 ### <span style="color:purple">Puis une `liste information_gains` de toutes les colonnes</span>

In [26]:
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
information_gains = []
#faire une boucle
for col in columns:
    

### ------------------------------------------------------------------------------------------------------------------------------

 ### <span style="color:purple">Selectionne le nom de colonne avec la valeur la plus élevée</span>

In [27]:
highest_gain

'marital_status'

### ------------------------------------------------------------------------------------------------------------------------------

 ### <span style="color:purple">A l'aide de la recusivite tu peux creer une fonction qui creera l integralite de l'arbre</span>

In [14]:
import sklearn.tree
help(sklearn.tree)

Help on package sklearn.tree in sklearn:

NAME
    sklearn.tree

DESCRIPTION
    The :mod:`sklearn.tree` module includes decision tree-based models for
    classification and regression.

PACKAGE CONTENTS
    _criterion
    _splitter
    _tree
    _utils
    export
    setup
    tests (package)
    tree

CLASSES
    sklearn.base.ClassifierMixin(builtins.object)
        sklearn.tree.tree.DecisionTreeClassifier(sklearn.tree.tree.BaseDecisionTree, sklearn.base.ClassifierMixin)
            sklearn.tree.tree.ExtraTreeClassifier
    sklearn.base.RegressorMixin(builtins.object)
        sklearn.tree.tree.DecisionTreeRegressor(sklearn.tree.tree.BaseDecisionTree, sklearn.base.RegressorMixin)
            sklearn.tree.tree.ExtraTreeRegressor
    sklearn.tree.tree.BaseDecisionTree(abc.NewBase)
        sklearn.tree.tree.DecisionTreeClassifier(sklearn.tree.tree.BaseDecisionTree, sklearn.base.ClassifierMixin)
            sklearn.tree.tree.ExtraTreeClassifier
        sklearn.tree.tree.DecisionTreeRegre

In [35]:
#X = np.array(income["age"])
X

array([39, 50, 38, ..., 58, 22, 52])

In [40]:
to_X = ["age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race",
        "sex","capital_gain","capital_loss","hours_per_week","native_country"]
X = income[to_X].as_matrix()
y = np.array(income["high_income"])
X

array([[    39,      7,  77516, ...,      0,     40,     39],
       [    50,      6,  83311, ...,      0,     13,     39],
       [    38,      4, 215646, ...,      0,     40,     39],
       ..., 
       [    58,      4, 151910, ...,      0,     40,     39],
       [    22,      4, 201490, ...,      0,     20,     39],
       [    52,      5, 287927, ...,      0,     40,     39]])

In [43]:
X.shape, y.shape

((32561, 14), (32561,))

In [47]:
tree = DecisionTreeClassifier(criterion="entropy")
tree.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [48]:
income1 = income
income1['predict'] = tree.predict(X)

In [50]:
income1.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income,predict
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0,0
5,37,4,284582,12,14,2,4,5,4,0,0,0,40,39,0,0
6,49,4,160187,6,5,3,8,1,2,0,0,0,16,23,0,0
7,52,6,209642,11,9,2,4,0,4,1,0,0,45,39,1,1
8,31,4,45781,12,14,4,10,1,4,0,14084,0,50,39,1,1
9,42,4,159449,9,13,2,4,0,4,1,5178,0,40,39,1,1


In [51]:
tree.score(X,y)

0.99996928841251809

In [52]:
tree.decision_path(X)

<32561x9321 sparse matrix of type '<class 'numpy.int64'>'
	with 566687 stored elements in Compressed Sparse Row format>

In [54]:
tree.apply(X)

array([7575, 4555, 6815, ..., 6526, 6333, 9319])