<a href="https://colab.research.google.com/github/eljimenezj/-Data-modeling-techniques/blob/master/Tree_Based_Models_in_Python__ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aprendizaje automático con modelos basados en árboles en Python (Algunas notas)

## Diagnosticando problemas de bias y variance en los modelos de arboles


En este caso vamos a estudiar unas aplicaciones de como diagnosticar problemas de bias y variance en modelos de arboles principalemente. Este libro de jupyter está basado en clases tomadas de datacamp, en la cual también utilizamos una base de datos de autos que utilizan en varias de sus clases.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as MSE

import matplotlib.pyplot as plt

In [2]:
url = 'https://assets.datacamp.com/production/repositories/1796/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv'
df = pd.read_csv(url)

In [3]:
df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [4]:
df.columns

Index(['mpg', 'displ', 'hp', 'weight', 'accel', 'origin', 'size'], dtype='object')

In [5]:
one_hot = pd.get_dummies(df['origin'])
df = df.drop('origin',axis = 1)
df = df.join(one_hot)

In [6]:
y = df['mpg'].values
df = df.drop(['mpg'],1)
X = df.values

In [7]:
# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

In [8]:
dt.fit(X_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=4,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=0.26, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

Se realiza cross validation con K = 10 folds y se calcula el error cuadratico medio (MSE, promedio de los 10 errores obtenidos por los 10 folds) y la raiz del error cuadrativo medio. Todo esto para el conjunto de entrenamiento

In [9]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


Evaluamos el error en el conjunto de entrenamiento

In [10]:

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train,y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


## Modelos de ensamble

Modelos que utilizan las predicciones individuales de diferentes modelos para generar una prediccion o clasificacion final.

Datos para el ejemplo: https://www.kaggle.com/jeevannagaraj/indian-liver-patient-dataset

In [11]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)


In [12]:
df_orig = pd.read_csv('data_ensamble.csv')
df_orig.head(2)

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1


In [13]:
df = pd.read_csv('data_ensamble.csv')

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

number = LabelEncoder()

df_gender = df['gender'] 
df_gender = pd.DataFrame(number.fit_transform(df['gender'].astype(str)))
df_gender.columns = ['genero']

In [14]:
df_complete = df.join(df_gender)
df_complete = df_complete.drop(['gender'],1)
df_complete = clean_dataset(df_complete)

In [15]:
y = df_complete['is_patient']
df_complete = df_complete.drop(['is_patient'],1)
col_names = df_complete.columns 

In [16]:
# Estandarizar
scaler = StandardScaler()
df_std = scaler.fit_transform(df_complete)

In [17]:
 df_std = pd.DataFrame(df_std)
 df_std.columns = col_names

In [18]:
#df_std = df_std.join(df_gender)
#df_std = clean_dataset(df_std)
df_std.head(5)

Unnamed: 0,age,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,genero
0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,-1.770795
1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,0.564718
2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,0.564718
3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,0.564718
4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,0.564718


In [19]:
#y = df_orig['is_patient'] 
#y = pd.DataFrame(number.fit_transform(df_orig['is_patient'].astype('float')))
#y.colums = ['is_patient']
#y = clean_dataset(y)

In [20]:
# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(df_std, y, test_size=0.3, random_state=SEED)

In [21]:
X_train , y_train

(          age  tot_bilirubin  direct_bilirubin  ...      sgot   alkphos    genero
 370  0.692113      -0.420320         -0.495414  ...  1.085338  1.105288 -1.770795
 342  0.075125      -0.436391         -0.459878  ...  0.329431 -0.147390  0.564718
 142 -0.912055      -0.275680         -0.388807  ... -0.552461 -0.147390  0.564718
 420 -0.788658      -0.420320         -0.495414  ... -0.174507 -0.773729 -1.770795
 6   -1.158850      -0.388178         -0.459878  ...  0.455416  0.165780 -1.770795
 ..        ...            ...               ...  ...       ...       ...       ...
 129  0.013427      -0.082826          0.073158  ... -1.056399 -0.460559  0.564718
 144  0.013427       0.029672          0.002087  ... -0.300492  0.165780 -1.770795
 72   1.864391      -0.404249         -0.459878  ... -1.434352 -0.460559 -1.770795
 235 -1.405646      -0.404249         -0.459878  ...  0.833369 -0.147390  0.564718
 37   0.075125       1.749283          2.240841  ... -1.434352 -0.460559 -1.770795
 
 [

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

Se la realiza la instancia de 3 clasificadores individuales y se mide posteriormente el desempeño de clasificacion de cada uno

In [23]:
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNeighborsClassifier(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

Medicion del accuracy de cada clasificador individual

In [24]:
from sklearn.metrics import accuracy_score

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.718
Classification Tree : 0.730


A continuación se realiza el modelo de ensamble mediante `VotingClassifier` y utilizando los tres clasificadores individuales.

Posteriormente se revisa el desempeño

In [25]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.759
