Iniciaremos trabajando con el archivo: **car.data**

In [1]:
#Importamos las librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df_car = pd.read_csv("/content/car.data", sep = ",", header=None)
df_car.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [2]:
#Dimension del dataset
print(df_car.shape)

(1728, 7)


In [3]:
#Tipo de dato para cada columna
df_car.dtypes

0    object
1    object
2    object
3    object
4    object
5    object
6    object
dtype: object

In [4]:
#Información del dataset
df_car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1728 non-null   object
 1   1       1728 non-null   object
 2   2       1728 non-null   object
 3   3       1728 non-null   object
 4   4       1728 non-null   object
 5   5       1728 non-null   object
 6   6       1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [5]:
#Renombramos las columnas para lograr una mayor interpretacion
df_car.rename(columns={0:'x0', 1:'x1',2:'x2',3:'x3',4:'x4',5:'x5',6:'y'}, inplace=True)

In [6]:
#Verificamos el cambio realizado
df_car.columns

Index(['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'y'], dtype='object')

In [7]:
#Separacion en X e y
y = df_car['y'].to_numpy()

for n in df_car.columns:
  if str(df_car[n].dtype) == 'object' or str(df_car[n].dtype) == 'category':
    df_car[n] = df_car[n].astype('category').cat.codes
X = df_car.drop(['y'], axis=1).to_numpy()
df_car.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,y
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


Iniciaremos aplicando el método de **StratifiedKFold** para posteriormente comparar los resultados obtenidos con **KFold**. 

In [8]:
#StratifiedKFold - Train y Test
from sklearn.model_selection import StratifiedKFold
skf_car = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
train, test = list(skf_car.split(X, y))[0]
X_train = X[train]
X_test = X[test]
y_train = y[train]
y_test= y[test]

In [15]:
#Entrenar arbol y encontrar el mejor alpha con GridSearchCV y StratifiedKFold
%%time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

t_car = DecisionTreeClassifier()
par_car = list(np.arange(0.0, 1., step=0.05))
cv_car = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
tunner_car = GridSearchCV(estimator=t_car, param_grid={'ccp_alpha':par_car}, cv=cv_car)
_=tunner_car.fit(X_train, y_train)

CPU times: user 306 ms, sys: 2.82 ms, total: 309 ms
Wall time: 309 ms


In [10]:
print('Train score StratifiedKFold: ' + str(tunner_car.score(X_train, y_train)))
print('Test score StratifiedKFold: ' + str(tunner_car.score(X_test, y_test)))

Train score StratifiedKFold: 1.0
Test score StratifiedKFold: 0.9826589595375722


In [11]:
#KFold
from sklearn.model_selection import KFold
KFold_car = KFold(n_splits=5, random_state=0, shuffle=True)
train_2, test_2 = list(KFold_car.split(X))[0]
X_train_2 = X[train_2]
X_test_2 = X[test_2]
y_train_2 = y[train_2]
y_test_2 = y[test_2]

In [12]:
#Entrenar arbol y encontrar el mejor alpha con GridSearchCV y KFold
t_car2 = DecisionTreeClassifier()
par_car2 = list(np.arange(0.0, 1., step=0.05))
cv_car2 = KFold(n_splits=5, random_state=0, shuffle=True)
tunner_car2 = GridSearchCV(estimator=t_car2, param_grid={'ccp_alpha':par_car2}, cv=cv_car2)
_=tunner_car2.fit(X_train_2, y_train_2)

In [13]:
#Metricas
print('Train score KFold: ' + str(tunner_car2.score(X_train_2, y_train_2)))
print('Test score KFold: ' + str(tunner_car2.score(X_test_2, y_test_2)))

Train score KFold: 1.0
Test score KFold: 0.976878612716763


**Conclusiones e Interpretaciones del dataset: car.data**

Si comenzamos nuestro análisis sobre el método de StratifiedKFold, se observan los siguientes valores para las métricas obtenidas:

* Train score StratifiedKFold: 1.0
* Test score StratifiedKFold: 0.9826589595375722

Claramente, podemos identificar que nuestro modelo se está "sobreajustando", razón inicial que podría deberse al pequeño tamaño del dataset 1728 registros respectivamente.

Ahora bien, analizando el método de KFold, los resultados de Train y Test se detallan a continuación: 

* Train score KFold: 1.0
* Test score KFold: 0.9682080924855492

Para este caso en particular para las métricas obtenidas, no se observan grandes mejoras al implementar este tipo de método de CrossValidation, dado que el algoritmo aun persiste sobreajustado y el Test Score se ve reducido muy levemente. 

Continuaremos trabajando con el archivo: **aug_train**

**Consignas**:
    
Se deberá replicar el ejemplo anterior realizado, sobre el dataset propuesto para el método de StratifiedKFold y KFold.
El algoritmo a entrenar será un DecisionTreeClassifier. Finalmente se solicita elaborar una pequeña conclusión e interpretación de las aplicaciones realizadas.

Aclaración: La variable de interes es "target".

In [16]:
#Importamos el segundo data set
df_aug = pd.read_csv("/content/aug_train.csv", sep = ",")
df_aug

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [17]:
#Dimension del dataset
print(df_aug.shape)

(19158, 14)


In [18]:
#Tipo de dato para cada columna
df_aug.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                    float64
dtype: object

In [19]:
#Información del dataset
df_aug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [20]:
#Separacin en X e y
y = df_aug['target'].to_numpy()

for n in df_aug.columns:
  if str(df_aug[n].dtype) == 'object' or str(df_aug[n].dtype) == 'category':
    df_aug[n] = df_aug[n].astype('category').cat.codes
X = df_aug.drop(['target'], axis=1).to_numpy()
df_aug.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,5,0.92,1,0,2,0,5,21,-1,-1,0,36,1.0
1,29725,77,0.776,1,1,2,0,5,6,4,5,4,47,0.0
2,11561,64,0.624,-1,1,0,0,5,15,-1,-1,5,83,0.0
3,33241,14,0.789,-1,1,-1,0,1,20,-1,5,5,52,1.0
4,666,50,0.767,1,0,2,2,5,21,4,1,3,8,0.0


In [21]:
#Dividir Train y Test para CV - StratifiedKFold
from sklearn.model_selection import StratifiedKFold
skf_aug = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
train, test = list(skf_aug.split(X, y))[0]
X_train = X[train]
X_test = X[test]
y_train = y[train]
y_test= y[test]

In [22]:
#Entrenar arbol y encontrar el mejor alpha con GridSearchCV y StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
t_aug = DecisionTreeClassifier()
par_aug = list(np.arange(0.0, 1., step=0.05))
cv_aug = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
tunner_aug = GridSearchCV(estimator=t_aug, param_grid={'ccp_alpha':par_aug}, cv=cv_aug)
_=tunner_aug.fit(X_train, y_train)

In [23]:
print('Train score StratifiedKFold: ' + str(tunner_aug.score(X_train, y_train)))
print('Test score StratifiedKFold: ' + str(tunner_aug.score(X_test, y_test)))

Train score StratifiedKFold: 0.7830484144590891
Test score StratifiedKFold: 0.7794885177453027


In [24]:
#KFold
from sklearn.model_selection import KFold
KFold_aug= KFold(n_splits=5, random_state=0, shuffle=True)
train_2, test_2 = list(KFold_aug.split(X))[0]
X_train_2 = X[train_2]
X_test_2 = X[test_2]
y_train_2 = y[train_2]
y_test_2 = y[test_2]

In [25]:
#Entrenar arbol y encontrar el mejor alpha con GridSearchCV y KFold
t_aug2 = DecisionTreeClassifier()
par_aug2 = list(np.arange(0.0, 1., step=0.05))
cv_aug2 = KFold(n_splits=5, random_state=0, shuffle=True)
tunner_aug2 = GridSearchCV(estimator=t_aug2, param_grid={'ccp_alpha':par_aug2}, cv=cv_aug2)
_=tunner_aug2.fit(X_train_2, y_train_2)

In [26]:
#Metricas
print('Train score KFold: ' + str(tunner_aug2.score(X_train_2, y_train_2)))
print('Test score KFold: ' + str(tunner_aug2.score(X_test_2, y_test_2)))

Train score KFold: 0.7818086911131411
Test score KFold: 0.784446764091858


**Conclusionees e Interpretaciones del dataset: aug_train**

Para este análisis en particular, si comenzamos observando los valores obtenidos para Train y Test con el método de StratifiedKFold ambos son muy similares entre sí, aspecto que es destacable para nuestro modelo. A continuación, se mencionan las métricas obtenidas: 

* Train score StratifiedKFold: 0.7830484144590891
* Test score StratifiedKFold: 0.7794885177453027

Ahora bien, en la aplicación del método de KFold, los resultados alcanzados nuevamente son similares en Train y Test, particularmente los valores rondan cerca de 0.78. 

Finalmente, se detallan las métricas obtenidas para KFold en Train y Test.

* Train score KFold: 0.7818086911131411
* Test score KFold: 0.784446764091858