**CURSO**: *Machine Learning* en Geociencias<br />
**Profesor**: Edier Aristizábal (evaristizabalg@unal.edu.co) <br />
**Classroom code**: [wv4cglx]

# 08: Validación Cruzada

## *train-test-split*

Este algortimo es muy rapido y es ideal para grandes bases de datos en donde los datos de entrenamiento y validacion son lo suficientemente represetativos del problema. Debido a que es rapido se puede utilizar cona algoritmos complejos y lento para el entrenamiento. Una falencia del método es que puede generar alta varianza debido a grandes diferencias entre los datos de entrenamiento y validación.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

import warnings
warnings.simplefilter("ignore")

In [9]:
from sklearn.datasets import load_boston
dataset=load_boston()
X,y=load_boston(return_X_y=True)

In [12]:
dataset.DESCR

".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000

Un agumento importante de este algoritmo es *random_state*, el cual permite obtener con el mismo numero la misma partición de datos aleatoria, para asegurar resultados similares. A continuacion se va a generar tres particiones, donde dos de ellas tiene el mismo valor semilla (1).

In [21]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train,y_test = train_test_split(X,y, train_size=0.8, random_state=1)
X2_train,X2_test, y2_train,y2_test = train_test_split(X,y, train_size=0.8, random_state=1)
X3_train,X3_test, y3_train,y3_test = train_test_split(X,y, random_state=2)

In [22]:
print('Dimensiones de la matriz para entrenar:',X_train.shape)
print('Dimensiones del vector para entrenar:',y_train.shape)
print('Dimensiones de la matriz para validar:',X_test.shape)
print('Dimensiones del vector para validar:',y_test.shape)

Dimensiones de la matriz para entrenar: (404, 13)
Dimensiones del vector para entrenar: (404,)
Dimensiones de la matriz para validar: (102, 13)
Dimensiones del vector para validar: (102,)


Por defecto la función *train_test_split* divide la base de datos en 75% para entrenamiento y 25% para validación. pero cone l argumento *test_size* se puede especificar otro valor para el tamno de lso datso de validacion entre 0 y 1.

In [19]:
np.array_equal(X_train,X2_train)

True

Como se puede observar para los conjuntos con valor 1 las bases de datso aleatroias seleccioandas son exactamente iguales. En el caso doden se comapra con la seleccion aleatoria pero con semilla 2, las bases de datso no son iguales.

In [None]:
np.array_equal(X_train,X3_train)

False

A continuacion se puede implementar el modelo, en este caso Lasso, entrenarlo directamente y preguntar por el *score*.

In [25]:
from sklearn.utils import class_weight
LinearRegression().fit(X_train,y_train).score(X_test,y_test)

0.7634174432138495

## Validación cruzada (*cross validation*)

### *K-fold*

El método de *K-fold Cross Validation* permite obtener el desempeño del algoritmo con menor varianza que un particion sencilla de *train-test set split*. Este metodo divide lso datos en un número de K subconjuntos (k = 5 ó k = 10). Cada partición es denominada un *fold*. El algoritmo es entonces entrenado con K-1 subconjuntos y un subconjunto es utilizado para validar. Esto es k veces repetido por lo que se obtienen k valores de *score*. El algoritmo es por lo tanto entrenado y evaluado múltiples veces. Como resultado de esta función no se obtiene un modelo, ya que varios modelos son creados internamente, el propósito es sólamente evaluar que tan bien un algoritmo determinado va a generalizar con otros datos diferentes al entrenamiento.

In [27]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [29]:
from sklearn.ensemble import RandomForestRegressor
kfold = KFold(n_splits=5, shuffle= True,random_state=1)
model = RandomForestRegressor()
results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='r2')
print(results)
print(results.mean())
print(results.std())

[0.84160485 0.7393877  0.91913118 0.84533725 0.87093729]
0.843279654477118
0.058854013990583175


A continuación se presenta la función *cross_validate*, la cual difiere de *cross_val_score* ya que permite definir múltiples métricas para estimar el ajuste, adicionalmente las salidas de la función son diferentes como se observa a continuación. Por defecto la función *cross_val_score* genera 3 particiones.

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
results_ridge = cross_validate(Ridge(),X,y,return_train_score=True,cv=5)
results_ridge

{'fit_time': array([0.29299998, 0.00199986, 0.00199986, 0.00099993, 0.00099993]),
 'score_time': array([0.00300002, 0.00099993, 0.00100017, 0.00100017, 0.00099993]),
 'test_score': array([ 0.66089569,  0.74094893,  0.62923672,  0.08530169, -0.17029513]),
 'train_score': array([0.74372716, 0.72395587, 0.68988726, 0.84024816, 0.73384871])}

In [None]:
test_scores = results_ridge['test_score']
train_scores = results_ridge['train_score']
print('Train scores:', np.mean(train_scores))
print('Test scores:', np.mean(test_scores))

Train scores: 0.746333431797385
Test scores: 0.38921758241023985


In [None]:
results_lasso = cross_validate(Lasso(),X,y,return_train_score=True,cv=5)
results_lasso

{'fit_time': array([0.10900021, 0.00200033, 0.00199986, 0.00099993, 0.00200009]),
 'score_time': array([0.00099993, 0.00099993, 0.00099993, 0.00099993, 0.00099993]),
 'test_score': array([0.56156843, 0.63385562, 0.33456629, 0.35466066, 0.27459294]),
 'train_score': array([0.69205313, 0.66722484, 0.62206251, 0.77992825, 0.68385778])}

In [None]:
test_scores = results_lasso['test_score']
train_scores = results_lasso['train_score']
print('Train scores:', np.mean(train_scores))
print('Test scores:', np.mean(test_scores))

Train scores: 0.6890252995484658
Test scores: 0.431848787926522


### *Stratified Kfold*

En problemas de regresión scikit-learn utiliza por defecto el k-fold, pero para problemas de clasificación scikit-learn utiliza *stratified k-fold cross-validation*, en donde los datos son divididos en igual proporción de las clases en la totalidad de datos, es decir preservando el porcentaje de observaciones en cada clase. Por esta razón es un buena estrategia para datos imbalanceados.

In [33]:
from sklearn.model_selection import StratifiedKFold

data= pd.read_excel('https://github.com/edieraristizabal/MachineLearning/blob/master/data/Torrencialidad_DB_vf.xlsx?raw=true', sheet_name='cluster', engine='openpyxl')
data.head()



Unnamed: 0,Mean Basin Slope,Relief Ratio,Form Factor,Melton Index,Relief,Leght of overland flow,Drainage Density,Constant Channel mantenance,Drainage Intensity,Stream Frequency,...,Perimeter,Elongation Ratio,Circularity Ratio,Compactness Coefficient,Texture Ratio,Fitness Ratio,Wandering ratio,Stream Frequency.1,Rudgeness Number,Flash flood record
0,24.32,0.223408,0.147258,0.582183,1.319,0.515155,0.970583,1.030309,0.602168,0.584454,...,12.454,0.433006,0.415876,1.561689,0.160591,0.325036,0.685637,0.584454,1.280198,AB
1,28.1,0.219618,0.168967,0.534278,1.968,1.032682,0.484176,2.065365,0.45667,0.221108,...,18.516,0.463828,0.497315,1.428106,0.108015,0.323882,0.669233,0.221108,0.952858,AB
2,28.86,0.120483,0.13787,0.324482,2.484,0.919901,0.543537,1.839802,0.376731,0.204767,...,47.543,0.418977,0.325805,1.7644,0.189302,0.381066,0.878741,0.204767,1.350146,AB
3,20.669,0.125434,0.232704,0.260024,0.903,0.2279,2.193947,0.4558,1.171624,2.570481,...,17.075,0.544323,0.519799,1.396877,1.464129,0.360293,0.854563,2.570481,1.981134,AB
4,17.79,0.136407,0.117966,0.397153,1.299,0.513981,0.972799,1.027962,0.384357,0.373902,...,20.898,0.387554,0.307824,1.815201,0.143554,0.351708,0.771816,0.373902,1.263665,AB


In [38]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["Flash flood record"]=le.fit_transform(data["Flash flood record"])
data.tail()

Unnamed: 0,Mean Basin Slope,Relief Ratio,Form Factor,Melton Index,Relief,Leght of overland flow,Drainage Density,Constant Channel mantenance,Drainage Intensity,Stream Frequency,...,Perimeter,Elongation Ratio,Circularity Ratio,Compactness Coefficient,Texture Ratio,Fitness Ratio,Wandering ratio,Stream Frequency.1,Rudgeness Number,Flash flood record
68,15.25,0.074091,0.061792,0.363366,1.97,0.552729,0.904603,1.105457,0.225657,0.20413,...,40.854,0.280493,0.221302,2.140838,0.122387,0.474862,0.8895,0.20413,1.782068,1
69,24.74,0.009269,0.113525,0.145172,3.442,0.7569,0.660589,1.513801,0.355458,0.234812,...,148.112,0.38019,0.322021,1.774738,0.715675,0.451726,0.950788,0.234812,2.273747,1
70,25.196,0.067981,0.194978,0.285303,1.576,0.658112,0.75975,1.316223,0.388216,0.294947,...,32.567,0.49825,0.361538,1.674941,0.245647,0.311542,0.811031,0.294947,1.197365,1
71,25.22,0.071111,0.182145,0.300709,2.033,0.799381,0.625484,1.598762,0.314806,0.196906,...,38.141,0.481575,0.394828,1.602773,0.157311,0.36643,0.882268,0.196906,1.271609,1
72,17.086,0.110327,0.276545,0.189486,0.563,0.864981,0.578047,1.729963,0.783853,0.453104,...,14.635,0.593387,0.517949,1.399371,0.204988,0.224394,0.581239,0.453104,0.325441,1


In [37]:
X=data.drop(['Flash flood record'],axis=1)
y=data['Flash flood record']

In [32]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [42]:
skfold=StratifiedKFold(n_splits=5)
results = cross_val_score(model, X, y, cv=skfold, scoring='r2')
print(results)
print(results.mean())
print(results.std())

[-0.81911243 -2.13701183  0.42707826  0.50320738 -0.04798167]
-0.4147640598932053
0.9813800845498886


### Leave One Out Cross Validation
Un caso especial de *K-fold cross validation* es donde k sea igual al número de observaciones. Este tipo de variación se denomina *leave-one-out cross validation*.

In [47]:
from sklearn.model_selection import LeaveOneOut

In [49]:
loocv = LeaveOneOut()
model = LinearRegression()
results = cross_val_score(model, X, y, cv=loocv, scoring='r2')
print(results.mean())
print(results.std())

nan
nan


### ShuffleSplit

Otra variación de *k-fold* es generar una partición aleatoria como *train-test-split*, pero repite el proceso de partición y evaluación múltiples veces como *K-fold*.

In [43]:
from sklearn.model_selection import ShuffleSplit

In [46]:
kfold = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)
model = LinearRegression()
results = cross_val_score(model, X, y, cv=kfold, scoring='r2')
print(results)
print(results.mean())
print(results.std())

[ 0.315044    0.26362321 -1.23206737 -0.94435342  0.24068397]
-0.2714139230585575
0.6735197834538271
