# Classification

Para el caso de la clasificación, la variable a predecir será `CAUSAACCI` y `TIPACCID`, a partir de otras variables que se enuentran mayormente correlacionadas entre sí, además de otras que se eligieron a criterio.
La forma para la elección del modelo será con el mejor resultado dado por **Cross Validation** con diferentes hiperparámetros aplicados a cada modelo, tomando los modelos de clasificación como **Decision Tree, LightGBM, CatBoost, KNN, Random Forest**.

CAUSAACCI (Causa probable o presunta del accidente)
- 1: Conductor 
- 2: Peatón o pasajero
- 3: Falla de vehículo
- 4: Mala condición del camino
- 5: Otra

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold

In [2]:
df = pd.read_csv('../data/processed/processed_nacional.csv')

Para el caso de la clasificación, no son necesarias las columnas de heridos y muertos más que las de los totales (`TOTMUERTOS`, `TOTHERIDOS`), ya que la variable `CAUSAACCI` no tiene correlación con las mismas y en caso de, es suficiente con las columnas de totales.

In [3]:
df.shape

(180219, 38)

In [4]:
df.columns

Index(['EDO', 'MES', 'ANIO', 'MPIO', 'HORA', 'MINUTOS', 'DIA', 'DIASEMANA',
       'ZONA', 'TIPACCID', 'AUTOMOVIL', 'CAMPASAJ', 'MOTOCICLET', 'BICICLETA',
       'OTROVEHIC', 'CAUSAACCI', 'CAPAROD', 'SEXO', 'EDAD', 'CONDMUERTO',
       'CONDHERIDO', 'PASAMUERTO', 'PASAHERIDO', 'PEATMUERTO', 'PEATHERIDO',
       'CICLMUERTO', 'CICLHERIDO', 'OTROMUERTO', 'OTROHERIDO', 'TOTMUERTOS',
       'TOTHERIDOS', 'CLASE', 'CALLE1', 'LONGITUD', 'LATITUD', 'TRANPUBLICO',
       'VEHICARGA', 'ALIENTOCINT'],
      dtype='object')

In [5]:
y = df['CAUSAACCI']

X = df[[
    'EDO', 'MES', 'HORA', 'MINUTOS', 'DIA', 'DIASEMANA', 'ZONA', 
    'AUTOMOVIL', 'CAMPASAJ', 'MOTOCICLET', 'BICICLETA', 'TRANPUBLICO',
    'VEHICARGA', 'OTROVEHIC', 'SEXO', 'EDAD', 'CAPAROD', 'ALIENTOCINT',
    'TOTMUERTOS', 'TOTHERIDOS', 'CLASE', 'TIPACCID'
]]

print(y.shape, X.shape)

(180219,) (180219, 22)


In [6]:
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.30, random_state=42)

In [7]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## 1. Decision Tree Classifier

In [8]:
from sklearn.tree import DecisionTreeClassifier

In [9]:
tree = DecisionTreeClassifier()

In [10]:
tree_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 7, 9, 10, 12, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 3, 5],
    'splitter': ['best', 'random'],
    'max_features': ['auto', 'sqrt', 'log2', None],
    'class_weight': [None, 'balanced'],
    'min_impurity_decrease': [0.0, 0.05, 0.1],
    'random_state': [42]
}

In [11]:
tree_cv = GridSearchCV(
    estimator=tree,
    param_grid=tree_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [None]:
tree_cv.fit(X_train, y_train)

Fitting 5 folds for each of 6912 candidates, totalling 34560 fits


In [14]:
tree_cv.best_params_

{'class_weight': None,
 'criterion': 'gini',
 'max_depth': 10,
 'max_features': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'random_state': 42,
 'splitter': 'best'}

In [15]:
tree_cv.best_score_

0.9273976788992107

---

In [12]:
pip install lightgbm catboost

Note: you may need to restart the kernel to use updated packages.


## 2. LightGBM

In [13]:
from lightgbm import LGBMClassifier

In [14]:
lgbm = LGBMClassifier()

In [15]:
lgbm_params = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 50]
}

In [16]:
lgbm_cv = GridSearchCV(
    estimator=lgbm,
    param_grid=lgbm_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [18]:
lgbm_cv.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003817 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 344
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105369
[LightGBM] [Info] Start training from score -4.902030
[LightGBM] [Info] Start training from score -5.183509
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040537
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003576 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002973 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start training from score -0.105368
[LightGBM] [Info] Start training from score -4.902040
[LightGBM] [Info] Start training from score -5.183519
[LightGBM] [Info] Start training from score -3.240136
[LightGBM] [Info] Start training from score -3.040547
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003286 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 344
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003310 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start training from score -0.105368
[LightGBM] [Info] Start training from score -4.902040
[LightGBM] [Info] Start training from score -5.183519
[LightGBM] [Info] Start training from score -3.240136
[LightGBM] [Info] Start training from score -3.040547
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002510 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003852 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 342
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105358
[LightGBM] [Info] Start training from score -4.903364
[LightGBM] [Info] Start training from score -5.181744
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040744
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003845 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004451 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105369
[LightGBM] [Info] Start training from score -4.903364
[LightGBM] [Info] Start training from score -5.181744
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040537
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003581 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 342
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003218 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 344
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105369
[LightGBM] [Info] Start training from score -4.902030
[LightGBM] [Info] Start training from score -5.183509
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040537
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003767 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004578 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start training from score -0.105368
[LightGBM] [Info] Start training from score -4.902040
[LightGBM] [Info] Start training from score -5.183519
[LightGBM] [Info] Start training from score -3.240136
[LightGBM] [Info] Start training from score -3.040547
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004409 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 344
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005280 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start training from score -0.105368
[LightGBM] [Info] Start training from score -4.902040
[LightGBM] [Info] Start training from score -5.183519
[LightGBM] [Info] Start training from score -3.240136
[LightGBM] [Info] Start training from score -3.040547
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004685 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 346
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start 

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005054 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 342
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105358
[LightGBM] [Info] Start training from score -4.903364
[LightGBM] [Info] Start training from score -5.181744
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040744
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004052 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set: 100923, number of used features: 22
[LightGBM] [Info] Start 

In [19]:
lgbm_cv.best_params_

{'learning_rate': 0.05, 'n_estimators': 200, 'num_leaves': 50}

In [20]:
lgbm_cv.best_score_

0.9353166304975007

## 3. CatBoost

In [21]:
from catboost import CatBoostClassifier

In [22]:
catboost = CatBoostClassifier(verbose=0)

In [23]:
catboost_params = {
    'iterations': [100, 200],
    'learning_rate': [0.01, 0.1],
    'depth': [6, 8, 10]
}

In [24]:
catboost_cv = GridSearchCV(
    estimator=catboost,
    param_grid=catboost_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [25]:
catboost_cv.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [26]:
catboost_cv.best_params_

{'depth': 10, 'iterations': 200, 'learning_rate': 0.1}

In [27]:
catboost_cv.best_score_

0.9320745338832171

## 4. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
r_forest = RandomForestClassifier()

In [None]:
r_forest_params = {
    'n_estimators': [50, 100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 7, 9, 10, 11, 12, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 5, 7, 9],
    'max_features': ['auto', 'sqrt', 'log2', None],
    'class_weight': [None, 'balanced'],
    'random_state': [42]
}

In [None]:
r_forest_cv = GridSearchCV(
    estimator=r_forest,
    param_grid=r_forest_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [None]:
r_forest_cv.fit(X_train, y_train)

In [None]:
r_forest_cv.best_params_

In [None]:
r_forest_cv.best_score_

## 5. KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn_params = {
    'n_neighbors' : [5, 7, 9, 11, 13, 15],
    'weights' : ['uniform','distance'],
    'metric' : ['minkowski','euclidean','manhattan']
}

In [None]:
knn_cv = GridSearchCV(
    estimator=knn,
    param_grid=knn_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [None]:
knn_cv.fit(X_train, y_train)

In [None]:
knn_cv.best_params_

In [None]:
knn_cv.best_score_