# Classification

Para el caso de la clasificación, la variable a predecir será `CAUSAACCI` y `TIPACCID`, a partir de otras variables que se enuentran mayormente correlacionadas entre sí, además de otras que se eligieron a criterio.
La forma para la elección del modelo será con el mejor resultado dado por **Cross Validation** con diferentes hiperparámetros aplicados a cada modelo, tomando los modelos de clasificación como **Decision Tree, LightGBM, CatBoost, KNN, Random Forest**.

CAUSAACCI (Causa probable o presunta del accidente)
- 1: Conductor
- 2: Peatón o pasajero
- 3: Falla de vehículo
- 4: Mala condición del camino
- 5: Otra

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold

In [2]:
df = pd.read_csv('processed_nacional.csv')

Para el caso de la clasificación, no son necesarias las columnas de heridos y muertos más que las de los totales (`TOTMUERTOS`, `TOTHERIDOS`), ya que la variable `CAUSAACCI` no tiene correlación con las mismas y en caso de, es suficiente con las columnas de totales.

In [None]:
df.shape

(180219, 38)

In [None]:
df.columns

Index(['EDO', 'MES', 'ANIO', 'MPIO', 'HORA', 'MINUTOS', 'DIA', 'DIASEMANA',
       'ZONA', 'TIPACCID', 'AUTOMOVIL', 'CAMPASAJ', 'MOTOCICLET', 'BICICLETA',
       'OTROVEHIC', 'CAUSAACCI', 'CAPAROD', 'SEXO', 'EDAD', 'CONDMUERTO',
       'CONDHERIDO', 'PASAMUERTO', 'PASAHERIDO', 'PEATMUERTO', 'PEATHERIDO',
       'CICLMUERTO', 'CICLHERIDO', 'OTROMUERTO', 'OTROHERIDO', 'TOTMUERTOS',
       'TOTHERIDOS', 'CLASE', 'CALLE1', 'LONGITUD', 'LATITUD', 'TRANPUBLICO',
       'VEHICARGA', 'ALIENTOCINT'],
      dtype='object')

In [3]:
y = df['CAUSAACCI']

X = df[[
    'EDO', 'MES', 'HORA', 'MINUTOS', 'DIA', 'DIASEMANA', 'ZONA',
    'AUTOMOVIL', 'CAMPASAJ', 'MOTOCICLET', 'BICICLETA', 'TRANPUBLICO',
    'VEHICARGA', 'OTROVEHIC', 'SEXO', 'EDAD', 'CAPAROD', 'ALIENTOCINT',
    'TOTMUERTOS', 'TOTHERIDOS', 'CLASE', 'TIPACCID'
]]

print(y.shape, X.shape)

(180219,) (180219, 22)


In [4]:
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.30, random_state=42)

In [5]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## 1. Decision Tree Classifier

In [29]:
from sklearn.tree import DecisionTreeClassifier

In [30]:
tree = DecisionTreeClassifier()

In [31]:
tree_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 7, 9, 10, 12, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 3, 5],
    'splitter': ['best', 'random'],
    'max_features': ['auto', 'sqrt', 'log2', None],
    'class_weight': [None, 'balanced'],
    'min_impurity_decrease': [0.0, 0.05, 0.1],
    'random_state': [42]
}

In [32]:
tree_cv = RandomizedSearchCV(
    estimator=tree,
    param_distributions=tree_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [33]:
tree_cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklea

In [34]:
tree_cv.best_params_

{'splitter': 'best',
 'random_state': 42,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'min_impurity_decrease': 0.0,
 'max_features': None,
 'max_depth': 7,
 'criterion': 'entropy',
 'class_weight': None}

In [35]:
tree_cv.best_score_

0.9230061884155525

---

In [7]:
pip install lightgbm catboost

Collecting catboost
  Downloading catboost-1.2.7-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp311-cp311-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


## 2. LightGBM

In [36]:
from lightgbm import LGBMClassifier

In [37]:
lgbm = LGBMClassifier()

In [38]:
lgbm_params = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 50]
}

In [39]:
lgbm_cv = RandomizedSearchCV(
    estimator=lgbm,
    param_distributions=lgbm_params,
    cv=cv,
    scoring='accuracy',
    verbose=1
)

In [40]:
lgbm_cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018157 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 344
[LightGBM] [Info] Number of data points in the train set: 100922, number of used features: 22
[LightGBM] [Info] Start training from score -0.105369
[LightGBM] [Info] Start training from score -4.902030
[LightGBM] [Info] Start training from score -5.183509
[LightGBM] [Info] Start training from score -3.240126
[LightGBM] [Info] Start training from score -3.040537
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018351 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 345
[LightGBM] [Info] Number of data points in the train set:

In [41]:
lgbm_cv.best_params_

{'num_leaves': 50, 'n_estimators': 200, 'learning_rate': 0.05}

In [42]:
lgbm_cv.best_score_

0.9353166304975007

## 3. CatBoost

In [8]:
from catboost import CatBoostClassifier

In [9]:
catboost = CatBoostClassifier(verbose=0)

In [10]:
catboost_params = {
    'iterations': [100, 200],
    'learning_rate': [0.01, 0.1],
    'depth': [6, 8, 10]
}

In [11]:
catboost_cv = RandomizedSearchCV(
    estimator=catboost,
    param_distributions=catboost_params,
    cv=cv,
    scoring='accuracy',
    verbose=2
)

In [12]:
catboost_cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ........depth=10, iterations=100, learning_rate=0.1; total time=  34.7s
[CV] END ........depth=10, iterations=100, learning_rate=0.1; total time=  34.2s
[CV] END ........depth=10, iterations=100, learning_rate=0.1; total time=  52.6s
[CV] END ........depth=10, iterations=100, learning_rate=0.1; total time=  33.0s
[CV] END ........depth=10, iterations=100, learning_rate=0.1; total time=  34.0s
[CV] END .........depth=8, iterations=200, learning_rate=0.1; total time=  51.7s
[CV] END .........depth=8, iterations=200, learning_rate=0.1; total time=  51.5s
[CV] END .........depth=8, iterations=200, learning_rate=0.1; total time=  53.3s
[CV] END .........depth=8, iterations=200, learning_rate=0.1; total time=  51.7s
[CV] END .........depth=8, iterations=200, learning_rate=0.1; total time=  50.8s
[CV] END .........depth=8, iterations=100, learning_rate=0.1; total time=  25.6s
[CV] END .........depth=8, iterations=100, learn

In [13]:
catboost_cv.best_params_

{'learning_rate': 0.1, 'iterations': 200, 'depth': 10}

In [14]:
catboost_cv.best_score_

0.9320745338832171

## 4. Random Forest Classifier

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
r_forest = RandomForestClassifier()

In [17]:
r_forest_params = {
    'n_estimators': [50, 100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 7, 9, 10, 11, 12, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 5, 7, 9],
    'max_features': ['auto', 'sqrt', 'log2', None],
    'class_weight': [None, 'balanced'],
    'random_state': [42]
}

In [18]:
r_forest_cv = RandomizedSearchCV(
    estimator=r_forest,
    param_distributions=r_forest_params,
    cv=cv,
    scoring='accuracy',
    verbose=2
)

In [19]:
r_forest_cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END class_weight=None, criterion=gini, max_depth=9, max_features=None, min_samples_leaf=5, min_samples_split=5, n_estimators=50, random_state=42; total time=  15.0s
[CV] END class_weight=None, criterion=gini, max_depth=9, max_features=None, min_samples_leaf=5, min_samples_split=5, n_estimators=50, random_state=42; total time=  14.9s
[CV] END class_weight=None, criterion=gini, max_depth=9, max_features=None, min_samples_leaf=5, min_samples_split=5, n_estimators=50, random_state=42; total time=  15.0s
[CV] END class_weight=None, criterion=gini, max_depth=9, max_features=None, min_samples_leaf=5, min_samples_split=5, n_estimators=50, random_state=42; total time=  15.7s
[CV] END class_weight=None, criterion=gini, max_depth=9, max_features=None, min_samples_leaf=5, min_samples_split=5, n_estimators=50, random_state=42; total time=  14.8s
[CV] END class_weight=balanced, criterion=entropy, max_depth=9, max_features=log2, min_sa

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn

In [20]:
r_forest_cv.best_params_

{'random_state': 42,
 'n_estimators': 100,
 'min_samples_split': 5,
 'min_samples_leaf': 9,
 'max_features': None,
 'max_depth': 12,
 'criterion': 'gini',
 'class_weight': None}

In [21]:
r_forest_cv.best_score_

0.9311470810346563

## 5. KNN

In [22]:
from sklearn.neighbors import KNeighborsClassifier

In [23]:
knn = KNeighborsClassifier()

In [24]:
knn_params = {
    'n_neighbors' : [5, 7, 9, 11, 13, 15],
    'weights' : ['uniform','distance'],
    'metric' : ['minkowski','euclidean','manhattan']
}

In [25]:
knn_cv = RandomizedSearchCV(
    estimator=knn,
    param_distributions=knn_params,
    cv=cv,
    scoring='accuracy',
    verbose=2
)

In [26]:
knn_cv.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END .metric=euclidean, n_neighbors=15, weights=distance; total time=  59.2s
[CV] END .metric=euclidean, n_neighbors=15, weights=distance; total time= 1.0min
[CV] END .metric=euclidean, n_neighbors=15, weights=distance; total time=  54.1s
[CV] END .metric=euclidean, n_neighbors=15, weights=distance; total time=  56.8s
[CV] END .metric=euclidean, n_neighbors=15, weights=distance; total time=  59.7s
[CV] END ..metric=euclidean, n_neighbors=9, weights=distance; total time=  55.0s
[CV] END ..metric=euclidean, n_neighbors=9, weights=distance; total time=  59.8s
[CV] END ..metric=euclidean, n_neighbors=9, weights=distance; total time=  53.8s
[CV] END ..metric=euclidean, n_neighbors=9, weights=distance; total time=  56.8s
[CV] END ..metric=euclidean, n_neighbors=9, weights=distance; total time=  58.5s
[CV] END .metric=euclidean, n_neighbors=13, weights=distance; total time=  57.4s
[CV] END .metric=euclidean, n_neighbors=13, weig

In [27]:
knn_cv.best_params_

{'weights': 'uniform', 'n_neighbors': 11, 'metric': 'manhattan'}

In [28]:
knn_cv.best_score_

0.9027450763977708