# Entrenamiento con todos los atributos
Se aplica one-hot-encoding a los datos categóricos para posteriormente entrenar:
1. Un modelo por regresión logística
2. Un modelo por máquinas de soporte vectorial

En cada caso se selecciona un mejor modelo por 10-fold cross validation y se comparan los modelos obtenidos. 

Iniciamos por cargar la base de datos. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Load already preprocessed dataset
dF = pd.read_csv("prep_sales_data.csv")
dF.head(5)

Unnamed: 0,flag,gender,education,house_val,age,online,customer_psy,marriage,child,occupation,mortgage,house_owner,region,car_prob,fam_income
0,1,1,0.75,0.024869,6upto65,1,B,1,N,Professional,2Med,1,West,0.1,0.545455
1,0,1,0.75,0.041693,5upto55,1,C,1,Y,Professional,1Low,1,South,0.2,0.727273
2,1,1,0.75,0.024569,4upto45,0,F,1,U,Blue Collar,1Low,1,South,0.3,0.363636
3,1,1,0.5,0.036059,5upto55,1,C,1,Y,Professional,3High,1,Midwest,0.1,0.818182
4,1,0,0.0,0.016288,1_Unk,1,G,1,Y,Professional,1Low,0,South,0.7,0.181818


In [3]:
# Labels vector and features matrix
y = dF.flag
X = pd.get_dummies(dF.drop(columns = 'flag', axis = 1))

# Regresión logística
Ahora obtenemos un modelo por regresión logística. La elección del modelo se hace por 10-fold cross validation

In [4]:
# Import scikit-learn modules and generate instance of regression model
from sklearn.linear_model import LogisticRegressionCV
from sklearn  import metrics
logistic_reg = LogisticRegressionCV(cv = 10, random_state = 1, max_iter = 150)

# Train model on X and y, and automatically select best after cross-validation
logistic_reg_model = logistic_reg.fit(X, y)

In [5]:
# Details of the model
print("Coeficientes del modelo: \n", logistic_reg_model.coef_)
print("Score sobre el conjunto de datos completo:", logistic_reg_model.score(X,y))

# Quality of the model
y_pred = logistic_reg_model.predict(X)
AUC = metrics.roc_auc_score(y, y_pred)
print("AUC =", AUC)
print("Cualidades de la clasificación:\n", classification_report(y, y_pred))
print("Matriz de confusión:\n", confusion_matrix(y, y_pred,labels = range(2)) )

Coeficientes del modelo: 
 [[-0.80164737  0.66627807  0.78684295  0.66600931 -0.20670227  0.04372072
  -0.75978905  0.37723813  0.00994293 -0.68805103  0.18337929  0.36995366
   0.3693824   0.10829249 -0.3603389   0.23013478  0.14752304  0.31637954
  -0.13685667  0.17763823 -0.0113811  -0.10283323 -0.22521224 -0.22852157
  -0.17430994 -0.01551017  0.06837687 -0.06030585 -0.11251318 -0.34860262
   0.07804721  0.35713845 -0.05692572  0.07541671 -0.25768302  0.07374792
   0.17649596 -0.15072725 -0.09469509  0.23113718  0.02865395 -0.02180794]]
Score sobre el conjunto de datos completo: 0.6998047372442482
AUC = 0.6874985240578948
Cualidades de la clasificación:
               precision    recall  f1-score   support

           0       0.67      0.59      0.63     10219
           1       0.72      0.78      0.75     13339

   micro avg       0.70      0.70      0.70     23558
   macro avg       0.70      0.69      0.69     23558
weighted avg       0.70      0.70      0.70     23558

Matriz

# Máquinas de soporte vectorial (SVM)
Ahora obtenemos un modelo por SVMs. En la selección del modelo evaluamos modelos con diferentes coeficientes de regularización C y gama, y evaluamos con 3fold cross validation

In [7]:
# Import scikit-learn modules
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# SLOW!!!
# Find best SVM model through 3fold cv and grid search, radial kernel
grid = {'C': [1, 10, 100], 'gamma': [0.001, 0.01, 0.1], }
svm_model = GridSearchCV(SVC(kernel='rbf'), grid, cv=3, iid=False)

svm_model = svm_model.fit(X_train, y_train)

In [9]:
print("Mejor modelo SVM encontrado por grid search:")
print(svm_model.best_estimator_)

Mejor modelo SVM encontrado por grid search:
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
