In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import plotly.express as px
import seaborn as sns
from matplotlib import pyplot
%matplotlib inline

# Compresión de los datos
## Recoleccion de los datos iniciales

Los datos utilizados son referentes a la calidad de la produccion de vinos. 


Los datos se obtuvieron de: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv


## Descripción de los datos

Los datos se encuentran almacenados en archivos Excel (.csv) y se manipularan mediante Python.

In [2]:
wine=pd.read_csv('wineEDA.csv', sep=',')
wine

Unnamed: 0.1,Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,quality_binned
0,0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,Medium
1,1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,Medium
2,2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,Medium
3,3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,Medium
4,4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,Medium
1595,1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,Medium
1596,1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,Medium
1597,1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,Medium


# Modelado

## Preparación de los datos de entrenamiento y prueba

Eliminar las columnas de la variable que se desea predecir

In [3]:
x = wine.drop(columns= ['quality', 'quality_binned'])

y = wine['quality_binned']

Separar el conjunto de entrenamiento y de pruebas.
Requiere: características, objetivo y el tamaño del conjunto de pruebas

In [4]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

Tamaño del conjunto de entrenamiento y pruebas

In [5]:
print('x train: ', len(x_train))
print('y train: ', len(y_train))
print('x test: ', len(x_test))
print('y test: ', len(y_test))

x train:  1199
y train:  1199
x test:  400
y test:  400



## Logistic Regression


La regresión logística es un método estadístico para predecir clases binarias, describe y estima la relación entre una variable binaria dependiente y las variables independientes.

* Este método es uno de los algoritmos de aprendizaje automático más simples y comúnmente utilizados para la clasificación de dos clases.
* Calcula la probabilidad de ocurrencia de un evento binario utilizando una función logit.
* El resultado o la variable objetivo es de naturaleza dicotómic
* La variable dependiente en la regresión logística sigue la distribución de Bernoulli.
* La estimación se realiza mediante el método de estimación de máxima verosimilitud (MLE).

In [6]:
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression(max_iter= 10000)
reg.fit(x_train, y_train)

LogisticRegression(max_iter=10000)

In [7]:
y_pred = reg.predict(x_test)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, r2_score, roc_auc_score
print(classification_report(y_test, y_pred))
print('Confusion matrix: ', confusion_matrix(y_test, y_pred))
print('Training score: ', reg.score(x_train, y_train)*100)

              precision    recall  f1-score   support

        High       0.61      0.31      0.41        45
         Low       0.00      0.00      0.00        16
      Medium       0.88      0.97      0.92       339

    accuracy                           0.86       400
   macro avg       0.49      0.43      0.44       400
weighted avg       0.81      0.86      0.83       400

Confusion matrix:  [[ 14   0  31]
 [  0   0  16]
 [  9   0 330]]
Training score:  83.73644703919933


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [8]:
score_lr = accuracy_score(y_test, y_pred)*100
print("Logistic Regression accuracy score: ", score_lr)

Logistic Regression accuracy score:  86.0


## K Nearest Neighbors Classifier (KNN)
### Algoritmo de clasificación de los K vecinos próximos

* Para encontrar similaridad entre puntos, se encuentra la distancia entre estos, para ello utiliza medidas como distancia Euclidiana, distancia de Manhattan, distancia de Hamming y la distancia de Minkowski
* Funciona bien con datos de pocas características
* K es el número de vecinos a evaluar

Se crea el objeto KNN y se ajusta el modelo de entrenamiento

In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors= 5)

knn.fit(x_train, y_train)

KNeighborsClassifier()

Se crea el objeto de predicción y evalua

In [10]:
y_pred = knn.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", knn.score(x_train, y_train)*100)
print("Testing Score:  ", knn.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.28      0.24      0.26        45
         Low       0.33      0.06      0.11        16
      Medium       0.87      0.91      0.89       339

    accuracy                           0.81       400
   macro avg       0.49      0.41      0.42       400
weighted avg       0.78      0.81      0.79       400

[[ 11   0  34]
 [  1   1  14]
 [ 27   2 310]]
Training Score:  84.7372810675563
Testing Score:   80.5


Se promedia los resultados de pruebas y entrenamiento

In [11]:
score_knn = accuracy_score(y_test, y_pred)*100
print("KNN accuracy Score: ", score_knn)

KNN accuracy Score:  80.5



# Decision Tree Classifier

Árbol de decisión


El objetivo es crear un modelo que prediga el valor de una variable objetivo mediante el aprendizaje de reglas de decisión simples inferidas de las características de los datos. 

* Cuanto más profundo es el árbol, más complejas son las reglas de decisión y más ajustado es el modelo.

* Los árboles de decisión son fáciles de interpretar y visualizar.

* Requiere menos procesamiento previo de datos por parte del usuario, por ejemplo, no hay necesidad de normalizar columnas.

* Sensible a datos ruidosos. Puede sobreajustarse a datos ruidosos. Esto se debe controlar mediante la profundidad del arbol.


In [12]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=6, random_state=0)

dtree.fit(x_train, y_train)

y_pred = dtree.predict(x_test)

In [13]:
y_pred=dtree.predict(x_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ",dtree.score(x_train, y_train)*100)


              precision    recall  f1-score   support

        High       0.42      0.49      0.45        45
         Low       0.33      0.06      0.11        16
      Medium       0.89      0.91      0.90       339

    accuracy                           0.82       400
   macro avg       0.55      0.49      0.48       400
weighted avg       0.82      0.82      0.82       400

[[ 22   0  23]
 [  1   1  14]
 [ 30   2 307]]
Training Score:  89.49124270225187


In [14]:
score_ds = accuracy_score(y_test, y_pred)*100
print("Decision Tree accuracy Score: ", score_ds)

Decision Tree accuracy Score:  82.5


## Pipeline

In [15]:
from sklearn.pipeline import make_pipeline # https://stackoverflow.com/questions/40708077/what-is-the-difference-between-pipeline-and-make-pipeline-in-scikit

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier

In [16]:
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',ExtraTreesClassifier())]

In [17]:
pipe=Pipeline(Input)
pipe

Pipeline(steps=[('scale', StandardScaler()),
                ('polynomial', PolynomialFeatures(include_bias=False)),
                ('model', ExtraTreesClassifier())])

In [18]:
pipe.fit(x_train, y_train)

Pipeline(steps=[('scale', StandardScaler()),
                ('polynomial', PolynomialFeatures(include_bias=False)),
                ('model', ExtraTreesClassifier())])

In [19]:
y_pipe=pipe.predict(x_test)

In [20]:
print(classification_report(y_test, y_pipe))
print(confusion_matrix(y_test, y_pipe))
print("Training Score: ", pipe.score(x_train, y_train)*100)
print("Testing Score: ", pipe.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.78      0.56      0.65        45
         Low       0.00      0.00      0.00        16
      Medium       0.90      0.98      0.94       339

    accuracy                           0.89       400
   macro avg       0.56      0.51      0.53       400
weighted avg       0.85      0.89      0.87       400

[[ 25   0  20]
 [  0   0  16]
 [  7   1 331]]
Training Score:  100.0
Testing Score:  89.0


In [21]:
pipe.score(x_test,y_test)

0.89


## Random Forest Classifier

In [22]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)

RandomForestClassifier()

In [23]:
y_pred = rfc.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", rfc.score(x_train, y_train)*100)
print("Testing Score: ", rfc.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.71      0.60      0.65        45
         Low       0.00      0.00      0.00        16
      Medium       0.91      0.97      0.94       339

    accuracy                           0.89       400
   macro avg       0.54      0.52      0.53       400
weighted avg       0.85      0.89      0.87       400

[[ 27   0  18]
 [  0   0  16]
 [ 11   0 328]]
Training Score:  100.0
Testing Score:  88.75


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
score_rf = accuracy_score(y_test, y_pred)*100
print("Random Forest accuracy Score: ", score_rf)

Random Forest accuracy Score:  88.75



## C - Support Vector Classifier (SVC)

In [25]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(x_train, y_train)

SVC()

In [26]:
y_pred = svc.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", svc.score(x_train, y_train)*100)
print("Testing Score: ", svc.score(x_test, y_test)*100)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        High       0.00      0.00      0.00        45
         Low       0.00      0.00      0.00        16
      Medium       0.85      1.00      0.92       339

    accuracy                           0.85       400
   macro avg       0.28      0.33      0.31       400
weighted avg       0.72      0.85      0.78       400

[[  0   0  45]
 [  0   0  16]
 [  0   0 339]]
Training Score:  81.7347789824854
Testing Score:  84.75


In [27]:
score_svc = accuracy_score(y_test, y_pred)*100
print("SVC accuracy Score: ", score_svc)

SVC accuracy Score:  84.75



## Ada Boost Classifier

In [28]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(base_estimator = None)
adb.fit(x_train, y_train)

AdaBoostClassifier()

In [29]:
y_pred = adb.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", adb.score(x_train, y_train)*100)
print("Testing Score: ", adb.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.38      0.69      0.49        45
         Low       0.10      0.19      0.13        16
      Medium       0.91      0.78      0.84       339

    accuracy                           0.74       400
   macro avg       0.46      0.55      0.49       400
weighted avg       0.82      0.74      0.77       400

[[ 31   0  14]
 [  0   3  13]
 [ 50  26 263]]
Training Score:  75.31276063386156
Testing Score:  74.25


In [30]:
score_ada = accuracy_score(y_test, y_pred)*100
print("Ada Boost accuracy Score: ", score_ada)

Ada Boost accuracy Score:  74.25



## Gradient Boosting Classifier

In [31]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)

GradientBoostingClassifier()

In [32]:
y_pred=gbc.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", gbc.score(x_train, y_train)*100)
print("Testing Score: ", gbc.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.63      0.58      0.60        45
         Low       0.00      0.00      0.00        16
      Medium       0.90      0.94      0.92       339

    accuracy                           0.86       400
   macro avg       0.51      0.51      0.51       400
weighted avg       0.84      0.86      0.85       400

[[ 26   0  19]
 [  0   0  16]
 [ 15   5 319]]
Training Score:  96.66388657214345
Testing Score:  86.25


In [33]:
score_gb = accuracy_score(y_test, y_pred)*100
print("Gradient Boosting accuracy Score: ", score_gb)

Gradient Boosting accuracy Score:  86.25



## XGB Classifier

In [34]:
from xgboost import XGBClassifier

xgb = XGBClassifier(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

xgb.fit(x_train, y_train)





XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3,
              enable_categorical=False, gamma=0, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=10, n_jobs=12, num_parallel_tree=1,
              objective='multi:softprob', predictor='auto', random_state=0,
              reg_alpha=10, reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [35]:
y_pred = xgb.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Training Score: ", xgb.score(x_train, y_train)*100)
print("Testing Score: ", xgb.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       1.00      0.02      0.04        45
         Low       0.00      0.00      0.00        16
      Medium       0.85      1.00      0.92       339

    accuracy                           0.85       400
   macro avg       0.62      0.34      0.32       400
weighted avg       0.83      0.85      0.78       400

[[  1   0  44]
 [  0   0  16]
 [  0   0 339]]
Training Score:  81.81818181818183
Testing Score:  85.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
score_xgb = accuracy_score(y_test, y_pred)*100
print("XGB accuracy Score: ", score_xgb)

XGB accuracy Score:  85.0



## Naive Bayes

In [37]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train,y_train)

GaussianNB()

In [38]:
y_pred = gnb.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print("Training Score: ",gnb.score(x_train, y_train)*100)
print("Testing Score: ", gnb.score(x_test, y_test)*100)

              precision    recall  f1-score   support

        High       0.38      0.71      0.49        45
         Low       0.00      0.00      0.00        16
      Medium       0.91      0.82      0.86       339

    accuracy                           0.77       400
   macro avg       0.43      0.51      0.45       400
weighted avg       0.81      0.77      0.78       400

[[ 32   0  13]
 [  0   0  16]
 [ 53   9 277]]
0.7725
Training Score:  78.0650542118432
Testing Score:  77.25


In [39]:
score_nb = accuracy_score(y_test, y_pred)*100
print("Naive Bayes accuracy Score: ", score_nb)

Naive Bayes accuracy Score:  77.25


### Resumen con los resultados

In [40]:
abstract = pd.DataFrame({"Model":['Logistic Regression','KNN Classifier','Decision Tree Classifier', 'Random Forest Classifier', 
                                'SVC Classifier','AdaBoost Classifier', 'Gradient Boosting Classifier','XGB Classifier',
                                'Naive Bayes Classifier'],
                      "Accuracy":[score_lr, score_knn, score_ds, score_rf, score_svc, score_ada, score_gb, score_xgb, score_nb]})

In [41]:
print(abstract)

                          Model  Accuracy
0           Logistic Regression     86.00
1                KNN Classifier     80.50
2      Decision Tree Classifier     82.50
3      Random Forest Classifier     88.75
4                SVC Classifier     84.75
5           AdaBoost Classifier     74.25
6  Gradient Boosting Classifier     86.25
7                XGB Classifier     85.00
8        Naive Bayes Classifier     77.25


Por Adilene Calderón, Aaron Lara, Adrían Vázquez. Introducción a la Ciencia de Datos y sus Metodologías. [MCD UNISON](https://mcd.unison.mx)

## Referencias
I_Prerna_Kalura(2021). Breast cancer solution. https://www.kaggle.com/iprernakalura/breast-cancer-solution

https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


https://www.datacamp.com/community/tutorials/decision-tree-classification-python

https://scikit-learn.org/stable/modules/tree.html

https://www.kaggle.com/madhurisivalenka/basic-machine-learning-with-red-wine-quality-data
