# Proyecto 1 – Predecir la calidad del Vino

### Integrantes
- Camilo Andres Galeano Trujillo

### Problema
Por medio de una base de datos con información referente al vino rojo de la compañia "Vinho Verde" se quiere saber la calidad de estos, siendo una variable categórica ordinal con puntajes del 1 al 10, encontramos las siguiente variables para dar paso a esa categorización:

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

### Enlace GitHub

<https://github.com/cgatrujillo/ml-ean>

In [10]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, mean_absolute_error, mean_squared_error

In [3]:
df = pd.read_csv("winequality-red.csv", sep=";")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Los datos consisten principalmente en las variables químicas que componen cada uno de los 1599 vinos rojos registrados, de las doce variables presentadas 11 son de tipo númerico flotante y la última que nos indica la calidad es categórica ordinal.

In [5]:
X = df.drop(columns="quality")
y = df["quality"]

In [26]:
model_rl = Pipeline(
    steps= [("scaler", StandardScaler()), ("reglin", LinearRegression())] 
)

In [27]:
model_rl.fit(X, y)
print("Modelo Regresion Lineal: %3f" % model_rl.score(X, y))

Modelo Regresion Lineal: 0.360552


In [28]:
y_predict = model_rl.score(X, y)
print("Coefficient of determination in the training set: ", y_predict)
print("MAE: ", mean_absolute_error(y, model_rl.predict(X)))
print("MSE: ", mean_squared_error(y, model_rl.predict(X)))
print("RMSE: ", np.sqrt(mean_squared_error(y, model_rl.predict(X))))

Coefficient of determination in the training set:  0.3605517030386881
MAE:  0.5004899635644883
MSE:  0.41676716722140805
RMSE:  0.6455750670692046


El modelo de regresión lineal sencillo, no está consiguiendo interpretar de manera adecuada la calidad de los vinos y las predicciones realizadas presentan una gran desviación.

In [20]:
df["categorical_quality"] = df['quality'].map(lambda x: 'Bueno' if x >= 6 else 'Regular')

In [22]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop(columns=["quality", "categorical_quality"]), df["categorical_quality"], test_size =0.2, random_state=12)

In [24]:
Xtrain.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0
mean,8.312119,0.528292,0.270727,2.532604,0.086908,15.662627,46.56294,0.996742,3.312361,0.658288,10.431744
std,1.727279,0.179117,0.194901,1.366531,0.04647,10.088504,33.197896,0.001896,0.154791,0.169065,1.083963
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99676,3.31,0.62,10.2
75%,9.2,0.64,0.42,2.6,0.09,21.0,63.0,0.99786,3.4,0.73,11.1
max,15.9,1.33,1.0,15.5,0.611,68.0,289.0,1.00369,4.01,2.0,14.9


In [25]:
Xtest.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0
mean,8.349687,0.525938,0.271969,2.563594,0.0897,16.723437,46.0875,0.996767,3.306125,0.657594,10.387969
std,1.797717,0.179098,0.194705,1.573555,0.049386,11.808925,31.704827,0.001855,0.152897,0.171527,0.990007
min,5.0,0.16,0.0,1.3,0.012,1.0,7.0,0.99064,2.88,0.39,8.5
25%,7.1,0.4,0.1,1.9,0.071,8.0,22.0,0.995615,3.2,0.55,9.5
50%,7.95,0.52,0.25,2.1,0.079,13.0,38.5,0.996665,3.31,0.62,10.1
75%,9.3,0.63625,0.43,2.6,0.091,24.0,60.0,0.997755,3.41,0.7325,11.0
max,15.6,1.58,0.76,15.4,0.415,72.0,160.0,1.00369,3.9,1.61,14.0


In [29]:
model_lr = Pipeline(
    steps= [("scaler", StandardScaler()), ("logistic", LogisticRegression())] 
)

In [31]:
ypred = model_lr.predict(Xtest)
confusion_matrix(ytest, ypred)

array([[134,  40],
       [ 43, 103]], dtype=int64)

In [33]:
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

       Bueno       0.76      0.77      0.76       174
     Regular       0.72      0.71      0.71       146

    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320



Comparando ambos modelos, podemos deducir que el modelo con mejor funcionamiento es el de regresión logística, viendo las métricas como el coeficiente de determinación del modelo lineal, que es bastante bajo, un valor de 0.36 y el valor de accuracy del modelo logístico en un 0.74, lo cual es más aceptable, a su vez es un modelo se podría mejorar aplicando regularización, validación cruzada y la busqueda de sus mejores hiperparámetros. También podemos tener en cuenta que el modelo logístico presenta una mejor posibilidad de clasificación adecuada en los vinos buenos que los regulares y que en el modelo lineal tuvimos en cuenta 11 posibles clasificaciones de calidad en vez de un resultado binario como el logístico.