# **Medidas de Desempeño**

[Victor Sanchez](https://github.com/VicoSan07) <br>
Dataset: [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) 

### **Objetivo**

Evaluar el desempeño de los distintos algoritmos supervisados utilizados hasta el momento con el fin de comparar y mejorar la toma de decisiones en la aplicación de dichos modelos.

### **Carga de Dataset**

In [2]:
import pandas as pd

#Carga de dataset 'vino rojo' 
dfwiner = pd.read_csv(r'C:/Users/vicos/Documents/winequality-red.csv',delimiter=";",
                      #Tipo de datos en cada columna
                      dtype={
                          'fixed acidity': float,
                          'volatile acidity': float,
                          'citric acid': float,
                          'residual sugar': float,
                          'chlorides': float,
                          'free sulfur dioxide': float,
                          'total sulfur dioxide': float,
                          'density': float,
                          'pH': float,
                          'sulphates': float,
                          'alcohol': float,
                          'quality': int,
                      })

#Carga de dataset 'vino blanco' 
dfwinew = pd.read_csv(r'C:/Users/vicos/Documents/winequality-white.csv',delimiter=";",
                      #Tipo de datos en cada columna
                      dtype={
                          'fixed acidity': float,
                          'volatile acidity': float,
                          'citric acid': float,
                          'residual sugar': float,
                          'chlorides': float,
                          'free sulfur dioxide': float,
                          'total sulfur dioxide': float,
                          'density': float,
                          'pH': float,
                          'sulphates': float,
                          'alcohol': float,
                          'quality': int,
                      })

# Crear columnas con string constante en cada fila
dfwiner['type']='red'
dfwinew['type']='white'

dfwineall = pd.concat([dfwiner,dfwinew],ignore_index=True)
dfwineall = dfwineall.drop(columns=["type"])

# Renombramos los nombres de las variables para facilitar su representación visual
dfwineall.rename(
    columns={"fixed acidity": "FA",
            "volatile acidity": "VA",
             "citric acid": "CA",
             "residual sugar": "RS",
             "chlorides": "CH",
             "free sulfur dioxide": "FSD",
             "total sulfur dioxide": "TSD",
             "density": "DE",
             "pH": "PH",
             "sulphates": "SU",
             "alcohol": "AL",
             "quality": "QU"},
    inplace=True,
)

dfwineall

Unnamed: 0,FA,VA,CA,RS,CH,FSD,TSD,DE,PH,SU,AL,QU
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


A partir de aquí vamos a partir los datos en 2 conjuntos: variables independientes y variable dependiente. Posteriormente partiremos estos dos conjuntos en _train_ y _test_ para evaluar el desempeño del modelo.

In [6]:
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

X=dfwineall.iloc[:,:-1]
Y=dfwineall.iloc[:,-1]

X_train, X_test, Y_train, Y_test = train_test_split(X.values,Y.values,test_size=0.2,random_state=0)

### **Regresión Lineal Múltiple**

Aplicamos la regresión lineal

In [7]:
from sklearn.linear_model import LinearRegression

regressor=LinearRegression()
regressor.fit(X_train,Y_train)
Y_prediction=regressor.predict(X_test)

Y obtenemos un resumen de las métricas de precisión en la regresión 

In [8]:
import sklearn.metrics as metrics

MAE = metrics.mean_absolute_error(Y_test, Y_prediction)
MSE = metrics.mean_squared_error(Y_test, Y_prediction)
RMSE = np.sqrt(MSE)  
R2 = metrics.r2_score(Y_test,Y_prediction)

resp_metrics = pd.DataFrame({'Valor': [MAE,MSE,RMSE,R2]  })

resp_metrics.rename(index= {0:'MAE',1:'MSE',2:'RMSE',3:'R2'}, inplace=True)

In [9]:
resp_metrics

Unnamed: 0,Valor
MAE,0.582443
MSE,0.548296
RMSE,0.74047
R2,0.294188


### **Regresión Lineal Múltiple con DBSCAN**


In [10]:
from sklearn import preprocessing

# Datos escalados
dfwineall_scaled = preprocessing.StandardScaler().fit_transform(dfwineall)
dfwineall_scaled = pd.DataFrame(dfwineall_scaled) 
dfwineall_scaled = dfwineall_scaled.rename(columns = {0:'FA',1: 'VA', 2:'CA',3:'RS',4:'CH',5:'FSD',6:'TSD',7:'DE',8:'PH',9:'SU',10:'AL',11:'QU'})

In [11]:
from sklearn.cluster import DBSCAN

# DBSCAN para todo el conjunto
model_dbscanX = DBSCAN(eps=3, min_samples = 5, metric = "euclidean").fit(dfwineall_scaled)

Nuevo dataframe sin outliers

In [12]:
clusters = model_dbscanX.fit_predict(dfwineall_scaled)

dfwineclean = dfwineall.copy(deep=True)
dfwineclean['id'] = clusters

dfwineclean = dfwineclean.drop(dfwineclean[dfwineclean['id'] == -1].index)
dfwineclean = dfwineclean.drop(columns=["id"])

Nueva Regresión

In [13]:
X2=dfwineclean.iloc[:,:-1]
Y2=dfwineclean.iloc[:,-1]

X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2.values,Y2.values,test_size=0.2,random_state=0)

regressor=LinearRegression()
regressor.fit(X2_train,Y2_train)
Y2_prediction=regressor.predict(X2_test)

MAE2 = metrics.mean_absolute_error(Y2_test, Y2_prediction)
MSE2 = metrics.mean_squared_error(Y2_test, Y2_prediction)
RMSE2 = np.sqrt(MSE2)  
R22 = metrics.r2_score(Y2_test,Y2_prediction)

resp_metrics2 = pd.DataFrame({'Valor': [MAE2,MSE2,RMSE2,R22]  })

resp_metrics2.rename(index= {0:'MAE',1:'MSE',2:'RMSE',3:'R2'}, inplace=True)

resp_metrics2

Unnamed: 0,Valor
MAE,0.580783
MSE,0.574369
RMSE,0.757872
R2,0.298859


La variación en los indicadores es muy mínima a comparación de la primera regresión realizada

### **Random Forest Regressor**


In [14]:
# Aplicación Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
RFC = RandomForestRegressor(n_estimators=100)
RFC.fit(X_train,Y_train)

Y3_prediction = RFC.predict(X_test)

MAE3 = metrics.mean_absolute_error(Y_test, Y3_prediction)
MSE3 = metrics.mean_squared_error(Y_test, Y3_prediction)
RMSE3 = np.sqrt(MSE3)  
R23 = metrics.r2_score(Y_test,Y3_prediction)

resp_metrics3 = pd.DataFrame({'Valor': [MAE3,MSE3,RMSE3,R23]  })

resp_metrics3.rename(index= {0:'MAE',1:'MSE',2:'RMSE',3:'R2'}, inplace=True)

resp_metrics3


Unnamed: 0,Valor
MAE,0.435762
MSE,0.373916
RMSE,0.611487
R2,0.518664


Logra un mejor desempeño a comparación de las primeras dos regresiones realizadas.

### **Conclusiones Regresión**

En definitiva los resultados obtenidos por Random Forest estuvieron por encima de lo esperado. Cabe destacar que la eliminación de valores atipicos mediante el DBSCAN previo a la aplicación de la regresión logró aumentar el valor de R cuadrada y disminuir el MAE ligeramente, mientras que el MSE y RMSE se nivelo, teniendo en cuenta que se eliminar supuestos outliers en los datos.

### **Random Forest Classifier**

In [15]:
#Nueva columna con las categorías
dfwineall['TY'] = 0
dfwineall.loc[dfwineall['QU'] > 5, 'TY'] = 1
dfwineall

Unnamed: 0,FA,VA,CA,RS,CH,FSD,TSD,DE,PH,SU,AL,QU,TY
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,1
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,1
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,0
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,1
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,1


In [16]:
from sklearn.model_selection import train_test_split

X=dfwineall.iloc[:,:-2]
Y=dfwineall.iloc[:,-1]

X_train, X_test, Y_train, Y_test = train_test_split(X.values,Y.values,test_size=0.2,random_state=0)

In [17]:
# Aplicación Random Forest Regressor
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=100)
RFC.fit(X_train,Y_train)

RFC_Y_prediction = RFC.predict(X_test)

In [20]:
#Metricas de Desempeño

# Accuracy
from sklearn.metrics import accuracy_score
print(f"La exactitud del modelo es de : {accuracy_score(Y_test, RFC_Y_prediction)}")

# Precision
from sklearn.metrics import precision_score
print(f"La precisión del modelo es de: {precision_score(Y_test, RFC_Y_prediction)}")

#Recall
from sklearn.metrics import recall_score
print(f"La exhaustividad del modelo es de: {recall_score(Y_test, RFC_Y_prediction)}")

# F1 Score
from sklearn.metrics import f1_score
print(f"El valor F del modelo es de: {f1_score(Y_test, RFC_Y_prediction)}")

# ROC AUC Curve
from sklearn.metrics import roc_curve, auc
class_probabilities = RFC.predict_proba(X_test)
preds = class_probabilities[:, 1]
fpr, tpr, threshold = roc_curve(Y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"El área bajo la curva del modelo es de: {roc_auc}")

La exactitud del modelo es de : 0.8315384615384616
La precisión del modelo es de: 0.8527042577675489
La exhaustividad del modelo es de: 0.890625
El valor F del modelo es de: 0.871252204585538
El área bajo la curva del modelo es de: 0.9020728036653517


### **Decision Trees**

In [26]:
# Aplicación Decision Trees
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(random_state=0)
DTC.fit(X_train,Y_train)

DTC_Y_prediction = DTC.predict(X_test)

In [27]:
#Metricas de Desempeño

# Accuracy
from sklearn.metrics import accuracy_score
print(f"La exactitud del modelo es de : {accuracy_score(Y_test, DTC_Y_prediction)}")

# Precision
from sklearn.metrics import precision_score
print(f"La precisión del modelo es de: {precision_score(Y_test, DTC_Y_prediction)}")

#Recall
from sklearn.metrics import recall_score
print(f"La exhaustividad del modelo es de: {recall_score(Y_test, DTC_Y_prediction)}")

# F1 Score
from sklearn.metrics import f1_score
print(f"El valor F del modelo es de: {f1_score(Y_test, DTC_Y_prediction)}")

# ROC AUC Curve
from sklearn.metrics import roc_curve, auc
class_probabilities = DTC.predict_proba(X_test)
preds = class_probabilities[:, 1]
fpr, tpr, threshold = roc_curve(Y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"El área bajo la curva del modelo es de: {roc_auc}")

La exactitud del modelo es de : 0.78
La precisión del modelo es de: 0.8242280285035629
La exhaustividad del modelo es de: 0.8341346153846154
El valor F del modelo es de: 0.8291517323775389
El área bajo la curva del modelo es de: 0.7589476495726496


### **Conclusiones Clasificación**

En líneas generales se observa un mejor desempeño del algoritmo por bosque aleatorio.