<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Module3_WineData_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Wine data classification example

Let's import relevant libaries:

In [82]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score, roc_curve, auc

Let's import the dataset from our github repo

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

In [None]:
wine = pd.read_csv('ML_656/winequality-red.csv', sep = ';')
wine.head()

Let's explore the data a little bit -- let's start by looking at correlations

In [None]:

mask = np.triu(np.ones_like(wine.corr(), dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(wine.corr(), mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

And let's look at how many wines there are in each quality level:

In [None]:
qualitycounts = wine['quality'].value_counts()
qualitycounts.plot(kind="bar")

We (somewhat arbitrarily) will classify wines as "high quality" if they have a quality score above 6 and as "low quality" otherwise

In [None]:
wine['quality'] = wine['quality'] > 6
wine['quality'].describe()

Let's randomly split into training and test sets (we will discuss the ins and outs of that more next week):

In [90]:
np.random.seed(43)
train, test = train_test_split(wine, test_size = 0.3)
Train = train.drop(columns = ['quality']).values
Train_y = train['quality'].values
Test = test.drop(columns = ['quality']).values
Test_y = test['quality'].values

### Logistic Regression

In [None]:
logistic_model = LogisticRegression(fit_intercept=True, max_iter=1000).fit(Train,Train_y)
print(logistic_model.intercept_)
print(logistic_model.coef_)

In [None]:
np.exp(logistic_model.coef_[0,-1])

(that means p/(1-p) will increase by 150% when we increase alcohol by 1%)

### Predictions in Training Set

In [None]:
logistic_pred_train = logistic_model.predict_proba(Train)
np.sum(logistic_pred_train[:,1] > 0.5)

In [None]:
logistic_pred_train_lab = logistic_pred_train[:,1] > 0.5
confusion_matrix(Train_y, logistic_pred_train_lab)

In [None]:
28/(28+111) #TPR

In [None]:
19/(961+19) #FPR

### Predictions in the Test Set

In [None]:
logistic_pred_test = logistic_model.predict_proba(Test)
np.sum(logistic_pred_test[:,1] > 0.5)

In [None]:
logistic_pred_test_lab = logistic_pred_test[:,1] > 0.5
confusion_matrix(Test_y, logistic_pred_test_lab)

In [None]:
21/(21+57) #TPR

In [None]:
10/(392+10) #FPR

### ROC Curves

In [None]:
fpr, tpr, threshold = roc_curve(Train_y, logistic_pred_train[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
fpr, tpr, threshold = roc_curve(Test_y, logistic_pred_test[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()