# White Wine

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 
from sklearn.preprocessing import scale

In [2]:
white_wine = pd.read_csv('wine+quality/winequality-white.csv', sep=';').drop_duplicates()
colNames = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
white_wine.columns = colNames
white_wine

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.99490,3.18,0.47,9.6,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


## Logistic Regression with Sugar and Density

### Change 'quality' into binary -> 7+ is good wine, else is bad wine!

In [3]:
# change into binary --> good wine has score of 7+, else is bad
white_wine['target'] = white_wine['quality'].apply(lambda x: 1 if x >= 7 else 0)
white_wine['target'].value_counts()

0    3136
1     825
Name: target, dtype: int64

### Create X and y values

In [4]:
X = white_wine.drop(['quality', 'target'], axis = 1)
y = white_wine['target']

### SCALE DATA: mean = 0, variance = 1

In [5]:
# Scale data: mean = 0, variance = 1
scaled_wine = scale(X)

### Create testing and trainings sets off of the scaled data

In [6]:
X_train, X_test, y_train, y_test = train_test_split(scaled_wine, y, test_size=0.3, random_state=1)

### Fit logistic regression model

In [7]:
logreg_model = LogisticRegression()
result = logreg_model.fit(X_train, y_train)

### Accuracy Scores for Training

In [8]:
prediction_train_logreg = logreg_model.predict(X_train)
print(metrics.accuracy_score(y_train, prediction_train_logreg))

0.8156565656565656


### Accuracy scores for ```Testing!!```

In [9]:
prediction_logreg = logreg_model.predict(X_test)
print(metrics.accuracy_score(y_test, prediction_logreg))

0.8040370058873002


### Findings! Coefficients, intercepts, and weight breakdowns

In [10]:
logreg_model.coef_

array([[ 0.33032157, -0.31493843, -0.01602414,  0.99419388, -0.59242451,
         0.29115717, -0.17143901, -1.31994547,  0.50863023,  0.24214085,
         0.5332329 ]])

In [11]:
logreg_model.intercept_

array([-1.87318702])

In [12]:
weights = Series(logreg_model.coef_[0],
                 index=X.columns.values)
weights.sort_values()

density                -1.319945
chlorides              -0.592425
volatile_acidity       -0.314938
total_sulfur_dioxide   -0.171439
citric_acid            -0.016024
sulphates               0.242141
free_sulfur_dioxide     0.291157
fixed_acidity           0.330322
pH                      0.508630
alcohol                 0.533233
residual_sugar          0.994194
dtype: float64

# Findings Report for White Wine

1. density: The coefficient is negative (-1.319945). As density increases, it is negatively associated with the likelihood of the wine being classified as 'good.' In other words, higher density tends to make the model predict the wine as 'bad.'

2. chlorides: The coefficient is negative (-0.592425). As the chloride content increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher chloride levels tend to make the model predict the wine as 'bad.'

3. volatile_acidity: The coefficient is negative (-0.314938). As volatile acidity increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher levels of volatile acidity tend to make the model predict the wine as 'bad.'

4. total_sulfur_dioxide: The coefficient is negative (-0.171439). As total sulfur dioxide content increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher total sulfur dioxide levels tend to make the model predict the wine as 'bad.'

5. citric_acid: The coefficient is negative (-0.016024). The citric acid content has a small negative effect on predicting the wine as 'good,' but its impact is not significant compared to other features.

6. sulphates: The coefficient is positive (0.242141). As sulphates increase, it is positively associated with the likelihood of the wine being classified as 'good.' Higher sulphate levels tend to make the model predict the wine as 'good.'

7. free_sulfur_dioxide: The coefficient is positive (0.291157). As free sulfur dioxide content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher free sulfur dioxide levels tend to make the model predict the wine as 'good.'

8. fixed_acidity: The coefficient is positive (0.330322). As fixed acidity increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher fixed acidity tends to make the model predict the wine as 'good.'

9. pH: The coefficient is positive (0.508630). As pH increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher pH levels tend to make the model predict the wine as 'good.'

10. alcohol: The coefficient is positive (0.533233). As alcohol content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher alcohol levels tend to make the model predict the wine as 'good.'

11. residual_sugar: The coefficient is positive (0.994194). As residual sugar content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher residual sugar levels tend to make the model predict the wine as 'good.'

The intercept value is -1.87318702
* When all predictor variables are set to zero, the log-odds of the wine being classified as 'good' is approximately -1.87318702. 
* Determines the baseline probability of the positive outcome in the absence of any feature effects

### Extra

In [13]:
print('Number of good wine examples=', len(y_train[y_train==1]))
print('Number of bad wine examples =', len(y_train[y_train==0]))

Number of good wine examples= 590
Number of bad wine examples = 2182


In [14]:
negative_examples_in_test = len(y_test[y_test==0])
total_examples_in_test = len(y_test)

print('Number of examples where baseline is correct =', negative_examples_in_test)      # number of correctly identified bad wine examples
print('Baseline accuracy =', negative_examples_in_test * 1.0 / total_examples_in_test)  

Number of examples where baseline is correct = 954
Baseline accuracy = 0.8023549201009251


# Red Wine

In [15]:
# load in dataset
red_wine = pd.read_csv('wine+quality/winequality-red.csv', sep=';').drop_duplicates()
colNames = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
red_wine.columns = colNames
red_wine

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
5,7.4,0.660,0.00,1.8,0.075,13.0,40.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1593,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [16]:
# change into binary --> good wine has score of 7+, else is bad
red_wine['target'] = red_wine['quality'].apply(lambda x: 1 if x >= 7 else 0)
red_wine['target'].value_counts()

0    1175
1     184
Name: target, dtype: int64

In [18]:
# create x and y values
X = red_wine.drop(['quality', 'target'], axis = 1)
y = red_wine['target']

# Scale data: mean = 0, variance = 1
scaled_wine = scale(X)

# create testing and training sets
X_train, X_test, y_train, y_test = train_test_split(scaled_wine, y, test_size=0.3, random_state=1)

In [19]:
# fit log reg model
logreg_model = LogisticRegression()
result = logreg_model.fit(X_train, y_train)

In [20]:
# accuracy for training sets
prediction_train_logreg = logreg_model.predict(X_train)
print(metrics.accuracy_score(y_train, prediction_train_logreg))

0.8832807570977917


In [21]:
# accuracy for testing sets
prediction_logreg = logreg_model.predict(X_test)
print(metrics.accuracy_score(y_test, prediction_logreg))

0.8700980392156863


In [22]:
# Get coefficients and intercepts
print('Coefficients:', logreg_model.coef_)
print()
print('intercept:', logreg_model.intercept_)

Coefficients: [[ 0.33188225 -0.56278893 -0.00119054  0.21888806 -0.32050344  0.34990868
  -1.04577503 -0.20978914 -0.0825475   0.81373633  1.02092384]]

intercept: [-3.05658068]


In [23]:
# get weights
weights = Series(logreg_model.coef_[0],
                 index=X.columns.values)
weights.sort_values()

total_sulfur_dioxide   -1.045775
volatile_acidity       -0.562789
chlorides              -0.320503
density                -0.209789
pH                     -0.082547
citric_acid            -0.001191
residual_sugar          0.218888
fixed_acidity           0.331882
free_sulfur_dioxide     0.349909
sulphates               0.813736
alcohol                 1.020924
dtype: float64

# Findings Report for Red Wine

1. total_sulfur_dioxide: The coefficient is negative (-1.045775). As total sulfur dioxide content increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher total sulfur dioxide levels tend to make the model predict the wine as 'bad.'

3. volatile_acidity: The coefficient is negative (-0.562789). As volatile acidity increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher levels of volatile acidity tend to make the model predict the wine as 'bad.'

3. chlorides: The coefficient is negative (-0.320503). As the chloride content increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher chloride levels tend to make the model predict the wine as 'bad.'

2. density: The coefficient is negative (-1.319945). As density increases, it is negatively associated with the likelihood of the wine being classified as 'good.' In other words, higher density tends to make the model predict the wine as 'bad.'

9. pH: The coefficient is positive (-0.082547). As pH increases, it is negatively associated with the likelihood of the wine being classified as 'good.' Higher pH levels tend to make the model predict the wine as 'bad.'

5. citric_acid: The coefficient is negative (-0.001191). The citric acid content has a small negative effect on predicting the wine as 'good,' but its impact is not significant compared to other features.

11. residual_sugar: The coefficient is positive (0.218888). As residual sugar content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher residual sugar levels tend to make the model predict the wine as 'good.'

7. free_sulfur_dioxide: The coefficient is positive (0.291157). As free sulfur dioxide content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher free sulfur dioxide levels tend to make the model predict the wine as 'good.'

8. fixed_acidity: The coefficient is positive (0.331882). As fixed acidity increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher fixed acidity tends to make the model predict the wine as 'good.'

7. free_sulfur_dioxide: The coefficient is positive (0.349909). As free sulfur dioxide content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher free sulfur dioxide levels tend to make the model predict the wine as 'good.'

6. sulphates: The coefficient is positive (0.813736). As sulphates increase, it is positively associated with the likelihood of the wine being classified as 'good.' Higher sulphate levels tend to make the model predict the wine as 'good.'

10. alcohol: The coefficient is positive (1.020924). As alcohol content increases, it is positively associated with the likelihood of the wine being classified as 'good.' Higher alcohol levels tend to make the model predict the wine as 'good.'


The intercept value is -3.05658068
* When all predictor variables are set to zero, the log-odds of the wine being classified as 'good' is approximately -3.05658068. 
* Determines the baseline probability of the positive outcome in the absence of any feature effects