# Feature Engineering Project: Predict Wine Quality with Regularization



The data is from the Wine Quality Dataset in the UCI Machine Learning Repository. We’re looking at the red wine data in particular and while the original dataset has a 1-10 rating for each wine, we’ve made it a classification problem with a wine quality of good (>5 rating) or bad (<=5 rating). The goals of this project is to model wine quality based on physicochemical tests. In order to get there I will perform task including:


* implement different logistic regression classifiers
* find the best ridge-regularized classifier using hyperparameter tuning
* implement a tuned lasso-regularized feature selection method

Let's explore these options by looking at the data we have available to us. First let's answer few questions:

* What the total number of records in the dataset?
* What are the columns, or features, of the dataset?
* What is the test result of the first wine in the dataset?


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('wine_quality.csv', delimiter=';')
print(data.columns)
print(data.info())
print(f'There are {data.shape[0]}, records and {data.shape[1]} features')
print(f'Taste of the first wine in data set in 1-10 scale: ', data.iloc[0,-1])



Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11 

In [None]:
y = data['quality']
features = data.drop(columns = ['quality'])


Before we begin modeling, let’s scale our data using `StandardScaler()`

In [None]:
from sklearn.preprocessing import StandardScaler
standard_scaler_fit = StandardScaler().fit(features)
X = standard_scaler_fit.transform(features)

Perform an 80:20 train-test split on the data. Set the random_state to 7 for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

Define a classifier, `clf_no_reg`, a logistic regression model without regularization and fit it to the training data.

In [None]:
from sklearn.linear_model import LogisticRegression

clf_no_reg = LogisticRegression()
clf_no_reg.fit(x_train, y_train)

LogisticRegression()

Now, I am going to plot the `coefficients` obtained from fitting the Logistic Regression model.

In [None]:
predictors = features.columns
coefficients = clf_no_reg.coef_.ravel()
print(coefficients)

coef.plot(kind='bar', title = 'Coefficients (no regularization)')
plt.tight_layout()
plt.show()
plt.clf()

ValueError: Length of values (66) does not match length of index (11)

In [None]:
from sklearn.metrics import f1_score
y_pred_test = clf_no_reg.predict(x_test)
y_pred_train = clf_no_reg.predict(x_train)
print('Training Score', f1_score(y_train, y_pred_train))
print('Testing Score', f1_score(y_test, y_pred_test))

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].