Training and Testing a Simple Logistic Regression model on Wine Dataset

In this notebook, there will be training and testing for a 70-30 ratio. There will be 3 penalty parameters like "none", "l1" and "l2" for difference regularization parameters C.  

In [52]:
# imports 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Preprocess Data here with standardizing and scaling features as well as train/test split of 70/30

In [53]:
# load the csv file red wine data set.
df = pd.read_csv("wine_dataset.csv")

# preprocess data and encode type of data
label_encoder = LabelEncoder()
df['style'] = label_encoder.fit_transform(df['style'])

# X should be all the other columns minus the style which 'style' is target column
X = df.drop(columns=['style'])
y = df['style']

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, train_size=.7, random_state = 1, stratify = y) 

Part A. Fit 7 different versions of logistic regression models where penalty is none, L1 and L2 along with different C values 

In [54]:
np.random.seed(1000)

penalties = ['l1', 'l1', 'l1', 'l2', 'l2', 'l2']
C_values = [0.01, .1, 1, 0.01, .1, 1]

accuracies = []
models = []
# seperate one for penalty as none 
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
models.append(model)

print(f'accuracy score for no penalty and  C= 1: {accuracy}')

for p, C in zip(penalties, C_values):
    model = LogisticRegression(penalty=p, solver='liblinear', C=C)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) 
    models.append(model) 
    accuracies.append(accuracy)



for p, C, accuracy in zip(penalties, C_values, accuracies):
    print(f'accuracy_score for logistic regression with {p}, C = {C}: {accuracy}')

accuracy score for no penalty and  C= 1: 0.9743589743589743
accuracy_score for logistic regression with l1, C = 0.01: 0.9394871794871795
accuracy_score for logistic regression with l1, C = 0.1: 0.9733333333333334
accuracy_score for logistic regression with l1, C = 1: 0.9805128205128205
accuracy_score for logistic regression with l2, C = 0.01: 0.9487179487179487
accuracy_score for logistic regression with l2, C = 0.1: 0.9666666666666667
accuracy_score for logistic regression with l2, C = 1: 0.9758974358974359


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Part B. Calulate L2 norm of the trained weights of model with no regularization 

In [65]:
# get the weights of the no penalty model
weights = models[0].coef_[0]

# calcuate its L2 norm
l2_norm = np.linalg.norm(weights, 2)

print('L2 norm of the weights for logistic regression with no penalty:', l2_norm)

L2 norm of the weights for logistic regression with no penalty: 10.97150400575851


Part C. Choose logistic regression model with penalty as l1 which is highest accuracy and report l2 norm

In [67]:
weights2 = models[3].coef_[0]

l2_norm1 = np.linalg.norm(weights2, 2)

print('L1 norm of the weights for logistic regression with c = 1:', l2_norm1)


L1 norm of the weights for logistic regression with c = 1: 23.024567731084346


Part D. Choose logistic regression model with penalty as l2 which is highest accuracy and report l2 norm

In [68]:
weights3 = models[6].coef_[0]

l2_norm2 = np.linalg.norm(weights3, 2)

print('L2 penality highest reg and c = 1', l2_norm2)

L2 penality highest reg and c = 1 11.247302948510855


Part E. Count number of zero weights in the three models above. 

In [71]:
weights_log0 = np.abs(models[0].coef_[0])
weights_log3 = np.abs(models[3].coef_[0])
weights_log6 = np.abs(models[6].coef_[0])

count1 = 0 
count2 = 0 
count3 = 0

for (weight1, weight2, weight3) in zip(weights_log0, weights_log3, weights_log6):
    if weight1 <= 1e-5:
        count1 += 1
    if weight2 <= 1e-5:
        count2 += 1
    if weight3 <= 1e-5:
        count3 += 1

print('no penalty',count1)
print('l1 penalty', count2)
print('l2 penalty', count3)

no penalty 0
l1 penalty 0
l2 penalty 0
