# Lab Assignment Three: Extending Logistic Regression 

Arely Alcantara, Emily Fashenpour

## 1. Preparation and Overview

### 1.1 Business Case

The dataset that we selected is titled "Wine Quality" and it looks at a total of 6497 samples of both red and white wines - where there is a CSV file for red and white wines respectively. The dataset looks at 11 attributes of each wine sample such as the acidity, amount of alcohol, and density just to name a few. Each wine sample was also given a quality value that rates how good or bad that wine is on a numerical scale between 0 and 10. Given a new wine company entering the market, we hope to be able to classify and find what its quality is relative to the other existing samples in our dataset. Third parties interested in this information might be new companies or people trying to enter the wine market and seeing how their creation relates to existing wines in the market and how its quality might be in relation to existing wineries.

We could essentially deploy this solution and charge a small fee to those companies wanting to find out what their wine quality is before they launch themselves into the market. In analyzing this dataset, we hope to help different wineries obtain a quality value for their wine creation and take into account the same attributes that our dataset uses.

Dataset URL: https://www.kaggle.com/danielpanizzo/wine-quality

Classification task: Wine quality on scale of 0 to 10

### 1.2 Data Preparation 

First, we're going to read in both of our CSV files and combine them to have one central source of information - we also added a color attribute, so that we know what each wine sample refers to in terms of red vs white wines. We're also going to remove column(s) that we feel aren't necessary for our analysis. We removed an index column that was included in both CSVs since we do not really care what a sample's position is in the dataset. Other than that, we feel like all of the other attributes are critical to determining the quality of a wine sample.

In [60]:
import pandas as pd
import numpy as np

#read red and white wine csv files respectively
redWines = pd.read_csv('wine-quality/wineQualityReds.csv')
whiteWines = pd.read_csv('wine-quality/wineQualityWhites.csv')
#add a color attribute, so we can differentiate between red and white wines
redWines['color']='Red'
whiteWines['color']='White'
#add both csv files so that there is only one dataframe
winesDf = pd.concat([redWines, whiteWines], ignore_index=True)

#drop unneeded columns
winesDf.drop(['Unnamed: 0'], axis=1, inplace=True)

#rename some columns - from periods to underscores to indicate spaces
winesDf = winesDf.rename(columns = {'fixed.acidity': 'fixed_acidity', 'volatile.acidity': 'volatile_acidity', 'citric.acid':'citric_acid', 'residual.sugar':'residual_sugar', 'free.sulfur.dioxide':'free_sulfur_dioxide', 'total.sulfur.dioxide': 'total_sulfur_dioxide'})

#shuffle rows and reset indices
winesDf = winesDf.sample(frac=1).reset_index(drop=True)
winesDf.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,6.7,0.24,0.41,2.9,0.039,48.0,122.0,0.99052,3.25,0.43,12.0,5,White
1,7.3,0.19,0.27,13.9,0.057,45.0,155.0,0.99807,2.94,0.41,8.8,8,White
2,6.8,0.23,0.29,15.4,0.073,56.0,173.0,0.9984,3.06,0.41,8.7,6,White
3,6.3,0.26,0.49,1.5,0.052,34.0,134.0,0.9924,2.99,0.61,9.8,6,White
4,6.0,0.45,0.65,9.7,0.08,11.0,159.0,0.9956,3.04,0.48,9.4,5,White


In [61]:
#show current column info with data type
print(winesDf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
fixed_acidity           6497 non-null float64
volatile_acidity        6497 non-null float64
citric_acid             6497 non-null float64
residual_sugar          6497 non-null float64
chlorides               6497 non-null float64
free_sulfur_dioxide     6497 non-null float64
total_sulfur_dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
color                   6497 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB
None


As you can see, most of the attributes are floats since most of the measures such as acidity, sugar, sulphates - are all numerical values that are needed to predict the quality of a wine sample - so, we have decided not to alter these values so that we can use all attributes and use those to classify a quality value.

In [62]:
#change color to be an int
winesDf['color']=winesDf['color'].map({'Red': 1, 'White': 2})
winesDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
fixed_acidity           6497 non-null float64
volatile_acidity        6497 non-null float64
citric_acid             6497 non-null float64
residual_sugar          6497 non-null float64
chlorides               6497 non-null float64
free_sulfur_dioxide     6497 non-null float64
total_sulfur_dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
color                   6497 non-null int64
dtypes: float64(11), int64(2)
memory usage: 660.0 KB


In [65]:
winesDf.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378,1.753886
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255,0.430779
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0,1.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0,2.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0,2.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0,2.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0,2.0


We have 6497 entries and every field is filled - therefore we have no missing data.

In [99]:
#create a dataframe to get a nice description table
data_des = pd.DataFrame()
data_des['Features'] = winesDf.columns
data_des['Description'] = ['does not evaporate readily',
                          ' the amount of acetic acid in wine',
                          'citric acid can add freshness and flavor to wines',
                          'sugar remaining after fermentation stops',
                          'the amount of salt in the wine',
                          'prevents microbial growth and the oxidation',
                          'SO2 becomes evident in the nose and taste of wine',
                          'water depending on the percent alcohol and sugar content',
                          'describes how acidic or basic a wine is',
                          'sulfur dioxide gas (S02) levels',
                          ' the percent alcohol content of the wine',
                          'value given by experts on wine quality',
                          'refers to color of wine']
data_des['Scale'] = ['ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'ratio', 'nominal', 'nominal']
data_des['Discrete/Continuous'] = ['continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'discrete', 'discrete']
data_des['Range'] = ['3.8 - 15.9', '0.08 - 1.58', '0 - 1.66', '0.66 - 65.8','0.01 - 0.61','1 - 289','6 - 440','0.99 - 1.04','2.72 - 4.01','0.22 -2','8 - 14.9','3 - 9 (based on sensory data)','1: Red, 2: White']
data_des

Unnamed: 0,Features,Description,Scale,Discrete/Continuous,Range
0,fixed_acidity,does not evaporate readily,ratio,continuous,3.8 - 15.9
1,volatile_acidity,the amount of acetic acid in wine,ratio,continuous,0.08 - 1.58
2,citric_acid,citric acid can add freshness and flavor to wines,ratio,continuous,0 - 1.66
3,residual_sugar,sugar remaining after fermentation stops,ratio,continuous,0.66 - 65.8
4,chlorides,the amount of salt in the wine,ratio,continuous,0.01 - 0.61
5,free_sulfur_dioxide,prevents microbial growth and the oxidation,ratio,continuous,1 - 289
6,total_sulfur_dioxide,SO2 becomes evident in the nose and taste of wine,ratio,continuous,6 - 440
7,density,water depending on the percent alcohol and sug...,ratio,continuous,0.99 - 1.04
8,pH,describes how acidic or basic a wine is,ratio,continuous,2.72 - 4.01
9,sulphates,sulfur dioxide gas (S02) levels,ratio,continuous,0.22 -2


We originally had 2 CSV files - one for red wines and one fro white wines, so we ended up merging both files and shuffling the rows to have a randomized dataset. We removed an index column since that wouldn't be necessary for our analysis. We did end up adding an extra attribute for color since we're looking at both red and white wines. We are trying to classify a quality value given 11 attributes that are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total dioxide, density, pH, sulphates, and alcohol. These attributes tell us how likely a wine sample is to evaporate, how sweet or acid it is, how much sugar there is, the water density, pH, and how much alcohol there is in the wine. So we did not remove any of those attributes as we feel that those are required for our classification.

### 1.3 Divide data into training and training sets

In [108]:
from sklearn.model_selection import train_test_split

# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(winesDf, winesDf['quality'], test_size=0.2)

## 2. Modeling

### 2.1 Implementation of logistic regression

### 2.2 One-versus-all logistic regression classifier

### 2.3 Train custom classifier

### 2.4 Compare custom results and scikit-learn

## 3. Deployment

## 4. Exceptional Work