# CS 495 (ML) Wine dataset
Welcome to the Machine Learning demo. In this notebook, we will demonstrate how to read in a basic dataset and generate a machine learning model (or two) based on a red wine quality data set.

# 1 - Load & Initialize Data
## Import Libraries & Load Data
First, we must initialize the environment and import data from the CSV file into a Pandas dataframe:

In [1]:
# Imports
import math
import pandas as pd
#from pandas import DataFrame
from pandas import DataFrame
from IPython import display
from sklearn import preprocessing # For label encoding and data scaling
from sklearn.model_selection import cross_val_score #for cross validation
from sklearn import linear_model #for linear regression model
# Test that Pandas is installed and imported
pd.__version__

df_wine = pd.read_csv("C:/Users/dattr/Desktop/wine_dataset.csv")
#print (df_wine)

## Display Numerical Data
Next, we clean up the data a bit and print basic stats on the number-based columns:

In [2]:
df_wine.describe(include = ['number'])

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


## Display Non-Numerical Data
Next, we print basic (and less useful) stats on the non-number columns:

In [3]:
#df_customers.describe(include = ["O"])
cm_labels = df_wine["style"].unique()
print (cm_labels)

['red' 'white']


# 2 - Pre-process Data
## Encode String Labels
Convert the labels (in String format) to integers, which are more easily processed by ML algorithms:


In [4]:
# Generate label encoder object
label_encoder = preprocessing.LabelEncoder()

# Convert Strings to ints and print unique ints
df_wine["style"] = label_encoder.fit_transform(df_wine["style"])
df_wine["style"].unique()
#df_wine.describe(include = ["number"])

###df_wine = df_wine[df_wine.Class == 0]
###df_wine.describe()

array([0, 1])

## Filter Data

In [5]:
# NOTHING for now (KEEP all data where the class label is greater or equal to 0 - which is ALL DATA in this case)
#df_wine = df_wine[df_wine.Class >= 0]
#df_wine.describe()
#df_wine.head(n = 100)

## Randomize Data
Randomize data and print first few rows for confirmation:

In [6]:
import numpy as np
df_wine = df_wine.reindex(np.random.permutation(df_wine.index))
df_wine.to_csv("C:/Users/dattr/Desktop/wine_dataset_Randomized.csv", index = False)
df_wine.head(n = 10)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,style
873,9.1,0.21,0.37,1.6,0.067,6.0,10.0,0.99552,3.23,0.58,11.1,7,0
3237,7.6,0.47,0.49,13.0,0.239,42.0,220.0,0.9988,2.96,0.51,9.2,5,1
3546,5.7,0.45,0.42,1.1,0.051,61.0,197.0,0.9932,3.02,0.4,9.0,5,1
50,8.8,0.66,0.26,1.7,0.074,4.0,23.0,0.9971,3.15,0.74,9.2,5,0
2220,6.5,0.26,0.43,8.9,0.083,50.0,171.0,0.9965,2.85,0.5,9.0,5,1
2941,8.4,0.58,0.27,12.15,0.033,37.0,116.0,0.9959,2.99,0.39,10.8,6,1
6336,6.1,0.24,0.32,9.0,0.031,41.0,134.0,0.99234,3.25,0.26,12.3,7,1
848,6.4,0.64,0.21,1.8,0.081,14.0,31.0,0.99689,3.59,0.66,9.8,5,0
1637,7.3,0.24,0.39,17.95,0.057,45.0,149.0,0.9999,3.21,0.36,8.6,5,1
4305,7.5,0.28,0.39,10.2,0.045,59.0,209.0,0.9972,3.16,0.63,9.6,6,1


## Select Columns for Features & Labels
The following methods pre-process the data by extracting the relevant features and targets into separate dataframes:

In [7]:
# Takes in a Pandas DataFrame taht contains a raw dataset and returns a
# Pandas DataFrame that contains only the selected features used for a model
def get_features_dataframe(df_input):
    
    # Create a new/blank DataFrame
    df_selected = pd.DataFrame()
    
    # Grab any features already available
    df_selected["fixed_acidity"] = df_input["fixed_acidity"]
    df_selected["volatile_acidity"] = df_input["volatile_acidity"]
    df_selected["citric_acid"] = df_input["citric_acid"]
    df_selected["redisual_sugar"] = df_input["residual_sugar"]
    df_selected["chlorides"] = df_input["chlorides"]
    df_selected["free_sulfur_dioxide"] = df_input["free_sulfur_dioxide"]
    df_selected["total_sulfur_dioxide"] = df_input["total_sulfur_dioxide"]
    df_selected["density"] = df_input["density"]
    df_selected["pH"] = df_input["pH"]
    df_selected["sulphates"] = df_input["sulphates"]
    df_selected["alcohol"] = df_input["alcohol"]
    
    # Make a copy of the selected features
    df_processed = df_selected.copy()
    
    # Return the selected features (both pre-existing and synthetic)
    return df_processed


# Takes in a Pandas DataFrame taht contains a raw dataset and returns a
# Pandas DataFrame that contains only the selected target(s) used for a model
def get_targets_dataframe(df_input):
    
    # Create a new/blank DataFrame
    df_selected = pd.DataFrame()
    
    # Grab any features already available
    df_selected["wine_style"] = df_input["style"]
    
    # Make a copy of the selected features
    df_processed = df_selected.copy()
    
    # Create any desired synthetic features
    
    # Return the selected features (both pre-existing and synthetic)
    return df_processed

## Seperate Data into Training & Testing Sets
Select the:
- percentage of data to be used for classic test/validation split training
- number of folds for cross-validation

In [8]:
# Percentage (0-1.0 corresponds to 0% to 100%) of dataset
percent_training_data= .8
precent_validation_data = 1-percent_training_data
num_cv_folds = 5



Now separate the data into training and validation sets by setting the percentage of data to be used for training:

In [9]:
# Choose the first (percent_training_data)% examples for training
num_total_examples = len(df_wine)
num_training_examples = math.ceil(num_total_examples * percent_training_data)
num_validation_examples = num_total_examples - num_training_examples

# Get all examples (useful later on...)
df_features_all = get_features_dataframe(df_wine.head(num_total_examples))
df_targets_all = get_targets_dataframe(df_wine.head(num_total_examples))

# Choose the first (percent_training_data)% for training examples
df_features_training = get_features_dataframe(df_wine.head(num_training_examples))
df_targets_training = get_targets_dataframe(df_wine.head(num_training_examples))

# Choose the last (1-percent_training_data)% for validation examples
df_features_validation = get_features_dataframe(df_wine.tail(num_validation_examples))
df_targets_validation = get_targets_dataframe(df_wine.tail(num_validation_examples))


## Display Summary of Training/Testing Data (SANITY CHECK)
Print out basic stats of the training and validation data for both the features and targets/labels. Means (averages) between the training and validation features/targets should be close if data was properly randomized:

In [10]:
# Print summary of data split
print (str(num_total_examples) + " total examples used: ")
print("\t" + str( round(num_training_examples / num_total_examples * 100, 2)  ) + 
      "% (" + str(num_training_examples) + " examples used for training)")
print("\t" + str( round(num_validation_examples / num_total_examples * 100, 2)  ) + 
      "% (" + str(num_validation_examples) + " examples used for validation)")

# Display summary of features data
print("\nTraining features summary:")
display.display(df_features_training.describe())
print("\nValidation features summary:")
display.display(df_features_validation.describe())

# Display summary of labels/targets data
print("\nTraining targets/labels summary:")
display.display(df_targets_training.describe())
print("\nValidation targets/labels summary:")
display.display(df_targets_validation.describe())


6497 total examples used: 
	80.01% (5198 examples used for training)
	19.99% (1299 examples used for validation)

Training features summary:


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,redisual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
count,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0
mean,7.229771,0.339533,0.320833,5.498288,0.05622,30.573682,115.852828,0.994751,3.21874,0.531301,10.477852
std,1.310984,0.163869,0.145154,4.789507,0.035291,17.760182,56.464978,0.003012,0.16127,0.14861,1.192113
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98713,2.72,0.22,8.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.9924,3.11,0.43,9.5
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99496,3.21,0.51,10.3
75%,7.7,0.4,0.39,8.2,0.066,41.0,156.0,0.997,3.32,0.6,11.3
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9



Validation features summary:


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,redisual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
count,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0
mean,7.157429,0.340196,0.309831,5.222941,0.055287,30.331794,115.311393,0.994478,3.217544,0.531139,10.547616
std,1.235291,0.167734,0.145697,4.624017,0.033985,17.711696,56.768624,0.002937,0.158897,0.149643,1.193935
min,4.7,0.08,0.0,0.8,0.014,1.0,6.0,0.98711,2.79,0.23,8.4
25%,6.4,0.22,0.24,1.8,0.0375,17.0,80.0,0.99211,3.11,0.43,9.5
50%,6.9,0.3,0.3,3.0,0.047,28.0,117.0,0.99458,3.21,0.51,10.4
75%,7.6,0.4075,0.39,7.7,0.063,41.0,155.0,0.9967,3.32,0.6,11.4
max,15.0,1.33,1.0,31.6,0.467,110.0,366.5,1.0103,4.01,1.62,14.2



Training targets/labels summary:


Unnamed: 0,wine_style
count,5198.0
mean,0.752982
std,0.431319
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0



Validation targets/labels summary:


Unnamed: 0,wine_style
count,1299.0
mean,0.757506
std,0.428757
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


## Standardize Data & Display (SANITY CHECK)
Standardize all feature data so that it looks like Gaussian distribution with 0 MEAN and UNIT 1 variation (standard deviation). Display results for sanity check:

In [11]:
# Create scalar from training examples and normalize both training and validation examples
scaler = preprocessing.StandardScaler().fit(df_features_training)
df_features_training_normalized = pd.DataFrame(scaler.transform(df_features_training))
df_features_validation_normalized = pd.DataFrame(scaler.transform(df_features_validation))

# Display summary of feature data
print("\nTraining features summary:")
display.display(df_features_training_normalized.describe())
print("\nValidation features summary:")
display.display(df_features_validation_normalized.describe())

# For more tips on scaling data in SCIKIT-LEARN:
# https://scikit-learn.org/stable/modules/preprocessing.html


Training features summary:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
count,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0,5198.0
mean,-5.741207e-17,1.530989e-16,4.3742530000000004e-17,6.629727000000001e-17,-1.500232e-16,4.4426010000000005e-17,-8.987723000000001e-17,2.874021e-14,-3.379111e-15,4.46994e-16,-1.936974e-15
std,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096
min,-2.616432,-1.583936,-2.210511,-1.022811,-1.338143,-1.665328,-1.945691,-2.530926,-3.092868,-2.094944,-2.078739
25%,-0.6329985,-0.6684843,-0.4880332,-0.7722388,-0.5163354,-0.7643496,-0.6881534,-0.7808393,-0.6743357,-0.6817171,-0.8203474
50%,-0.175283,-0.3023035,-0.0746385,-0.521667,-0.2612916,-0.08861585,0.03803027,0.06929761,-0.0541993,-0.1433449,-0.1492053
75%,0.3587185,0.369028,0.4765545,0.5641441,0.2771343,0.5871179,0.7110785,0.7467504,0.6279507,0.4623237,0.6897224
max,6.614164,7.570584,9.226743,12.59159,15.72146,14.55228,5.741229,14.68767,4.906892,9.883836,3.709862



Validation features summary:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
count,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0
mean,-0.055187,0.004045,-0.075805,-0.057495,-0.026449,-0.013621,-0.00959,-0.090839,-0.007415,-0.001085,0.058527
std,0.942353,1.023686,1.003838,0.96554,0.963082,0.997366,1.005474,0.975488,0.985379,1.007045,1.001625
min,-1.929859,-1.583936,-2.210511,-0.981049,-1.196452,-1.665328,-1.945691,-2.537567,-2.658772,-2.027647,-1.743168
25%,-0.632998,-0.729514,-0.556932,-0.772239,-0.530504,-0.76435,-0.635018,-0.877144,-0.674336,-0.681717,-0.820347
50%,-0.251569,-0.241273,-0.143538,-0.521667,-0.261292,-0.144927,0.020318,-0.056895,-0.054199,-0.143345,-0.065313
75%,0.282433,0.414801,0.476554,0.459739,0.19212,0.587118,0.693367,0.647125,0.627951,0.462324,0.773615
max,5.92759,6.044831,4.679401,5.450294,11.640754,4.472587,4.439412,5.163477,4.906892,7.326568,3.122613


# 3 - Generate Machine Learning Models & Make Predictions
## Variable Initialization
The following code creates multiple arrays for the purposes of code simplicity:

In [12]:
# TODO
lst_model_names = ["Support Vector Classification (SVC)", "K-Nearest Neighbor (KNN)", "Linear Support Vector Classification (LSVC)"]
lst_models =[]
lst_model_predictions = []
lst_model_CMs = []



## Train Data
The following code fits several classifiers to the training data:

In [13]:
print ("Model Parameters:", end = "\n\n")




#Train/fit Support Vector-Classification Model
from sklearn import svm
svc = svm.SVC(kernel = "linear", class_weight = "balanced")
svc.fit(df_features_training_normalized, df_targets_training.to_numpy().ravel())
lst_models.append(svc)
print(svc, end="\n\n\t")



#Train/fit K-nearest Neighbors Model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(df_features_training_normalized, df_targets_training.to_numpy().ravel())
lst_models.append(knn)
print(knn, end="\n\n\t")


#Train/fit Linear Support Vector Classificaiton Model
from sklearn.svm import LinearSVC
lsvc = LinearSVC(random_state = 0, tol = 1e-5, max_iter = 100000)
lsvc.fit(df_features_training_normalized, df_targets_training.to_numpy().ravel())
lst_models.append(lsvc)
print(lsvc, end="\n\n\t")





Model Parameters:

SVC(C=1.0, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

	KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                     weights='uniform')

	LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=100000,
          multi_class='ovr', penalty='l2', random_state=0, tol=1e-05,
          verbose=0)

	

## Make Predictions
The following code makes predictions and prints the raw prediction arrays:

In [14]:
#Predict validation examples and print
for i in range(len(lst_model_names)):
    prediction = lst_models[i].predict(df_features_validation_normalized)
    lst_model_predictions.append(prediction)
    print(lst_model_names[i] + " Predictions:")
    print (prediction, "\n")
    
#Print actual validation lables
print("***ACTUAL LABELS:***")
print(df_targets_validation.to_numpy().ravel())

Support Vector Classification (SVC) Predictions:
[1 1 0 ... 1 1 1] 

K-Nearest Neighbor (KNN) Predictions:
[1 1 0 ... 1 1 1] 

Linear Support Vector Classification (LSVC) Predictions:
[1 1 0 ... 1 1 1] 

***ACTUAL LABELS:***
[1 1 0 ... 1 1 1]


# 4 - Formatted results
## Generate Stats
Generate confusion matrices and labels to display






































In [15]:
# Intuitive labels for data



from sklearn.metrics import confusion_matrix
#Generate confusion matrices
for i in range(len(lst_model_names)):
    cm = confusion_matrix(df_targets_validation, lst_model_predictions[i])
    lst_model_CMs.append(cm)


## Display Basic Summary
The following code prints basic results:

In [16]:
# Print correctness of each model
for i in range(len(lst_model_names)):
    print(lst_model_names[i] + " Prediction Accuracy: ")
    
    # Print results for classic split of test and validation data
    print("\tResults for classic {:.0f}/{:.0f} (training/testing) split:".format(percent_training_data * 100, (1 - percent_training_data) * 100))
    overall_score = lst_models[i].score(df_features_validation_normalized, df_targets_validation)
    print("\t\tOverall: {:.2f}%".format(overall_score * 100))
    
    # Print out scores for individual classes
    for j in range(len(cm_labels)):
        print("\t\t{:s}: {:.2f}%".format(cm_labels[j], lst_model_CMs[i][j][j] / sum(lst_model_CMs[i][j]) * 100))
    
    # Print results for cross-validation
    cv_results = cross_val_score(lst_models[i], df_features_all, df_targets_all.to_numpy().ravel(), cv=num_cv_folds)
    print("\tResults for classic {:d}-fold cross-validation:".format(num_cv_folds))
    print("\t\tOverall: {:.2f}%\n".format(np.mean(cv_results) * 100))
    
    
    
    

Support Vector Classification (SVC) Prediction Accuracy: 
	Results for classic 80/20 (training/testing) split:
		Overall: 99.38%
		red: 99.05%
		white: 99.49%
	Results for classic 5-fold cross-validation:
		Overall: 98.54%

K-Nearest Neighbor (KNN) Prediction Accuracy: 
	Results for classic 80/20 (training/testing) split:
		Overall: 99.46%
		red: 99.37%
		white: 99.49%
	Results for classic 5-fold cross-validation:
		Overall: 93.00%

Linear Support Vector Classification (LSVC) Prediction Accuracy: 
	Results for classic 80/20 (training/testing) split:
		Overall: 99.46%
		red: 98.73%
		white: 99.70%




	Results for classic 5-fold cross-validation:
		Overall: 98.74%





## Display Confusion Matrices
The following code generates and displays the confussion matrix of the previous predictions:

In [17]:
#import method from file ( I made changes so that it didnt have depricated methods
# File/package sourced from : http://github.com/wipria
from confusion_matrix_pretty_print import pretty_plot_confusion_matrix

#print confusion matrices
for i in range(len(lst_model_names)):
    title = lst_model_names[i] + " Confusion Matrix"
    df_cm = DataFrame(lst_model_CMs[i], index = cm_labels, columns = cm_labels)
    pretty_plot_confusion_matrix(df_cm, cmap="PuRd", pred_val_axis="X", title=title)

ModuleNotFoundError: No module named 'matplotlib'