# Machine Learning in Python - Project 2

Due Wednesday, April 15th by 5 pm.

## 1. Setup

### 1.1 Libraries

In [None]:
# Add any additional libraries or submodules below

# Display plots inline
%matplotlib inline

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

# sklearn modules
import sklearn

### 1.2 Data

In [None]:
wine_train = pd.read_csv("wine_qual_train.csv")
wine_test  = pd.read_csv("wine_qual_test.csv")

## 2. Exploratory Data Analysis and Preprocessing

*Include a discussion of the data with a particular emphasis on the features of the data that are relevant for the subsequent modeling. Including visualizations of the data is strongly encouraged - all code and plots must also be described in the write up.*

*In this section you should also implement and describe any preprocessing / transformations of the features. Hint - you need to take care of the discretization of the `quality` variable as described in `README.ipynb`.*

## 3. Model Fitting and Tuning

*In this section you should detail your choice of model and describe the process used to refine and fit that model. You are strongly encouraged to explore many different modeling methods (e.g. logistic regression, regression trees, support vector machines, etc.) but you should not include a detailed narrative of all of these attempts. At most this section should mention the methods explored and why they were rejected - most of your effort should go into describing the model you are using.*

*For example if you considered a logistic regression model, a classification tree, and a support vector machine model and ultimately settled on the logistic regression approach then you should mention that other two approaches were tried but do not include any of the code or any in-depth discussion of these models beyond why they were rejected. Additional code for these models should be included in a supplemental materials notebook. What this section should then detail is the development of the  model of choice in terms of features used, interactions considered, and any additional tuning and validation which ultimately led to your final model.*

## 4. Discussion


*In this section you should provide a general overview of your final model and its performance. You should discuss what the implications of your model are in terms of the included features, predictive performance, and anything else you think is relevant. The target audience for this should be someone who is familiar with the basics of mathematics but not necessarily someone who has taken a postgraduate statistical modeling course. Your goal should be to convince this audience that your model is both accurate and useful. Your discussion should also include some discussion of differences in potential losses incurred by misclassification - i.e. classifying a "poor" wine as "excellent" is likely not the same as misclassifying an "excellent" wine as "good".*

## 5. Model Validation

*We have provided a third csv file called `wine_qual_holdout.csv` which we will be using for assessing the predictive performance of your model. The file provided with the assignment contains **identical** data to `wine_qual_test.csv`, however after you turn in your notebook we will be replacing this file with the true holdout data (1000 additional wines not included in the train or test set) and rerunning your notebook.*

*The objective of this is two-fold, the first is to ensure that your modeling code is reproducible and everything can be rerun and "identical" results can be obtained. And second, to obtain a reliable estimate of your final model's predictive performance, which will be compared across all of the projects in the course.*

*You should include a brief write up in the section detailing the performance of your model, in particular you should discuss the implications of this modeling uncertainty in the context of classifying wine quality.*

In [None]:
wine_holdout = pd.read_csv("wine_qual_holdout.csv")

# Adjust this code as necessary to preprocess the holdout data
X_holdout = wine_holdout.drop('quality', axis=1)
y_holdout = wine_holdout.quality

In [None]:
# This is a placeholder model so the subsequent cell runs
# DELETE this cell once `final_model` or equivalent is defined
# in Section 3.

from sklearn.linear_model import LogisticRegression

X_holdout = X_holdout.select_dtypes(exclude='object')
final_model = LogisticRegression().fit(X_holdout, y_holdout)

In [None]:
# Calculate the confusion matrix for your model
# 
# Change the name of `final_model` to reflect the name of your fitted model object

sklearn.metrics.confusion_matrix(y_holdout, final_model.predict(X_holdout))

In [None]:
# Calculate the classification report for your model
# 
# Change the name of `final_model` to reflect the name of your fitted model object

print(
    sklearn.metrics.classification_report(y_holdout, final_model.predict(X_holdout))
)

In [None]:
# Alternative metrics can be included below