<h1 style="text-align:center">CS 212 - Introduction to Programming for Analysts - Spring 2021</h1>
<h1 style="text-align:center">Machine Learning Case Study - Wine Quality Evaluation - 40 Points</h1>

# Objectives
Upon completion of this programming exercise, students will:
* Describe when clustering, regression or classification is the appropriate analysis technique
* Use scikit-learn to perform clustering, regression and classification


# Description
In this case study, you will provide recommendations to leading grape growers and grape farmers who produce the finest wines in the world. Every grape growers to find the recipe for the perfect wine. Of course, this definition of the "perfect" wine is very subjective, so we can use wine quality reviews to help our grape growers craft the best wine. Our dataset comes from the [University of California, Irvine Machine Learning Data Repository](https://archive.ics.uci.edu/ml/datasets/wine).

***

# Gather and Prepare Data

### Import the Relevant Libraries

In [1]:
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Loading the Data
**Q1 (1 points)** Load `winequality-red.csv` using pandas and specify the `delimiter` parameter to be a semi-colon (;). Assign the resulting DataFrame to the variable `df_red`. Do the same for `winequality-white.csv` and assign the DataFrame to the variable `df_white`. Inspect the first five lines of `df_red`.

### Concatenate Red and White DataFrames
**Q2 (1 point)** Both `df_red` and `df_white` have the same columns, so when putting these two DataFrames together we can use pandas' `concat` function. Concatenate `df_red` and `df_white` together and assign the resulting DataFrame to the variable `df`.

# Choose, Train and Evaluate Model
### k-Means Clustering
We will use k-means clustering to try to see how many groups of wine we have. What do these groups represent?

Assign a `copy` of the DataFrame (`df`) to the variable `X`. This will be our features for our k-means clustering model.

**Q3 (2 points)** Create ten `KMeans` objects using 1 to 10 clusters, `fit` the model using `X`, find the `intertia_` of those models and save the inertia to a list called `wcss`. Perform the Elbow Method by plotting the Within-Cluster-Sum-of-Squares (`wcss`) over the number of clusters.

It looks like either 2 or 3 clusters will be the best choice for this dataset. **REMEMBER:** more clusters will reduce the WCSS, but it will make the model more challenging to interpret. 

**Q4 (2 points)** Take a moment to pause and think about how you, a human, would categorize this wine dataset? Type your answer in the Markdown cell below

**Q5 (2 points)** Let's create a k-means model with 2 clusters. Create a `copy` of `df` and assign it to the variable `clusters`. Create a new column in `clusters` called `cluster_pred` containing the cluster category predictions for each of our observations (`fit_predict(X)`).

**Q6 (2 points)** Now, we want to see which feature(s) are most important for categorizing our dataset into 2 clusters. Create a `pairplot` of our `clusters` DataFrame using the `cluster_pred` for `hue`.

Take a look at the plot above and see which row or column or combination of the 2 best separates the data into 2 groups. 

**Q7 (1 point)** Enter the row number that separates the data into two groups cleanly.

In [2]:
row = 
print('{} is the best feature to split the data into 2 groups.'.format(X.columns[row]))

Take a look at [this website](https://www.piwine.com/use-and-measurement-of-sulfur-dioxide-in-wine.html) to learn a little more about *why* wines would be separated into 2 groupd based on total sulfur dioxide. **HINT:** take a look specifically at the third bullet in the section "Some General Points to Consider:".

**Q8 (2 points)** Based on the article link above, what do you think our two groups represent? Enter your answer in the Markdown cell below.

**Q9 (2 points)** Let's check our suspicion that our k-means with 2 clusters used total sulfur dioxide to split our dataset into red and white wines. Add in a new column called `type` to `df_red` that contains the label either "red". Create a new column called `type` in `df_white` that contains the label "white". Concatenate the two DataFrames together again and call the resulting DataFrame `df2`.

**Q10 (5 points)** Create a figure containing 2 subplots that are scatter plots:
1. `total sulfur dioxide` and `fixed acidity` columns within our DataFrame `df2` with the `hue` as `type`
2. `total sulfur dioxide` and `fixed acidity` columns within our DataFrame `clusters` with the `hue` as `cluster_pred`

We might be able to conclude that if a wine has more total sulfur dioxide and lower fixed acidity, there is a greater likelihood the wine is white.

### Linear Regression
We will use linear regression now to try to predict the `quality` based on the other features in our data.

**Q11 (2 points)** Divide our DataFrame `df` into the features (`X`) and the targets (`y`). Check out the `shape` of `X` and `y`, as well as the `columns` attribute of `X` and the `name` attribute of `y`.

**Q12 (2 points)** Split our `X` and `y` data into the training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets using the `train_test_split()` function.

**Q13 (2 points)** Create a `LinearRegression` object and fit the model using `X_train` and `y_train`. Evaluate the model using the `score` method on our `LinearRegression` object with the parameters `X_test` and `y_test`.

### Classification
Well it didnt seem like Linear Regression created a good model to predict quality, so we will use classification to try to classify the wine quality.

**NOTE: We need to change the maximum number of iterations for our Logistic Regression object. Add `max_iter=5000` as a paramter to your object instantiation**

**Q14 (2 points)** In order to create a more accurate model, our grape growers agreed that wines with quality 3-5 are categorized as 'good', wines with quality 6-8 are "great" and wines that scored a 9 are "amazing". Create a function called `wine_quality` that accepts a wine quality observation (3-9) and returns the mapped value based on the table below:

| `quality` | Grape Grower Assessment | Mapped Value |
| :----: | :----: | :----: |
| 3 - 5 | Good | 0 | 
| 6 - 8 | Great | 1 | 
| 9 | Amazing | 2 | 

**Q15 (2 points)** Create a copy of `df` and assign it to `df2`. Then `apply` the `wine_quality` function to the `quality` column in our DataFrame `df2`. Assign the resulting Series to overwrite our column `quality` in `df2`.

**Q16 (2 points)** Divide our DataFrame `df2` into the features (`X`) and the targets (`y`). Check out the `shape` of `X` and `y`, as well as the `columns` attribute of `X` and the `name` attribute of `y`.

**Q17 (2 points)** Split our `X` and `y` data into the training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets using the `train_test_split()` function.

**Q18 (2 points)** Iterate through values of K to find the best KNN model. Each iteration create a `KNeighborsClassifier` object where `n_neighbors` = K, `fit` the model using `X_train` and `y_train`, and find the prediction accuracy (`score`) using `X_test` and `y_test`. Append the prediction accuracy of each model to a list and output the list.

**Q19 (2 points)** Create a `LogisticRegression` object and `fit` the model using `X_train` and `y_train`. Find the prediction accuracy (`score`) using `X_test` and `y_test`. 

**Q20 (2 points)** Which model would you choose and why would you choose it?