# Data Science Assignment
By: Kevin Fang, 2021
- This task should take less than 30 minutes. If it is taking you longer, just submit what you have!

Some code taken from https://www.kaggle.com/vishalyo990/prediction-of-quality-of-wine

Data is from the UCI Wine Quality Dataset: 
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

### Importing & Preprocessing
First, we will import some important libraries and do some preprocessing of the data:

In [128]:
# import pandas and numpy, two important libraries
import pandas as pd
import numpy as np

In [81]:
# two useful functions
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
# train_test_split divides the data into "training" and "testing" portions, 
# so you don't test the model on data it has already seen

In [39]:
# load the spreadsheet
wine = pd.read_csv("winequality-red.csv", delimiter=";")

In [40]:
# convert quality ratings into "bad", "okay", and "good" ratings
group_names = [1, 2, 3]
bins = (2, 5, 6, 8)
wine['quality'] = pd.cut(wine['quality'], bins = bins, labels = group_names)

In [129]:
# examine the dataset
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,2
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1


In [42]:
# split the data into the features and the result.
X = wine.drop('quality', axis=1)
y = wine['quality']

In [46]:
# print out our list of wine qualities
print(list(y))

[1, 1, 1, 2, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 1, 3, 1, 1, 1, 2, 3, 3, 1, 1, 3, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 3, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 3, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 3, 3, 2, 3, 1, 3, 1, 1, 2, 2, 3, 1, 3, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 3, 2, 3, 1, 1, 2, 2, 2, 3, 1, 2, 1, 2, 2, 2, 

In [44]:
# count the amount of instances
np.bincount(y)[1:]

array([744, 638, 217])

From what we can see above, that means there are **744** bad wines, **638** okay wines, and **217** good wines.

Now, we must do some preprocessing of the data to help the ML model understand the data. We didn't go over this during the talk, so you're not expected to understand what is going on here. Basically we're scaling the data, to prevent huge and small numbers from throwing off our classifier.

In [31]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Train-Test-Split
Here, we now do the **train test split**. This is extremely important, as we do not want to test the model on data it has already seen!

The `train_test_split` function takes in two arrays (`X` and `y`) and an optional `test_size` variable of `0.2`. That means that 80% of the data is being allocated to the training data, and 20% of the data is allocated to the testing data. The function returns a tuple (a, b, c, d) that we deconstruct into `X_train, X_test, y_train, y_test` for you.

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [61]:
X_train.shape

(1279, 11)

In [62]:
X_test.shape

(320, 11)

Your task here is to explore the different classifiers on [the sklearn website](https://scikit-learn.org/stable/supervised_learning.html) and try to get the highest accuracy you can.
### MODIFY THE CODE BELOW
You shouldn't have to modify stuff above this unless you really know what you're doing and want to get the highest accuracy possible. Once you have an accuracy score >= 0.7 (70%), you can go ahead and submit what you have.

- It might be difficult to immediately get 70% by only trying different classifiers, so you might want to modify some parameters of the model. For example, things you can change for KNN include: `n_neighbors`, `weights`, `algorithm`, `leaf_size`, etc. Just try different values until you get one that works! Maybe try `n_neighbors = 5, 10, 100`. Or if you use a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier), try different values for `n_estimators`. 

In [119]:
# Some classifiers that you may want to try out. Look at the sklearn website for more to explore!
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

In [112]:
# Write code to initialize a classifier here (should just be one line). 
# Refer to the tutorial notebooks if you need help!

In [114]:
# Write code to fit the classifier on X_train and y_train here (should just be one line). 
# Refer to the tutorial notebooks if you need help!

Test your model with the code below:

In [115]:
y_pred = clf.predict(X_test)
accuracy_score(y_pred, y_test)

0.7

Once you achieve an accuracy greater or equal to .7, please click `File > Download as > HTML (.html)` and submit it [here](https://forms.gle/B6589DfHFVuLAkPw7)

### Extra Credit:
- Use cross validation to make sure your results are valid across all parts of the data!
- Use a Grid Search to find the best hyperparameters! There's a good tutorial [here](https://www.mygreatlearning.com/blog/gridsearchcv/) if you're interested.
- Try different classifiers and see how high of an accuracy you can get!

### Code to do a 10-Fold Cross Validation:

In [120]:
clf = KNeighborsClassifier()
cv_scores = cross_val_score(clf, X, y, cv=10)

In [121]:
cv_scores.mean()

0.47779088050314467

### Starter code to do a Grid Search:
- this might be very very slow

In [125]:
param = {
    'n_estimators': [1, 10, 50, 100, 200, 1000],
    'max_depth': [5, 10, 20]
}
clf = RandomForestClassifier()
grid_svc = GridSearchCV(clf, param_grid=param, scoring='accuracy', verbose=True, cv=10, n_jobs=5)

In [126]:
grid_svc.fit(X_train, y_train)

Fitting 10 folds for each of 18 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.3min finished


GridSearchCV(cv=10, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 10, 20],
                         'n_estimators': [1, 10, 50, 100, 200, 1000]},
             scoring='accuracy', verbose=True)