# Assignment 3. Wine Quality Prediction

The assignment is to develop a regression model to predict wine quality score.

The dataset consists of 11 predictor variables and one target variable, `quality`. Predictor variables are listed below:

- fixed_acidity (고정 산성도)
- volatile_acidity (휘발성 산성도)
- critic_acid (구연산)
- sugar (잔당)
- chloride (염화물)
- free_sulfer_dioxide (자유 이상화황)
- total_sulfer_dioxide (총 이산화황)
- density (밀도)
- acidity (산도)
- sulfate (황산염)
- alcohol (알코올)

Can you build a machine learning model to accurately predict the quality scores of given wines?

First, let's load the data. The training and test data files are located in the same folder.
- training data: `wine_quality_train.csv` (4,000 samples)
- test data: `wine_quality_test.csv` (898 samples) 

In [6]:
import pandas as pd

df = pd.read_csv('wine_quality_train.csv')
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,critic_acid,sugar,chloride,free_sulfer_dioxide,total_sulfer_dioxide,density,acidity,sulfate,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [7]:
df.columns

Index(['fixed_acidity', 'volatile_acidity', 'critic_acid', 'sugar', 'chloride',
       'free_sulfer_dioxide', 'total_sulfer_dioxide', 'density', 'acidity',
       'sulfate', 'alcohol', 'quality'],
      dtype='object')

Define `X` and `y`. Here, `X` and `y` refers to the input and output of our classification models.

In [10]:
X = df.iloc[:, :-1] # or, df[['fixed_acidity', 'volatile_acidity', 'critic_acid', 'sugar', 'chloride','free_sulfer_dioxide', 'total_sulfer_dioxide', 'density', 'acidity', 'sulfate', 'alcohol', ]]
y = df.iloc[:, -1]  # or, df['quality']

## Problem 1. Check the average quality value

Check the average value of the wine qualitiy of the training data (`y` in the above cell).

In [None]:
# your code here
# ...

## Problem 2. Train Classifiers

First, split the data into training and validation data. This is necessary to avoid overfitting.

In [11]:
from sklearn.model_selection import train_test_split

X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.3, random_state=0)

Apply feature normalization so that all features are considered equal.

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_trn)
X_trn_norm = scaler.transform(X_trn)
X_val_norm = scaler.transform(X_val)

Train decision tree, random forest, and lasso regressiors using the normalized training data. And check the training and validation performances of the model using the root mean squared error (RMSE) measure.
* The choice of model parameters (e.g., `max_depth` in decision tree, `n_estimators` in random forest, `alpha` (C) in lasso) is up to you :) Note that the models have their default values (e.g., `max_depth=3`, `n_estimators=100`, `alpha=1`) I recommend you to search over some parameter candidates to find a good model with low validation error!

In [None]:
# your code here
# ...

## Problem 3. Feature importance

Remember that linear regression, decision tree, and random forest provide information about which input feature is important to predict the outcome.

In our problem of wine quality prediction, which feature is important? explore the model and describe your opinion.

In [None]:
# your code here
# ...

## Problem 4. Check model performance on new data (test performance)

In this month, 898 new wine samples arrived in our shop, so you want to check your model performance on new samples.
The wine quality "labels (answers)" of new wines were obtained from experienced wine critics in the shop.

The test data named `wine_quality_test.csv` is located in the same forder.

In [13]:
df_test = pd.read_csv('wine_quality_test.csv')
df_test.head()

Unnamed: 0,fixed_acidity,volatile_acidity,critic_acid,sugar,chloride,free_sulfer_dioxide,total_sulfer_dioxide,density,acidity,sulfate,alcohol,quality
0,6.4,0.24,0.49,5.8,0.053,25.0,120.0,0.99420,3.01,0.98,10.5,6
1,6.4,0.25,0.57,1.0,0.062,21.0,122.0,0.99238,3.00,0.40,9.5,5
2,6.1,0.25,0.48,15.8,0.052,25.0,94.0,0.99782,3.07,0.45,9.2,6
3,6.8,0.14,0.35,1.5,0.047,40.0,117.0,0.99111,3.07,0.72,11.1,6
4,6.5,0.38,0.26,5.2,0.042,33.0,112.0,0.99067,3.06,0.50,12.3,7
...,...,...,...,...,...,...,...,...,...,...,...,...
893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


Define `X` and `y` of test data.

In [None]:
X_tst = df_test.iloc[:, :-1]
y_tst = df_test.iloc[:, -1]

Don't forget to normalize the input data `X_tst` using `scaler` obtained from the training data!

In [None]:
X_tst_norm = scaler.transform(X_tst)

Test your regressiors on the test data and get RMSE scores of the models obtained in the Problem 2.

In [None]:
# your code here
# ...