# Wine Quality Prediction

**Objective :** To try to predict the quality of the wine with the help of features provided.
For this project we will be using the following UCI dataset- https://archive.ics.uci.edu/ml/datasets/Wine
The dataset is related to red variants of the Portuguese "Vinho Verde" wine.

Here are the features represented through columns :
<br>
**Input variables (based on physicochemical tests)**

1 - fixed acidity
<br>
2 - volatile acidity
<br>
3 - citric acid
<br>
4 - residual sugar
<br>
5 - chlorides
<br>
6 - free sulfur dioxide
<br>
7 - total sulfur dioxide
<br>
8 - density
<br>
9 - pH
<br>
10 - sulphates
<br>
11 - alcohol

**Output variable (based on sensory data)**

12 - quality (score between 0 and 10)


## Steps :
1. Importing Libraries
2. Exploring the Dataset
3. Exploratory Data Analysis
> * Univariate Analysis
4. Data Preprocessing
5. Model Building
> * Random Forest Classifier
> * Stochastic Gradient Descent Classifier
> * Support Vector Classifier(SVC)
6. Validation
> * Grid Search CV
> * Cross Validation Score
7. Conclusion

## 1. Import Libraries
Import the necessary packages to process or plot the data

### Get the Data

Use pandas to read winequality-red.csv as a dataframe called wine

## 2. Exploring the Dataset

**Check distribution of data**
<br>
Use head() method

**Check information about the columns**
<br>
Use info() method

## 3. Exploratory Data Analysis

### Univariate Analysis

Let's do some data visualization! Feel free to use whatever library you want. 

Note - Directions for a few plots are given below, we encourage you to explore further for more insights into the data!

**Goal : Observe and identify trends in quality as we vary the feature values**

In [1]:
#Observe if and how fixed acidity gives any specification to classify the quality.



In [2]:
#Observe the trend in the volatile acidity as we go higher the quality 



In [3]:
#Study the composition of citric acid go higher as we go higher in the quality of the wine



In [4]:
#Study the composition of residual sugar go higher as we go higher in the quality of the wine



In [5]:
#Observe the composition of chloride as we go higher in the quality of the wine



In [6]:
#Observe the composition of free suplhur dioxide as we go higher in the quality of the wine



In [7]:
#Observe the composition of total suplhur dioxide as we go higher in the quality of the wine



In [8]:
#Observe sulphates level with respect to varying quality of wine



In [9]:
#Find out the effecct of alcohol level on quality



## 4. Data Preprocessing

Notice that the quality column has entries between 0 - 10. <br>
**We need to divide the samples into categories 'good' and 'bad' according to a self defined limit for the categories.**

We will do this by using pd.cut as follows : <br>
**data_column = pd.cut(data_column, bins, labels)**


In [10]:
bins = (2, 6.5, 8)

#create a list named group_names containing 2 strings : 'good' and 'bad'


#Use pd.cut() on the quality column and set bins to bins and labels to group_names 



Now lets assign labels to our quality variable using **LabelEncoder**

In [11]:
#Import LabelEncoder and create an instance named label_quality



In [12]:
#Use .fit_transform method to fit label_quality to the 'quality' column and return encoded labels
#Bad becomes 0 and good becomes 1 



In [13]:
#Use .value_counts method on the 'quality' column to find out category size for both.



In [14]:
#Print out a countplot for the 'quality' column



**Now seperate the dataset as response variable and feature variabes**

In [15]:
#Set the 'quality' column to y
#Drop the 'quality' column from the dataframe and set the remaining dataframe to X



**Train Test Split**

In [16]:
#Import train_test_split


#Split the data with parameter test_size = 0.2



** Let's apply standard scaling to get optimized result

In [17]:
#import StandardScaler


#Create an instance of StandardScaler called sc


#Use .fit_transform on sc for both, X_train and X_test



## 5. Model Building

### Random Forest Classifier

**Import RandomForestClassifier**

**Create an instance of RandomForestClassifier() called rfc and fit it to the training data.**

**Create predictions from the test set and name the result pred_rfc**

**Let's see how our model performed!**

In [18]:
#create a classification report and confusion matrix


Note the accuracy

### Stochastic Gradient Descent Classifier

In [19]:
#Import SGDClassifier


#Create an instance of SGDClassifier() called sgd and fit it to the training data.


#Create predictions from the test set and name the result pred_sgd



In [20]:
#create a classification report and confusion matrix



Note the accuracy

### Support Vector Classifier

In [21]:
#Import SVC


#Create an instance of SVC() called svc and fit it to the training data.


#Create predictions from the test set and name the result pred_svc



In [22]:
#create a classification report and confusion matrix



Note the accuracy

## 6. Validation
Let's try to increase our accuracy of models

### Grid Search CV for SVC

In [23]:
#import GridSearchCV


#Finding best parameters for our SVC model
param = {
    'C': [0.1,0.8,0.9,1,1.1,1.2,1.3,1.4],
    'kernel':['linear', 'rbf'],
    'gamma' :[0.1,0.8,0.9,1,1.1,1.2,1.3,1.4]
}

#Create an instance of GridSearchCV() and input svc as the data parameter



In [24]:
#fit grid_svc to the training data



In [25]:
#Use grid_svc.best_params_ to find the best parameters for our svc model



In [26]:
#Let's run our SVC again with the best parameters. Create a new instance called svc2 with the above found parameters


#fit svc2 to the training data


#Create predictions from the test set and name the result pred_svc2


##create a classification report 


Observe if there is any improvement in the accuracy of the SVC

### Cross Validation Score for RFC


In [27]:
#import cross_val_score


#create an instance of cross_val_score called rfc_eval and use rfc to fit the training data. 


#Calculate mean for rfc_eval. This is your new accuracy



Observe if there is any improvement in the accuracy of rfc

## 7. Conclusion

Compare the all rounded performances of various models and impact of cross validation techniques