<hr style="margin-bottom: 40px;">

<img src="diabetes.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>
    
    
# Wine Quality - Principal Component Analysis Project

This datasets is related to red variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

Content

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

### Importing libraries.

In [3]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore') 

### Importing dataset.

In [4]:
dataset=pd.read_csv('wine_quality.csv')
dataset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


**Creating matrix of features** 

In [7]:
X=dataset.iloc[:,0:11].values
y=dataset.iloc[:,11].values

In [21]:
columns = dataset.columns
columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

**Splitting data into test train**

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0) 

**Feature Scaling **

In [9]:
from sklearn.preprocessing import StandardScaler 
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.fit_transform(X_test)
#X_train

In [10]:
X_train.shape

(1279, 11)

**Applying PCA**

In [11]:
from sklearn.decomposition import PCA
pca=PCA(n_components=None)
X_train=pca.fit_transform(X_train)
X_test=pca.transform(X_test)
explained_variance=pca.explained_variance_ratio_

In [12]:
explained_variance

array([0.28263119, 0.17942214, 0.1358536 , 0.10862763, 0.08692731,
       0.06008452, 0.0533286 , 0.03926827, 0.03166857, 0.01667793,
       0.00551024])

We see that the the first three principal components 0.282 + 0.179 + 0.135 = 0.596. So we will consider the first three components 

In [13]:
from sklearn.decomposition import PCA
pca=PCA(n_components=3)
X_train=pca.fit_transform(X_train)
X_test=pca.transform(X_test)
explained_variance=pca.explained_variance_ratio_

In [14]:
pca.explained_variance_ratio_

array([0.28263119, 0.17942214, 0.1358536 ])

In [23]:
X.shape

(1599, 11)

In [15]:
np.identity(X.shape[1])

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

In [16]:
components=pca.transform(np.identity(X.shape[1]))

In [17]:
dataset.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [24]:
pd.DataFrame(
    components,columns=['pc_1','pc_2','pc_3'],
    index=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides',
           'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol'])

Unnamed: 0,pc_1,pc_2,pc_3
fixed acidity,1.0,-1.1110910000000001e-17,-2.3610690000000002e-17
volatile acidity,0.0,1.0,1.403834e-15
citric acid,0.0,-1.57577e-15,1.0
residual sugar,0.0,2.282708e-16,4.055193e-16
chlorides,0.0,3.618395e-16,1.223289e-16
free sulfur dioxide,0.0,4.332171e-17,-1.808515e-16
total sulfur dioxide,0.0,-2.128862e-16,5.41755e-16
density,0.0,-3.876705e-17,8.803196e-17
pH,0.0,3.3430160000000004e-17,4.878793e-17
sulphates,0.0,7.6532e-17,-8.570531000000001e-17


In [25]:
X_train.shape

(1279, 3)