### Principal Component Analysis

We are gonna use PCA to identify different types of glasses.

In [8]:
import pandas as pd
from sklearn.decomposition import PCA

#### Importing data

In [3]:
df = pd.read_csv('glass.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
RI      214 non-null float64
Na      214 non-null float64
Mg      214 non-null float64
Al      214 non-null float64
Si      214 non-null float64
K       214 non-null float64
Ca      214 non-null float64
Ba      214 non-null float64
Fe      214 non-null float64
Type    214 non-null int64
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


In [5]:
df.sample(5)

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
62,1.52172,13.51,3.86,0.88,71.79,0.23,9.54,0.0,0.11,1
206,1.51645,14.94,0.0,1.87,73.11,0.0,8.67,1.38,0.0,7
30,1.51768,12.65,3.56,1.3,73.08,0.61,8.69,0.0,0.14,1
124,1.52177,13.2,3.68,1.15,72.75,0.54,8.52,0.0,0.0,2
17,1.52196,14.36,3.85,0.89,71.36,0.15,9.15,0.0,0.0,1


### Preparing the model

Seperating predictor and target columns

In [7]:
X = df.drop('Type', axis = 1)
y = df['Type']

Applying PCA over X

In [11]:
pca = PCA()
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Checking the components

In [13]:
print(pca.components_)

[[ 9.28126899e-04  1.72248332e-02 -7.23534913e-01  4.63352227e-02
  -7.69381480e-03 -7.84042855e-02  6.79716799e-01  7.63580112e-02
   9.05695250e-04]
 [-1.52290883e-03  3.98797552e-01 -5.43050989e-01  2.58840747e-01
   1.94092491e-01  1.03826640e-01 -6.16724638e-01  2.23545134e-01
  -1.67842645e-02]
 [-1.37689385e-03 -6.54934730e-01 -1.31198879e-01  5.56521411e-02
   6.91951335e-01  2.18565071e-01 -7.87784202e-02 -1.33876425e-01
   7.21253225e-03]
 [ 3.10643441e-04 -3.46599960e-01 -9.86931157e-02  2.70893633e-01
  -5.70087029e-01  6.77700643e-01 -5.39461282e-02  9.71284426e-02
   1.10986100e-02]
 [ 7.12950233e-04 -3.98381798e-01  7.68490459e-02  3.13525755e-01
  -1.03320001e-01 -5.08016303e-01 -6.57426026e-02  6.80657156e-01
   2.67473294e-02]
 [ 1.82174928e-03 -1.55680962e-02 -4.77602532e-02 -7.80387063e-01
   6.02933593e-02  2.65187354e-01 -2.88933651e-02  5.60066050e-01
  -9.36624304e-04]
 [-3.32594524e-04 -3.76900981e-02 -7.49534298e-02 -7.48038453e-02
  -5.87295419e-02 -6.0375866

Checking the variance ratio, this will show us how much info we have been able to retain after reducing dimensions.

In [15]:
print(pca.explained_variance_ratio_)

[4.76205247e-01 2.63192760e-01 1.07800432e-01 1.02024637e-01
 3.30672372e-02 1.60477360e-02 1.42743130e-03 2.34365001e-04
 1.53917702e-07]


Taking the sum

In [16]:
print(sum(pca.explained_variance_ratio_))

0.9999999999999998


We have reatined a good amount of data, i.e. 99%.