<h1 style="text-align:center">Principal Component Analysis (PCA)<br>[<a href="https://setosa.io/ev/principal-component-analysis/">source</a>]</h1>
<p style="font-size:15px">PCA is considered to be one of the most used unsupervised algorithms and can be seen as the most popular <strong>dimensionality reduction algorithm</strong>.</p>
<p style="font-size:15px">PCA is used for operations such as:<ul style="font-size:13px">
        <li>Noise filtering
        <li>Visualization
        <li>Feature Extraction
        <li>Stock market predictions
        <li>Gene data analysis
        <li>and more
    </ul>
</p><br>

<p style="font-size:15px">The goal of PCA:<ul style="font-size:15px">
        <li>Identify patterns in data
        <li>Detect correlation between variables<ul>
                <li>Detect correlation between variables, if there is a strong correlation and it`s found then we could reduce the dimensionality which really what PCA is intended for
                <li>We find the directions of maximum variance in high dimensional data and the project it into a smaller dimensional subspace while retaining most of the information
            </ul>
        <li>Usually, with PCA the goal to reduce the dimensions of a d-dimensional dataset onto a k-dimensional subspace where <i><strong>k</strong></i> is less than <i><strong>d</strong></i> (<i><strong>k < d</strong></i>) 
    </ul>
</p><br>

<p style="font-size:15px">The main functions of the PCA algorithm whould be:<ul style="font-size:15px">
        <li>Standardize the data
        <li>Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition
        <li>Sort Eigenvalues in descending order and choose the <i>k</i> Eigenvectors that correspond to the <i>k</i> largest Eigenvalues where <i>k</i> is the number of dimensions of the new feature subspace (<i>k <= d</i>)
        <li>Construct the projection matrix <i><strong>W</strong></i> from the selected <i>k</i> Eigenvectors
        <li>Transform the original dataset <i><strong>X</strong></i> via <i><strong>W</strong></i> to obtain a k-dimensional feature subspace <i><strong>Y</strong></i>
        <li><a href="https://plot.py/ipython-notebooks/principal-component-analysis/">source</a>
    </ul>
</p><br>

<p style="font-size:18px"><strong>PCA does have a weakness, which is highly affected by outliers in the data but it is considered to be one of the most used and extremely popular to use.</strong></p>

---
<h2>1. Importing the Dataset</h2>

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../../../../data/clean/Wine.csv')
display(df.head())
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,1


---
<h2>2. Splitting the Dataset</h2>

In [2]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print("train dataset size : {} observations\ntest dataset size : {} observations".format(x_train.shape[0], x_test.shape[0]))

train dataset size : 142 observations
test dataset size : 36 observations


---
<h2>3. Feature Scaling</h2>

In [3]:
from sklearn.preprocessing import StandardScaler

stand_x = StandardScaler().fit(x_train)
x_ss = stand_x.transform(x_train)

---
<h2>4. Applying PCA and KernelPCA</h2>

In [4]:
from sklearn.decomposition import PCA, KernelPCA

pca = PCA(n_components=2).fit(x_ss)
kpca = KernelPCA(n_components=2, kernel='rbf').fit(x_ss)
x_pca = pca.transform(x_ss)
x_kpca = kpca.transform(x_ss)

---
<h2>5. Training the Logistic Regression Model (for testing purpose) on The Training Dataset</h2>

In [5]:
from sklearn.linear_model import LogisticRegression

logreg_pca = LogisticRegression(random_state=0)
logreg_kpca = LogisticRegression(random_state=0)
logreg_pca.fit(x_pca, y_train)
logreg_kpca.fit(x_kpca, y_train)

LogisticRegression(random_state=0)

---
<h2>6. Predicting the Test Dataset Results</h2>

In [6]:
y_pred_pca = logreg_pca.predict(pca.transform(stand_x.transform(x_test)))
y_pred_kpca = logreg_kpca.predict(kpca.transform(stand_x.transform(x_test)))

pd.DataFrame(data=np.stack((y_test, y_pred_pca, y_pred_kpca), axis=1),
             index=None, columns=['y actual', 'y pca', 'y kernel_pca'],
             copy=False).head(10)

Unnamed: 0,y actual,y pca,y kernel_pca
0,1,1,1
1,3,3,3
2,2,2,2
3,1,1,1
4,2,2,2
5,2,1,2
6,1,1,1
7,3,3,3
8,2,2,2
9,2,2,2


---
<h2>7. Making the Confusion Matrix</h2>

In [7]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred_pca))
print("\nConfusion matrix result for PCA shows that:\n\t- 14 correct predictions of the customer segment 1\
        \n\t- 15 correct predictions of the customer segment 2\
        \n\t- 6 correct predictions of the customer segment 3")
print("\n"+"="*50+"\n")
print(confusion_matrix(y_test, y_pred_kpca))
print("\nConfusion matrix result for KernelPCA shows that:\n\t- 14 correct predictions of the customer segment 1\
        \n\t- 16 correct predictions of the customer segment 2\
        \n\t- 6 correct predictions of the customer segment 3")

[[14  0  0]
 [ 1 15  0]
 [ 0  0  6]]

Confusion matrix result for PCA shows that:
	- 14 correct predictions of the customer segment 1        
	- 15 correct predictions of the customer segment 2        
	- 6 correct predictions of the customer segment 3


[[14  0  0]
 [ 0 16  0]
 [ 0  0  6]]

Confusion matrix result for KernelPCA shows that:
	- 14 correct predictions of the customer segment 1        
	- 16 correct predictions of the customer segment 2        
	- 6 correct predictions of the customer segment 3
