## 1. Install libraries

In [None]:
pip install scikit-learn

In [None]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

## 2. Load the data
It is important to note that all the feature values have to be quantitative

In [None]:
wine = datasets.load_wine()

Optional dataset for you to explore with

In [None]:
#iris = datasets.load_iris()

In [None]:
df = pd.DataFrame(wine['data'], columns = wine['feature_names'])
df.head()

## 3. Standardizing features

In [None]:
#Create an object of StandardScaler which is present in sklearn.preprocessing
scalar = StandardScaler() 
scaled_data = pd.DataFrame(scalar.fit_transform(df)) #scaling the data
scaled_data.tail()

### Under the hood functioning of standardizing features
a. We start with converting the dataframe into a matrix for manipulation

In [None]:
dataMatrix = df.values
dataMatrix.shape

We get a 178 x 13 matrix where each row is one of our samples and each column is a feature, similar to what we have in the dafaframe

b. Calculate the mean of each feature:

In [None]:
mean = np.mean(dataMatrix, axis = 0) #the axis setting specifies that we are calculating the mean along each column (feature)
print(mean)

c. Centering our data: 

In [None]:
centeredMat = dataMatrix - mean
centeredMat.shape

d. Calculate the covariance:

*_rowvar_* specifies whether each row or column of the centered matrix represents a feature. Setting it to False means that each column of the matrix represents a feature, and each row represents an observation.

In [None]:
covMat = np.cov(centeredMat.T)
covMat.shape

covMat is a square matrix (m x m)

e. Eigen decomposition of the covariance matrix:

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(covMat)

In [None]:
print("values:")
eigenvalues

In [None]:
print("vectors:")
eigenvectors.shape

## 4. Checking correlation between features

In [None]:
sns.heatmap(scaled_data.corr())

## 5. Apply PCA

In [None]:
#Taking no. of Principal Components as 4
pca = PCA(n_components = 4)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3', 'PC4'])
data_pca.head()

In [None]:
#Checking Co-relation between features after PCA
sns.heatmap(data_pca.corr())

## 5. How to determine the best number of components to use

In the example, we arbitrarily chose the number of components to be 4. However, to get the optimal number, there are several methods we can use such as cross-validation, elbow point, and the scree plot. in this notebook, we will use the elbow point method. We plot the cumulative explained variance ratio against the number of components. Look for the point where adding more components stops increasing the explained variance (what Eni taught us but in the opposite direction).

In [None]:
pca = PCA().fit(df)

*cumsum* calculates the cumulative sum of an array. It returns an array where each element represents the cumulative sum of the elements up to that index

In [None]:
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

The explained variance ratio represents the proportion of the total variance in the dataset that is explained by each principal component. When you perform PCA on a dataset, the algorithm computes the principal components, which are ordered by the amount of variance they explain in the data. The explained variance ratio for each principal component is calculated as the ratio of the variance explained by that component to the total variance in the dataset. It gives you an idea of how much information each principal component carries compared to the total information in the dataset.

It returns an array where the sum of all elements in this array will be equal to 1.0

In [None]:
plt.plot(cumulative_variance_ratio)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio vs. Number of Components')
plt.grid(True)
plt.show()

We see a sharp increase at 2. However, we can also see that the variance continues increasing and it stops doing so at n = 3. Therefore, 3 is the optimal number of principal components to use in this dataset.