## Principal Component Analysis (PCA)
is a dimensionality reduction technique widely used in statistics and machine learning. It aims to transform high-dimensional data into a lower-dimensional representation while retaining as much of the original variance as possible. This is achieved by finding the principal components, which are linear combinations of the original features.

Here's a simplified explanation of PCA with an example:

Example:
Let's consider a dataset with two features, "Height" and "Weight," measured for a group of individuals. Each data point is represented as (height, weight). We want to apply PCA to reduce the dimensionality of this dataset.

### Data Preparation:

Collect a dataset with measurements for each individual: (height1, weight1), (height2, weight2), ..., (heightN, weightN).
Organize the data into a matrix, where each row corresponds to an individual and each column corresponds to a feature.
### Standardization:

Standardize the data by subtracting the mean and dividing by the standard deviation for each feature. This ensures that each feature has a comparable scale.
### Compute Covariance Matrix:

Calculate the covariance matrix of the standardized data. The covariance matrix provides information about the relationships between different features.
### Eigenvalue Decomposition:

Find the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues indicate the amount of variance captured by each principal component.
### Select Principal Components:

Sort the eigenvectors by their corresponding eigenvalues in descending order. The eigenvectors with the highest eigenvalues are the principal components that capture the most variance in the data.
### Projection:

Project the original data onto the selected principal components to obtain a lower-dimensional representation.
Interpretation:
The first principal component (PC1) represents the direction of maximum variance in the data.
The second principal component (PC2) is orthogonal to PC1 and represents the next highest variance.
Each principal component is a linear combination of the original features.
Application:
By selecting a subset of the principal components, you can effectively reduce the dimensionality of the data while retaining most of the information. This is particularly useful in cases where the original dataset has a large number of correlated features, and you want to simplify the representation while minimizing information loss.

In [1]:
import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [2]:
from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [3]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## KNeighborsClassifier
###  implement a KNN model in scikitlearn using a Euclidean distance metric:

In [5]:
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_std, y_train)
y_pred = knn.predict(X_test_std)
print(accuracy_score(y_pred,y_test))

0.9629629629629629


In [6]:
X_train_std.shape

(124, 13)

# PCA

In [7]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

In [8]:
X_train_pca.shape

(124, 3)

In [9]:
knn_pca=KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
y_pred = knn_pca.predict(X_test_pca)
print(accuracy_score(y_pred,y_test))

1.0
