#### Dimension Reduction

- finds patterns in the data
- uses the patterns to reexpress it in a compressed form
- but the **main function** is to reduce a dataset to its "bare bones" features, discarding noisy features that may interfere with teh modeling tasks like regression and classification
- ultimately represents the same data, using less features

Having **too many** factors on which the classificaton is performed can be a nuisance to the model.  The higher the number of features the harder it gets to train the model appropriately.  

The **Benefits** of Dimentionality Reduction

- Reduces Overfitting
- Improves Model Performance
- Reduces Training Time
- Utilizs Unlabelled Data
- Better Visualizations

![image.png](attachment:image.png)

### PCA - Principal Component Analysis

- is a **fundamental dimention reduction** technique
- named as such because it learns the principle components of the data 
- 1st step: "decorrelation"
- 2nd step: "dimention reduction"

### When should you use PCA?

- Reduce the number of variables? reduce noise?
- Ensure variables are independent of one another?
- when having latent features driving the patterns in the data
- as a preprocessing step to improve performance on an algorithm

#### What does PCA do?

- rotates data samples to be aligned with axes, and shifts samples to a mean of zero
- follows the fit/transform pattern of KMeans and StandardScaler
    - fit() learns the transformation from a given data
    - transform() applies the learned transformation and can be used on any new data

In [None]:
from sklearn.decomposition import PCA

#create PCA object and fit it to the samples
model = PCA()
model.fit(samples)

#transform the samples
transformed = model.transform(samples)

#### The down-sides of PCA

- if the number of variables is large
- more suitables when variables have a linear relationship between each other
- susceptible to big outliers

#### Correlation

- Features of a dataset are often correlated
- Decorrelations occurs when PCA aligns the data with the axes resulting in features no longer being linearly correlated

In [None]:
#imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# 
width = grains[:,0]
length = grains[:,1]

# Scatter plot width vs length
plt.scatter(width, length)
plt.axis('equal')
plt.show()

#Pearson correlation
correlation, pvalue = pearsonr(width, length)

#the correlation
print(correlation)


example ![image.png](attachment:image.png)

#### Principal Components

- a numpy array which makes the principle components (an attribute of PCA object)
- each row defines displacement from mean, as one principal compoenent

In [None]:
print(model.components_)

In [None]:
from sklearn.decomposition import PCA

# a PCA instance: model
model = PCA()

#the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)

#the coordinates data
xs = pca_features[:,0]
ys = pca_features[:,1]

#scatter plot
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

#Pearson correlation
correlation, pvalue = pearsonr(xs, ys)
#the correlation
print(correlation)

#### Intrinsic Dimension

- can be defined by PCA through counting the number of PCA features with significant variance

##### Plotting the variances of PCA Features

In [None]:
import matplotlib.pyplot as plt
from skleaern.decomposition import PCA

#create a PCA model and fit it to the samples
pca = PCA()
pca.fit(samples)

#creating a range enumerating the pca features
features = range(pca.n_components_)

#bar plot of the variances
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('Variance')
plt.xlabel('PCA Feature')
plt.show()

##### Dimension Reduction of Dataset

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  
#retaining the 2 PCA features with the highest variance
#you can tell if the distinct categories still remain as expected once the features are isolated

pca.fit(samples)

transformed = pca.transform(samples)
print(transformed.shape)

#plotting the new outcome with the 2 features
import matplotlib.pyplot as plt
xs = transformed[:, 0]
ys = transformed[:, 1]
#plotting the two features in a scatterplot
plt.scatter(xs, ys, c=categories)
plt.show()