### Dimension reduction
- more efficient storage and computation
- remove less-informative "noise" features
- which cause problems for prediction tasks eg. classification, regression
### Principal component analysis
- PCA = Principal compnenet analysis
- fundamental dimension reduction technique
- first step "decorrelation"
- second step reduces dimension
### PCA aligns data with axes
- ROtates data samples to be aligned with axes
- shifts data samples so they have mean 0
- no information is lost
- ![image.png](attachment:ddc24a21-8141-4866-a4bd-638f2b9bc1a9.png)
### PCA follows the fit/transform pattern
- PCA is a scikit-learn component like KMeans or StandardScaler
- fit() learns the transformation from the given data
- transform() applies the learned transformation
- transform() can also be applied to new data
```python
from sklearn.decomposition import PCA
model=PCA()
model.fit(samples)
transformed= model.transform(samples)
```
### PCA features
- rows of transformed correspond to samples
- columns of transformed are the "PCA features"
- Row gives PCA feature values of corresponding sample
### PCA features are not correlated
- features of dataset are often correlated eg. total_points and od280
- PCA aligns the data with axes
- Resulting PCA features are not linearly correlated ("decorrelation")
### Pearson correlation
- measures linear correlation of features
- value between -1 and 1
- value of 0 means no linear correlation
- ![image.png](attachment:91f651f9-b6b2-4e20-b17e-b90b43efde39.png)
### Principal components
- = directions of variance
- PCA aligns principal components with the axes
- ![image.png](attachment:605f96cb-a4be-4715-90bb-8331f9e5098f.png)
### Principal components
- available as components_ attribute of PCA object
- each row defines displacement from mean
- ```python
  print(model.components_)
  ```

Q. Correlated data in nature
You are given an array grains giving the width and length of samples of grain. You suspect that width and length will be correlated. To confirm this, make a scatter plot of width vs length and measure their Pearson correlation
```python
# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Assign the 0th column of grains: width
width = grains[:,0]

# Assign the 1st column of grains: length
length = grains[:,1]

# Scatter plot width vs length
plt.scatter(width,length)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation
correlation, pvalue = pearsonr(width, length)

# Display the correlationprint(correlation)
.

Q. Decorrelating the grain measurements with PCA
You observed in the previous exercise that the width and length measurements of the grain are correlated. Now, you'll use PCA to decorrelate these measurements, then plot the decorrelated points and measure their Pearson correlation.

```python

# Import PCA
from sklearn.decomposition import PCA

# Create PCA instance: model
model = PCA()

# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)

# Assign 0th column of pca_features: xs
xs = pca_features[:,0]

# Assign 1st column of pca_features: ys
ys = pca_features[:,1]

# Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)

# Display the correlation
print(correlation)

```
![image.png](attachment:952cf765-63c6-4c18-8dc8-0a46925a3fea.png))

### intrinsic dimension 
- = number of features needed to approximate the dataset
- essential idea behinf dimension reduction
### PCA identifies intrinsic dimension
- scatter plots work only of samples have 2 or 3 features
- PCA identifies intrinsic dimension when samples have any number of features
- intrinsic dimension = number of PCA features with significant variance
### variance and intrinsic dimension
- intrinsic dimension is number of PCA features with significant variance

```python
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples) #samples=array of versicolor samples
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xtricks(features)
plt.ylabel('variance')
plt.xlabel('PCA features')
plt.show()
```
![image.png](attachment:ecfa64f4-8343-4f6f-ba8d-456078fc08cd.png)

Q. The first principal component
The first principal component of the data is the direction in which the data varies the most. In this exercise, your job is to use PCA to find the first principal component of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot.

The array grains gives the length and width of the grain samples. PyPlot (plt) and PCA have already been imported for y'equal')
plt.show()ou.

```python
# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(grains)

# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0,:]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0],mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

# Keep axes on same scale
plt.axis(
'equal')
plt.show()

![image.png](attachment:969a21f9-fab6-4677-aa20-e150e07e6a49.png)

Q. Variance of the PCA features
The fish dataset is 6-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, samples is a 2D array, where each row represents a fish. You'll need to standardize the features first
.

```python
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

# Fit the pipeline to 'samples'
pipeline.fit(samples)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()


![image.png](attachment:e291387f-f60e-4d66-b20a-58a45bcec12d.png)

### dimension reduction
- represents same data, using less features
- important part of machine-learning pipelines
- can be performed using PCA

### dimension reduction with PCA
- specify how many features to keep
- PCA(n_components=2)
- keeps the first 2 PCA features
- intrinsic dimension is a good choice
- ![image.png](attachment:390e7104-3ea3-45be-9a2d-e6fec3d5b5ac.png)
- ![image.png](attachment:441c5656-35d1-4332-bbfe-450d28455aef.png)

  

### TruncatedSVD and csr_matrix
- scikit-learn PCA doesnt support csr_matrix
- use scikit-learn TruncatedSVD instead
- performs same transformation

```python
from sklearn.decomposition import TruncatedSVD
model=TruncatedSVD(n_components=3)
model.fit(documents)
transformed = model.transform(documents)
```