# Lab 6

Today we continue exploring **dimension reduction** to help us get a handle on large dimensional data. Today's goals are:

0. Understand how to use the `sklearn` implementation for PCA
1. Determine what the "right" lower dimension is
2. Introduce _Singular Value Decomposition_
3. Compare and contrast PCA and SVD

In [2]:
# Import block
%matplotlib inline

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

For easier comparisons, we will continue with the ever exciting `students_info.csv` file. Please import this in `pandas` and then create a `numpy` array with only the numerical variables

In [3]:
# Import data
students = pd.read_csv("../Lab03/students_info.csv")

In [4]:
# Create justnum with only the numerical data
justnum = students[["coffee", "sleep", "gym", "gpa"]].to_numpy()

## PCA in `sklearn`

PCA in `sklearn` works much the way that kmeans did. We first set up how PCA will function and then apply the particular PCA that we have crafted to our data. 

As we did with kmeans, we will take each step individually, exploring the output that we generate. 

In the below code block, we have one possible setting of `PCA()` in `sklearn`.
* What _type_ is the output and what information is contained within `PCA`? 
* What are the various parameters doing? 

In [5]:
# Step one: Set up PCA
pca_alg = PCA(n_components=2)

In [7]:
# Code block for further discovery
type(pca_alg)
print(pca_alg)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)


As with kmeans, now that we have set up our PCA, we can _fit_ it to our data. We can `fit` our data and then `transform` it; or if we prefer, we can do both using `fit_transform`.

#### Using `.fit()`

This first fitting applies our PCA to the data. What _type_ is the output and what information is contained within `pfit`?

In [8]:
pfit = pca_alg.fit(justnum)

In [10]:
# Code block for further discovery





It's not immediately obvious how to access the principal components. The result of `.fit()` wraps this information inside a class style object. We can use `.components_` to access the principal components: 

In [12]:
print("Shape of the resulting components\n", pfit.components_.shape)

print("\n Actual components \n",pfit.components_)

Shape of the resulting components
 (2, 4)

 Actual components 
 [[-9.33356335e-03  1.39551490e-03 -9.99955051e-01 -9.13061749e-04]
 [-2.14277434e-02 -9.95515707e-01 -1.27343628e-03  9.21287514e-02]]


Is this what we expect to see? Why or why not? 

Take a minute to explain. 






##### PCA `.fit` result

The output of `fit()` is a _transition_ matrix. This is the matrix that "carries" our data from a higher dimension down to the lower one. What `.fit` does **not** do is carry out this tranformation. For that step, we need to use `.transform()`. 


#### Using `.transform()`

So far, we have set-up our PCA and applied it to our data to get our transition matrix. To actually send our data to the lower dimensional space, we _transform_ our data using `transform()`. 


_Note_ - Both `.predict` for kmeans and `transform` for PCA in `sklearn` are similar in the sense of _extending_ the common applications of their algorithms respectively. However, in the case of PCA, simply stopping at the _transition_ matrix feels a bit odd as the _dimension reduction_ has not yet occurred. 

In [None]:
justnum_intwo = pca_alg.transform(justnum)


fit_transform does both

## Choosing the dimenstion

What's the right number? A lot of time the answer is 2 (for visualization). You can look at what you lose or what you explain. 

PCA in a summary

## Singular Value Decomposition (SVD)

SVD is linear algebra at its finest (though I promise not to explore that tangent). 



### Sparsity 

What is _sparse_ data? Is it bad or good? 

### Truncated SVD

There are a few flavors of SVD. 

## PCA & SVD - A Comparison



### Final Thoughts

To finish up this lab, read about the [PCA implementation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) in `sklearn` and create a post to **#lab_submission** channel on slack sharing one surprising thing about PCA that you learned by first walking through it and then reading about it in `sklearn`. Your post must start with **Lab6** to get credit. 

If your have questions from this lab, post them to #lab_questions with the same preamble (i.e. starting with **Lab6**). If you have the same question, please use one of the emoji's to upvote the question. If you would like to answer someone's question, please use the thread function. This will tie your answer to their question. 

#### References consulted
0. _Doing Data Science: Straight talk from the frontline_ by C. O'Neil & R. Schutt (2014)
1. _Python Machine Learning_
2. [PCA `sklearn` helpfile](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)
3. [Truncated SVD `sklearn` helpfile](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD)
4. [SVD `numpy` helpfile](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html)