## Codio Activity 6.4: Adjusting Parameters for Variance

**Expected Time: 60 Minutes**

**Total Points: 20 Points**

This activity focuses on using the $\Sigma$ matrix to limit the principal components based on how much variance should be kept.  In the last activity, a screen plot was used to see when the difference in variance explained slows. 

Here, you will determine how many components are required to explain a proportion of variance.  The dataset is a larger example of a housing dataset related to individual houses and features in Ames Iowa.  For our purposes the non-null numeric data is selected.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import numpy as np
from scipy.linalg import svd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_openml

In [2]:
#fetching the data
housing = fetch_openml(name="house_prices", as_frame=True, data_home='data')

In [3]:
#examine the dataframe
housing.frame

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1.0,60.0,RL,65.0,8450.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,2.0,2008.0,WD,Normal,208500.0
1,2.0,20.0,RL,80.0,9600.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,5.0,2007.0,WD,Normal,181500.0
2,3.0,60.0,RL,68.0,11250.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,9.0,2008.0,WD,Normal,223500.0
3,4.0,70.0,RL,60.0,9550.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,2.0,2006.0,WD,Abnorml,140000.0
4,5.0,60.0,RL,84.0,14260.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,12.0,2008.0,WD,Normal,250000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456.0,60.0,RL,62.0,7917.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,8.0,2007.0,WD,Normal,175000.0
1456,1457.0,20.0,RL,85.0,13175.0,Pave,,Reg,Lvl,AllPub,...,0.0,,MnPrv,,0.0,2.0,2010.0,WD,Normal,210000.0
1457,1458.0,70.0,RL,66.0,9042.0,Pave,,Reg,Lvl,AllPub,...,0.0,,GdPrv,Shed,2500.0,5.0,2010.0,WD,Normal,266500.0
1458,1459.0,20.0,RL,68.0,9717.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,4.0,2010.0,WD,Normal,142125.0


In [6]:
#select numeric data and drop missing values
df = housing.frame.select_dtypes(['float', 'int']).dropna(axis = 1)#.select_dtypes(['int', 'float'])

In [7]:
df.shape

(1460, 35)

[Back to top](#Index:) 

## Problem 1

### Scale the data

**5 Points**

Scale the `df` data using its mean and standard deviation so that it is ready for SVD.  Assign the scaled data to `df_scaled` below.  

In [11]:
### GRADED

df_scaled = (df - df.mean() ) / df.std()

print(df_scaled.head(5))
# Answer check
print(type(df_scaled))

         Id  MSSubClass   LotArea  OverallQual  OverallCond  YearBuilt  \
0 -1.730272    0.073350 -0.207071     0.651256    -0.517023   1.050634   
1 -1.727900   -0.872264 -0.091855    -0.071812     2.178881   0.156680   
2 -1.725528    0.073350  0.073455     0.651256    -0.517023   0.984415   
3 -1.723156    0.309753 -0.096864     0.651256    -0.517023  -1.862993   
4 -1.720785    0.073350  0.375020     1.374324    -0.517023   0.951306   

   YearRemodAdd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  ...  WoodDeckSF  \
0      0.878367    0.575228   -0.288554  -0.944267  ...   -0.751918   
1     -0.429430    1.171591   -0.288554  -0.641008  ...    1.625638   
2      0.829930    0.092875   -0.288554  -0.301540  ...   -0.751918   
3     -0.720051   -0.499103   -0.288554  -0.061648  ...   -0.751918   
4      0.733056    0.463410   -0.288554  -0.174805  ...    0.779930   

   OpenPorchSF  EnclosedPorch  3SsnPorch  ScreenPorch  PoolArea   MiscVal  \
0     0.216429      -0.359202  -0.116299    -0.2701

[Back to top](#Index:) 

## Problem 2

### Extracting $\Sigma$

**5 Points**

Using the scaled data, extract the singular values from the data using the `scipy.linalg` function `svd`.  Assign your results to `U`, `sigma`, and `VT` below. 

In [14]:
### GRADED

U, sigma, VT = svd(df_scaled)

# Answer check
print(type(sigma))
print(sigma.shape)

<class 'numpy.ndarray'>
(35,)


[Back to top](#Index:) 

## Problem 3

### Percent Variance Explained

**5 Points**

Divide `sigma` by the sum of the singular values to compute the percent variance explained. Assign your result as a percents array to `percent_variance_explained` below.  

Note that due to rounding this percent won't sum to exactly 1.  

In [16]:
### GRADED

sum_of_singular_values = sigma.sum()
percent_variance_explained = sigma / sum_of_singular_values

print(percent_variance_explained.shape)
print(percent_variance_explained.sum())

[8.75769203e-02 5.81847410e-02 4.79455289e-02 4.60005341e-02
 3.97754492e-02 3.56510420e-02 3.51284989e-02 3.44714217e-02
 3.39391269e-02 3.37739204e-02 3.33823266e-02 3.27193966e-02
 3.26162909e-02 3.22053318e-02 3.17229223e-02 3.07745480e-02
 3.03654206e-02 2.94060550e-02 2.92735397e-02 2.86757685e-02
 2.75789503e-02 2.57443285e-02 2.50665392e-02 2.24658513e-02
 2.09267801e-02 1.83524871e-02 1.76221696e-02 1.67402510e-02
 1.44320573e-02 1.30836615e-02 1.23844825e-02 1.17587153e-02
 1.02549429e-02 3.88919653e-17 6.89316954e-18]
(35,)
1.0000000000000002


[Back to top](#Index:) 

## Problem 4

### Cumulative Variance Explained

**5 Points**

Using the solution to problem 3, how many principal components are necessary to retain up to 80% of the explained variance if we consider them in descending order?  Assign your response to `ans4` below as an integer. 

**HINT**: explore the `np.cumsum` function.

In [18]:
### GRADED

print(np.cumsum(percent_variance_explained))
ans4 = 21

print(type(ans4))
print(ans4)

[0.08757692 0.14576166 0.19370719 0.23970772 0.27948317 0.31513422
 0.35026271 0.38473414 0.41867326 0.45244718 0.48582951 0.51854891
 0.5511652  0.58337053 0.61509345 0.645868   0.67623342 0.70563948
 0.73491301 0.76358878 0.79116773 0.81691206 0.8419786  0.86444445
 0.88537123 0.90372372 0.92134589 0.93808614 0.9525182  0.96560186
 0.97798634 0.98974506 1.         1.         1.        ]
<class 'int'>
21
