# Vertical Line Test

## 1.1 Create two graphs, one that passes the vertical line test and one that does not.

## 1.2 Why are graphs that don't pass the vertical line test not considered "functions?"

Because they have more than one x-axis, making the point redundant.

# Functions as Relations

## 2.1 Which of the following relations are functions? Why?

\begin{align}
\text{Relation 1: } \{(1, 2), (3, 2), (1, 3)\}
\\
\text{Relation 2: } \{(1, 3), (2, 3), (6, 7)\}
\\
\text{Relation 3: } \{(9, 4), (2, 1), (9, 6)\}
\\
\text{Relation 4: } \{(6, 2), (8, 3), (6, 4)\}
\\
\text{Relation 5: } \{(2, 6), (2, 7), (2, 4)\}
\end{align}

Relation 2, because it is the only set without repeating the x-axis.

# Functions as a mapping between dimensions


## 3.1 for the following functions what is the dimensionality of the domain (input) and codomain (range/output)?

\begin{align}
m(𝑥_1,𝑥_2,𝑥_3)=(x_1+x_2, x_1+x_3, x_2+x_3)
\\
n(𝑥_1,𝑥_2,𝑥_3,𝑥_4)=(x_2^2 + x_3, x_2x_4)
\end{align}

## 3.2 Do you think it's possible to create a function that maps from a lower dimensional space to a higher dimensional space? If so, provide an example.

# Vector Transformations

## 4.1 Plug the corresponding unit vectors into each function. Use the output vectors to create a transformation matrix.

\begin{align}
p(\begin{bmatrix}x_1 \\ x_2 \end{bmatrix}) = \begin{bmatrix} x_1 + 3x_2 \\2 x_2 - x_1 \\  \end{bmatrix}
\\
\\
q(\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix}) = \begin{bmatrix} 4x_1 + x_2 + 2x_3 \\2 x_2 - x_1 + 3x_3 \\ 5x_1 - 2x_3 + x_2  \end{bmatrix}
\end{align}

## 4.2 Verify that your transformation matrices are correct by choosing an input matrix and calculating the result both via the traditional functions above and also via vector-matrix multiplication.

# Eigenvalues and Eigenvectors

## 5.1 In your own words, give an explanation for the intuition behind eigenvalues and eigenvectors.

# The Curse of Dimensionality

## 6.1 What are some of the challenges of working with high dimensional spaces?

## 6.2 What is the rule of thumb for how many observations you should have compared to parameters in your model?

# Principal Component Analysis

## 7.1 Code for loading and cleaning the 2013 national dataset from the [Housing Affordability Data System (HADS)](https://www.huduser.gov/portal/datasets/hads/hads.html) --housing data, can be found below. 

## Perform PCA on the processed dataset `national_processed` (Make sure you standardize your data!) and then make a scatterplot of PC1 against PC2. Some of our discussion and work around PCA with this dataset will continue during tomorrow's lecture and assignment.

Not only does this dataset have decent amount columns to begin with (99), but in preparing the data for PCA we have also [one-hot-encoded](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f#targetText=One%20hot%20encoding%20is%20a,the%20entry%20in%20the%20dataset.) all of the categorical variables. This has the effect of creating a new column for each individual category of each categorical variable. After processing this dataset has 64738 columns. --Das a lot of columns.

Don't worry too much about the mechanics of one-hot encoding right now, you will learn and experiment with a whole bunch of categorical encoding approaches in unit 2. 

The code below will read in the dataset and perform the one-hot encoding of the categorical variables. Start adding your PCA code at the bottom of the provided code.

In [4]:
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read Natinal Data 
national_url = 'https://www.huduser.gov/portal/datasets/hads/hads2013n_ASCII.zip'
national_file = 'thads2013n.txt'

if os.path.exists(national_file):
    national = pd.read_csv(national_file)
else: 
    z_national = urlopen(national_url)
    zip_national = ZipFile(BytesIO(z_national.read())).extract(national_file)
    national = pd.read_csv(zip_national)

print(national.shape)
national.head()

(64535, 99)


Unnamed: 0,CONTROL,AGE1,METRO3,REGION,LMED,FMR,L30,L50,L80,IPOV,BEDRMS,BUILT,STATUS,TYPE,VALUE,VACANCY,TENURE,NUNITS,ROOMS,WEIGHT,PER,ZINC2,ZADEQ,ZSMHC,STRUCTURETYPE,OWNRENT,UTILITY,OTHERCOST,COST06,COST12,COST08,COSTMED,TOTSAL,ASSISTED,GLMED,GL30,GL50,GL80,APLMED,ABL30,...,COST08RELPOVCAT,COST08RELFMRPCT,COST08RELFMRCAT,COST12RELAMIPCT,COST12RELAMICAT,COST12RELPOVPCT,COST12RELPOVCAT,COST12RELFMRPCT,COST12RELFMRCAT,COSTMedRELAMIPCT,COSTMedRELAMICAT,COSTMedRELPOVPCT,COSTMedRELPOVCAT,COSTMedRELFMRPCT,COSTMedRELFMRCAT,FMTZADEQ,FMTMETRO3,FMTBUILT,FMTSTRUCTURETYPE,FMTBEDRMS,FMTOWNRENT,FMTCOST06RELPOVCAT,FMTCOST08RELPOVCAT,FMTCOST12RELPOVCAT,FMTCOSTMEDRELPOVCAT,FMTINCRELPOVCAT,FMTCOST06RELFMRCAT,FMTCOST08RELFMRCAT,FMTCOST12RELFMRCAT,FMTCOSTMEDRELFMRCAT,FMTINCRELFMRCAT,FMTCOST06RELAMICAT,FMTCOST08RELAMICAT,FMTCOST12RELAMICAT,FMTCOSTMEDRELAMICAT,FMTINCRELAMICAT,FMTASSISTED,FMTBURDEN,FMTREGION,FMTSTATUS
0,'100003130103',82,'3','1',73738,956,15738,26213,40322,11067,2,2006,'1',1,40000,-6,'1',1,6,3117.394239,1,18021,'1',533,1,'1',169.0,213.75,648.588189,803.050535,696.905247,615.156712,0,-9,73738,15738,26213,40322,51616.6,20234.571429,...,4,72.898038,2,48.402635,2,290.250487,4,84.001102,2,37.077624,2,222.339102,4,64.346936,2,'1 Adequate','-5','2000-2009','1 Single Family','2 2BR','1 Owner','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','3 150-200% Poverty','2 50.1 - 100% FMR','2 50.1 - 100% FMR','2 50.1 - 100% FMR','2 50.1 - 100% FMR','1 LTE 50% FMR','2 30 - 50% AMI','2 30 - 50% AMI','2 30 - 50% AMI','2 30 - 50% AMI','2 30 - 50% AMI','.','2 30% to 50%','-5','-5'
1,'100006110249',50,'5','3',55846,1100,17165,28604,45744,24218,4,1980,'1',1,130000,-6,'1',1,6,2150.725544,4,122961,'1',487,1,'1',245.333333,58.333333,1167.640781,1669.643405,1324.671218,1058.988479,123000,-9,55846,17165,28604,45744,55846.0,19911.4,...,4,120.424656,3,103.094063,6,275.768999,4,151.785764,3,65.388468,4,174.90932,3,96.27168,2,'1 Adequate','-5','1980-1989','1 Single Family','4 4BR+','1 Owner','3 150-200% Poverty','4 200%+ Poverty','4 200%+ Poverty','3 150-200% Poverty','4 200%+ Poverty','3 GT FMR','3 GT FMR','3 GT FMR','2 50.1 - 100% FMR','3 GT FMR','4 60 - 80% AMI','4 60 - 80% AMI','6 100 - 120% AMI','4 60 - 80% AMI','7 120% AMI +','.','1 Less than 30%','-5','-5'
2,'100006370140',53,'5','3',55846,1100,13750,22897,36614,15470,4,1985,'1',1,150000,-6,'1',1,7,2213.789404,2,27974,'1',1405,1,'1',159.0,37.5,1193.393209,1772.627006,1374.582175,1068.025168,28000,-9,55846,13750,22897,36614,44676.8,19937.5,...,4,124.962016,3,109.452905,6,458.339239,4,161.14791,3,65.946449,4,276.15389,4,97.093197,2,'1 Adequate','-5','1980-1989','1 Single Family','4 4BR+','1 Owner','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','3 150-200% Poverty','3 GT FMR','3 GT FMR','3 GT FMR','2 50.1 - 100% FMR','2 50.1 - 100% FMR','4 60 - 80% AMI','5 80 - 100% AMI','6 100 - 120% AMI','4 60 - 80% AMI','4 60 - 80% AMI','.','3 50% or More','-5','-5'
3,'100006520140',67,'5','3',55846,949,13750,22897,36614,13964,3,1985,'1',1,200000,-6,'1',1,6,2364.585097,2,32220,'1',279,1,'1',179.0,70.666667,1578.857612,2351.169341,1820.4429,1411.700224,0,-9,55846,13750,22897,36614,44676.8,17875.0,...,4,191.827492,3,161.926709,7,673.494512,4,247.752301,3,97.224801,5,404.382763,4,148.75661,3,'1 Adequate','-5','1980-1989','1 Single Family','3 3BR','1 Owner','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','4 200%+ Poverty','3 GT FMR','3 GT FMR','3 GT FMR','3 GT FMR','2 50.1 - 100% FMR','6 100 - 120% AMI','7 120% AMI +','7 120% AMI +','5 80 - 100% AMI','4 60 - 80% AMI','.','1 Less than 30%','-5','-5'
4,'100007130148',26,'1','3',60991,737,14801,24628,39421,15492,2,1980,'1',1,-6,-6,'2',100,4,2314.524902,2,96874,'1',759,5,'2',146.0,12.5,759.0,759.0,759.0,759.0,96900,0,60991,14801,24628,39421,48792.8,16651.125,...,3,102.985075,3,55.308707,3,195.972115,3,102.985075,3,55.308707,3,195.972115,3,102.985075,3,'1 Adequate','Central City','1980-1989','5 50+ units','2 2BR','2 Renter','3 150-200% Poverty','3 150-200% Poverty','3 150-200% Poverty','3 150-200% Poverty','4 200%+ Poverty','3 GT FMR','3 GT FMR','3 GT FMR','3 GT FMR','3 GT FMR','3 50 - 60% AMI','3 50 - 60% AMI','3 50 - 60% AMI','3 50 - 60% AMI','7 120% AMI +','0 Not Assisted','1 Less than 30%','-5','-5'


In [5]:
national.describe()

Unnamed: 0,AGE1,LMED,FMR,L30,L50,L80,IPOV,BEDRMS,BUILT,TYPE,VALUE,VACANCY,NUNITS,ROOMS,WEIGHT,PER,ZINC2,ZSMHC,STRUCTURETYPE,UTILITY,OTHERCOST,COST06,COST12,COST08,COSTMED,TOTSAL,ASSISTED,GLMED,GL30,GL50,GL80,APLMED,ABL30,ABL50,ABL80,ABLMED,BURDEN,INCRELAMIPCT,INCRELAMICAT,INCRELPOVPCT,INCRELPOVCAT,INCRELFMRPCT,INCRELFMRCAT,COST06RELAMIPCT,COST06RELAMICAT,COST06RELPOVPCT,COST06RELPOVCAT,COST06RELFMRPCT,COST06RELFMRCAT,COST08RELAMIPCT,COST08RELAMICAT,COST08RELPOVPCT,COST08RELPOVCAT,COST08RELFMRPCT,COST08RELFMRCAT,COST12RELAMIPCT,COST12RELAMICAT,COST12RELPOVPCT,COST12RELPOVCAT,COST12RELFMRPCT,COST12RELFMRCAT,COSTMedRELAMIPCT,COSTMedRELAMICAT,COSTMedRELPOVPCT,COSTMedRELPOVCAT,COSTMedRELFMRPCT,COSTMedRELFMRCAT
count,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0
mean,47.968932,68109.744309,1164.395181,17712.125436,29511.032076,46207.56763,15986.664229,2.660308,1966.432835,1.06536,142032.8,-5.458433,14.889052,5.631208,1885.195459,1.886015,61355.84,1061.613744,1.920973,183.099576,67.87303,1525.576425,2074.054208,1697.144663,1406.865145,44910.928117,-5.516262,68109.744309,17712.125436,29511.032076,46207.56763,53643.917778,20897.959515,34818.57682,54521.716416,67198.011519,2.063421,104.784829,3.377795,358.745915,2.280158,127.677315,1.463113,87.021707,3.949144,352.582333,2.487921,124.754094,2.376307,96.214197,4.162997,392.790898,2.526288,137.651042,2.424901,116.408711,4.501278,481.12305,2.574758,165.983691,2.48899,80.661251,3.7709,324.761274,2.450376,115.830454,2.328922
std,22.869374,12371.177175,394.119188,4441.564491,7407.01241,10640.616965,7219.688394,1.093778,26.304726,0.459341,249027.8,2.01973,54.746222,1.904533,1245.591074,2.566334,74405.3,982.274571,1.470134,130.169614,145.52833,1671.829974,2596.919689,1957.423626,1477.759879,64806.938997,4.499224,12371.177175,4441.564491,7407.01241,10640.616965,20217.626363,4760.655951,7938.626857,11155.611871,15999.59066,116.039132,123.34988,4.086886,430.187422,3.282772,144.158419,2.957551,84.323299,2.037007,422.726511,3.282635,109.648711,0.725381,98.155386,2.103802,492.45684,3.287995,127.422272,0.722485,129.60662,2.195668,648.790025,3.294774,168.183589,0.713639,75.058303,1.975262,375.384088,3.276692,97.841114,0.72654
min,-9.0,38500.0,394.0,10257.0,17057.0,27307.0,-6.0,0.0,1919.0,1.0,-6.0,-6.0,-7.0,1.0,0.0,-6.0,-117.0,-6.0,-9.0,0.0,0.0,0.0,0.0,0.0,0.0,-9.0,-9.0,38500.0,10257.0,17057.0,27307.0,-9.0,10293.0,17126.0,27403.0,28875.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,0.0,1.0,-9.0,-9.0,0.0,1.0,0.0,1.0,-9.0,-9.0,0.0,1.0,0.0,1.0,-9.0,-9.0,0.0,1.0,0.0,1.0,-9.0,-9.0,0.0,1.0
25%,35.0,60300.0,888.0,14449.0,24050.0,38150.0,12019.0,2.0,1950.0,1.0,-6.0,-6.0,1.0,4.0,780.192857,1.0,14987.0,424.0,1.0,91.0,0.0,682.0,750.0,710.0,658.0,0.0,-9.0,60300.0,14449.0,24050.0,38150.0,45199.8,17535.857143,29216.25,46660.0,56032.2,0.116927,29.692138,1.0,99.700897,1.0,37.578125,1.0,45.383905,2.0,141.45916,2.0,70.622475,2.0,47.433018,2.0,146.966122,2.0,73.934763,2.0,50.513942,2.0,155.065135,3.0,78.760563,2.0,43.597884,2.0,136.58077,2.0,67.620731,2.0
50%,50.0,64600.0,1100.0,16829.0,28053.0,44605.0,15452.0,3.0,1970.0,1.0,60000.0,-6.0,1.0,5.0,2010.723837,2.0,39987.0,838.0,1.0,166.25,31.666667,1100.0,1329.0,1180.304783,1045.0,24000.0,-9.0,64600.0,16829.0,28053.0,44605.0,53280.0,20077.942857,33455.625,53043.75,65025.0,0.220192,73.539308,4.0,245.196444,4.0,91.007384,2.0,68.315916,4.0,253.757542,4.0,102.924791,3.0,73.215301,4.0,271.364249,4.0,109.641873,3.0,82.699789,5.0,308.246798,4.0,122.699387,3.0,64.724154,4.0,240.498939,4.0,97.930142,2.0
75%,64.0,74008.0,1390.0,20011.0,33334.0,52086.0,18612.0,3.0,1985.0,1.0,200000.0,-6.0,4.0,7.0,2625.034116,3.0,81231.0,1399.0,3.0,249.0,83.333333,1825.181234,2485.131774,2031.718595,1683.060292,67000.0,0.0,74008.0,20011.0,33334.0,52086.0,63727.2,23362.3,38937.6,61295.0,76232.0,0.37598,140.174135,7.0,474.693878,4.0,173.702863,3.0,103.338537,6.0,431.626389,4.0,148.924146,3.0,115.153649,6.0,481.78561,4.0,165.965152,3.0,142.512939,7.0,596.378735,4.0,206.979921,3.0,95.516505,5.0,398.416412,4.0,137.76935,3.0
max,93.0,115300.0,3511.0,42550.0,70850.0,111450.0,51635.0,7.0,2013.0,9.0,2520000.0,5.0,998.0,15.0,38474.709667,20.0,1061921.0,10667.0,6.0,1613.0,6856.0,19261.472577,28992.600365,22305.447202,17155.289494,698886.0,1.0,115300.0,42550.0,70850.0,111450.0,217448.16,48302.222222,80560.0,113296.2963,160968.0,15600.0,1927.748157,7.0,6957.363063,4.0,2030.15272,3.0,1475.285156,7.0,6819.091923,4.0,2193.961702,3.0,1732.97613,7.0,7919.290576,4.0,2577.333821,3.0,2299.084334,7.0,10336.261088,4.0,3419.544546,3.0,1296.983943,7.0,6057.843858,4.0,1928.699349,3.0


In [6]:
# Look at datatypes
# a lot of object datatypes even though they seem to be strings of numbers.
national.dtypes

CONTROL            object
AGE1                int64
METRO3             object
REGION             object
LMED                int64
                    ...  
FMTINCRELAMICAT    object
FMTASSISTED        object
FMTBURDEN          object
FMTREGION          object
FMTSTATUS          object
Length: 99, dtype: object

In [7]:
# check for null values
national.isnull().sum().any()

False

In [8]:
# check for number of categorical vs numeric columns
cat_cols = national.columns[national.dtypes=='object']
num_cols = national.columns[national.dtypes!='object']

print(f'{len(cat_cols)} categorical columns')
print(f'{len(num_cols)} numerical columns')

32 categorical columns
67 numerical columns


In [9]:
# We're making a copy of our data in case we mess something up.
national_processed = national.copy()

# Categorically Encode our Variables:
# They need to all be numeric before we do PCA.
# https://pbpython.com/categorical-encoding.html

# Cast categorical columns to "category" data type
national_processed[cat_cols] = national_processed[cat_cols].astype('category')

national_processed.dtypes

CONTROL            category
AGE1                  int64
METRO3             category
REGION             category
LMED                  int64
                     ...   
FMTINCRELAMICAT    category
FMTASSISTED        category
FMTBURDEN          category
FMTREGION          category
FMTSTATUS          category
Length: 99, dtype: object

In [10]:
# Replace all category cell values with their numeric category codes
for col in cat_cols:
  national_processed[col] = national_processed[col].cat.codes

print(national_processed.shape)
national_processed.head()

(64535, 99)


Unnamed: 0,CONTROL,AGE1,METRO3,REGION,LMED,FMR,L30,L50,L80,IPOV,BEDRMS,BUILT,STATUS,TYPE,VALUE,VACANCY,TENURE,NUNITS,ROOMS,WEIGHT,PER,ZINC2,ZADEQ,ZSMHC,STRUCTURETYPE,OWNRENT,UTILITY,OTHERCOST,COST06,COST12,COST08,COSTMED,TOTSAL,ASSISTED,GLMED,GL30,GL50,GL80,APLMED,ABL30,...,COST08RELPOVCAT,COST08RELFMRPCT,COST08RELFMRCAT,COST12RELAMIPCT,COST12RELAMICAT,COST12RELPOVPCT,COST12RELPOVCAT,COST12RELFMRPCT,COST12RELFMRCAT,COSTMedRELAMIPCT,COSTMedRELAMICAT,COSTMedRELPOVPCT,COSTMedRELPOVCAT,COSTMedRELFMRPCT,COSTMedRELFMRCAT,FMTZADEQ,FMTMETRO3,FMTBUILT,FMTSTRUCTURETYPE,FMTBEDRMS,FMTOWNRENT,FMTCOST06RELPOVCAT,FMTCOST08RELPOVCAT,FMTCOST12RELPOVCAT,FMTCOSTMEDRELPOVCAT,FMTINCRELPOVCAT,FMTCOST06RELFMRCAT,FMTCOST08RELFMRCAT,FMTCOST12RELFMRCAT,FMTCOSTMEDRELFMRCAT,FMTINCRELFMRCAT,FMTCOST06RELAMICAT,FMTCOST08RELAMICAT,FMTCOST12RELAMICAT,FMTCOSTMEDRELAMICAT,FMTINCRELAMICAT,FMTASSISTED,FMTBURDEN,FMTREGION,FMTSTATUS
0,0,82,2,0,73738,956,15738,26213,40322,11067,2,2006,0,1,40000,-6,1,1,6,3117.394239,1,18021,1,533,1,0,169.0,213.75,648.588189,803.050535,696.905247,615.156712,0,-9,73738,15738,26213,40322,51616.6,20234.571429,...,4,72.898038,2,48.402635,2,290.250487,4,84.001102,2,37.077624,2,222.339102,4,64.346936,2,1,0,5,1,2,0,4,4,4,4,3,1,1,1,1,1,1,1,1,1,2,0,2,0,0
1,1,50,4,2,55846,1100,17165,28604,45744,24218,4,1980,0,1,130000,-6,1,1,6,2150.725544,4,122961,1,487,1,0,245.333333,58.333333,1167.640781,1669.643405,1324.671218,1058.988479,123000,-9,55846,17165,28604,45744,55846.0,19911.4,...,4,120.424656,3,103.094063,6,275.768999,4,151.785764,3,65.388468,4,174.90932,3,96.27168,2,1,0,3,1,4,0,3,4,4,3,4,2,2,2,1,3,3,3,5,3,7,0,1,0,0
2,2,53,4,2,55846,1100,13750,22897,36614,15470,4,1985,0,1,150000,-6,1,1,7,2213.789404,2,27974,1,1405,1,0,159.0,37.5,1193.393209,1772.627006,1374.582175,1068.025168,28000,-9,55846,13750,22897,36614,44676.8,19937.5,...,4,124.962016,3,109.452905,6,458.339239,4,161.14791,3,65.946449,4,276.15389,4,97.093197,2,1,0,3,1,4,0,4,4,4,4,3,2,2,2,1,2,3,4,5,3,4,0,3,0,0
3,3,67,4,2,55846,949,13750,22897,36614,13964,3,1985,0,1,200000,-6,1,1,6,2364.585097,2,32220,1,279,1,0,179.0,70.666667,1578.857612,2351.169341,1820.4429,1411.700224,0,-9,55846,13750,22897,36614,44676.8,17875.0,...,4,191.827492,3,161.926709,7,673.494512,4,247.752301,3,97.224801,5,404.382763,4,148.75661,3,1,0,3,1,3,0,4,4,4,4,4,2,2,2,2,2,5,6,6,4,4,0,1,0,0
4,4,26,0,2,60991,737,14801,24628,39421,15492,2,1980,0,1,-6,-6,2,100,4,2314.524902,2,96874,1,759,5,1,146.0,12.5,759.0,759.0,759.0,759.0,96900,0,60991,14801,24628,39421,48792.8,16651.125,...,3,102.985075,3,55.308707,3,195.972115,3,102.985075,3,55.308707,3,195.972115,3,102.985075,3,1,1,3,5,2,1,3,3,3,3,4,2,2,2,2,3,2,2,2,2,7,1,1,0,0


In [11]:
national_processed.describe()

Unnamed: 0,CONTROL,AGE1,METRO3,REGION,LMED,FMR,L30,L50,L80,IPOV,BEDRMS,BUILT,STATUS,TYPE,VALUE,VACANCY,TENURE,NUNITS,ROOMS,WEIGHT,PER,ZINC2,ZADEQ,ZSMHC,STRUCTURETYPE,OWNRENT,UTILITY,OTHERCOST,COST06,COST12,COST08,COSTMED,TOTSAL,ASSISTED,GLMED,GL30,GL50,GL80,APLMED,ABL30,...,COST08RELPOVCAT,COST08RELFMRPCT,COST08RELFMRCAT,COST12RELAMIPCT,COST12RELAMICAT,COST12RELPOVPCT,COST12RELPOVCAT,COST12RELFMRPCT,COST12RELFMRCAT,COSTMedRELAMIPCT,COSTMedRELAMICAT,COSTMedRELPOVPCT,COSTMedRELPOVCAT,COSTMedRELFMRPCT,COSTMedRELFMRCAT,FMTZADEQ,FMTMETRO3,FMTBUILT,FMTSTRUCTURETYPE,FMTBEDRMS,FMTOWNRENT,FMTCOST06RELPOVCAT,FMTCOST08RELPOVCAT,FMTCOST12RELPOVCAT,FMTCOSTMEDRELPOVCAT,FMTINCRELPOVCAT,FMTCOST06RELFMRCAT,FMTCOST08RELFMRCAT,FMTCOST12RELFMRCAT,FMTCOSTMEDRELFMRCAT,FMTINCRELFMRCAT,FMTCOST06RELAMICAT,FMTCOST08RELAMICAT,FMTCOST12RELAMICAT,FMTCOSTMEDRELAMICAT,FMTINCRELAMICAT,FMTASSISTED,FMTBURDEN,FMTREGION,FMTSTATUS
count,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,...,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0,64535.0
mean,32267.0,47.968932,1.227179,1.394406,68109.744309,1164.395181,17712.125436,29511.032076,46207.56763,15986.664229,2.660308,1966.432835,0.068769,1.06536,142032.8,-5.458433,1.320663,14.889052,5.631208,1885.195459,1.886015,61355.84,1.000496,1061.613744,1.920973,0.424405,183.099576,67.87303,1525.576425,2074.054208,1697.144663,1406.865145,44910.928117,-5.516262,68109.744309,17712.125436,29511.032076,46207.56763,53643.917778,20897.959515,...,2.526288,137.651042,2.424901,116.408711,4.501278,481.12305,2.574758,165.983691,2.48899,80.661251,3.7709,324.761274,2.450376,115.830454,2.328922,1.000496,0.333044,2.259053,1.921252,2.61156,0.424405,3.106841,3.145208,3.193678,3.069296,2.899078,1.376307,1.424901,1.48899,1.328922,2.082033,2.949144,3.162997,3.501278,2.7709,3.996715,0.478237,1.497823,0.175967,0.0
std,18629.794148,22.869374,1.26946,1.050114,12371.177175,394.119188,4441.564491,7407.01241,10640.616965,7219.688394,1.093778,26.304726,0.253063,0.459341,249027.8,2.01973,0.618766,54.746222,1.904533,1245.591074,2.566334,74405.3,0.417153,982.274571,1.470134,0.494256,130.169614,145.52833,1671.829974,2596.919689,1957.423626,1477.759879,64806.938997,4.499224,12371.177175,4441.564491,7407.01241,10640.616965,20217.626363,4760.655951,...,3.287995,127.422272,0.722485,129.60662,2.195668,648.790025,3.294774,168.183589,0.713639,75.058303,1.975262,375.384088,3.276692,97.841114,0.72654,0.417153,0.471306,1.58569,1.468915,0.998857,0.494256,1.32006,1.315403,1.309547,1.322897,1.414442,0.725381,0.722485,0.713639,0.72654,0.991235,2.037007,2.103802,2.195668,1.975262,2.562805,0.674264,0.913932,0.380795,0.0
min,0.0,-9.0,0.0,0.0,38500.0,394.0,10257.0,17057.0,27307.0,-6.0,0.0,1919.0,0.0,1.0,-6.0,-6.0,0.0,-7.0,1.0,0.0,-6.0,-117.0,0.0,-6.0,-9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9.0,-9.0,38500.0,10257.0,17057.0,27307.0,-9.0,10293.0,...,-9.0,0.0,1.0,0.0,1.0,-9.0,-9.0,0.0,1.0,0.0,1.0,-9.0,-9.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16133.5,35.0,0.0,0.0,60300.0,888.0,14449.0,24050.0,38150.0,12019.0,2.0,1950.0,0.0,1.0,-6.0,-6.0,1.0,1.0,4.0,780.192857,1.0,14987.0,1.0,424.0,1.0,0.0,91.0,0.0,682.0,750.0,710.0,658.0,0.0,-9.0,60300.0,14449.0,24050.0,38150.0,45199.8,17535.857143,...,2.0,73.934763,2.0,50.513942,2.0,155.065135,3.0,78.760563,2.0,43.597884,2.0,136.58077,2.0,67.620731,2.0,1.0,0.0,1.0,1.0,2.0,0.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0
50%,32267.0,50.0,1.0,1.0,64600.0,1100.0,16829.0,28053.0,44605.0,15452.0,3.0,1970.0,0.0,1.0,60000.0,-6.0,1.0,1.0,5.0,2010.723837,2.0,39987.0,1.0,838.0,1.0,0.0,166.25,31.666667,1100.0,1329.0,1180.304783,1045.0,24000.0,-9.0,64600.0,16829.0,28053.0,44605.0,53280.0,20077.942857,...,4.0,109.641873,3.0,82.699789,5.0,308.246798,4.0,122.699387,3.0,64.724154,4.0,240.498939,4.0,97.930142,2.0,1.0,0.0,2.0,1.0,3.0,0.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,2.0,3.0,3.0,4.0,3.0,4.0,0.0,1.0,0.0,0.0
75%,48400.5,64.0,2.0,2.0,74008.0,1390.0,20011.0,33334.0,52086.0,18612.0,3.0,1985.0,0.0,1.0,200000.0,-6.0,2.0,4.0,7.0,2625.034116,3.0,81231.0,1.0,1399.0,3.0,1.0,249.0,83.333333,1825.181234,2485.131774,2031.718595,1683.060292,67000.0,0.0,74008.0,20011.0,33334.0,52086.0,63727.2,23362.3,...,4.0,165.965152,3.0,142.512939,7.0,596.378735,4.0,206.979921,3.0,95.516505,5.0,398.416412,4.0,137.76935,3.0,1.0,1.0,3.0,3.0,3.0,1.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,2.0,3.0,5.0,5.0,6.0,4.0,7.0,1.0,2.0,0.0,0.0
max,64534.0,93.0,4.0,3.0,115300.0,3511.0,42550.0,70850.0,111450.0,51635.0,7.0,2013.0,1.0,9.0,2520000.0,5.0,3.0,998.0,15.0,38474.709667,20.0,1061921.0,3.0,10667.0,6.0,1.0,1613.0,6856.0,19261.472577,28992.600365,22305.447202,17155.289494,698886.0,1.0,115300.0,42550.0,70850.0,111450.0,217448.16,48302.222222,...,4.0,2577.333821,3.0,2299.084334,7.0,10336.261088,4.0,3419.544546,3.0,1296.983943,7.0,6057.843858,4.0,1928.699349,3.0,3.0,1.0,6.0,6.0,4.0,1.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,2.0,3.0,6.0,6.0,6.0,6.0,7.0,2.0,4.0,1.0,0.0


In [0]:
# Now we only ahve numeric columns (ints and floats)
national_processed.dtypes

CONTROL                  int32
AGE1                     int64
METRO3                    int8
REGION                    int8
LMED                     int64
FMR                      int64
L30                      int64
L50                      int64
L80                      int64
IPOV                     int64
BEDRMS                   int64
BUILT                    int64
STATUS                    int8
TYPE                     int64
VALUE                    int64
VACANCY                  int64
TENURE                    int8
NUNITS                   int64
ROOMS                    int64
WEIGHT                 float64
PER                      int64
ZINC2                    int64
ZADEQ                     int8
ZSMHC                    int64
STRUCTURETYPE            int64
OWNRENT                   int8
UTILITY                float64
OTHERCOST              float64
COST06                 float64
COST12                 float64
                        ...   
COSTMedRELAMICAT         int64
COSTMedR

In [0]:
### Your Code Here

# Stretch Goals

## 1) Perform further data exploration on the HADS national dataset (the version before we one-hot encoded it) Make scatterplots and see if you can see any resemblance between the original scatterplots and the plot of the principal components that you made in 7.1. 

(You may or may not not see very much resemblance depending on the variables you choose, and that's ok!)

## 2) Study "Scree Plots" and then try and make one for your PCA dataset. How many principal conponents do you need to retain in order for your PCs to contain 90% of the explained variance? 

We will present this topic formally at the beginning of tomorrow's lecture, so if you figure this stretch goal out, you're ahead of the game. 

## 3) Explore further the intuition behind eigenvalues and eigenvectors by creating your very own eigenfaces:

Prioritize self-study over this stretch goal if you are not semi-comfortable with the topics of PCA, Eigenvalues, and Eigenvectors.

![Eigenfaces](https://i.pinimg.com/236x/1c/f1/01/1cf101a9859437a5d096a04b05be06b4--faces-tattoo.jpg)

You don't necessarily have to use this resource, but this will get you started: 
[Eigenface Tutorial](https://sandipanweb.wordpress.com/2018/01/06/eigenfaces-and-a-simple-face-detector-with-pca-svd-in-python/)