# Overview

The goal of this jupyter notebook is to understand what SageMaker really buys me.  Because SKlearn has a free version of all the algorithms that SageMaker provides.  So I am doing this experiment.  

## Question trying to answer: 

1. Is SageMaker really about computing?  
    - So if you have a model that you are trying to training.  If there are a lot of data, will it take to long on your personal computer?  If so, then you would use SageMaker.

2. Is it about getting the results of your training?

3.  Is it about the storage of model artifacts?
    - SageMaker does the wrapper of results of SKlearn a little better.  You can still create the results how you want, but you would have to do this each time you ran a model.  So it would be nice to use the API on my local machine, but maybe store the results on S3. I don't think using the notebook instance is that valuable.

4.  Is it the organization of model experiments and tracking experiments

5.  Is about the ability to easily reload past model experiments and use it?

6. I don't care about endpoints. 


In [13]:
# data managing and display libs
import pandas as pd
import numpy as np
import os
import io

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline 

In [2]:
# Import the Sklearn 
from sklearn.decomposition import PCA

In [3]:
# Read in dataset
df = pd.read_csv("counties_scaled.csv")
df.head()

Unnamed: 0,State,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,Alabama-Autauga,0.005475,0.005381,0.005566,0.026026,0.759519,0.215367,0.004343,0.024038,0.0,...,0.007022,0.033248,0.048387,0.55243,0.005139,0.75,0.25,0.150273,0.0,0.208219
1,Alabama-Baldwin,0.019411,0.019246,0.019572,0.045045,0.832665,0.110594,0.006515,0.016827,0.0,...,0.014045,0.035806,0.104839,0.549872,0.018507,0.884354,0.107616,0.15847,0.040816,0.205479
2,Alabama-Barbour,0.002656,0.002904,0.002416,0.046046,0.462926,0.543655,0.002172,0.009615,0.0,...,0.025281,0.038363,0.043011,0.491049,0.001819,0.719388,0.248344,0.199454,0.010204,0.482192
3,Alabama-Bibb,0.002225,0.002414,0.002042,0.022022,0.746493,0.249127,0.004343,0.002404,0.0,...,0.008427,0.038363,0.018817,0.611253,0.001754,0.804422,0.17053,0.18306,0.040816,0.227397
4,Alabama-Blount,0.005722,0.005738,0.005707,0.086086,0.880762,0.017462,0.003257,0.002404,0.0,...,0.01264,0.01023,0.061828,0.767263,0.004751,0.892857,0.127483,0.114754,0.040816,0.210959


In [7]:
df.index = df['State']
df.drop(columns='State',inplace=True)
df.head()

Unnamed: 0_level_0,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,Citizen,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama-Autauga,0.005475,0.005381,0.005566,0.026026,0.759519,0.215367,0.004343,0.024038,0.0,0.006702,...,0.007022,0.033248,0.048387,0.55243,0.005139,0.75,0.25,0.150273,0.0,0.208219
Alabama-Baldwin,0.019411,0.019246,0.019572,0.045045,0.832665,0.110594,0.006515,0.016827,0.0,0.024393,...,0.014045,0.035806,0.104839,0.549872,0.018507,0.884354,0.107616,0.15847,0.040816,0.205479
Alabama-Barbour,0.002656,0.002904,0.002416,0.046046,0.462926,0.543655,0.002172,0.009615,0.0,0.003393,...,0.025281,0.038363,0.043011,0.491049,0.001819,0.719388,0.248344,0.199454,0.010204,0.482192
Alabama-Bibb,0.002225,0.002414,0.002042,0.022022,0.746493,0.249127,0.004343,0.002404,0.0,0.00286,...,0.008427,0.038363,0.018817,0.611253,0.001754,0.804422,0.17053,0.18306,0.040816,0.227397
Alabama-Blount,0.005722,0.005738,0.005707,0.086086,0.880762,0.017462,0.003257,0.002404,0.0,0.00697,...,0.01264,0.01023,0.061828,0.767263,0.004751,0.892857,0.127483,0.114754,0.040816,0.210959


In [8]:
# To use PCA it has to be a numpy array
train_data_np = df.values.astype('float32')

In [9]:
N_COMPONENTS = 33

In [10]:
# Configure the algorithm (hyperparameters)
pca = PCA(n_components=N_COMPONENTS, svd_solver='full')

In [15]:
# It was fast.  Less than 1 secs
# the instance now stores all single-value decompositions values, dataset descriptors
# explained-variance
pca.fit(train_data_np)

PCA(n_components=33, svd_solver='full')

# Results

PCA is doing Single Value Decomposition

**SVD=UΣVT**
T= transpose, so V_transpose

U how each subject (e.g., counties in this example) relates to the latent factors (the moved and translated dimensions)
V_transpose relates to how the columns (e.g., measurement types) relate to the latent factors.
Sigma is a k x k diagonal matrix.  It helps us decide how many factors to keeps.  (showing the how each fitted dimensions explain the variance)

## PCA Model Attributes

Three types of model attributes are contained within the PCA model.

* **mean**: The mean that was subtracted from a component in order to center it.
* **v**: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).
* **s**: The singular values of the components for the PCA transformation. This does not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.
    
We are only interested in v and s. 

From s, we can get an approximation of the data variance that is covered in the first `n` principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for _all_ components:

\begin{equation*}
\frac{\sum_{n}^{ } s_n^2}{\sum s^2}
\end{equation*}

From v, we can learn more about the combi
nations of original features that make up each principal component.


In [21]:
explained_var = pd.Series(pca.explained_variance_ratio_)
explained_var

0     3.209868e-01
1     1.421049e-01
2     1.148278e-01
3     8.666071e-02
4     5.340210e-02
5     4.951809e-02
6     3.417160e-02
7     2.991340e-02
8     2.602435e-02
9     2.106444e-02
10    1.724376e-02
11    1.579245e-02
12    1.467393e-02
13    1.142748e-02
14    1.067634e-02
15    9.639916e-03
16    7.614509e-03
17    6.587862e-03
18    6.075651e-03
19    4.759028e-03
20    4.410688e-03
21    3.924108e-03
22    2.644663e-03
23    2.129502e-03
24    1.906785e-03
25    1.658913e-03
26    1.357431e-04
27    1.348715e-05
28    7.520258e-06
29    1.057873e-06
30    9.965286e-07
31    7.975693e-07
32    6.142346e-07
dtype: float32

In [23]:
explained_var[:7].sum()

0.80167204

So in general, you want your models to capture at least 80% of the variance. So, in this case, I would keep at least the top 7 dimension.

In [18]:
print(pca.singular_values_)

[19.592176   13.035978   11.718245   10.180057    7.991315    7.6952195
  6.392515    5.980974    5.578649    5.018963    4.5410376   4.3457413
  4.1890187   3.6967015   3.5731425   3.3952808   3.0175886   2.806799
  2.695476    2.3856032   2.2966366   2.1662548   1.7783786   1.5957989
  1.510045    1.4084805   0.4029007   0.12699868  0.09483211  0.0355677
  0.03452105  0.03088327  0.0271023 ]


In [24]:
print(pca.components_)

[[ 5.04592387e-03  4.95318510e-03  5.13611455e-03 ... -6.34508505e-02
  -1.42857628e-02  2.33626753e-01]
 [-1.59877054e-02 -1.57302711e-02 -1.62377879e-02 ...  3.09405625e-01
   7.06733987e-02 -1.20168991e-01]
 [ 7.63924643e-02  7.59586319e-02  7.68135563e-02 ... -1.39890611e-01
  -4.17343006e-02 -6.97059631e-02]
 ...
 [ 1.17335969e-03 -9.72318370e-03  1.17554124e-02 ... -3.72649968e-01
  -9.91007686e-02 -7.19352538e-05]
 [-2.53616087e-02  6.83546364e-01 -7.13751435e-01 ...  2.52577569e-03
   7.08975422e-04 -2.98635248e-04]
 [ 4.76849731e-03 -7.12749660e-02  7.86113143e-02 ... -1.14719095e-02
  -2.81052268e-03  8.26647738e-05]]


In [26]:
v = pd.DataFrame(pca.components_)

In [27]:
v.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
0,0.005046,0.004953,0.005136,0.392619,-0.601972,0.20753,0.03843,-0.004536,0.001652,0.004569,...,-0.004034,0.013728,-0.078116,0.054965,0.003422,-0.1036,0.141609,-0.063451,-0.014286,0.233627
1,-0.015988,-0.01573,-0.016238,0.278089,-0.092913,-0.3488,0.113913,0.002378,0.014045,-0.019234,...,0.120834,0.039226,0.225706,-0.288411,-0.015866,-0.443172,0.232524,0.309406,0.070673,-0.120169
2,0.076392,0.075959,0.076814,0.322116,-0.373134,0.001155,-0.038152,0.173815,0.018965,0.08455,...,-0.014532,0.012386,-0.009739,0.120439,0.079684,0.182157,-0.085712,-0.139891,-0.041734,-0.069706
3,0.009003,0.008634,0.009362,-0.534676,-0.160297,0.622484,0.124463,0.059567,0.011659,0.011346,...,0.063648,0.042423,0.088901,-0.064774,0.010629,-0.259197,0.243868,0.00962,0.015787,0.03182
4,-0.016843,-0.016702,-0.016979,0.102067,-0.210338,0.30764,-0.136468,-0.026954,-0.003319,-0.022274,...,-0.046131,-0.033763,0.012945,0.306704,-0.016639,0.069203,-0.138455,0.112136,0.019906,-0.142847


In [32]:
#rows = princpal components
#col = features (x inputs)
# so I need to transpose this panda to get it in the form I like (components x features)
# Need to double check the order of principal components and features after I do this transpose
# This is something I wouldn't necessary have to do with SageMaker sdk
# Would love to run SageMaker locally
v.shape

(33, 34)

In [None]:
# The singular_values are the diagnols of Sigma (which is a square matrix with 0's outside of the diagnols)
results ={
    'explain_var_percent' : pca.explained_variance_ratio_,
    'singular_values' : pca.singular_values_, # this is s: and tells you more about the variance explained
    'principal_components' : pca.components_ # This is v_transposed:
}