___
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Principal Components Analysis 


---

In this lab, let's try PCA on a dataset derived from the USDA National Nutrient Database.

** Import libraries **

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

% matplotlib inline

** Import the `nutrition_usda` data set and look at the variables **

#### Let's check for highly correlated features in our dataset. Remove any redundant variables.

In [None]:
used = []
corrs = []
for i, j in enumerate(df.corr().columns):
    for k in range(len(df.corr())):
        if ((df.corr().iloc[k, i] > 0.9) & 
            (j not in used) &
            (j != df.corr().index[k])):
            
            used.append(j)
            corrs.append((j, df.corr().index[k], 
                          np.round(df.corr().iloc[k, i], 2)))

            
corrsdf = pd.DataFrame([[i[0] for i in corrs],
                        [i[1] for i in corrs],
                        [i[2] for i in corrs]])

corrsdf = corrsdf.T.rename(columns = {0:'column',1:'row',2:'corr'})
corrsdf[:15]

** Remove redundant features **

#### Next, separate the non-numeric features

### Now, look at the data distribution

In [None]:
ax = df.hist(bins=50, xlabelsize=-1, ylabelsize=-1, figsize=(11,11))

Most of the variables are "zero" inflated and skewed right. We may want to consider transformation so "improve" the distributions and hopefully produce better correlations for our PCA. 

Note: this is an optional step that may not always improve results.

In [None]:
from scipy.stats import boxcox
# add 1 because data must be positive (we have many zeros)
df = df + 1
df_TF = pd.DataFrame(index=df.index)
for i in df.columns.values:
    df_TF["%s_TF" % i] = boxcox(df.loc[:, i])[0]

In [None]:
ax = df_TF.hist(bins=50, xlabelsize=-1, ylabelsize=-1, figsize=(11,11))

This should help our PCA.

To account for different scales of measurement, we'll standardize to mean=0, variance=1.

In [None]:
# from sklearn.preprocessing import StandardScaler
df_TF = StandardScaler().fit_transform(df_TF)

print("mean: ", np.round(df_TF.mean(), 2))
print("standard dev: ", np.round(df_TF.std(), 2))

### Implement PCA

In [None]:
# from sklearn.decomposition import PCA



#### check the eigenvalues to find most important components


In [None]:
plt.plot(fit.explained_variance_ratio_)

In [None]:
print(fit.explained_variance_ratio_)

print(fit.explained_variance_ratio_[:5].sum())

#### Keeping the first 5 eigenvectors, for example

In [None]:
pca = pd.DataFrame(pca[:, :5], index=df.index)
pca = pca.join(df_desc)
pca.drop(['CommonName','MfgName','ScientificName'], axis=1, inplace=True)
pca.rename(columns={0:'c1',1:'c2',2:'c3',3:'c4',4:'c5'}, inplace=True)

### Try to interpret the components

(this is where deep subject matter expertise, in this case nutrition, comes in handy)

**Component one** 



In [None]:
vects = fit.components_[:5]

one = pd.Series(vects[0], index=df.columns)
one.sort_values(ascending=False)

**Component two**



In [None]:
two = pd.Series(vects[1], index=df.columns)
two.sort_values(ascending=False)

In [None]:
# do the same for #3, 4 and 5.

** Now let's look at which food groups are most common in each component **

#### Component 1 

In [None]:
pca.sort_values(by='c1')['FoodGroup'][:500].value_counts()

#### Component 2

In [None]:
pca.sort_values(by='c2')['FoodGroup'][:500].value_counts()

#### Repeat for the other three components