# PCA Rundown 

Analysis largely taken from [this page](https://data.world/exercises/principal-components-exercise-1) and modified by David John Baker. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("nndb_flat.csv")

In [None]:
df

In [None]:
df.shape

In [None]:
df.columns

There is a bit of redundant information here that we need to drop before we start doing any sort of analysis.

In [None]:
sns.pairplot(df[['Energy_kcal','Protein_g','Fat_g']])


In [None]:
df.drop(df.columns[df.columns.str.contains('_USRDA')].values, 
        inplace=True, axis=1)

In [None]:
df.set_index('ID', inplace=True)
df_desc = df.iloc[:, :6]
df.drop(df.columns[:6].values, axis=1, inplace=True)

In [None]:
%matplotlib inline
ax = df.hist(bins=50, xlabelsize=-1, ylabelsize=-1, figsize=(11,11))

## Most of the variables are "zero" inflated and skewed right. We may want to consider transformation so "improve" the distributions and hopefully produce better correlations for our PCA. Note: this is an optional step that may not always improve results.

In [None]:
from scipy.stats import boxcox
# add 1 because data must be positive (we have many zeros)
df = df + 1
df_TF = pd.DataFrame(index=df.index)
for i in df.columns.values:
    df_TF["%s_TF" % i] = boxcox(df.loc[:, i])[0]

In [None]:
ax = df_TF.hist(bins=50, xlabelsize=-1, ylabelsize=-1, figsize=(11,11))

In [None]:
# from sklearn.preprocessing import StandardScaler
df_TF = StandardScaler().fit_transform(df_TF)

print("mean: ", np.round(df_TF.mean(), 2))
print("standard dev: ", np.round(df_TF.std(), 2))

# Run PCA 

In [None]:
# from sklearn.decomposition import PCA
fit = PCA()
pca = fit.fit_transform(df_TF)
pca

In [None]:
plt.plot(fit.explained_variance_ratio_)
plt.title("Variance Explained, Given Number of Components")
plt.xlabel("Number of Eigenvectors")
plt.ylabel("Variance Explained")

In [None]:
print(fit.explained_variance_ratio_)
print("--------------------")
print(fit.explained_variance_ratio_[:5].sum())
print("--------------------")
print("If we use ALL the data we can explain",fit.explained_variance_ratio_[:23].sum()*100," % of our data!")

#### the first 5 eigenvectors account for 77% of the variance and will be kept

In [None]:
display(pca)

In [None]:
pca = pd.DataFrame(pca[:, :5], index=df.index)
pca

In [None]:
df_desc

In [None]:
pca = pca.join(df_desc)


In [None]:
pca.drop(['CommonName','MfgName','ScientificName'], axis=1, inplace=True)
pca.rename(columns={0:'c1',1:'c2',2:'c3',3:'c4',4:'c5'}, inplace=True)

In [None]:
pca

## Try to interpret the components

(this is where deep subject matter expertise, in this case nutrition, comes in handy)

**Component one** 

foods that are high in: zinc, and other vitamins and minerals

low in: sugar, vitamin C, Carbs, and fiber

In [None]:
# Get First Five Columns
vects = fit.components_[:5]

In [None]:
one = pd.Series(vects[0], index=df.columns)
one.sort_values(ascending=False)

**Component two**

High: Carbs, Fiber, Mang, Sugar, Vitamin C...

Low: Vitamine B12, protein, selenium, Fat...

In [None]:
two = pd.Series(vects[1], index=df.columns)
two.sort_values(ascending=True)

**Component three**

High: calories, fat, carbs, sugar...

Low: vitamin A, vitamin C, folate, copper...

In [None]:
three = pd.Series(vects[2], index=df.columns)
three.sort_values(ascending=False)

**Component four**

High: vitamin A, vitamin E, fat, sugar, calcium, vitamin B12, calories...

Low: manganese, copper, iron, magnesium, fiber

In [None]:
four = pd.Series(vects[3], index=df.columns)
four.sort_values(ascending=False)

** Component five**

High: riboflavin, thiamin, niacin, sugar, vitB6, vitC, vitB12...

Low: manganese, copper, fat, vitE, calories, magnesium

In [None]:
five = pd.Series(vects[4], index=df.columns)
five.sort_values(ascending=False)

In [None]:
## Now let's look at which food groups are highest in each component (1)

In [None]:
pca

In [None]:
pca.sort_values(by='c1')['FoodGroup'][:500].value_counts()

In [None]:
pca.sort_values(by='c2')['FoodGroup'][:500].value_counts()

In [None]:
pca.sort_values(by='c3')['FoodGroup'][:500].value_counts()

In [None]:
pca.sort_values(by='c4')['FoodGroup'][:500].value_counts()

In [None]:
pca.sort_values(by='c5')['FoodGroup'][:500].value_counts()

# What About Other Ways?

In [None]:
# from sklearn.decomposition import PCA
fit_2 = PCA(n_components=2)
pca_2 = fit_2.fit_transform(df_TF)
pca_2

fit_7 = PCA(n_components=7)
pca_7 = fit_7.fit_transform(df_TF)
pca_7

In [None]:
# Create a new dataset from principal components 
df = pd.DataFrame(data = pca_2, 
                  columns = ['PC1', 'PC2'])


df_7 = pd.DataFrame(data = pca_7, 
                  columns = ['PC1', 'PC2','PC3','PC4','PC5','PC6','PC7'])

# target = pd.Series(iris['target'], name='target')

# result_df = pd.concat([df, target], axis=1)
# result_df.head(5)

In [None]:
df_desc = pd.DataFrame(df_desc)
food_2 = df.join(df_desc)

display(food_2)

food_7 = df_7.join(df_desc)

#display(food_7)


In [None]:
sns.scatterplot("PC3", "PC2", hue = "FoodGroup", data = food_7)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In [None]:
food_7

## Your Turn !

Import in the iris dataset found in this directory.
Go through the steps of PCA in the next 15 minutes to see if you can extract differences between the groups of flowers based on their physical features.