# PCA analysis on Nisei Matrix

Given the nisei_matrix.csv, run PCA analysis to figure out which lemma/tokens should be used for clustering


_why PCA (principal component analysis)?_
1. saves data / space
2. quicker running time
3. reduces overfitting (and thus overtraining) 

In [98]:
# libraries and such 

import numpy as np
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_style()

# PCA libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [99]:
data = (pd.read_csv('nisei_matrix.csv', 
                    index_col = 0)) # if you don't say this, you get 2 columns of indeces

data.head()

Unnamed: 0,念,世,れる,—,座,把,あー,十,万,内,...,学生,兵,員,母,知,婚,盛,移民,良,〕
0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,3.0,1.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Scaling the Data

We scale the data so the average is 0 and the sum is 1. this is so all attributes carry equal weight. Because of the nature of this matrix, it won't matter much but will be useful to do anyway incase we add a distance metric to the values in our matrix that will skew the averages. 

In [100]:
# making scaler:

scaling = StandardScaler() 

# fit the standardizer then scale the data 
scaling.fit(data)
scaled_data = scaling.transform(data)

## Determining the Number of Components for our data 

with this, we can choose different number of components we want

In [105]:
# setting the components: 

n = 100 # this is the maximum value of n 

principal = PCA(n_components = n)
principal.fit(scaled_data)
X = principal.transform(scaled_data)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
X.shape # see how the shape is unchanged, now we can check the components 

In [None]:
principal.components_.shape

In [None]:
plt.plot(np.arange(0,n),principal.explained_variance_ratio_)
plt.title('Variance Plot for Our Components')
plt.xlabel('Paramter Number')
plt.grid()
plt.ylabel('Variance');

_based off of inspection, it is difficult to tell where the "elbow" of the arm is, but I will say roughly 50 is a good number to pick_

In [None]:
# so let's plot just the first 50 and see how it looks again: 

plt.plot(np.arange(0,50),principal.explained_variance_ratio_[0:50])
plt.title('Variance Plot for Our Components')
plt.xlabel('Paramter Number')
plt.grid()
plt.ylabel('Variance');

In [None]:
# now we can reduce the size of X to only include the first 50 columns: 
n = 50 
principal = PCA(n_components = n)
principal.fit(scaled_data)
reduced_X = principal.transform(scaled_data)

_You can see that the first principal component has a larger spread than the second component. You can also see that the cluster of values seems to be centered at (0,0), which is good because that means the scaler worked correctly._

In [None]:
PCA_X = pd.DataFrame(reduced_X)

PCA_X

In [None]:
# Let's visualize the first 2 components: 

sns.scatterplot(data = PCA_X, x = 0, y = 1)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Plotting the Variance of the 1st and 2nd Principal Components');