# Unsupervised Learning

In [None]:
reset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# 1 k-means clustering

* k-means clusting is an unsupervised learning technique used to partition observations in a dataset into k-clusters in which observations in each cluster belong to a center point (the centroid).    
* the algorithm starts by assigning k number of centroids to random locations in m-dimensional space (m: number of columns or features in the dataset).   
 - 1 each sample in the dataset gets assigned to the nearest centroid based (most commonly) on the squared Euclidean distance.  
 - 2 the centroids for each cluster gets recomputed by taking the mean of all the samples (the center of the samples in m-dimentional space) that belong to that cluster.   
* iterative refinement of steps 1 and 2 continues until the stopping criterina is met: centroids do not move significantly and the sum of Euclidean distances is minimized.  

&nbsp;


## 1.1 **Example 1** 

we look at a dataset where we have 3 distinctly defined groups.      

From the UCI Machine Learning Repository the <a href='https://archive.ics.uci.edu/ml/datasets/seeds'>seeds</a> dataset examines and classifies kernals into 3 groups based on 7 features. 

In [None]:
# modify the column names for viewing convenience
col_names = ['area', 'perimeter', 'compactness', 'length of kernel', 'width of kernel', 'asymmetry coefficient',
             'length of kernel groove', 'variety' ]

In [None]:
seeds = pd.read_table('data/seeds_.txt', header = -1, index_col = False, sep = '\s+' , names = col_names )

In [None]:
seeds.shape

In [None]:
seeds.head()

&nbsp;

column `variety` is the actual kernel assignment.     


isolate this column and drop it form <U>seeds</U>, we shall use it later to assess the performance of the clustering algorithm 

In [None]:
seed_class = seeds.variety.copy()

&nbsp;

change levels to make between 0 and 2.

In [None]:
seed_class = seed_class - 1

In [None]:
seed_class[:10]

In [None]:
# value counts is a pandas method similar to Table in R


seed_class.value_counts()

In [None]:
# drop variety column from seeds
seeds.drop(['variety'], axis = 1, inplace = True)

In [None]:
seeds.head()

&nbsp;


In [None]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_score

* the reason variety column is `y_of_train` and `y_of_test` (vs y_train and y_test) is because there is no training component here. We will only use the y subsets to compare the resulting groups.   

* this is an example that happens to come with training wheels.   

In [None]:
x_train, x_test, y_of_train, y_of_test = train_test_split(seeds, seed_class, test_size = .2, random_state = 33 )

`sklearn.cluster.KMeans(n_clusters=8, 
                        init='k-means++', 
                        n_init=10, 
                        max_iter=300,   
                        verbose=0, 
                        copy_x=True, n_jobs=1, 
                        algorithm='auto')`
                        
* assigning the parameter `n_clusters` is an arbitrary choise and there's a couple of ways that we can assess which `n_clusters` returns the best separation between clusters.  

* `init = 'k-means++'`  selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. other options: `'random'` or `ndarray` of starting points.

* `n_init` number of times the algorithm runs with different centroids.    
* `max_iter` max number of iterations between steps (1) and (2) mentioned earlier above to refine the centroid locations.  
* `copy_x = {True, False}` default `True`, original dataset is kept intact. `False` the data sample is centered.    


the resulting `KMeans` object returns a number of attributes, most importantly: 

`.culster_centers_` is an array of shape (m, k) where m is the number of attributes (columns) in the dataset and k is the number of clusters.   
`labels_` an arrary of integers from 0 to k-1 that assigns each datapoint to a cluster.   
`inertia_` sum of squared distances of samples to their closest cluster center.    

<a id = 'loc_1'></a>

**VAF**    
 - `VAF` **Variance Accounted For** is one of the metric used to assess whether the number of clusters selected is optimal.  
 - `VAF` is the ratio between the within-clusters sum of squares and the total sum of squres.    
 - `VAF` values for different `n_cluster` are calculated and the resulting values are used to generate a **scree plot**    


**silhouette_score**    
 - the `silhouett_score` `[1,-1]` from `sklearn.metrics` computes a silhouette coefficient for every `n_cluster` assignment using the original dataset and the resulting `labels_` attribute.    
 - best value is 1 and worst is -1. Values near 0 indicates overlapping clusters. negative values indicate that samples have been assigned to the wrong cluster. increasiing positive values indicate better cluster quality and better separation. 

run `%load udf/VAF.py` method [here](#VAF)
<a href = '#VAF'></a>

In [None]:
inertia = []
silh = []
vaf = []

for k_clusters in range(2,15):
    kmc = KMeans(n_clusters = k_clusters)
    kmc.fit(x_train)
    inertia.append(kmc.inertia_)
    silh_ = silhouette_score(x_train, kmc.labels_, metric = 'euclidean')
    silh.append(silh_)
    # vaf is the third object in the array returned by VAF()
    vaf_ = VAF(x_train, kmc)[3]
    vaf.append(vaf_)


In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (15,5))

ax1.plot(range(2,15), vaf )
ax1.plot(range(2,15), vaf, 'o')
ax1.set_title('scree plot')
ax1.set_xlabel('clusters')
ax1.set_ylabel('VAF')

ax2.plot(range(2,15), silh)
ax2.plot(range(2,15), silh, 'o')
ax2.set_title('silhouette coefs')
ax2.set_xlabel('clusters')
ax2.set_ylabel('coefficient')

it is clear that the `VAF` is not returning the optimal number of clusters which we know to be 3 from the original dataset  

let us select number of clusters 4  

In [None]:
kmc_4 = KMeans(n_clusters = 4).fit(x_train)

* refit the model with the number of clusters chosen based on the *scree plot* and the *silh coefficients*.   
* predict the test sample.  


* there is no explicit methodology to carry out model validation for clustering since this is an unsupervised learning method. the only way to test the validity of the model is to classify the test set according to the trained model and compare the proportions of observations in every cluster.     


1- calculate the proportion of observartions in each of the clusters in the training set 

In [None]:
np.unique(kmc_4.labels_, return_counts = True)[1]/x_train.shape[0]

&nbsp;

2- fit the model to the test set and cluster the test set 

In [None]:
pred_variety = kmc_4.predict(x_test)

&nbsp;

3- calculate the proportions of observations in the test cluster and compare to the train cluster  

In [None]:
np.unique(pred_variety, return_counts = True)[1]/x_test.shape[0]

&nbsp;

`TSNE` or **t-distribution Stochastic Neighbor Embegging** is a visualization tool from `sklearn.manifold` that converts similarities between datapoints into join probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.    

PCA is recommended before using `TSNE` especially with higher number of dimentions (>50).   

`class sklearn.manifold.TSNE(n_components=2,             
                             learning_rate=200.0,       
                             n_iter=1000,               
                             n_iter_without_progress=300,
                             metric = 'euclidean',       
                             verbose=0,                 
                             random_state=None) `       
                        
`learning_rate` between [10.0 1000.0]. If the learning rate is too high the data may look like a ball, if the learning rate is too low the data may look compressed in a dense cloud.                          

`n_iter` maximum number of iterations for the optimization    
`n_iter_without_progress`  number of iterations without progress before the optimization is aborted    
`metric` euclidian by default    
`random state` the seed used by the random number generator. controls the random initialization    


In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components = 2)
tsne_fit = tsne.fit_transform(x_train)

In [None]:
plt.figure(figsize=(10,7))

clust = zip(tsne_fit[:,0], tsne_fit[:,1], kmc_4.labels_)

# if you list clust it looks like this 
#[(3.490185, -4.8993158, 3),
# (-7.8433094, 10.558871, 1),
# (-9.051569, 8.7391624, 1),
# (6.6155434, -11.603072, 0),

for x, y, label in clust:
    plt.plot(x, y, 'o', color = plt.cm.tab10(label/10.))

plt.title('k = 4 clusters')

&nbsp;


the seeds in the original dataset are categorized into 3 categories   
we can check to see if the features of the seeds contain enough information to distinguish the seeds   
since we do not need to compare sample proportions between train and test we will use the entire <U>seeds</U> set and `seed_class`    

In [None]:
kmc_3 = KMeans(n_clusters = 3).fit(seeds)
tsne_full = TSNE(n_components = 2)
tsne_full_fit = tsne_full.fit_transform(seeds)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,7))

clust_1 = zip(tsne_full_fit[:,0], tsne_full_fit[:,1], kmc_3.labels_)
for x, y, label in clust_1:
    ax1.plot(x, y, 'o', color = plt.cm.tab10(label/10.))
ax1.set_title('k = 3 clusters')
ax1.set_xlabel('component 1')
ax1.set_ylabel('component 2')
    
clust_2 = zip(tsne_full_fit[:,0], tsne_full_fit[:,1], seed_class)
for x, y, label in clust_2:
    ax2.plot(x, y, 'o', color = plt.cm.tab10(label/10.))
ax2.set_title('actual seed varieties')
ax2.set_xlabel('component 1')
ax2.set_ylabel('component 2')

note: compare the groupings not the colors , the colors are assigned randomly   

_______________________________________

In [None]:
reset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import matplotlib.cm as cm

## 1.2 **Example 2**

* the `flavors_cacao.csv` dataset contains the expert rating of over 1700 chocolate bars along with a plethora of information about the origins of these comestible delights.   

* more information about the dataset can be found <a href = 'https://www.kaggle.com/rtatman/chocolate-bar-ratings'>here</a>


In [None]:
flavors = pd.read_csv('data/flavors_cacao.csv', header = 0)
flavors.shape

In [None]:
flavors.head()

there's a small problem      
missing values in the dataset are coded as empty spaces of length 1. hence `pandas.read_txt()` is unable to identify them.   
in order to overcome this problem use `DataFrame.replace()` with argument `regex = True` as follows:    

In [None]:
flavors.replace(to_replace={r'^\s':np.nan}, regex = True, inplace = True)
flavors.head(5)

&nbsp;

* kmeans clustering can only be applied to **numeric** variables. categorical variables which are strings need to be converted to integers in order to compute. 


first we convert the categorical columns into datatype category using `.astype()`

In [None]:
flavors.company = flavors.company.astype('category')
flavors.origin_name = flavors.origin_name.astype('category')
flavors.company_loc = flavors.company_loc.astype('category')
flavors.bean_type = flavors.bean_type.astype('category')
flavors.broad_bean_origin = flavors.broad_bean_origin.astype('category')

In [None]:
flavors.dtypes

<a id = 'loc_2'></a>

&nbsp;

second we replace the categories (which are strings) with the actual integer labels for each categorry

In [None]:
flavors.company = flavors.company.cat.codes
flavors.origin_name = flavors.origin_name.cat.codes
flavors.company_loc = flavors.company_loc.cat.codes
flavors.bean_type = flavors.bean_type.cat.codes
flavors.broad_bean_origin = flavors.broad_bean_origin.cat.codes

In [None]:
flavors.head()

In [None]:
coco_train, coco_test = train_test_split(flavors, test_size = .2, random_state = 85)

run `%load data/VAF.py` method [here](#VAF)
<a href = '#VAF'></a>

In [None]:
inertia_coco = []
silh_coco = []
vaf_coco = []

for k_clusters in range(5,70,2):
    kmc_coco = KMeans(n_clusters = k_clusters)
    kmc_coco.fit(coco_train)
    inertia_coco.append(kmc_coco.inertia_)
    silh_ = silhouette_score(coco_train, kmc_coco.labels_, metric = 'euclidean')
    silh_coco.append(silh_)
    vaf_ = VAF(coco_train, kmc_coco)[3]
    vaf_coco.append(vaf_)


In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (15,5))

ax1.plot(range(5,70,2), vaf_coco )
ax1.plot(range(5,70,2), vaf_coco, 'o')
ax1.set_title('scree plot')
ax1.set_xlabel('clusters')
ax1.set_ylabel('VAF')

ax2.plot(range(5,70,2), silh_coco)
ax2.plot(range(5,70,2), silh_coco, 'o')
ax2.set_title('silhouette coefs')
ax2.set_xlabel('clusters')
ax2.set_ylabel('coefficient')

In [None]:
kmc_pick = KMeans(n_clusters = 7, n_init = 10).fit(coco_train)

In [None]:
np.unique(kmc_pick.labels_, return_counts = True)[1]/coco_train.shape[0]

In [None]:
pred_variety = kmc_pick.predict(coco_test)
np.unique(pred_variety, return_counts = True)[1]/coco_test.shape[0]

In [None]:
tsne_coco = TSNE(n_components = 2)
tsne_coco_fit = tsne_coco.fit_transform(coco_train)

In [None]:
plt.figure(figsize=(14,9))

clust_coco = zip(tsne_coco_fit[:,0], tsne_coco_fit[:,1], kmc_pick.labels_)

for x, y, label in clust_coco:
    plt.plot(x, y, 'o', color = plt.cm.tab20(label/10.))

plt.title('k = 7 clusters')

___________________________________

In [None]:
reset

&nbsp;

&nbsp;


# 2 Principle Component Analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import matplotlib.cm as cm
%matplotlib inline

&nbsp;


* we begin with a simple example to demonstrate how PCA works.   


* principle component analysis projects the independent variables onto a new set of orthogonal axis that ensure maximization of the variance for each idependent variable along the new axis.  


* PCA is an unsupervised learning technique that is primarily used for dimentionality reduction and it can improve the outcome of many supervised learning techniques when applied to the new coordinate system. 

In [None]:
data = pd.read_csv('data/simple_pca_example.csv', header = 0)
data.head()

In [None]:
from sklearn.decomposition import PCA

`PCA(n_components=None, 
            copy=True, 
            svd_solver='auto', 
            iterated_power='auto', 
            random_state=None)`

In [None]:
pca = PCA(n_components = None)

testpca = pca.fit(data)
transformed = pca.fit(data).transform(data)

In [None]:
testpca.mean_

In [None]:
testpca.explained_variance_ratio_

In [None]:
plt.bar(range(3), testpca.explained_variance_ratio_)

In [None]:
# this is manual calculation of what testpca.explaind_variance_ratio_ is

testpca.explained_variance_ / np.sum(testpca.explained_variance_)

&nbsp;

* the **eigen vectors** returned by `pca-object.components_` are unit vectors (magnitude = 1) that define the new model space.

In [None]:
columns_ , index_ = ['V1','V2','V3']  , ['Comp 1', 'Comp 2', 'Comp 3']

In [None]:
eigenVec = pd.DataFrame(testpca.components_, columns = columns_, index = index_).transpose().copy()
eigenVec

In [None]:
np.dot(eigenVec['Comp 1'], eigenVec['Comp 2'])

In [None]:
np.dot(eigenVec['Comp 1'], eigenVec['Comp 3'])

In [None]:
np.dot(eigenVec['Comp 2'], eigenVec['Comp 3'])

&nbsp;

creating a new frame of reference to visualize the new components 

In [None]:
x1 = np.arange(-50,200)*eigenVec['Comp 1'][0]
y1 = np.arange(-50,200)*eigenVec['Comp 1'][1]
z1 = np.arange(-50,200)*eigenVec['Comp 1'][2]


x2 = np.arange(-50,200)*eigenVec['Comp 2'][0]
y2 = np.arange(-50,200)*eigenVec['Comp 2'][1]
z2 = np.arange(-50,200)*eigenVec['Comp 2'][2]

x3 = np.arange(-50,200)*eigenVec['Comp 3'][0]
y3 = np.arange(-50,200)*eigenVec['Comp 3'][1]
z3 = np.arange(-50,200)*eigenVec['Comp 3'][2]

line_1 = pd.concat([pd.Series(x1), pd.Series(y1), pd.Series(z1)], axis = 1)
line_2 = pd.concat([pd.Series(x2), pd.Series(y2), pd.Series(z2)], axis = 1)
line_3 = pd.concat([pd.Series(x3), pd.Series(y3), pd.Series(z3)], axis = 1)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

# this is required to plot a 3d view
%matplotlib notebook

In [None]:
import matplotlib 
matplotlib.__version__

In [None]:
fig = plt.figure(figsize=(9,7))

ax = fig.add_subplot(111, projection='3d')

ax.scatter(xs = test.iloc[:,0], ys = test.iloc[:,1] , zs = test.iloc[:,2] )

ax.plot(xs = line_1.iloc[:,0], ys = line_1.iloc[:,1] , zs = line_1.iloc[:,2], color = 'red')
ax.plot(xs = line_2.iloc[:,0], ys = line_2.iloc[:,1] , zs = line_2.iloc[:,2], color = 'red')
ax.plot(xs = line_3.iloc[:,0], ys = line_3.iloc[:,1] , zs = line_3.iloc[:,2], color = 'red')


ax.mouse_init()

&nbsp;

* in order to obtain the new factors use the instance method `.fit_transform` on the `PCA` object using the original dataframe

In [None]:
Factors = pd.DataFrame(testpca.fit_transform(test))
Factors.head()

with a dependent variable we can use the new factors as independent variables to estimate a model

&nbsp;



In [None]:
reset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline

In [None]:
m490 = pd.read_csv('data/extended_pca_example.csv', header = 0)
m490.head()

In [None]:
m490.shape

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
function_call = 'Y' + ' ~ ' + '+'.join(m490.columns[1:])

In [None]:
import sys
sys.setrecursionlimit(10000)

In [None]:
lmod = smf.ols(function_call , data = m490)

In [None]:
results = lmod.fit().summary()
print(results)

<a id = 'loc_3'></a>

run `%load data/signif.py` method [here](#signif)
<a href = '#signif'></a>

In [None]:
signif(results.tables)

&nbsp;

&nbsp;


In [None]:
from sklearn.decomposition import PCA

In [None]:
x_490 = m490.iloc[:,1:]
y_490 = m490.iloc[:,0]

In [None]:
pca_490  = PCA(n_components = None).fit(x_490)

In [None]:
len(pca_490.explained_variance_ratio_)

In [None]:
plt.figure(figsize = (12,7))
#plt.xlim(1,20)
plt.bar(range(491), pca_490.explained_variance_ratio_)

In [None]:
names = ['PC' + str(i) for i in range(1,492)]

In [None]:
pca_factors = pd.DataFrame(pca_490.fit_transform(x_490), columns = names)
pca_factors.head()

&nbsp;

## Running R script within Python Kernel    
&nbsp;

* package `relaimpo` in **R** has no equivalent in **Python** 
* using package `rpy2` we can **R** code in jupyter notebook inside a **Python** kerel

to begin, install the package `rpy2` in Anaconda comand prompt:

### conda install rpy2

&nbsp;


In [None]:
import rpy2

In [None]:
%load_ext rpy2.ipython

#### R code begins

#### relative importance of `output ~ PC_1 + PC_2 + . . . PC_490`

In [None]:
# do this only once
#%R install.packages("relaimpo", repos = 'https://CRAN.R-project.org', quiet = TRUE)

In [None]:
#%R install.packages("caret",  repos = 'https://CRAN.R-project.org')

In [None]:
%R rm(list = ls())

In [None]:
%%R 
library(relaimpo)
m490 = read.csv('data/extended_pca_example_r.csv', sep = ',')
pca <- prcomp(m490[,-1])

In [None]:
%%R
df = data.frame(y = m490$Y, pca$x)
model <- lm(y ~ . , df )

In [None]:
%%R
metric <- calc.relimp(model,  type = 'first')
metric.first.rank<-metric@first.rank
metric.first.rank

In [None]:
ranks = %R print(metric.first.rank)

In [None]:
rank_names = %R print(names(metric.first.rank))

#### R code ends

In [None]:
print(rank_names[70:80])
print(ranks[70:80])

In [None]:
ranked_factors = pd.DataFrame(rank_names, ranks).reset_index().sort_values(by = ['index'], axis = 0).iloc[:,1]
ranked_factors.head(10)

&nbsp;

rearrange the columns of `pca_factors` table using the new order obtained from relative importance 

In [None]:
pca_factors_ranked = pca_factors[ranked_factors]
pca_factors_ranked.head()

this is the order of importance of the principle components

&nbsp;

we can rank the variables `X1` thru `X491` also using relative importance   

in this process we are assessing PCA by relative importance,  we are only checking to see among the native variables what is the rank of their importance.

#### R code begins   

#### relative importance of `output ~ x_1 + x_2 + . . . x_490`  

In [None]:
%%R 
rm(list = ls())
library(relaimpo)
m490 = read.csv('data/extended_pca_example_r.csv', sep = ',')

model_490 <- lm(Y ~ . , m490 )

metric_490 <- calc.relimp(model_490,  type = 'first')
metric_490.first.rank<-metric_490@first.rank
metric_490.first.rank

In [None]:
ranks_m490 = %R print(metric_490.first.rank)
rank_m490_names = %R print(names(metric_490.first.rank))

#### R code ends

In [None]:
print(rank_m490_names[70:80])
print(ranks_m490[70:80])

In [None]:
m490_ranked_variables = pd.DataFrame(rank_m490_names, ranks_m490).reset_index().sort_values(by = ['index'], axis = 0).iloc[:,1]
m490_ranked_variables.head(10)

In [None]:
m490_variables_ranked = m490.iloc[:,1:][m490_ranked_variables]
m490_variables_ranked.head()

&nbsp;

at this point we have developed two data frames and in total we can have four different dataframes   

1- Native dataframe unchanged           
2- dataframe with y: output, X: reordered according to relative importance       
3- dataframe with y: output, X: PCA components              
4- dataframe with y: output, X: PCA components reordered according to relative importance     

&nbsp;


### comparing $R^2$ of three different models 


* we would like to compare the incremental increase in $R^2$ between all three models:   
    - the first model is the native model of `Y ~ X1, X2, ..... X491`
    - the second model is native model with independent variables ordere according to relative importance measure `Y ~ X297, X345, ..... X242`
    - the third model is the model where the variables are replaced with the ranked PCA factors `Y ~ PC79, PC176, ..... X475`
    
We loop thru the column adding one column at each iteration and calculating $R^2$, then we plot the values of $R^2$ from each model


### 1- native model

we want to observe the improvement in $R^2$ as we add in variables sequentially to the model 

to do this we estimate a linear model about 300 times and each time we add a single variable and calculate and grab the value of $R^2$ 

In [None]:
model_m490_rsq = []
for i in range(2,300):
    func_call = 'Y ~ ' + '+'.join(m490.columns.tolist()[1:i])
    lmod_rsq = smf.ols(func_call, data = m490).fit().rsquared
    model_m490_rsq.append(lmod_rsq)

### 2- ranked variables of native model

prepare the DataFrame of output `Y` and ranked order of independent variables

In [None]:
ranked_m490 = pd.concat([pd.Series(y_490).reset_index(drop=True), m490_variables_ranked], axis = 1)
ranked_m490.head()

In [None]:
model_m490_ranked_rsq = []
for i in range(2,300):
    func_call = 'Y ~ ' + '+'.join(ranked_m490.columns.tolist()[1:i])
    lmod_rsq = smf.ols(func_call, data = ranked_m490).fit().rsquared
    model_m490_ranked_rsq.append(lmod_rsq)

&nbsp;

### 3- PCA/PCR facrors without ranking

In [None]:
PCR_df = pd.concat([pd.Series(y_490).reset_index(drop=True), pca_factors], axis = 1)
PCR_df.head()

In [None]:
model_PCR_rsq = []
for i in range(2,300):
    func_call = 'Y ~ ' + '+'.join(PCR_df.columns.tolist()[1:i])
    lmod_rsq = smf.ols(func_call, data = PCR_df).fit().rsquared
    model_PCR_rsq.append(lmod_rsq)

&nbsp;


### 4- ranked pca factors

prepare the DataFrame of output `Y` and ranked PCA factors

In [None]:
PCR_df_ranked = pd.concat([pd.Series(y_490).reset_index(drop=True), pca_factors_ranked], axis = 1)
PCR_df_ranked.head()

In [None]:
model_PCR_ranked_rsq = []
for i in range(2,300):
    func_call = 'Y ~ ' + '+'.join(PCR_df_ranked.columns.tolist()[1:i])
    lmod_rsq = smf.ols(func_call, data = PCR_df_ranked).fit().rsquared
    model_PCR_ranked_rsq.append(lmod_rsq)

&nbsp;

plot the values of $R^2$ 

In [None]:
plt.figure(figsize = (15,10))

plt.plot(range(298), model_m490_rsq, color = 'blue')
plt.plot(range(298), model_m490_ranked_rsq, color = 'green')
plt.plot(range(298), model_PCR_rsq, color = 'red')
plt.plot(range(298), model_PCR_ranked_rsq, color = 'magenta')
plt.axhline(.95, linewidth = 1, linestyle = '--', color = 'black')
plt.legend(['native model','native model ranked','PCR','ranked PCR'],loc = 'best', prop = {'size':15})

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;


<a id = 'VAF'></a>

In [None]:
%load udf/VAF.py

[back to seeds analysis](#loc_1)
<a href = '#loc_1'></a>

[back to chocolate flavor analysis](#loc_2)
<a href = '#loc_2'></a>

<a id = 'signif'></a>

In [None]:
%load udf/signif.py

[back to PCA](#loc_3)
<a href = '#loc_3'></a>