# Chapter 10: Factor Analysis

In [8]:
%reset
low_memory=False
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns; sns.set()
from scipy import stats
import math
import os
import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.decomposition import FactorAnalysis

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.1 Introduction & Motivation

We've seen in the previous chapter how we can use Dimensionality Reduction techniques to get a better grasp on our dataset. This week we'll see another technique for dimensionality reduction: Factor Analysis.

Factor Analysis, in fact, is a more general category of Data Science methods of which PCA is one. In factor analysis, we try to "summarize" multiple variables into 1 certain variable. For example:

![](https://res.cloudinary.com/dchysltjf/image/upload/f_auto,q_auto:best/v1554830233/1.png)

If this sounds familiar, that's because it is. It's extremely similar to what we have done with PCA. The two main goals of factor analysis are the following:
* Reducing the number of observed variables
* Finding ubobservable variables

In essence, PCA is a specific type of Factor Analysis. This also means that it's more specific. In general, PCA requires much more assumptions on our dataset, and only delivers non interpretable components, whereas FA allows us to interpret the factors. PCA is focused on reducing the amount of dimensions in our dataset while still preserving all information. FA is focused on finding variables that we can't initially observe.

## 1.2 Problem Setting

Our setting is the same as in the previous chapter:

In [9]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

(1797, 64)

## 1.3 Model

### 1.3.1 Model

Remember the way PCA reduced the number of variables: it goes looking for the highest variance among our dataset. It then puts all this into a first component. The variance explained by this factor is then removed from the dataset. This process is repeated until the desired amount of components has been constructed.

As Factor Analysis has the main objective of discovering unobserved variables, it works a little bit different. Factor Analysis goes looking for variables which are correlated. If variables are correlated highly, they are put together in a factor. After this has happened, factors can still switch the variables they contain in order to try to abide by the rule that *the factors must not be correlated*. By doing this, we hope each factor to contain as much information as possible.

### 1.3.2 Model Estimation

##### Question 1: In the previous chapter, we saw the Scree plot and the elbow method to detect what the optimal number of variables is. Construct the scree plot and determine the optimal amount of factors for our dataset.

Luckily, sklearn comes with this algorithm build in. Let's try this out by fitting the model for 7 components:

In [12]:
transformer = FactorAnalysis(n_components=7, random_state=0)
X_transformed = transformer.fit_transform(digits.data)
X_transformed.shape

(1797, 7)

We can clearly see that we've indeed discovered 7 components. However, we don't really know yet just how good they are, and what they represent. Let's take a look at the components:

In [16]:
print(transformer.components_)
print(transformer.components_.shape)

[[-4.17469616e-24  5.02577078e-01  4.53311760e+00  2.63289978e+00
  -1.51699869e-01  4.32323078e-01  1.21382881e-02 -7.14557888e-02
   4.09419003e-03  2.05001741e+00  3.75028163e+00 -1.21025910e+00
   1.96019078e-01  7.62738938e-01 -3.19641199e-01 -1.04646745e-01
  -4.34023500e-06  1.18355749e+00 -5.89688525e-01 -2.20614780e+00
   1.59528558e+00 -3.00455192e-01 -8.70776785e-01 -6.89139533e-02
  -9.95228390e-04 -6.28815395e-01 -2.43700889e+00  6.66691032e-01
   1.64493510e+00 -1.28722071e+00 -9.59180654e-01 -3.32476347e-03
  -0.00000000e+00 -1.59123526e+00 -3.14596956e+00 -3.56643018e-01
  -6.29465818e-01 -1.87877413e+00 -7.91780932e-01 -0.00000000e+00
  -1.39270375e-02 -1.08335169e+00 -2.60395724e+00 -1.10301519e+00
  -1.58739684e+00 -1.30938955e+00 -2.12753191e-01 -1.90849484e-02
  -9.50512402e-03  3.10039395e-01  1.53494455e+00 -2.49467720e-02
  -6.31571499e-01  5.62856328e-01  4.81686404e-01 -1.71975971e-03
  -3.59073940e-04  4.70877934e-01  4.83125099e+00  2.43816050e+00
  -6.73072

We see that we've gotten returned 7 factors with 64 numbers each: these are the values of our original variables where our factors are situated.

##### Question 2: Go looking in the documentation how you can determine the score of our model. What is the score for 7 factors?

##### Question 3: Fit the factor analysis for the optimal number of components. Also calculate the score. Is the model indeed better than with 7 factors?

## 1.4 Exercises

##### Question 1: See section 1.3.3
##### Question 2: See section 1.3.3
##### Question 3: See section 1.3.3
##### Question 4: Compare the PCA and the FA of the digits dataset. Do you find the equal number of factors in each best model version? Can you explain why this is? 
##### Question 5: Compare the PCA and the FA of the digits dataset. Which model do you think is the best? Why is this?