# Exploratory Factor Analysis

I created a relatively large set of variables during feature building, several of which are probably correlated with one another. For instance, being born in the USA correlates highly with being born on the continent of North America. Now I would like to:

1. Reduce the *dimensionality of the feature space* to help prevent overfitting during model building.
2. Find a representation of the "measured" variables in a lower dimensional space of "unobserved" *latent factors* that span them. Reducing the variables to factors helps with interpretability of models.

The aim is to possibly use the output in building machine learning models that can predict *Nobel Laureates in Physics*.

[Exploratory Factor Analysis](https://en.wikipedia.org/wiki/Exploratory_factor_analysis) (EFA) is a multivariate statistical method that may help with this as it was designed to uncover latent structure in a relatively large set of variables. [Factor Analysis](https://en.wikipedia.org/wiki/Factor_analysis) uses the [correlation matrix](https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_matrices) of the variables to examine intercorrelations between the measured variables. It reduces the dimensionality of the matrix by finding groups of variables with high intra-correlation but with low intercorrelation with other groups of variables. A group of these variables is a construct known as a *factor* and in a good factor model the factors have meaning and can easily be named.

There are several different types of factor models. Since I have only categorical (binary) features, the one that seems most appropriate is [Multiple Correspondence Analysis](https://en.wikipedia.org/wiki/Multiple_correspondence_analysis) (MCA). It is essentially the counterpart of [Principal Components Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) for categorical data. Fortunately, there is a nice python library called [prince](https://github.com/MaxHalford/prince) that implements MCA along with other factor analysis methods. I will be using the library for my analysis.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from prince import MCA

%matplotlib inline

## Reading in the Data

First let's read in the training features and the target.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features.tail()

In [None]:
target = pd.read_csv('../data/processed/train-target.csv', squeeze=True)
target.tail()

## Suitability of Data for Factor Analysis

Any factor analysis always starts with the same question: *Is the data suitable for factor analysis?* There are a few issues to address here. The first are with regards to the *minimum sample size* and *subjects-to-variables* (STV) ratio. There are numerous rules of thumb and various researchers and empirical studies differ in their findings. An excellent review of these are given in chapter 3 of [Best Practices in Exploratory Factor Analysis](https://www.researchgate.net/publication/209835856_Best_Practices_in_Exploratory_Factor_Analysis_Four_Recommendations_for_Getting_the_Most_From_Your_Analysis) and a very good short summary is given in [The Minimum Sample Size in Factor Analysis](https://www.encorewiki.org/display/~nzhao/The+Minimum+Sample+Size+in+Factor+Analysis). To cut a very long story short, basically, my sample size of *N = 542* is deemed sufficient by all researchers and even very good by some. However, my *STV ratio = 542 / 202 = 2.68* is considered unacceptable by many researchers. But it is important to mention that indeed both references give examples of succesful factor analyses for lower values than this.

The last issue concerns *factorability of the correlation matrix* itself. According to Wikiversity's article on [Exploratory Factor Analysis](https://en.wikiversity.org/wiki/Exploratory_factor_analysis), "Factorability is the assumption that there are at least some correlations amongst the variables so that coherent factors can be identified. Basically, there should be some degree of collinearity among the variables but not an extreme degree or singularity among the variables". There are in fact two statistical tests for this: [Bartlett’s test of sphericity](https://en.wikipedia.org/wiki/Bartlett%27s_test) and the [Kaiser–Meyer–Olkin](https://www.statisticshowto.datasciencecentral.com/kaiser-meyer-olkin/) (KMO). However, I'm not going to say too much about these as they are based on the assumption that the data is multivariate normal, which clearly isn't the case here.

The article [Establishing Evidence for Internal Structure Using
Exploratory Factor Analysis](https://www.tandfonline.com/doi/pdf/10.1080/07481756.2017.1336931) suggests that "an intercorrelation matrix is deemed factorable when the majority of the correlation coefficients
computed are in the moderate range wherein r values are between .20 and .80. If a significant
number of variables are producing values below .20 (i.e., items not representing same construct)
or above .80 (i.e., multicollinearity), the researcher should consider eliminating these items
before conducting an EFA (Field, 2013)". OK let's see if this is the case here making sure to take into consideration the fact that it should not matter if the correlations are positive or negative.

In [None]:
train_features_numerical = train_features.drop('full_name', axis='columns')
train_features_numerical = train_features_numerical.replace({'yes': 1, 'no': 0, 'male': 1, 'female': 0})
correlation = train_features_numerical.corr()
correlation

In [None]:
print('Percent of correlations in range abs(0.2, 0.8): {0} %'.format(
    round(100 * ((abs(correlation) > 0.2) & (abs(correlation) < 0.8)).sum().sum() /
          len(correlation) ** 2))
)     

As you can see, only a small percentage of the values are within this range. This would clearly fail the criteria given above. However, this is not the only viewpoint on this matter. In the article [Exploratory factor analysis: A five-step guide for
novices](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.414.4818&rep=rep1&type=pdf), Tabachnick and Fidell recommended inspecting the correlation matrix (often termed Factorability of R) for correlation coefficients over 0.30... If no correlations go beyond 0.30, then
the researcher should reconsider whether factor analysis is the appropriate statistical method
to utilise." Clearly, there are some correlations above an absolute value of 0.3 in the matrix, so by this criteria, the correlation matrix is factorable. As you can see there are a lot of contrasting recommendations in factor analysis! So for now I will proceed as there are some correlations amongst the variables.

Let me take a little digression for now to explain a subtle but important point. Some readers may be wondering why I'm perfectly comfortable using a [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) to measure the correlation between binary-binary variable pairs? It is because the Pearson correlation coefficient calculated for two binary variables returns the [phi coefficient](https://en.wikipedia.org/wiki/Phi_coefficient).

## How Many Factors?

It is now time to perform the factor analysis and determine how many factors to retain. Again this is more of an art than a science are there are numerous recommended ways of doing this. Some of the simpler, most straightforward and intuitive ways are:

- *Cumulative variance accounted for by the retained factors*. Here again there are a few recommendations, although most researchers do recommend in the 75-90% range.
- *Scree plot*. A plot of the extracted factors against their eigenvalues in descending order of magnitude. Typically the elbow in the plot is identified where the larger eigenvalues end and the smaller eigenvalues begin. Factors to the left of the elbow are retained and those to the right are dropped. Note that this is quite subjective as sometimes there can be more than one elbow.
- *Kaiser Greater-Than-One-Rule* which says that only those factors with eigenvalues greater than 1 should be retained for interpretation. Again this is arbirtrary, however, an eigenvalue of 1 is the value at which a factor accounts for at least as much variance as any individual variable.

OK let's perform the factor analysis now and use these criteria to decide on the number of factors to retain.

In [None]:
mca = MCA(
    n_components=10,
    n_iter=10,
    copy=True,
    random_state=0,
    engine='sklearn'
)
train_features = train_features.drop('full_name', axis='columns')
mca = mca.fit(train_features)

In [None]:
ax1 = sns.lineplot(x=range(1, 11), y=mca.eigenvalues_)
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 1.0)
ax1.set_xlabel('Number of factors')
ax1.set_ylabel('Eigenvalues')
ax1.set_title('Scree plot')
ax1.set_xticks(range(0, 11, 2));

In [None]:
ax = sns.lineplot(x=range(1, 11), y=np.cumsum(mca.explained_inertia_))
ax.figure.set_size_inches((8, 6))
ax.set_xlabel('Number of factors')
ax.set_ylabel('Cumulative variance')
ax.set_title('Cumulative variance accounted for by factors')
ax.set_xlim((0, 10))
ax.set_ylim((0, 1.0))
ax.axhline(y=0.9, linestyle='--', color='r', linewidth=1.0);

The scree plot suggests taking 2 factors. However, the Kaiser rule suggests that these factors are very poor as the eigenvalues are small. All eigenvalues are well below 1, indicating that they explain far less variance than any individual feature. This is further backed up by the cumulative variance plot, which shows that only about 33% of the variance in the data is explained by the first 10 factors.

However, these are not the only considerations when choosing the number of factors. Very important are the following criteria:

- All factors should be interpretable. In other words, one should be able to coherently name and describe the set of collective variables in an underlying factor.
- There should be several variables that load onto each of the factors. Generally, the more variables per factor, the greater the reliability of the factor. Typically 3 or more variables per factor as a minimum.
- The model should be parsimonius meaning that certain variables should load highly onto a particular factor but load lowly on to other factors. Typically the recommended is loadings with absolute values above 0.3, 0.4 or 0.5 with minimal cross loadings.

With these criteria considered, I find that a factor model with any number of factors seems implausible. To see this, we can examine the table below. The 0th factor doesn't make any intuitive sense at all.

In [None]:
factor_loadings = mca.column_coordinates(train_features)
factor_loadings.loc[factor_loadings[0] < -0.4, 0:5].sort_values(by=0, ascending=True)

It seems pretty clear that this factor analysis is going nowhere.

## Discussion

I suspect that the factor analysis results may be invalid due to the STV ratio and / or sample size. Another theory I have is that the features are just too sparse for factor analysis to extract any meaningful information from the correlations in the data. There is some discussion in the context of PCA in [how can one extract meaningful factors from a sparse matrix](https://stats.stackexchange.com/questions/4753/how-can-one-extract-meaningful-factors-from-a-sparse-matrix). If you recall, earlier we saw that only 7% of the values in the correlation matrix had absolute values of correlation coefficients between 0.2 and 0.8. In fact most of the remaining 93% of the values have absolute values of correlation coefficients less than or equal to 0.2.

In [None]:
print('Percent of correlations less than or equal to abs(0.2): {0} %'.format(
    round(100 * (abs(correlation) <= 0.2).sum().sum() / len(correlation) ** 2))
)     

This sparsity was of course induced by the binary encoding of variables during feature construction. The point is that most physicists are only associated with a very small fraction of the features. Finding latent structure in such data is difficult. So where does this leave us now?

## Conclusion

The approaches I have taken so far have been fruitless in achieving the two goals I set out to achieve at the outset of this EFA. However there are alternative approaches that can be taken. One such alternative is [Multidimensional scaling](https://en.wikipedia.org/wiki/Multidimensional_scaling) (MDS). This would have to use a distance "metric" such as the [Gower distance](https://stats.stackexchange.com/questions/15287/hierarchical-clustering-with-mixed-type-data-what-distance-similarity-to-use) since Euclidean distance is just not appropriate for such binary data types. There is no well established implementation in python, although a [Gower Similarty Coefficient implementation in sklearn](https://github.com/scikit-learn/scikit-learn/issues/5884) may not be too far away. There is however a [rudimentary Gower python implementation](https://datascience.stackexchange.com/questions/8681/clustering-for-mixed-numeric-and-nominal-discrete-data), although according to the previous reference, it should be using the Jaccard coefficient for "present" vs "absent" binary variables. For the data, the Gower similarity would essentially reduce to a combination of the Jaccard and Dice coefficients, so coding it up would not be too difficult.

Using (sklearn's) MDS is not actually viable for dimensionality reduction since [sklearn's MDS implementation has no `transform` method](https://stackoverflow.com/questions/21962204/sklearn-manifold-mds-has-no-transform-method) which means that new data points cannot be projected onto the embedding space that the MDS was fit on. Not sure how far off in sklearn this is as the [issue has been pending for a few years now](https://github.com/scikit-learn/scikit-learn/pull/6222).

Another [approach that is closely related to the previous one](https://stats.stackexchange.com/questions/87681/performing-pca-with-only-a-distance-matrix) is to use [kernel principal component analysis](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis) with the Gower distance. This is possible in sklearn as [sklearn's kernel PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html) implementation allows the use of  distance "metrics" other than Euclidean distance through the `precomputed` parameter. However, I am concerned about using any form of PCA to reduce the dimensionality of this binary data as PCA works with a centered gram matrix. Therefore PCA does not seem like a natural fit for binary data as there is [no reason to assume that the binary data is centered anywhere other than at the origin](https://stats.stackexchange.com/questions/16331/doing-principal-component-analysis-or-factor-analysis-on-binary-data).

An approach of dimensionality reduction that seems more attractive for the data is [Non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) (NMF) as the features matrix consists entirely of non-negative entries. Although, the general NMF is not suitable for the binary case as the approximation is not bounded from above, there is an extension of NMF to the binary case known as [binary matrix factorization](http://ranger.uta.edu/~chqding/papers/icdm07-binary.pdf) (BMF). Hong LiangJie has performed [reviews on BMF](http://www.hongliangjie.com/2011/03/15/reviews-on-binary-matrix-decomposition/) and states "In all, it seems that the performance advantages of specifically designed binary data models are small. However, the biggest advantage of these models is that they can give better interpretations sometimes."

I explored a [BMF model](http://nimfa.biolab.si/nimfa.methods.factorization.bmf.html) for dimensionality reduction using the [nimfa](http://nimfa.biolab.si/), the only python implementation I could find. At first it seemed promising, however, I found two major roadblocks. The first was that the *penalty function method* implemented is only really appropriate for dense binary data. This is discussed by Zhang in [Binary Matrix Factorization with Applications](http://ranger.uta.edu/~chqding/papers/icdm07-binary.pdf) along with the *thresholding algorithm*, which is more appropriate for sparse binary data. Unfortunately the *thresholding algorithm* is not implemented in *nimfa*. The second limitation is the same as that mentioned above for sklearn's MDS, there is no way of projecting new data points onto the embedding space that the model was fit on. As a last resort, I could have possibly [altered the code to perform this projection](https://github.com/marinkaz/nimfa/issues/43), however, this would require some testing and is not a completely satisfactory solution. In the end I have found another promising approach that is more powerful that I'll be discussing in the next notebook.