# Topic Modeling

If you recall, the goals of the unsuccessful [exploratory factor analysis](4.1-exploratory-factor-analysis.ipynb) were:

1. Reduce the dimensionality of the feature space to help prevent overfitting during model building.
2. Find a representation of the "measured" variables in a lower dimensional space of "unobserved" latent factors that span them. Reducing the variables to factors helps with interpretability of models.

I still believe I can achieve these goals with a different approach and this is the aim of this notebook.

A [topic model](https://en.wikipedia.org/wiki/Topic_model) is an unsupervised method in natural language processing for discovering latent *topics* in a corpus of documents. A topic is essentially a collection of words that statistically co-occur frequently together in documents. So in the topic modeling framework, a document consists of topics and topics are composed of words. 

It is important to understand that topic modeling is not only restricted to words and can be used for any discrete data. In this case, the discrete data (words) are the binary features and the corpus of documents are the physicists. I will use topic modeling to discover latent topics, analogous to the latent factors in factor analysis, that underlie the physicists data. The number of topics is specified *a priori* and is expected to correspond to the intrinsic dimensionality of the data. As such it is expected to be much lower than the dimensionality of feature data.

[Correlation Explanation](https://www.transacl.org/ojs/index.php/tacl/article/view/1244/275) (*CorEx*) is a discriminative and information-theoretic approach to learning latent topics over documents. It is different from most topic models as it does not assume an underlying generative model for the data. It instead learns maximally informative topics through an information-theoretic framework. The *CorEx* topic model seeks to maximally explain the dependencies of words in documents through latent topics. *CorEx* does this by maximizing a lower bound on the [total correlation](https://en.wikipedia.org/wiki/Total_correlation) (multivariate [mutual information](https://en.wikipedia.org/wiki/Mutual_information)) of the words and topics.

There are many advantages of the CorEx model that make it particularly attractive. The most relevant ones for this study are:
- No generative model is assumed for the data, which means means no validation of assumptions that may or may not be true. The latent topics are learnt entirely from the data. This makes the model extremely flexible and powerful.
- The method can be used for any sparse binary dataset and its algorithm naturally and efficiently takes advantage of the sparsity in the data.
- Binary latent topics are learnt, which leads to highly interpretable models. A document can consist of no topics, all topics, or any number of topics in between.
- No tuning of numerous hyperparameters. There is only one hyperparameter, the *number of topics*, and there is a principled way to choose this.

More details on the mathematical and implementation details of the *CorEx* model can be found in section 2 of [Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge](https://www.transacl.org/ojs/index.php/tacl/article/view/1244/275) by Gallagher et al. I will be using the python implementation [corextopic](https://github.com/gregversteeg/corex_topic) for the topic modeling.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.sparse as ss
import seaborn as sns
from corextopic import corextopic as ct

from src.features.features_utils import convert_categoricals_to_numerical
from src.data.progress_bar import progress_bar

%matplotlib inline

## Reading in the Data

First let's read in the training, validation and test features and convert the categorical fields to a numerical form that is suitable for building machine learning models.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features = convert_categoricals_to_numerical(train_features)
train_features.head()

In [None]:
validation_features = pd.read_csv('../data/processed/validation-features.csv')
validation_features = convert_categoricals_to_numerical(validation_features)
validation_features.head()

In [None]:
test_features = pd.read_csv('../data/processed/test-features.csv')
test_features = convert_categoricals_to_numerical(test_features)
test_features.head()

In [None]:
# drop these variables which have zero values for all physicists.
# need to fix the bug (probably in rank-hot encoding) that is in feature building.
train_features = train_features.drop([
    'num_alma_mater_country_alpha_3_codes_at_least_4',
    'num_physics_laureate_doctoral_advisors_at_least_2',
    'num_alma_mater_continent_codes_at_least_3'], axis='columns')
validation_features = validation_features.drop([
    'num_alma_mater_country_alpha_3_codes_at_least_4',
    'num_physics_laureate_doctoral_advisors_at_least_2',
    'num_alma_mater_continent_codes_at_least_3'], axis='columns')
test_features = test_features.drop([
    'num_alma_mater_country_alpha_3_codes_at_least_4',
    'num_physics_laureate_doctoral_advisors_at_least_2',
    'num_alma_mater_continent_codes_at_least_3'], axis='columns')

## Model Selection

There is a principled way for choosing the *number of topics*. Gallagher et al. state that "Since each topic explains a certain portion of the overall total correlation, we may choose the number of topics by observing diminishing returns to the objective. Furthermore, since the CorEx implementation depends on a random initialization (as described shortly), one may restart the CorEx topic model several times and choose the one that explains the most total correlation." Following this suggestion, I have written a function that fits a CorEx topic model over a *number of topics range*. For each *number of topics*, the function fits a specified *number of topic models* and selects the topic model with the highest total correlation (TC). Finally, the topic model with the *number of topics* corresponding to the overall highest TC is chosen (namely, the model that produces topics that are most informative about the documents). This function takes a few minutes to run as it is doing an exhaustive search over a wide range of the number of topics, so feel free grab a coffee.

In [None]:
def find_best_topic_model(features, num_topic_models=10, num_topics_range=range(1, 11),
                          max_iter=200, eps=1e-05, progress_bar=None):
    """Find the best topic model as measured by total correlation (TC).
    
    Fits a CorEx topic model over a number of topics range. For each number of topics,
    fits a specified number of topic models and selects the topic model with the
    highest total correlation (TC), ignoring topic models with empty topics. Finally, 
    the topic model with the value of number of topics corresponding to the overall 
    highest TC is chosen (namely, the model that produces topics that are most 
    informative about the documents).

    Args:
        features (pandas.DataFrame): Binary features dataframe.
        num_topic_models (int, optional): Defaults to 10. Number of topics models to
            fit for each number of topics.
        num_topics_range (range, optional): Defaults to range(1, 11). Range of number
            of topics to fit models over.
        max_iter (int, optional): Defaults to 200. Maximum number of iterations
            before ending.
        eps (float, optional): Defaults to 1e-05. Convergence tolerance.
        progress_bar (progressbar.ProgressBar, optional): Defaults to None.
            Progress bar.

    Returns:
        corextopic.CorEx: CorEx topic model.

        CorEx topic model with the highest total correlation.

    """
    
    if progress_bar:
        progress_bar.start()
    
    X = ss.csr_matrix(features.values)

    high_tc_topic_models = []
    for n_topic in num_topics_range:
        
        if progress_bar:
            progress_bar.update(n_topic)
        
        topic_models = []
        for n_topic_models in range(1, num_topic_models + 1):
            topic_model = ct.Corex(n_hidden=n_topic, max_iter=max_iter, eps=eps, seed=n_topic_models)
            topic_model.fit(X, words=features.columns, docs=features.index)
            if _has_empty_topics(topic_model):  # unstable model so ignore
                continue
            topic_models.append((topic_model, topic_model.tc))

        if not topic_models:
            continue
        # for given number of topics, find model with highest total correlation (TC)
        topic_models.sort(key=lambda x:x[1], reverse=True)
        high_tc_topic_models.append((topic_models[0][0], topic_models[0][1]))
        
    # find overall model with highest total correlation (TC)
    high_tc_topic_models.sort(key=lambda x:x[1], reverse=True)
    high_tc_model = high_tc_topic_models[0][0]
    
    if progress_bar:
        progress_bar.finish()
    
    return high_tc_model


def _has_empty_topics(model):
    for n_topic in range(model.n_hidden - 1, 0, -1):
        if not model.get_topics(topic=n_topic):
            return True
    return False

In [None]:
num_topics_range=range(1, 31)
topic_model = find_best_topic_model(
    train_features, num_topic_models=20, num_topics_range=num_topics_range,
    progress_bar=progress_bar(len(num_topics_range), banner_text_begin='Running: ',
                              banner_text_end=' topics range'))

In [None]:
print('Number of latent factors (topics) = ', topic_model.n_hidden)
print('Total correlation = ', round(topic_model.tc, 2))

So the optimal number of topics is 25. Note that I have tuned the `num_topic_models` so that this number is stable. If for instance the `num_topic_models` is reduced to 10, then the value of the optimal number of topics will change due to the random initializations of the CorEx topic model. 

Let's now observe the distribution of TCs for each topic to see how much each additional topic contributes to the overall TC. We should keep adding topics until additional topics do not significantly contribute to the overall TC.

In [None]:
def plot_topics_total_correlation_distribution(
    topic_model, ylim=(0, 2.5), title='Topics total correlation distribution',
    xlabel='Topic number'):
    """Plot the total correlation distribution of a CorEx topic model.

    Args:
        topic_model (corextopic.CorEx): CorEx topic model.
        ylim (tuple of (`int`, `int`), optional): Defaults to (0, 2.5).
            y limits of the axes.
        title (str, optional): Defaults to 'Topics total correlation distribution'.
            Title for axes.
        xlabel (str, optional):. Defaults to 'Topic number'. x-axis label.

    """
    
    plt.bar(range(0, topic_model.tcs.shape[0]), topic_model.tcs)
    plt.xticks(range(topic_model.n_hidden))
    plt.ylim(ylim)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Total correlation (nats)')
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.tick_params(bottom=False, left=False)

In [None]:
plot_topics_total_correlation_distribution(topic_model)

Looking at the plot, you can see that this statement is fairly subjective. Should we take 10, 12, 15, 18 or 22 topics? To me a slightly more principled way would be to look at the cumulative distribution and select the minimum number of topics that explains say 95% of the overall topics total correlation. This is similar to an explained variance cut-off value in principal component analysis. The plot is shown below.    

In [None]:
def plot_topics_total_correlation_cumulative_distribution(
    topic_model,
    ylim=(0, 17),
    cutoff=None,
    title='Topics total correlation cumulative distribution', xlabel='Topic number'):
    """Plot the total correlation cumulative distribution of a CorEx topic model.

    Args:
        topic_model (corextopic.CorEx): CorEx topic model.
        ylim (tuple of (`int`, `int`), optional): Defaults to (0, 2.5).
            y limits of the axes.
        cutoff (float, optional). Defaults to None. `If float, then 0 < cutoff < 1.0.
            The fraction of the cumulative total correlation to use as a cutoff. A
            horizontal dashed line will be drawn to indicate this value.
        title (str, optional): Defaults to 'Topics total correlation cumulative distribution'.
            Title for axes.
        xlabel (str, optional): Defaults to 'Topic number'. x-axis label.

    """

    plt.bar(range(0, topic_model.tcs.shape[0]), np.cumsum(topic_model.tcs))
    if cutoff:
        plt.axhline(cutoff * topic_model.tc, linestyle='--', color='r')
    plt.xticks(range(topic_model.n_hidden))
    plt.ylim(ylim)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Total correlation (nats)')
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.tick_params(bottom=False, left=False)

In [None]:
plot_topics_total_correlation_cumulative_distribution(topic_model, cutoff=0.95)

Using this criteria suggests that 18 topics would be appropriate. However, again this is fairly subjective. Should we choose a cut-off of 90%, 95% or 99%? All of these different values would change the conclusion of the number of topics to retain. To me since there are so few topics anyway it makes sense to retain all 25 topics and not lose any further information. You will also see shortly that there is some interesting information in the tail of the topics.   

## Topics

Now we will take a look at the produced topics, in descending order of the total correlation they explain, to see how coherent they are. The features in topics are ranked in descending order of their *mutual information* with the topic. So features with higher values of mutual information are more associated with the topic than features with low values. Do not be alarmed by the negative values of mutual information. As Gallagher explains in the [notebook example](https://github.com/gregversteeg/corex_topic/blob/master/corextopic/example/corex_topic_example.ipynb), "Theoretically, mutual information is always positive. If the CorEx output returns a negative mutual information from `get_topics()`, then the absolute value of that quantity is the mutual information between the topic and the absence of that word." I add labels to the topics to aid with their interpretability.

In [None]:
latent_factors = {'is_eu_worker':'European Workers',
                  'is_eu_alumni':'European Alumni',
                  'is_alumni':'Alumni',
                  'is_na_eu_resident':'North American and European Residents',
                  'is_na_citizen':'North American Citizens',
                  'is_na_worker':'North American Workers',
                  'is_as_citizen':'Asian Citizens',
                  'is_na_alumni':'North American Alumni',
                  'is_gbr_citizen':'British Citizens',
                  'is_rus_citizen':'Russian Citizens',
                  'is_deu_citizen':'German Citizens',
                  'is_nld_ita_che_citizen':'Netherlands, Italian and Swiss Citizens',
                  'is_studyholic':'Studyholics',
                  'is_workhorse':'Workhorses',
                  'is_aut_citizen':'Austrian Citizens',
                  'is_eu_citizen':'European Citizens',
                  'is_gbr_worker':'British Workers',
                  'is_passport_collector':'Passport Collectors',
                  'is_born':'Born',
                  'is_fra_citizen':'French Citizens',
                  'is_other_citizen':'Other Citizens',
                  'is_emigrant':'Emigrants',
                  'is_physics_laureate_teacher':'Physics Laureate Teachers',
                  'is_physics_laureate_student':'Physics Laureate Students',
                  'is_astronomer':'Astronomers'
                 }

In [None]:
def plot_topics(topic_model, topic_labels=None, max_features_per_topic=15, xlim=(-0.5, 1),
                ylabel='Feature', figsize=None, plotting_context='notebook'):
    """Plot the topics of a CorEx topic model.

    Args:
        topic_model (corextopic.CorEx): CorEx topic model.
        topic_labels (list of `str`, optional): Defaults to None. Topic labels for each
            axis.
        max_features_per_topic (int, optional): Maximum number of features to plot
            per topic.
        xlim (tuple of (`int`, `int`), optional): Defaults to (-0.5, 1).
            x limits of the axes.
        ylabel (str, optional): Defaults to 'Feature'. y-axis label.
        figsize (tuple of (`int`, `int`), optional): Defaults to None. Figure size in
            inches x inches.
        plotting_context (str, optional): Defaults to `notebook`. Seaborn plotting
            context.

    """
    
    with sns.plotting_context(plotting_context):
        fig, ax = plt.subplots(nrows=topic_model.n_hidden, ncols=1, sharex=False, figsize=figsize)
        plt.subplots_adjust(hspace=200)
        for n_topic in range(topic_model.n_hidden):
            topic = topic_model.get_topics(n_words=max_features_per_topic, topic=n_topic)
            labels = [label[0] for label in topic]
            mutual_info = [mi[1] for mi in topic]
            ax[n_topic].barh(labels, mutual_info)
            ax[n_topic].set_xlim(xlim)
            ax[n_topic].set_ylim(-0.5, max_features_per_topic - 0.5)
            if topic_labels:
                title = topic_labels[n_topic]
            else:
                title = 'topic_' + str(n_topic)
            ax[n_topic].set(title=title, xlabel='Mutual information (nats)',
                            ylabel=ylabel)
        fig.tight_layout()

In [None]:
plot_topics(topic_model, topic_labels=list(latent_factors.values()), figsize=(20, 280),
            plotting_context='talk')

As you can see, the topic labels are self-explanatory and correspond mainly with the dominant features of each topic as measured by the mutual information. As I said, the features with very low mutual information are not really informative about the topic. The fact I could put a name to every topic shows just how discriminative the topic modeling is. It's impressive how coherent some of the topics are. The *North American Workers*, *North American Alumni*, *Workhorses* and *Studyholics* topics are exemplerary examples of such topics. The *Born* topic is definitely the least coherent topic and maybe suggests that the features in this topic were probably not so useful to begin with.

## Physicist Topic Labels

As with the topic features, the most probable documents (physicists) per topic can also be easily accessed, and it is interesting to take a look at a few of these. As Gallagher says, they "are sorted according to log probabilities which is why the highest probability documents have a score of 0 ($e^0 = 1$) and other documents have negative scores (for example, $e^{-0.5} \approx 0.6$)."

OK let's take a look at the top physicists in the *European Workers* (topic 0), *Workhorses* (topic 13) and *Physics Laureate Teachers* (topic 22).

In [None]:
topic_model.get_top_docs(n_docs=30, topic=0, sort_by='log_prob')

The names here seem reasonable as physicists who have worked in Europe. But as you can see from the probabilities, a lot of the physicists have a similar mutual information with this topic. It's a different story if we use the TC instead. This is more discriminative, but from the warning message you can see that Gallagher does not yet recommend this.

In [None]:
topic_model.get_top_docs(n_docs=30, topic=0, sort_by='tc')

Below we see the real workhorses of physics. The probabilites here seem to discriminate the physicists a lot better. If you examine the Wikipedia Infobox *Institutions* field of some of these physicists, you will see the breadth of workplaces corroborates this list.

In [None]:
topic_model.get_top_docs(n_docs=30, topic=13, sort_by='log_prob')

Below we see the great teachers and influencers of physics laureates, many of whom are laureates themselves. Likewise, if you take a look at the Wikipedia Infobox *Doctoral students* and *Other notable students* fields of some of these physicists, you will see the number of laureates they have had an impact on. Interestingly, the first paragraph of [Arnold Sommerfeld's Wikipedia article](https://en.wikipedia.org/wiki/Arnold_Sommerfeld) focuses on this aspect of his career and compares him to J. J. Thomson.

In [None]:
topic_model.get_top_docs(n_docs=30, topic=22, sort_by='log_prob')

## Projecting Features to the Topic Space

CorEx is a discriminative model which means that it estimates the probability a document (physicist) belongs to a topic given that document's words (features). The estimated probabilities of topics for each document can be obtained through the topic model's properties `log_p_y_given_x` or `p_y_given_x` or function `predict_proba`. A binary determination of which documents belong to each topic is obtained using a softmax and can be accessed through the topic model's `labels` property or function `transform` (or `predict`). I will now use the latter to reduce the dimensionality of the original binary features by projecting them into the latent space spanned by the binary topics of the topic model.

In [None]:
def project_features_to_topic_space(features, topic_model, columns=None):
    """Project the binary features to the latent space spanned by the binary
    topics of the topic model.

    Args:
        features (pandas.DataFrame): Binary features dataframe.
        topic_model (corextopic.CorEx): CorEx topic model.
        topic_labels (list of `str`, optional): Defaults to None. Topic labels
            to use as columns for the dataframe.
            
    Returns:
        pandas.DataFrame: Binary features dataframe containing the topics.

    """
    
    X = ss.csr_matrix(features.values)
    X_topics = topic_model.transform(X)
    features_topics = pd.DataFrame(X_topics, index=features.index, columns=columns)
    features_topics = features_topics.applymap(lambda x: 'yes' if x == True else 'no')
    return features_topics

In [None]:
train_features_topics = project_features_to_topic_space(train_features, topic_model,
                                                        list(latent_factors.keys()))
train_features_topics.head()

In [None]:
validation_features_topics = project_features_to_topic_space(validation_features, topic_model,
                                                             list(latent_factors.keys()))
validation_features_topics.head()

In [None]:
test_features_topics = project_features_to_topic_space(test_features, topic_model,
                                                       list(latent_factors.keys()))
test_features_topics.head()

You may be wondering why I did not just use the estimated probabilities as my reduced dimension features. Mostly likely a model built from those features would be more accurate than one built from the binary features. Interpretability is the answer. For example, it does not make much sense to talk about the probability of a physicist being a *European Worker* or not. S/he is either a *European Worker* or not. It is more natural to say for instance that a physicist is a Nobel Laureate because s/he is a *European Worker*, a *North American Citizen* and a *Physics Laureate Teacher* etc.

The *European Alumni* and *Astronomer* topics are interesting as they both consist of only one feature. Therefore, you would expect a one-to-one correspondence between the labels in the topic and the label in the original feature. I noticed that this is not always the case as the topic has actually "flipped" the label for some of the physicists. I am not sure exactly why it happens. Clearly it is a quirk of the topic modeling.

In [None]:
len(train_features) - (train_features_topics.is_eu_alumni == train_features.alumnus_in_EU.map(
    {1: 'yes', 0:'no'})).sum()

In [None]:
len(train_features) - (train_features_topics.is_astronomer == train_features.is_astronomer.map(
    {1: 'yes', 0:'no'})).sum()

## Persisting the Data

Now I have the training, validation and test features dataframes in the topic model space, I'll persist them for future use.

In [None]:
train_features_topics.to_csv('../data/processed/train-features-topics.csv')
validation_features_topics.to_csv('../data/processed/validation-features-topics.csv')
test_features_topics.to_csv('../data/processed/test-features-topics.csv')