<a href="https://colab.research.google.com/github/cchummer/sec-api/blob/main/s1_topic_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Analysis

LDA is a 'bag-of-words' style model, which essentially treats a corpus and tokens as a big 2D matrix (rows = documents, columns = tokens), from which is calculated a topic-feature matrix and a document-topic matrix. Tokens (words, key parts of words, etc) are simply counted and grouped by their appearences together. Order is not considered or analyzed.



#### Vectorization + Tokenization
There are a couple of options of how exactly to turn a list of documents into an optimal matrix for analysis. Scikit-learn has a nice overview [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

For LDA, only raw term counts matter, so vectorization options are limited. Choices exist in how exactly to tokenize the data (n-grams and other preprocessecing configurables). We will start off with as little optimization as needed and move forward.

I have uploaded a sample of just shy of 50 summaries and their respective filing URL's I grabbed yesterday, as a [JSON file here](https://drive.google.com/file/d/1y8O3FNmzjjXkbr0VtPc1Sb7qifFbk21B/view?usp=sharing). Please note I have not gone through these or verified their accuracy as all prospectus summaries yet, but in case you want to follow along.

In [None]:
from google.colab import drive
import json

drive.mount('/content/drive')

samples_summaries = []
with open('/content/drive/My Drive/Colab Notebooks/ML+DL/sample_summaries.json', 'r') as sample_file:
  samples_summaries = json.load(sample_file)

print(samples_summaries)

In [None]:
# Grab just the summaries from our list of dicts. A fun exercise in list comprehension and dictionary iteration lol
summaries_no_urls = [list(inner_dict.values())[0] for x in samples_summaries for inner_dict in x.values()]

print(summaries_no_urls)
print(len(summaries_no_urls))

#### Getting Ready
The below code to neatly plot our outputs is borrowed from the scikit-learn article [here on this exact topic](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html).

In [37]:
from time import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import matplotlib.pyplot as plt

In [40]:
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[-n_top_words:]
        top_features = feature_names[top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

### Perform our Vectorization and apply Models
Again, look to the same [scikit-learn article](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html) for inspiration.

In [None]:
n_features = 10000
n_topics = 10
n_top_words = 20
init = "nndsvda"

my_stop_words = ['prospectus', 'summary', 'prospectussummary', 'highlights', 'common',' stock', \
              'share', 'shares', 'offering', 'shareholders', 'companies']

# Use tf (raw term count) features for LDA. No ngram manipulation
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words=[] # Works better without at the moment... investigating to do
)

t0 = time()
tf = tf_vectorizer.fit_transform(summaries_no_urls)

print("done in %0.3fs." % (time() - t0))
print()

tf

In [None]:
# Let's get a feel for the features that have been chosen:
import numpy as np

with np.printoptions(threshold=np.inf):
  print(tf_vectorizer.get_feature_names_out())

In [None]:
print(
    "\n" * 2,
    "Fitting LDA models with tf features, n_samples=%d and max n_features=%d..."
    % (len(summaries_no_urls), n_features),
)

lda = LatentDirichletAllocation(
    n_components=n_topics,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

tf_feature_names = tf_vectorizer.get_feature_names_out()
plot_top_words(lda, tf_feature_names, n_top_words, "Topics in LDA model")

### NMF

In [None]:
# First, tokenization

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(summaries_no_urls)
print("done in %0.3fs." % (time() - t0))


# Apply analysis

# Fit the NMF model
print(
    "Fitting the NMF model (Frobenius norm) with tf-idf features, "
    "max n_samples=%d and n_features=%d..." % (len(summaries_no_urls), n_features)
)
t0 = time()
nmf = NMF(
    n_components=n_topics,
    random_state=1,
    init=init,
    beta_loss="frobenius",
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=1,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))


tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf, tfidf_feature_names, n_top_words, "Topics in NMF model (Frobenius norm)"
)

# Fit the NMF model
print(
    "\n" * 2,
    "Fitting the NMF model (generalized Kullback-Leibler "
    "divergence) with tf-idf features, max n_samples=%d and n_features=%d..."
    % (len(summaries_no_urls), n_features),
)
t0 = time()
nmf = NMF(
    n_components=n_topics,
    random_state=1,
    init=init,
    beta_loss="kullback-leibler",
    solver="mu",
    max_iter=1000,
    alpha_W=0.00005,
    alpha_H=0.00005,
    l1_ratio=0.5,
).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(
    nmf,
    tfidf_feature_names,
    n_top_words,
    "Topics in NMF model (generalized Kullback-Leibler divergence)",
)

Ok, off the bat: NMF results looks much more promising without any additional preprocessing or optimization. More work to do making sense + explaining the maths behind the two models, and the differences between the beta loss functions shown.