<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reference" data-toc-modified-id="Reference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# Identify the most important vector

A prerequisite for finding trading signal is to understand whether the data we extract contains topics/signals related to market, and more importantly, whether it contains information that we may trade.

This requires us to examine and evaluate the various topics and vocabulary representing these topics in the data. The so-called: garbage in, garbage out.

To explore various topics in the FOMC documents, we will use Gensim’sLatent Dirichlet Allocation(Hidden Dirichlet distribution model). LDA is a generation probability model suitable for discrete data sets such as text. The function of LDA is as a hierarchical Bayesian model, in which each item in the collection is modeled as a finite mixture on the basic theme collection. In turn, each topic is shaped into an infinite mixture of basic topic probabilities.

In LDA model, we need to estimate the number of topics in the dataset through the num_topics hyperparameter. According to models' coherence scores, 10 and 7 are wise choices.

In [None]:
def preprocess_text(text, stop_words=all_stopwords):
    """
    Tokenize and Lemmatize raw tweets in a given DataFrame.
    Args:
      stop_words: A list of Strings containing stop words to be removed.
    Returns:
      processed_tweets: A list of preprocessed tokens of type String.
    """
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    lemmatizer = WordNetLemmatizer()
    words = []
    for w in word_tokenize(text):
        if w not in stop_words:
            words.append(w)
    res = []
    for w in words:
        if len(w) > 2:
            res.append(lemmatizer.lemmatize(w))
    return res

In [None]:
# Tokenize & normalise statements and minutes
pool = Pool(16)
statements_list = df_statements["statements"].to_list()
statements_preprocessed = list(
    tqdm(pool.imap(preprocess_text, statements_list), total=len(statements_list), desc='Multiprocess'))

In [None]:
minutes_list = df_minutes["minutes"].to_list()
minutes_preprocessed = list(
    tqdm(pool.imap(preprocess_text, minutes_list), total=len(minutes_list), desc='Multiprocess'))

In [None]:
pool.close()

In [None]:
statements_dict = gs.corpora.Dictionary(statements_preprocessed)
minutes_dict = gs.corpora.Dictionary(minutes_preprocessed)

In [None]:
cbow_statements = [statements_dict.doc2bow(doc) for doc in statements_preprocessed]
cbow_minutes = [minutes_dict.doc2bow(doc) for doc in minutes_preprocessed]

In [None]:
model_statements = gs.models.LdaMulticore(cbow_statements, num_topics=16, id2word=statements_dict, passes=10, workers=16)
# model_statements.show_topics()

In [None]:
model_minutes = gs.models.LdaMulticore(cbow_minutes, num_topics=12, id2word=minutes_dict, passes=10, workers=16)
model_minutes.show_topics()

In [None]:
topic_vis_statements = gensimvis.prepare(model_statements, cbow_statements, statements_dict)
pyLDAvis.display(topic_vis_statements)

In [None]:
topic_vis_minutes = gensimvis.prepare(model_minutes, cbow_minutes, minutes_dict)
pyLDAvis.display(topic_vis_minutes)

In [None]:
pyLDAvis.save_html(topic_vis_statements, 'topic_vis_statements.html')
pyLDAvis.save_html(topic_vis_minutes, 'topic_vis_minutes.html')

By examining the final topic map, we can see that the performance of the LDA model on capturing the salient topics and their constituent words in the data is not bad.

In [None]:
# Compute Coherence Score
coherence_model_statements = gs.models.CoherenceModel(model=model_statements, texts=statements_preprocessed,
                                                      dictionary=statements_dict, coherence='c_v')
coherence_score_statements = coherence_model_statements.get_coherence()
print(f'Coherence Score_Statements: {coherence_score_statements}')
coherence_model_minutes = gs.models.CoherenceModel(model=model_minutes, texts=minutes_preprocessed,
                                                   dictionary=minutes_dict, coherence='c_v')
coherence_score_minutes = coherence_model_minutes.get_coherence()
print(f'Coherence Score_Minutes: {coherence_score_minutes}')

The results of Roder, Both and Hindeburg in the paper motivated us to choose the method of score measurement, which can be seen from the signature of the above consistency model logic. You can see that we have chosen the coherence ='c_v metric for the model, instead of'u_mass','c_v', and'c_uci'. We found that the "c_v" scoring standard can achieve better results than other methods, especially when the word set is small, which is in line with our choice. The consensus score of our model is X. We believe our model can be better if we have data with higher quality. Generally, our LDA model has been trained on the correct number of topics and maintains a sufficient degree of semantic similarity between words with higher scores in each topic.

## Reference

1. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966). PMLR.
2. Loughran, T., & McDonald, B. (2020). Measuring firm complexity. Available at SSRN 3645372.
3. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408).
4. Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).