<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reference" data-toc-modified-id="Reference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# Identify the most important vector

A prerequisite for finding trading signal is to understand whether the data we extract contains topics/signals related to market, and more importantly, whether it contains information that we may trade.

This requires us to examine and evaluate the various topics and vocabulary representing these topics in the data. The so-called: garbage in, garbage out.

To explore various topics in the FOMC documents, we will use Gensim’sLatent Dirichlet Allocation(Hidden Dirichlet distribution model). LDA is a generation probability model suitable for discrete data sets such as text. The function of LDA is as a hierarchical Bayesian model, in which each item in the collection is modeled as a finite mixture on the basic theme collection. In turn, each topic is shaped into an infinite mixture of basic topic probabilities.

In LDA model, we need to estimate the number of topics in the dataset through the num_topics hyperparameter. According to models' coherence scores, 15 is a wise
choice.

In [1]:
import re
from collections import Counter
from os import listdir
import preprocessor
from multiprocess import Pool
import tqdm
from tqdm import tqdm
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import pickle

import gensim as gs
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import spacy
import swifter
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile
from nltk.tokenize import word_tokenize
# nltk.download('punkt')

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object: SlowAppendObjectArrayToTensorProto,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool: SlowAppendBoolArrayToTensorProto,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def randint(low, high=None, size

In [2]:
nlp = spacy.load("en_core_web_sm")
all_stopwords = nlp.Defaults.stop_words
all_stopwords |= {'the', 'is', 'th', 's', 'm', 'would', 'The'}

  and should_run_async(code)


In [3]:
def preprocess_text(text, stop_words=all_stopwords):
    """
    Tokenize and Lemmatize raw tweets in a given DataFrame.
    Args:
      stop_words: A list of Strings containing stop words to be removed.
    Returns:
      processed_tweets: A list of preprocessed tokens of type String.
    """
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    lemmatizer = WordNetLemmatizer()
    words = []
    for w in word_tokenize(text):
        if w not in stop_words:
            words.append(w)
    res = []
    for w in words:
        if len(w) > 2:
            res.append(lemmatizer.lemmatize(w))
    return res

  and should_run_async(code)


In [4]:
speech = pd.read_pickle('frb_speeches_all.pkl')

  and should_run_async(code)


In [5]:
speech = speech['full_text']

  and should_run_async(code)


In [6]:
# Tokenize & normalise statements and minutes
pool = Pool(16)
speeches_list = speech.to_list()
speeches_preprocessed = list(
    tqdm(pool.imap(preprocess_text, speeches_list), total=len(speeches_list), desc='Multiprocess'))

  and should_run_async(code)
Multiprocess: 100%|████████████████████████████████████████████████████████████████| 1548/1548 [00:15<00:00, 99.67it/s]


In [7]:
pool.close()

  and should_run_async(code)


In [8]:
speeches_dict = gs.corpora.Dictionary(speeches_preprocessed)

  and should_run_async(code)


In [9]:
cbow_speeches = [speeches_dict.doc2bow(doc) for doc in speeches_preprocessed]

  and should_run_async(code)


In [15]:
model = gs.models.LdaMulticore(cbow_speeches, num_topics=15, id2word=speeches_dict, passes=10, workers=16)
model.show_topics()

  and should_run_async(code)


[(6,
  '0.026*"inflation" + 0.021*"price" + 0.013*"policy" + 0.011*"model" + 0.008*"expectation" + 0.008*"oil" + 0.008*"monetary" + 0.007*"rate" + 0.006*"change" + 0.006*"economy"'),
 (7,
  '0.028*"policy" + 0.019*"rate" + 0.018*"inflation" + 0.014*"monetary" + 0.009*"central" + 0.009*"Federal" + 0.008*"price" + 0.008*"bank" + 0.008*"term" + 0.007*"Reserve"'),
 (8,
  '0.020*"bank" + 0.018*"credit" + 0.016*"business" + 0.015*"loan" + 0.011*"Federal" + 0.011*"market" + 0.011*"small" + 0.010*"financial" + 0.010*"mortgage" + 0.009*"Reserve"'),
 (0,
  '0.011*"financial" + 0.010*"payment" + 0.008*"loan" + 0.008*"Federal" + 0.008*"mortgage" + 0.008*"consumer" + 0.008*"Reserve" + 0.007*"community" + 0.007*"market" + 0.007*"credit"'),
 (12,
  '0.029*"bank" + 0.028*"risk" + 0.020*"capital" + 0.012*"Basel" + 0.009*"banking" + 0.008*"market" + 0.007*"management" + 0.007*"financial" + 0.007*"supervisor" + 0.006*"regulatory"'),
 (5,
  '0.024*"financial" + 0.024*"market" + 0.020*"risk" + 0.009*"liqui

In [16]:
topic_vis = gensimvis.prepare(model, cbow_speeches, speeches_dict)
pyLDAvis.display(topic_vis)

  and should_run_async(code)


In [17]:
pyLDAvis.save_html(topic_vis, 'topic_vis.html')

  and should_run_async(code)


By examining the final topic map, we can see that the performance of the LDA model on capturing the salient topics and their constituent words in the data is not bad.

In [18]:
# Compute Coherence Score
coherence_model = gs.models.CoherenceModel(model=model, texts=speeches_preprocessed,
                                                      dictionary=speeches_dict, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f'Coherence Score_Speeches: {coherence_score}')

  and should_run_async(code)


Coherence Score_Speeches: 0.46324171573047507


The results of Roder, Both and Hindeburg in the paper motivated us to choose the method of score measurement, which can be seen from the signature of the above consistency model logic. You can see that we have chosen the coherence ='c_v metric for the model, instead of'u_mass','c_v', and'c_uci'. We found that the "c_v" scoring standard can achieve better results than other methods, especially when the word set is small, which is in line with our choice. The consensus score of our model is X. We believe our model can be better if we have data with higher quality. Generally, our LDA model has been trained on the correct number of topics and maintains a sufficient degree of semantic similarity between words with higher scores in each topic.

## Reference

1. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966). PMLR.
2. Loughran, T., & McDonald, B. (2020). Measuring firm complexity. Available at SSRN 3645372.
3. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408).
4. Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).