<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reference" data-toc-modified-id="Reference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# Identify the most important vector

A prerequisite for finding trading signal is to understand whether the data we extract contains topics/signals related to market, and more importantly, whether it contains information that we may trade.

This requires us to examine and evaluate the various topics and vocabulary representing these topics in the data. The so-called: garbage in, garbage out.

To explore various topics in the FOMC documents, we will use Gensim’sLatent Dirichlet Allocation(Hidden Dirichlet distribution model). LDA is a generation probability model suitable for discrete data sets such as text. The function of LDA is as a hierarchical Bayesian model, in which each item in the collection is modeled as a finite mixture on the basic theme collection. In turn, each topic is shaped into an infinite mixture of basic topic probabilities.

In LDA model, we need to estimate the number of topics in the dataset through the num_topics hyperparameter. According to models' coherence scores, 11 is a wise
choice.

In [1]:
import gensim as gs
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import spacy
from gensim.models.doc2vec import Doc2Vec
from multiprocess import Pool
from tqdm import tqdm
# nltk.download('punkt')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object: SlowAppendObjectArrayToTensorProto,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool: SlowAppendBoolArrayToTensorProto,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def randint(low, high=None, size=None, dtype=onp.int):  # pylint: disable=missing-function-docstring


In [2]:
nlp = spacy.load("en_core_web_sm")
all_stopwords = nlp.Defaults.stop_words
all_stopwords |= {'the', 'is', 'th', 's', 'm', 'would', 'The'}

  and should_run_async(code)


In [3]:
def preprocess_text(text, stop_words=all_stopwords):
    """
    Tokenize and Lemmatize raw tweets in a given DataFrame.
    Args:
      stop_words: A list of Strings containing stop words to be removed.
    Returns:
      processed_tweets: A list of preprocessed tokens of type String.
    """
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    lemmatizer = WordNetLemmatizer()
    words = []
    for w in word_tokenize(text):
        if w not in stop_words:
            words.append(w)
    res = []
    for w in words:
        if len(w) > 2:
            res.append(lemmatizer.lemmatize(w))
    return res

  and should_run_async(code)


In [4]:
speech = pd.read_pickle('frb_speeches_all.pkl')

  and should_run_async(code)


In [5]:
speech = speech['full_text']

  and should_run_async(code)


In [6]:
# Tokenize & normalise statements and minutes
pool = Pool(16)
speeches_list = speech.to_list()
speeches_preprocessed = list(
    tqdm(pool.imap(preprocess_text, speeches_list), total=len(speeches_list), desc='Multiprocess'))
pool.close()

  and should_run_async(code)
Multiprocess: 100%|████████████████████████████████████████████████████████████████| 1548/1548 [00:16<00:00, 93.75it/s]


In [7]:
speeches_dict = gs.corpora.Dictionary(speeches_preprocessed)

  and should_run_async(code)


In [8]:
cbow_speeches = [speeches_dict.doc2bow(doc) for doc in speeches_preprocessed]

  and should_run_async(code)


In [13]:
model = gs.models.LdaMulticore(cbow_speeches, num_topics=11, id2word=speeches_dict, passes=10, workers=16)
model.show_topics()

  and should_run_async(code)


[(5,
  '0.018*"rate" + 0.014*"growth" + 0.013*"inflation" + 0.012*"price" + 0.011*"year" + 0.010*"economy" + 0.008*"productivity" + 0.008*"policy" + 0.006*"increase" + 0.006*"percent"'),
 (2,
  '0.016*"mortgage" + 0.014*"loan" + 0.009*"credit" + 0.009*"income" + 0.008*"market" + 0.008*"home" + 0.008*"financial" + 0.008*"housing" + 0.007*"rate" + 0.007*"percent"'),
 (1,
  '0.029*"business" + 0.020*"small" + 0.012*"financial" + 0.010*"credit" + 0.008*"firm" + 0.008*"Year" + 0.008*"market" + 0.008*"Federal" + 0.007*"Reserve" + 0.006*"information"'),
 (0,
  '0.009*"benefit" + 0.008*"government" + 0.008*"saving" + 0.008*"budget" + 0.007*"tax" + 0.007*"policy" + 0.007*"security" + 0.006*"fiscal" + 0.006*"future" + 0.006*"market"'),
 (7,
  '0.017*"financial" + 0.016*"bank" + 0.016*"capital" + 0.012*"risk" + 0.012*"firm" + 0.008*"Basel" + 0.007*"system" + 0.007*"requirement" + 0.007*"regulatory" + 0.006*"crisis"'),
 (10,
  '0.022*"market" + 0.017*"financial" + 0.011*"price" + 0.008*"economy" +

In [14]:
topic_vis = gensimvis.prepare(model, cbow_speeches, speeches_dict)
pyLDAvis.display(topic_vis)

  and should_run_async(code)


In [11]:
pyLDAvis.save_html(topic_vis, 'topic_vis.html')

  and should_run_async(code)


By examining the final topic map, we can see that the performance of the LDA model on capturing the salient topics and their constituent words in the data is not bad.

In [12]:
# Compute Coherence Score
for i in range(10,22):
    model = gs.models.LdaMulticore(cbow_speeches, num_topics=i, id2word=speeches_dict, passes=10, workers=16)
    coherence_model = gs.models.CoherenceModel(model=model, texts=speeches_preprocessed,
                                                      dictionary=speeches_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    print(f'num_topics: {i}')
    print(f'Coherence Score_Speeches: {coherence_score}')

  and should_run_async(code)


num_topics: 10
Coherence Score_Speeches: 0.44559080590289096
num_topics: 11
Coherence Score_Speeches: 0.47591157418860114
num_topics: 12
Coherence Score_Speeches: 0.4555837249534085
num_topics: 13
Coherence Score_Speeches: 0.459501886059303
num_topics: 14
Coherence Score_Speeches: 0.4470811680013243
num_topics: 15
Coherence Score_Speeches: 0.4642344466407853
num_topics: 16
Coherence Score_Speeches: 0.4482846401940985
num_topics: 17
Coherence Score_Speeches: 0.4571350997610412
num_topics: 18
Coherence Score_Speeches: 0.4561429397703855
num_topics: 19
Coherence Score_Speeches: 0.45429160360880355
num_topics: 20
Coherence Score_Speeches: 0.45716256965774227
num_topics: 21
Coherence Score_Speeches: 0.44477535939994073


The results of Roder, Both and Hindeburg in the paper motivated us to choose the method of score measurement, which can be seen from the signature of the above consistency model logic. You can see that we have chosen the coherence ='c_v metric for the model, instead of'u_mass','c_v', and'c_uci'. We found that the "c_v" scoring standard can achieve better results than other methods, especially when the word set is small, which is in line with our choice. The consensus score of our model is X. We believe our model can be better if we have data with higher quality. Generally, our LDA model has been trained on the correct number of topics and maintains a sufficient degree of semantic similarity between words with higher scores in each topic.

## Reference

1. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966). PMLR.
2. Loughran, T., & McDonald, B. (2020). Measuring firm complexity. Available at SSRN 3645372.
3. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408).
4. Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).