<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reference" data-toc-modified-id="Reference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# Identify the most important vector

Though we didn't find trading signals in the last part, we would like to explore the speeches whether contains information that we may trade. Are the speeches of federal officials meaningless for the market? This requires us to examine and evaluate the various topics and vocabulary representing these topics in the data. 

To explore various topics in the FOMC speeches, we will use Gensim'sLatent Dirichlet Allocation(Hidden Dirichlet distribution model). LDA is a generation probability model suitable for discrete data sets such as text. The function of LDA is as a hierarchical Bayesian model, in which each item in the collection is modeled as a finite mixture on the primary theme collection. In turn, each topic is shaped into an infinite combination of basic topic probabilities.

In the LDA model, we need to set several hyperparameters. According to the models' coherence scores, we have optimized the hyperparameter, including the number of topics shown in the last paragraph.

In [1]:
import gensim as gs
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import spacy
from gensim.models.doc2vec import Doc2Vec
from multiprocess import Pool
from tqdm import tqdm
# nltk.download('punkt')



In [2]:
nlp = spacy.load("en_core_web_sm")
all_stopwords = nlp.Defaults.stop_words
all_stopwords |= {'the', 'is', 'th', 's', 'm', 'would', 'The'}



In [3]:
def preprocess_text(text, stop_words=all_stopwords):
    """
    Tokenize and Lemmatize raw tweets in a given DataFrame.
    Args:
      stop_words: A list of Strings containing stop words to be removed.
    Returns:
      processed_tweets: A list of preprocessed tokens of type String.
    """
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    lemmatizer = WordNetLemmatizer()
    words = []
    for w in word_tokenize(text):
        if w not in stop_words:
            words.append(w)
    res = []
    for w in words:
        if len(w) > 2:
            res.append(lemmatizer.lemmatize(w))
    return res

In [4]:
speech = pd.read_pickle('./data/frb_speeches_all.pkl')

In [5]:
speech = speech['full_text']

In [6]:
# Tokenize & normalise statements and minutes
pool = Pool(16)
speeches_list = speech.to_list()
speeches_preprocessed = list(
    tqdm(pool.imap(preprocess_text, speeches_list), total=len(speeches_list), desc='Multiprocess'))
pool.close()

Multiprocess: 100%|███████████████████████████████████████████████████████████████| 1548/1548 [00:08<00:00, 177.98it/s]


In [7]:
speeches_dict = gs.corpora.Dictionary(speeches_preprocessed)

In [8]:
cbow_speeches = [speeches_dict.doc2bow(doc) for doc in speeches_preprocessed]

In [33]:
model = gs.models.LdaMulticore(cbow_speeches, alpha='asymmetric', eta='auto', 
                               num_topics=12, id2word=speeches_dict, passes=10, workers=16)
model.show_topics()

[(11,
  '0.012*"price" + 0.011*"oil" + 0.007*"market" + 0.007*"trade" + 0.006*"year" + 0.006*"economic" + 0.006*"economy" + 0.005*"growth" + 0.005*"new" + 0.004*"energy"'),
 (10,
  '0.025*"inflation" + 0.020*"policy" + 0.019*"rate" + 0.013*"price" + 0.011*"monetary" + 0.007*"economy" + 0.007*"year" + 0.006*"market" + 0.006*"growth" + 0.005*"economic"'),
 (9,
  '0.013*"financial" + 0.008*"Federal" + 0.008*"Reserve" + 0.007*"system" + 0.006*"Year" + 0.006*"public" + 0.005*"policy" + 0.005*"information" + 0.005*"bank" + 0.005*"market"'),
 (8,
  '0.016*"growth" + 0.014*"productivity" + 0.011*"year" + 0.010*"price" + 0.010*"rate" + 0.009*"investment" + 0.009*"capital" + 0.008*"economy" + 0.007*"business" + 0.007*"increase"'),
 (7,
  '0.019*"Federal" + 0.017*"Reserve" + 0.009*"financial" + 0.007*"bank" + 0.007*"economic" + 0.005*"policy" + 0.005*"rate" + 0.005*"Board" + 0.005*"credit" + 0.005*"economy"'),
 (4,
  '0.018*"community" + 0.013*"loan" + 0.012*"bank" + 0.011*"mortgage" + 0.010*"cre

In [34]:
topic_vis = gensimvis.prepare(model, cbow_speeches, speeches_dict)
pyLDAvis.display(topic_vis)

  default_term_info = default_term_info.sort_values(


The speeches of federal reserve officials are closely related to the two main goals of the FED. One is to keep the inflation rate at the target (2%), and the other maximizes sustainable employment. FED officials attach much attention to these topics as the words such as "inflation," "price," "labor" have high weights in the speech.

In [35]:
pyLDAvis.save_html(topic_vis, 'topic_vis.html')

By examining the final topic map, we can see that the performance of the LDA model on capturing the salient topics and their constituent words in the data is not bad.

In [32]:
# Compute Coherence Score
coherence_model = gs.models.CoherenceModel(model=model, texts=speeches_preprocessed, dictionary=speeches_dict, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f'Coherence Score_Speeches: {coherence_score}')

Coherence Score_Speeches: 0.45993622688536584


The results of Roder, Both, and Hindenburg in the paper motivated us to choose the method of score measurement. You can see that we have chosen the coherence ='c_v metric for the model, instead of'u_mass','c_v', and'c_uci'. We found that the "c_v" scoring standard can achieve better results than other methods, especially when the word set is small, in line with our choice. The consensus score of our model is 0.46. We believe our model can be better if we have data with higher quality. Generally, our LDA model has been trained on the correct number of topics and maintains a sufficient degree of semantic similarity between words with higher scores in each topic.

## Reference

1. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408).
2. Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).