### Topic Models:

Topic Modeling is an unsupervised approach to analyze the distribution of words across a set of documents in a corpus and to use the word distribution, and the co-occurrence of words across documents, to identify patterns (clusters) of words. These clusters are typically referred to as topics.  Each topic is defined over the distribution of the keywords (tokens) in the lexicon, e.g., a topic labeled “financial fraud” may include keywords such as “financial wrongdoing”, “theft”, “financial mismanagement”, “misleading lending practices”, etc. A weight is typically associated with each keyword in the topic. 

After the topics are identified, each document in the corpus is labeled as a mix of topics, e.g., document X may be a mix of two topics: “financial fraud” and “private equity investment”. A weight for the topic mix for each topic in the document may also be available. Topic modeling is unsupervised, similar to say clustering, and there is usually no labeled training data. The (unsupervised) training can make use of several parameters including the maximum count of topics. The user has to examine the keywords and manually label the topic.

#### Software:
Latent Topic Modeling based on the Dirichlet Allocation (LDA_bleiref, LDAexample) has been very successful and there have been many variants including some supervised approaches and extensions to accommodate topics evolving over time.  The Gensim package is widely used for topic modeling.

BERT is a transformer based approach for a Large Language Model (LLM). FinBERT has been trained on financial documents. 
FinBERT based topic modeling extends prior approaches, e.g., LDA topic modeling, to use the embeddings produced by FinBERT to construct topics. The following model has been pre-trained on a general corpus and then fine tuned on financial data. 
It provides a topic model that includes 20 labeled topics.
https://huggingface.co/nickmuchi/finbert-tone-finetuned-finance-topic-classification


#### Data Collection
Consider the following corpus of quarterly earnings reports (transcripts) from 25 companies in the S&P 500 index; the dataset covers the period 2017-2021.
https://drive.google.com/drive/folders/1e_FpIjoKTHNNUHl-HBCKHnhMGUVODc2U?usp=sharing
Note: For each company / quarter there are multiple entries ( multiple speakers). 

#### Exercise -  Please complete this
Read the transcripts into a Pandas dataframe, with an entry for each statement. Use the financial topic tuned finbert model to compute the topic(s) for each statement in each transcript. Save the output in a csv file.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import pandas as pd
import os
import warnings
warnings.filterwarnings("ignore")


tokenizer = AutoTokenizer.from_pretrained("nickmuchi/finbert-tone-finetuned-finance-topic-classification", token='')
model = AutoModelForSequenceClassification.from_pretrained("nickmuchi/finbert-tone-finetuned-finance-topic-classification", token='')
topics = {
   0: "Analyst Update",
   1: "Fed | Central Banks",
   2: "Company | Product News",
   3: "Treasuries | Corporate Debt",
   4: "Dividend",
   5: "Earnings",
   6: "Energy | Oil",
   7: "Financials",
   8: "Currencies",
   9: "General News | Opinion",
   10: "Gold | Metals | Materials",
   11: "IPO",
   12: "Legal | Regulation",
   13: "M&A | Investments",
   14: "Macro",
   15: "Markets",
   16: "Politics",
   17: "Personnel Change",
   18: "Stock Commentary",
   19: "Stock Movement",
}
