## Task 4 - Earnings Announcement Language
You’ve taken a new position at a boutique brokerage firm that wishes to focus on a “value-based” investment strategy. The firm has several analysts devoted to analyzing the extent to which quantitative financial statement information predicts future returns. You’ve been asked to consider qualitative information, particularly the narrative information accompanying firms’ earnings announcements.

Your supervisor has provided you with a sample of 2,000 earnings announcements, which data scientists (err… a PhD student) at the firm have already cleaned and parsed. You’ve also been provided with abnormal market returns surrounding each firm’s earnings announcement.

The firm would like to understand which words appear to be most predictive of the immediate investor reaction to earnings news. You have been asked to identify words conveying both “positive tone” (words that strongly predict positive returns) and “negative tone” (words that strongly predict negative returns). You remember from your applied analytics class that two finance professors, Tim Loughran and Bill McDonald, maintain a finance sentiment dictionary. You think it’d be worthwhile to consider financial sentiment from this dictionary as well.

Your task involves the following requirements:
1.	Construct a case insensitive document-term matrix using all language from the earnings announcements. Your matrix should only include the 2,000 most common words (assuming they meet the criteria below), and you should allow for bigrams *and* trigrams. You should exclude:

    a)	stop words
    
    b)	any non-alpha tokens
    
    c)	tokens shorter than 3 characters in length
    
    d)	tokens appearing in more than 90% of earnings announcements

    Report the top 25 words in your matrix.

2. Using the Loughran and McDonald financial sentiment dictionary, compute overall sentiment using three measures:

    a) positive words / total words
   
    b) negative words / total words
   
    c) (positive words – negative words) / total words

    Report descriptive statistics for these three measures (i.e., `.describe().transpose()`).

3. Compute the correlations between each word in your matrix. Answer these questions:

    a) How do each of these three measures correlate with returns?
   
    b) Are you surprised by this pattern? Why or why not?

4. Scale each row of the document term matrix by the total words in the document (i.e., so that the “counts” are proportions and sum to 1). Using these percentages, list the 25 words that correlate most positively with returns and most negatively with returns (50 total words). Report how many of these appear in the financial sentiment dictionary used in question 2.

5. (6046 only) Fit an LDA model using `sklearn` with 50 topics. Identify the topic that correlates most positively and most negative with returns. For these two topics, summarize the most relevant 10 words for each, and whether you find this pattern intuitive.

### Requirement 1
In this step, we will first load the data, and then generate the document term matrix needed to answer the specific requirements.

First, load the data and briefly inspect. There are two data files:
- `Task4_ea_sample.zip`: sample of earnings announcements
- `EA_list.csv`: Returns data for each earnings announcement

We haven't done much with zip files, so I will provide the code to load this data.


In [14]:
import pandas as pd
from zipfile import ZipFile

folder = '/Users/cooperdenning/Documents/GitHub/MGT-6000/Task4' # Update with your own path
archive = f'{folder}/Task4_ea_sample.zip'

eadata = []
with ZipFile(archive,'r') as arc:
    for mem in arc.namelist(): #namelist() lists files in zip archive, so this loop iterates over those
        if mem.endswith('txt'):  # if the member is a txt file, it's extracted and added to the "eadata" list
            contents = arc.read(mem)
            eadata.append([mem,contents])

textdf = pd.DataFrame(eadata,columns=['File_Name','text'])
textdf['text'] = textdf['text'].str.decode('utf8')
textdf['File_Name'] = textdf['File_Name'].str.split("/").str[-1]

Now load the 'ea_list.csv' data into a dataframe called `ealist` and merge with `textdf` using the field `File_Name`:

In [15]:
ealist = pd.read_csv(f'{folder}/EA_list.csv')
both = pd.merge(ealist, textdf, how='inner',on='File_Name')

Report `info()` and `head(10)` for both:

In [16]:
# print info() and head(10) for both datasets

both.info()
both.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   File_Name      2000 non-null   object 
 1   CIK            2000 non-null   int64  
 2   datadate       2000 non-null   object 
 3   announce_date  2000 non-null   object 
 4   tic            2000 non-null   object 
 5   AbnReturn      2000 non-null   float64
 6   text           2000 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 109.5+ KB


Unnamed: 0,File_Name,CIK,datadate,announce_date,tic,AbnReturn,text
0,790705-0000950123-10-100486.txt,790705,9/30/2010,2010-11-04,TKLC,-1.826796,EX-99. 1 2 v57726exv99w1. htm EX-99. 1 exv99w...
1,814547-0001144204-14-003352.txt,814547,12/31/2013,2014-01-22,FICO,-3.337049,EX-99. 1 2 v365966_ex99-1. htm EXHIBIT 99. 1 ...
2,71829-0001437749-14-019134.txt,71829,9/30/2014,2014-10-30,NR,4.051841,EX-99 2 ex99-1. htm EXHIBIT 99. 1 ex99-1. htm...
3,71691-0001157523-19-000236.txt,71691,12/31/2018,2019-02-06,NYT,16.480962,EX-99. 1 2 a51935890ex99_1. htm EXHIBIT 99. 1 ...
4,320193-0001193125-10-230992.txt,320193,9/30/2010,2010-10-18,AAPL,-1.228231,EX-99. 1 2 dex991. htm TEXT OF PRESS RELEASE I...
5,1666134-0001171843-18-005691.txt,1666134,6/30/2018,2018-08-02,BL,10.491547,EX-99. 1 2 exh_991. htm PRESS RELEASE EdgarFi...
6,29332-0000029332-11-000064.txt,29332,6/30/2011,2011-08-04,DXYN,13.407912,EX-99. 1 2 ex99_1pressrel-01. htm 2Q 2011 PRE...
7,1318605-0001564590-20-033069.txt,1318605,6/30/2020,2020-07-22,TSLA,-9.107826,EX-99. 1 2 tsla-ex991_63. htm EX-99. 1 tsla-...
8,731012-0000950123-11-010074.txt,731012,12/31/2010,2011-02-07,HCSG,4.447709,EX-99. 1 2 c11979exv99w1. htm EXHIBIT 99. 1 E...
9,1282637-0001193125-11-103679.txt,1282637,3/31/2011,2011-04-20,NEU,13.225461,EX-99. 1 2 dex991. htm PRESS RELEASE REGARDING...


Now, construct a document term matrix using `CountVectorizer` using the specifications provided above. Your matrix should exclude:

a)	stop words (for simplicity, you can set `stop_words = 'english')`

b)	any non-alpha tokens

c)	tokens shorter than 3 characters in length

d)	tokens appearing in more than 90% of earnings announcements

Also, limit the DTM to 2,000 words, bigrams, or trigrams, and do not consider case sensitivity.


In [17]:
from sklearn.feature_extraction.text import CountVectorizer

stops = []

# Open the file in read mode
with open(f'{folder}/english', 'r') as file:
    # Iterate through each line in the file
    for line in file:
        # Strip the newline character and append the line to the list
        stops.append(line.strip())

# Now, lines_list contains each line of the file as an element

vec =  CountVectorizer(
    stop_words='english',             # Exclude stop words
    token_pattern=r'\b[a-zA-Z]{3,}\b',  # Only include alpha tokens with 3+ chars
    max_df=0.9,                       # Exclude tokens in more than 90% of docs
    ngram_range=(1, 3),               # Include unigrams, bigrams, and trigrams
    max_features=2000,                # Limit vocabulary to 2000 features
    lowercase=True                    # Case insensitive processing
)
dtm = vec.fit_transform(both['text'])




Report the top 25 words in your matrix (BONUS POINT: Report the top 25 words AND the count of each).

In [18]:
import numpy as np
vocab = vec.vocabulary_

# Get feature names (columns in the resulting DataFrame)
feature_names = vec.get_feature_names_out()

# Convert sparse matrix to dense if needed and create DataFrame
dtm_df = pd.DataFrame(dtm.toarray(), columns=feature_names)
# Insert code to generate top 25 word counts (and for bonus point, count of each)
vocab_df = pd.DataFrame(list(vocab.items()), columns=["Word", "Index"])
vocab_df['Count'] = [dtm_df.iloc[:, idx].sum() for idx in vocab_df['Index']]

# Sort by count and select the top 25 words
top_25_vocab_df = vocab_df.sort_values(by='Count', ascending=False).head(25)[['Word', 'Count']]

top_25_vocab_df

Unnamed: 0,Word,Count
8,gaap,46793
1055,adjusted,30494
353,non gaap,28293
203,sales,27446
38,loss,25300
383,net income,24976
2,revenue,23094
388,months ended,18163
1570,ebitda,17013
33,measures,15944


### Requirement 2

Using the Loughran and McDonald financial sentiment dictionary, compute overall sentiment using three measures:

a) positive words / total words

b) negative words / total words

c) (positive words – negative words) / total words
    
First, load the LM Dictionary (file name is `LoughranMcDonald_SentimentWordLists_2018.xlsx`. Recall that the positive and negative terms are in separate sheets. Once you've loaded the language, store the positive and negative words in a list. I recommend converting to lower case at this stage.


In [19]:
lmdict = f"{folder}/LoughranMcDonald_SentimentWordLists_2018.xlsx"
lmpos = pd.read_excel(lmdict,sheet_name='Positive',header=None,names=['word'])
lmneg = pd.read_excel(lmdict,sheet_name='Negative',header=None,names=['word'])

In [20]:
# save lower case version of each word in lmpos and lmneg in two sets
pos = set(lmpos['word'].str.lower())  # Convert to lowercase and store in a set
neg = set(lmneg['word'].str.lower()) 

Second, identify the indices in the vocabulary that correspond to positive and negative words.

In [21]:
# Identify indices for positive and negative words in the vocabulary
posidx = [v for k, v in vec.vocabulary_.items() if k in pos]
negidx = [v for k, v in vec.vocabulary_.items() if k in neg]


Third, compute the required sentiment measures and add them to the combined dataframe, `both`. Specifically add these three measures to the dataframe:

a) `pos_pct` = positive words / total words

b) `neg_pct` = negative words / total words

c) `net_pos` = (positive words – negative words) / total words

In [22]:
poswords = np.asarray(dtm[:,posidx].sum(axis=1)).flatten()
negwords = np.asarray(dtm[:,negidx].sum(axis=1)).flatten()
totwords = np.asarray(dtm.sum(axis=1)).flatten()

# Compute the sentiment measures
both['pos_pct'] = poswords / totwords
both['neg_pct'] = negwords / totwords
both['net_pos'] = (poswords - negwords) / totwords

In [23]:
both[['pos_pct', 'neg_pct', 'net_pos']].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pos_pct,2000.0,0.012726,0.005795,0.000583,0.008749,0.011907,0.015798,0.053219
neg_pct,2000.0,0.017444,0.010692,0.0,0.009298,0.015707,0.023848,0.063321
net_pos,2000.0,-0.004718,0.01152,-0.049675,-0.011567,-0.003674,0.00298,0.041064


### Requirement 3
Answer these questions:

a) How do each of these three measures correlate with returns?
b) Are you surprised by this pattern? Why or why not?

To answer 3A, simply generate a correlation matrix with the variables needed to assess the correlation between returns and sentiment:

In [24]:
# Generating the correlation matrix between returns (`AbnReturn`) and sentiment measures
correlation_matrix = both[['AbnReturn', 'pos_pct', 'neg_pct', 'net_pos']].corr()

# Displaying the correlation matrix
correlation_matrix

Unnamed: 0,AbnReturn,pos_pct,neg_pct,net_pos
AbnReturn,1.0,0.036189,-0.048489,0.063213
pos_pct,0.036189,1.0,0.122707,0.389189
neg_pct,-0.048489,0.122707,1.0,-0.866441
net_pos,0.063213,0.389189,-0.866441,1.0


**Answer to 3B:**
Not really, the strongest relationship is between net_pos and neg_pct (-0.866), which is logical due to how the measures are constructed. 

Returns are influenced by a variety of factors, including macroeconomic conditions, geopolitical environment, and media portrayal. Textual sentiment is only one piece of the puzzle and may not always have a strong predictive relationship with returns.



### Requirement 4
We'll now look at how you might start to construct your own dictionary. First, scale each row of the document term matrix by the total words in the document (i.e., so that the “counts” are proportions and sum to 1).

In [25]:
dtm_scaled = dtm / totwords[:, np.newaxis]  # Broadcasting to scale rows

# Convert to dense array
dtm_scaled = dtm_scaled.toarray()

Using these percentages, generate correlations between each word count and returns. List the 25 words that correlate most positively with returns and most negatively with returns (50 total words). Report how many of these appear in the financial sentiment dictionary used in requirement 2.

In [26]:
dtm_ret = np.hstack([both['AbnReturn'].values.reshape(-1,1),dtm_scaled])
corrs = np.corrcoef(dtm_ret,rowvar=False)

# Compute correlations between each word proportion and AbnReturn
# Extract AbnReturn as the first column
returns = dtm_ret[:, 0]

# Compute correlations of each word (column) with returns
word_correlations = np.corrcoef(dtm_ret, rowvar=False)[0, 1:]

# Get the top 25 positively and negatively correlated words
sorted_indices = np.argsort(word_correlations)  # Sort correlations

# 25 most positively correlated
top_positive_indices = sorted_indices[-25:]
top_positive_words = [(vec.get_feature_names_out()[i], word_correlations[i]) for i in top_positive_indices]

# 25 most negatively correlated
top_negative_indices = sorted_indices[:25]
top_negative_words = [(vec.get_feature_names_out()[i], word_correlations[i]) for i in top_negative_indices]

# Combine into a single list
top_words = top_positive_words + top_negative_words

# Check how many of these words appear in the financial sentiment dictionary
lm_dict_words = pos.union(neg)  # Combine positive and negative dictionary words
dictionary_matches = [word for word, _ in top_words if word in lm_dict_words]

List the 25 words that correlate most positively with returns and most negatively with returns (50 total words) (BONUS POINT FOR REPORTING WORD + CORRELATION). 

In [27]:
ret_corrs = corrs[1:,0].flatten() # Grabs the first column, skipping upper left element
print("Most Negative Words")
top_negative_words

Most Negative Words


[('net loss', -0.10571868850311193),
 ('net loss income', -0.08571840188975391),
 ('site', -0.07773283069185974),
 ('loss income', -0.07617007189789415),
 ('rent', -0.0754569426245722),
 ('conjunction', -0.07446446307367027),
 ('employee', -0.07209179265624727),
 ('included', -0.07172953177691292),
 ('operating loss', -0.07152646753189354),
 ('start', -0.06893282079350574),
 ('publicly', -0.06871491761308089),
 ('million compared', -0.06813421957379075),
 ('contribution', -0.06759510723427935),
 ('gaap gross profit', -0.06743012623080652),
 ('loss', -0.06536294149176902),
 ('ceo', -0.06493621954878402),
 ('enhance', -0.0630490538981664),
 ('million compared million', -0.06179089723417775),
 ('nasdaq', -0.06174798298469522),
 ('involve', -0.061089418249149846),
 ('described', -0.05914426064073346),
 ('net current', -0.057358000615496396),
 ('write', -0.0566766768137),
 ('parties', -0.055859749558654796),
 ('fiscal year', -0.05577877769395761)]

In [31]:
print("Most Positive Words")
top_positive_words.reverse()
top_positive_words

Most Positive Words


[('excluding', 0.06925928128556731),
 ('diluted eps', 0.06779897060473143),
 ('free', 0.06484286019833677),
 ('record', 0.06453959230372946),
 ('different', 0.06346869395055169),
 ('free cash flow', 0.06279697673379618),
 ('net income', 0.06266065652395256),
 ('statements income', 0.06206529744318952),
 ('free cash', 0.06078639603148409),
 ('consolidated statements income', 0.06077706488256738),
 ('ended june june', 0.06005425468430923),
 ('cash flow', 0.059511539739231824),
 ('june june', 0.05941922033512571),
 ('average shares', 0.05804716965880158),
 ('option', 0.05759084276838067),
 ('weighted average shares', 0.05506712723626777),
 ('rate', 0.05476673957949416),
 ('selling general administrative', 0.054684764255307064),
 ('strong', 0.05436175411637299),
 ('flow', 0.053569348989459926),
 ('diluted net income', 0.05351204849585309),
 ('risks associated', 0.053482787218680494),
 ('achieved', 0.05309693361785794),
 ('expenses current assets', 0.05278734438315649),
 ('divestiture', 0.0

Finally, report how many of these appear in the financial sentiment dictionary used in requirement 2.

In [35]:
# Create sets of the top correlated words
top_positive_set = set(word for word, _ in top_positive_words)
top_negative_set = set(word for word, _ in top_negative_words)

# Find overlaps with the financial sentiment dictionary
negative_overlap = top_negative_set.intersection(neg)  # Overlapping negative words
positive_overlap = top_positive_set.intersection(pos)  # Overlapping positive words

# Print the counts of overlapping words
print(f"There are {len(negative_overlap)} overlapping words for negative")
print(f"There are {len(positive_overlap)} overlapping words for positive")

There are 1 overlapping words for negative
There are 2 overlapping words for positive
{'strong', 'achieved'}


### Requirement 5 (6046 only)

Fit an LDA model using `sklearn` with 50 topics. Set the `random_state` to 123, and leave everything else at default settings. You can use the `dtm` you generated earlier. I recomment using `fit_transform()` so you have the topic matrix ready to analyze correlations.

In [63]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_components=50, random_state=123)  # Create LDA model with 50 topics

# Fit the LDA model to the data and transform the document-term matrix to the topic matrix
topics = lda.fit_transform(dtm)

Next, generate correlations between `AbnReturn` and topic weights for each of the 2,000 earnings press releases (HINT: This will be very similar to the prior step, except you do not need to scale the topics matrix by total words).

In [83]:
# Convert AbnReturn Series to a NumPy array and reshape
AbnReturn = both['AbnReturn'].to_numpy().reshape(-1, 1)

# Combine AbnReturn with the topic weights
top_ret = np.hstack([AbnReturn, topics])  # Shape: (2000, 51)

# Compute the correlation matrix
corrs_top = np.corrcoef(top_ret, rowvar=False)  # Correlation matrix of shape (51, 51)
# Extract correlations of AbnReturn with topic weights
ret_top_corrs = corrs_top[0, 1:]  # First row, excluding the self-correlation
ret_top_corrs

[1.05708245e-05 1.05708245e-05 1.05708245e-05 4.11041950e-01
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 1.05708245e-05 2.14176997e-02 3.39668085e-02 6.43049400e-02
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 1.05708245e-05 1.05708245e-05 3.78399891e-02 1.05708245e-05
 1.05708245e-05 1.74268611e-01 1.05708245e-05 1.35799554e-01
 1.05708245e-05 1.54677941e-02 3.13227425e-02 1.05708245e-05
 1.05708245e-05 1.05708245e-05 1.05708245e-05 1.05708245e-05
 7.43133048e-03 1.05708245e-05 1.05708245e-05 1.05708245e-05
 4.09661402e-02 2.57707489e-02]
1.0


Now, identify the topic that correlates most positively and most negatively with returns. Report the 10 words most relevant to that topics. I've provided a function you can use to access those words.

In [75]:
# Extract the topic-word matrix
top_word = lda.components_  # Shape: (50, 2000)

# Verify vocab alignment

# Define function to get top words for a topic
def get_topic_words(topic, top_word, vocab, topn=10):
    top_words = top_word[topic, :].argsort()[-topn:][::-1].tolist()  # Top word indices
    return [vocab[i] for i in top_words]  # Map indices to vocab

# Find the most positively and negatively correlated topics
mostpos = np.argmax(ret_top_corrs)  # Topic index with highest positive correlation
mostneg = np.argmin(ret_top_corrs)  # Topic index with highest negative correlation

# Convert vocab dictionary to a sorted list
vocab_list = [word for word, index in sorted(vocab.items(), key=lambda item: item[1])]


# Retrieve the top 10 words for each topic
pos_words = get_topic_words(mostpos, top_word, vocab_list, topn=10)
neg_words = get_topic_words(mostneg, top_word, vocab_list, topn=10)

print("Most positively correlated topic:", mostpos)

print("Most negatively correlated topic:", mostneg)

Most positively correlated topic: 11
Most negatively correlated topic: 44


In [76]:
print("Words for most positive topic:")
print(f"{'|'.join(get_topic_words(mostpos,lda.components_,vocab_list,topn=10))}")

Words for most positive topic:
continuing|continuing operations|loss|gaap|discontinued|income continuing|income continuing operations|discontinued operations|net income|income loss


In [77]:
print("Words for most negative topic:")
print(f"{'|'.join(get_topic_words(mostneg,lda.components_,vocab_list,topn=10))}")

Words for most negative topic:
loss|net loss|loss income|adjusted|sales|net loss income|impairment|gaap|million quarter|primarily


**Final Question:** 
Do you find the topic pattern intuitive? In your opinion, does the topic-based approach appear to better identify themes that should correlate with returns?