# Demo 5 - Measuring Sentiment
In this demonstration, we'll learn how to construct a measure of sentiment from a larger corpus. Note that we are starting with cleaned textual data so we can focus on the actual measurement of sentiment, but cleaning the data is a vital step in any NLP setting.

To measure sentiment, we'll first rely on two different sentiment dictionaries. The first is a basic, general purpose dictionary and the second is a dictionary developed specifically for financial sentiment. We'll conclude with a pattern-based method that is available through a package called `textblob`. We'll go through these steps:

1. Load the dataset and generate a document-term matrix with `sklearn`.
2. Measure sentiment using the Harvard General Inquirer Dictionary
3. Measure sentiment using the Loughran-McDonald Dictionary
4. Measure sentiment using `textlob`
5. Evaluate correlations among measures

### Step 1: Load and Setup Data
I've provided you with a small set of MD&A disclosures from a random sample of 10-K filings. For ease of use, these are in a compressed csv file, so we can easily load directly with pandas:

In [None]:
import pandas as pd

folder = "./"
df = pd.read_csv("./MDAs201.csv.gz")
df

Let's look at one specific disclosure so we can get a sense of general structure:

In [None]:
print(df.iloc[0]['mda'][:1000])

These tend to be fairly long disclosures, suggesting we'll have a large volume of text we can use for our analyses.

To generate the document-term matrix, we will use `sklearn`'s __[`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__. Note that one can generate TF-IDF weighted counts with `TfidfVectorizer`. It works the same exact way.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_features=1500,min_df=5)
dtm = vec.fit_transform(df['mda'])

I left parameters as defaults for most things. The two I updated, `max_features` and `min_df` controls the volume and sparsity of words. We can inspect the shape of the `dtm` like this:

In [None]:
dtm.shape

We have 141 rows (# of documents) and 1500 unique words. If we want to access the vocabulary, we can use the `vocabulary_` attribute:

In [None]:
vec.vocabulary_

This vocabulary allows us to find the index of certain words, but often it's more useful to go the other way. We could flip this dictionary (`{v:k for k,v in vec.vocabulary_.items()}`), but I usually just grab the vocabulary in `np.array` form:

In [None]:
vocab = vec.get_feature_names_out()
vocab

This highlights that we have a lot of numbers here. Numbers won't tell us much in terms of qualitative sentiment, so let's restrict our vocabulary to words only. We can do this with a `token_pattern`, which is an attribute in the vectorizer:

In [None]:
vec = CountVectorizer(max_features=1500,min_df=5,token_pattern=r'[a-z]{2,}') # requires two or more letters. Note that the vectorizer defaults to lower case so we can leave out capital letters
dtm = vec.fit_transform(df['mda'])
vocab = vec.get_feature_names_out()
vocab

This looks better. Now, let's look at which words are most common in our corpus. To do so, recall that we can sum *by column*, or `axis=0`:

In [None]:
dtm.sum(axis=0)

How do we figure out which word has the highest word count? We need to first convert the sparse `matrix` into a numpy array, and we can look at some statistics

In [None]:
import numpy as np
wcs = np.asarray(dtm.sum(axis=0)).flatten()
wcs.max()

One word appears almost 80,000 times! What word is it? We can use a method called __[`argmax()`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html)__ to figure that out:

In [None]:
wcs.argmax()

Which word is 1360?

In [None]:
vocab[wcs.argmax()]

We did not address **stopwords**. Do we have a lot of stop words in our data? We can use `argsort()`, which behaves similar to `argmax()` to figure this out. For instance, suppose we wanted to look at the 10 most frequently occurring words:

In [None]:
vocab[wcs.argsort()[-10:][::-1]] # the [::-1] selects all, start from last to first

Fortunately it's easy to remove stop words. We could use `sklearn`'s list of stop words, but their own documentation recommends using one from `nltk`, so we'll do that. Let's load these stop words and then re-vectorize:

In [None]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
stops[:10]

Notice that the stop words include contractions. We'll adjust our token pattern to allow those to match as well:

In [None]:
vec = CountVectorizer(max_features=1500,min_df=5,token_pattern=r'[a-z]\'?[a-z]+',stop_words=stops) #
dtm = vec.fit_transform(df['mda'])
vocab = vec.get_feature_names_out()
wcs = np.asarray(dtm.sum(axis=0)).flatten()
vocab[wcs.argsort()[-10:]]

Much better! Let's make one more adjustment to help deal with some of these other very common words. We could explicitly identify them and add to our list. Or we can set some threshold. Let's say we only want words that appear in fewer than 50% of documents:

In [None]:
vec = CountVectorizer(max_features=1500,min_df=5,token_pattern=r'[a-z]\'?[a-z]+',stop_words=stops,max_df=0.50) #
dtm = vec.fit_transform(df['mda'])
vocab = vec.get_feature_names_out()
wcs = np.asarray(dtm.sum(axis=0)).flatten()
vocab[wcs.argsort()[-10:]]

### Step 2: Measure sentiment using Harvard dictionary
Let's go ahead and load the dictionary, which is in the file called "inquirerbasic.xls":

In [None]:
hvd = pd.read_excel(f"{folder}/inquirerbasic.xls")
hvd

How many unique words are in the data? How about textual attributes?

To grab words in a given list, we simply need to identify where a given column is not a `NaN`, and then keep the `Entry` for this. For instance, which words are deemed "Hostile?

In [None]:
hvd.loc[hvd['Hostile'].notnull(),'Entry'].str.lower()

This short list illustrates one issue with the inquirer dictionary we'll need to address. Some words have multiple meanings or word forms. To address, we'll need to remove the "#n", where *n* is some number. We can do this with split, and we'll remove duplicates:

In [None]:
hvd.loc[hvd['Hostile'].notnull(),'Entry'].str.lower().str.split("#").str[0].unique()

Let's go ahead and create our list of positive and negative words. We'll use `Positiv` and `Negativ`:

In [None]:
positives = set(hvd.loc[hvd['Positiv'].notnull(),'Entry'].str.lower().str.split("#").str[0].unique())
negatives = set(hvd.loc[hvd['Negativ'].notnull(),'Entry'].str.lower().str.split("#").str[0].unique())
print(f"There are {len(positives)} positive words and {len(negatives)} negative words.")

Now, we can very easily generate a measure of sentiment from our `dtm` by subsetting on the columns that appear in one of these lists. Recall we have the `dtm.vocabulary_` dictionary. We can use that to generate a list of indices:

In [None]:
posidx = [v for k,v in vec.vocabulary_.items() if k in positives]
negidx = [v for k,v in vec.vocabulary_.items() if k in negatives]

Now, we select the columns from `dtm`. Note that the way we access numpy objects is very similar to `loc`. We identify rows, then columns. In this case, we want all rows (`:`), and columns specified by `posidx`:

In [None]:
dtm[:,posidx]

To generate the count of positive words, we can sum. We need to do some `numpy` gymnastics to get things in the right format.

In [None]:
np.asarray(dtm[:,posidx].sum(axis=0)).flatten()

We can now directly add this to our original dataset.

In [None]:
df['pos_words'] = np.asarray(dtm[:,posidx].sum(axis=1)).flatten()
df['neg_words'] = np.asarray(dtm[:,negidx].sum(axis=1)).flatten()
df['tot_words'] = np.asarray(dtm.sum(axis=1)).flatten()
df[['pos_words','neg_words','tot_words']].describe()

Before we move on, I'll highlight some limitations of this approach:
1. We did not consider alternative word formats. This diciontary is pretty comprehensive, but it is possible that there is a negative word of one format that doesn't appear with alternative formats (e.g., singular vs. plural). To address this, we could lemmatize our data, or use word stems (though this would be noisy).
2. We do not consider *negation*, which can invert the sentiment of a given word. If you wished to consider negation, you would need to handle before you generated a document term matrix.
3. We limited the size of our `dtm`, which could leave out some relatively rare words. In practice, this shouldn't have a large influence, but it's worth considering. For instance, let's compare the average positive word count here to one that uses all words:

In [None]:
vec2 = CountVectorizer(min_df=1,token_pattern=r'[a-z]\'?[a-z]+',stop_words=stops,max_df=0.50) #
dtm2 = vec2.fit_transform(df['mda'])
vocab2 = vec2.get_feature_names_out()
posidx2 = [v for k,v in vec2.vocabulary_.items() if k in positives]
negidx2 = [v for k,v in vec2.vocabulary_.items() if k in negatives]
np.asarray(dtm2[:,posidx2].sum(axis=1)).flatten().mean()

### Step 3: Financial Sentiment

To measure financial sentiment, we will use the Loughran and McDonald dictionary. The portion of the dictionary we will use was described in their __[2011 paper](https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.2010.01625.x)__. The data is in the excel workbook I've provided, titled "LoughranMcDonald_SentimentWordLists_2018.xlsx".

Note that in this workbook, each tab is a different set of words, and there are no headers. We can adjust our import to deal with this:

In [None]:
lmpos = pd.read_excel(f'{folder}LoughranMcDonald_SentimentWordLists_2018.xlsx',
                      sheet_name='Positive',header=None,names=['word'])
lmpos

In [None]:
lmneg = pd.read_excel(f'{folder}LoughranMcDonald_SentimentWordLists_2018.xlsx',
                      sheet_name='Negative',header=None,names=['word'])
lmneg

Next, let's generate the same type of indices reference we did earlier:

In [None]:
lmposidx = [v for k,v in vec.vocabulary_.items() if k in lmpos['word'].values]
lmnegidx = [v for k,v in vec.vocabulary_.items() if k in lmneg['word'].values]

How many positive words appear at least once in the corpus?

In [None]:
len(lmposidx)

Zero?!

We didn't adjust for case:

In [None]:
lmposidx = [v for k,v in vec.vocabulary_.items() if k in lmpos['word'].str.lower().values]
lmnegidx = [v for k,v in vec.vocabulary_.items() if k in lmneg['word'].str.lower().values]
len(lmposidx)

Now we can generate word counts for these two measures:

In [None]:
df['lmpos_words'] = np.asarray(dtm[:,lmposidx].sum(axis=1)).flatten()
df['lmneg_words'] = np.asarray(dtm[:,lmnegidx].sum(axis=1)).flatten()
df[['lmpos_words','lmneg_words']].describe()

### Step 4: Pattern-based sentiment
Our final measure of sentiment will be based on a pattern-based approach, which combines specific patterns of text with dictionary weights to generate sentiment. Note that these pattern-based approaches require the original text since they examine how words appear together.

To generate these measures, we'll use `textblob`, a popular NLP package that allows access to an array of pre-built methods in a very simple API. I'll illustrate a few things on a single example:

In [None]:
from textblob import TextBlob
sample = df['mda'][0]
blob = TextBlob(sample)

We can easily look at words or sentences:

In [None]:
print(f"There are {len(blob.tokens)} tokens in this text.")
print(f"There are {len(blob.sentences)} sentences in this text.")

You can also identify specific attributes, like noun phrases:

In [None]:
[np for np in blob.noun_phrases][:10]

We are interested in *sentiment*, though. This is available as an attribute:

In [None]:
blob.sentiment

If we want polarity, we can access that directly or within "sentiment":

In [None]:
print(blob.sentiment.polarity)
print(blob.polarity)

Note that this polarity measure is meant to be applied to shorter spans of text, so I will usually generate polarity individually for all sentences in the data, and compute the average. Let's set up a function to do this:

In [None]:
def measure_polarity(txt):
    blob = TextBlob(txt)
    return np.mean([s.polarity for s in blob.sentences])

measure_polarity(sample)

Finally, we can generate this measure as another column in our dataframe using `apply()`:

In [None]:
df['tb_polarity'] = df['mda'].apply(measure_polarity)
df['tb_polarity'].describe()

### 5. Evaluate Correlations
The final step in this demo is to evaluate the correlations among sentiment measures, and then we'll do one other exercise with correlations.

The first step is relatively simple, but we first have to define our measures of general and financial sentiment. We'll use the formula (positive - negative) / total:

In [None]:
df['gen_sent'] = df.eval('(pos_words - neg_words)/tot_words')
df['lm_sent'] = df.eval('(lmpos_words - lmneg_words)/tot_words')

Now, we can examine overall correlations:

In [None]:
df[['gen_sent','lm_sent','tb_polarity']].corr()

That's an interesting pattern. The sentiment based on the LM dictionary exhibits a *negative* correlation with the polarity-based sentiment. 

Before we conclude, let's see which words correlate most strongly with tb_polarity. We first need to generate a DTM that's deflated by the total word count (so each word is represented as a proportion of words), and then we'll examine correlations.

Scaling the DV is relatively straightforward (with a little `numpy` manipulation; __[`reshape()`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html#numpy.reshape)__ is a complicated method):

In [None]:
dtm_scaled = np.asarray(dtm.todense()) / df['tot_words'].values.reshape(-1,1)
dtm_scaled

Next, I'm going to correlate each column of this dtm with the measure of polarity in the dataframe. To do so, we'll combine the polarity with this dtm using numpy:

In [None]:
dtm_pol = np.hstack([df['tb_polarity'].values.reshape(-1,1),dtm_scaled])
# take care of missing values
dtm_pol = np.nan_to_num(dtm_pol)

Compute the correlations:

In [None]:
corrs = np.corrcoef(dtm_pol,rowvar=False)

In [None]:
corrs

Now examine which words correlate most strongly. To do this, we can focus on the first column (or row) of this matrix. Let's use `argsort` on one column of data. We can then grab the top 5 and bottom 5 for our most negative and most positive correlations:

In [None]:
top5 = corrs[0,1:].argsort()[-5:] # start at 1 to skip the top left corner
top5

In [None]:
bottom5 = corrs[0,1:].argsort()[:5] # start at 1 to skip the top left corner
bottom5

Let's now print each set of words:

In [None]:
print("5 words with most positive correlation:\n")
print("|".join(vocab[top5]))

In [None]:
print("5 words with most negative correlation:\n")
print("|".join(vocab[bottom5]))

If we want to see what these correlations are, we can access in original matrix, but this time we have to add 1 since we are referencing relative to index 0 (recall that we left out the first column of data above):

In [None]:
corrs[0,top5+1]

In [None]:
corrs[0,bottom5+1]