# NER and Sentiment
In this section we will work through applying basic sentiment analysis to our data using a pre-built distilBERT model from the Flair library. We will then use our organization labels captured through NER in the previous section to create a list of organizations with the highest and lowest average sentiment scores.



import pandas as pd
import flair

In [3]:
!pip install flair

Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.9/401.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting gdown==4.4.0
  Downloading gdown-4.4.0.tar.gz (14 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting more-itertools
  Downloading more_itertools-8.13.0-py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.6/51.6 kB[0m [31m937.5 kB/s[0m eta [36m0:00:00[0m0:01[0m
Collecting deprecated>=1.2.4
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ..

Building wheels for collected packages: gdown, mpld3, sqlitedict, langdetect, pptree, wikipedia-api, overrides
  Building wheel for gdown (pyproject.toml) ... [?25ldone
[?25h  Created wheel for gdown: filename=gdown-4.4.0-py3-none-any.whl size=14759 sha256=79041606fd7eb8cc0520aa05cf8f9a3595417a81c2a4d1eb660947b15f9830cb
  Stored in directory: /Users/ankush.singal/Library/Caches/pip/wheels/7d/37/b6/b2a79c75e898c0b8e46ff255102602d7159a10d9af0d80641a
  Building wheel for mpld3 (setup.py) ... [?25ldone
[?25h  Created wheel for mpld3: filename=mpld3-0.3-py3-none-any.whl size=116686 sha256=8e0d48567e6c6e79e8937e7748e75c2c9e89cfdb746e229d0f48f7d19219e233
  Stored in directory: /Users/ankush.singal/Library/Caches/pip/wheels/a6/f4/e6/e40ff9021f6b3854af70fa8ea004f5ab95672817462df08fed
  Building wheel for sqlitedict (setup.py) ... [?25ldone
[?25h  Created wheel for sqlitedict: filename=sqlitedict-2.0.0-py3-none-any.whl size=15734 sha256=c254ef621526a39c3c73c06444596649d94d55c8ccc799b1cab1d

In [4]:
import pandas as pd
import flair

In [23]:
model = flair.models.TextClassifier.load('en-sentiment')


2022-07-07 16:21:20,417 loading file /Users/ankush.singal/.flair/models/sentiment-en-mix-distillbert_4.pt


In [24]:
def get_sentiment(text):
    # tokenize input text
    sentence = flair.data.Sentence(text)
    # make sentiment prediction
    model.predict(sentence)
    # extract sentiment direction and confidence (label and score) object
    sentiment = sentence.labels[0]
    return sentiment

We now need to load our previously processed dataframe (which includes the organizations column) and apply the get_sentiment function to the selftext column. These sentiment scores will then be stored in a new sentiment column.

In [25]:
# load data
df = pd.read_csv('reddit_investing_ner.csv', sep='|')
df.head()

Unnamed: 0,id,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations
0,t3_vte51d,1657184469,investing,Daily General Discussion and Advice Thread - J...,Have a general question? Want to offer some c...,1.0,1,0,1,['FAQ']
1,t3_vt8mmp,1657164103,investing,Can I make my own index fund?,I've taken an interest to certain medical stoc...,0.6,1,0,1,[]
2,t3_vt59tx,1657154251,investing,Tool that combines the holdings of multiple ET...,I'm looking for a tool that would show concent...,0.83,7,0,7,[]
3,t3_vt44ns,1657150930,investing,Why doesn't the Fed just say fuck it and hike ...,"\nIf a recession is coming why not do this, ta...",0.71,113,0,113,[]
4,t3_vt0kre,1657141552,investing,"GameStop board approves stock split plan, shar...",[https://www.reuters.com/markets/us/gamestop-...,0.88,973,0,973,"['Reuters', 'GME', 'NFT', 'Mixer']"


In [26]:
# get sentiment
df['sentiment'] = df['selftext'].apply(get_sentiment)
df.head()

Unnamed: 0,id,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations,sentiment
0,t3_vte51d,1657184469,investing,Daily General Discussion and Advice Thread - J...,Have a general question? Want to offer some c...,1.0,1,0,1,['FAQ'],"Sentence: ""Have a general question ? Want to o..."
1,t3_vt8mmp,1657164103,investing,Can I make my own index fund?,I've taken an interest to certain medical stoc...,0.6,1,0,1,[],"Sentence: ""I 've taken an interest to certain ..."
2,t3_vt59tx,1657154251,investing,Tool that combines the holdings of multiple ET...,I'm looking for a tool that would show concent...,0.83,7,0,7,[],"Sentence: ""I 'm looking for a tool that would ..."
3,t3_vt44ns,1657150930,investing,Why doesn't the Fed just say fuck it and hike ...,"\nIf a recession is coming why not do this, ta...",0.71,113,0,113,[],"Sentence: ""If a recession is coming why not do..."
4,t3_vt0kre,1657141552,investing,"GameStop board approves stock split plan, shar...",[https://www.reuters.com/markets/us/gamestop-...,0.88,973,0,973,"['Reuters', 'GME', 'NFT', 'Mixer']","Sentence: ""[ https :// www.reuters.com / marke..."


In [29]:
# import ast

# df['organizations'] = df['organizations'].apply(lambda x: ast.literal_eval(x))


Now we need to extract each of the organizations alongside it's sentiment score. We will then loop through each, tallying up a total sentiment score and count.

Before we do that, we need to convert each value in the organizations column to a list (they are currently strings because we cannot save Python lists to file within Pandas dataframes, they are automatically converted to strings).

In [30]:
# initialize sentiment dictionary
sentiment = {}

# loop through dataframe and extract org labels and sentiment scores into sentiment dictionary
for i, row in df.iterrows():
    # extract sentiment direction and score
    direction = row['sentiment'].value
    score = row['sentiment'].score
    # loop through each label in organizations column
    for org in row['organizations']:
        # check if org label exists in sentiment dictionary already
        if org not in sentiment.keys():
            # if it doesn't, initialize new entry in dictionary
            sentiment[org] = {'POSITIVE': [], 'NEGATIVE': []}
        # append positive/negative score to respective dictionary entry
        sentiment[org][direction].append(score)

In [31]:
sentiment['ARK']

{'POSITIVE': [], 'NEGATIVE': [0.5963033437728882]}

In [32]:
# initialize sentiment list
avg_sentiment = []

# loop through each organization
for org in sentiment.keys():
    # get number of positive and negative ratings
    freq = len(sentiment[org]['POSITIVE']) + len(sentiment[org]['NEGATIVE'])
    for direction in ['POSITIVE', 'NEGATIVE']:
        # assign to variable for cleaner code
        score = sentiment[org][direction]
        # if there are no entries, set to 0
        if len(score) == 0:
            sentiment[org][direction] = 0.0
        else:
            # otherwise calculate total
            sentiment[org][direction] = sum(score)
    # now calculate total amount
    total = sentiment[org]['POSITIVE'] - sentiment[org]['NEGATIVE']
    # and the average score
    avg = total/freq
    # add to sentiment list
    avg_sentiment.append({
        'entity': org,
        'positive': sentiment[org]['POSITIVE'],
        'negative': sentiment[org]['NEGATIVE'],
        'frequency': freq,
        'score': avg
    })

In [33]:
sentiment_df = pd.DataFrame(avg_sentiment)
sentiment_df.head()

Unnamed: 0,entity,positive,negative,frequency,score
0,FAQ,0.0,53.961282,54,-0.999283
1,Reuters,0.0,9.876138,10,-0.987614
2,GME,0.0,2.990292,3,-0.996764
3,NFT,0.0,0.999891,1,-0.999891
4,Mixer,0.0,0.999891,1,-0.999891


Immediately we can see we have a lot of entities which have appeared once in our dataset, and because of this their score will be pushed to one extreme or the other. We can filter out anything with less than or equal to a frequency of 3 to remove many of these instances:

In [34]:
sentiment_df = sentiment_df[sentiment_df['frequency'] > 3]
sentiment_df

Unnamed: 0,entity,positive,negative,frequency,score
0,FAQ,0.0,53.961282,54,-0.999283
1,Reuters,0.0,9.876138,10,-0.987614
7,Robinhood,0.0,4.976206,5,-0.995241
15,Fed,1.748395,42.858616,46,-0.8937
20,fed,0.967102,10.963172,12,-0.833006
26,Nasdaq,0.0,8.457225,9,-0.939692
27,NVDA,0.982667,3.99473,5,-0.602413
31,IRS,0.0,5.992541,6,-0.998757
32,Apple,0.0,8.210319,9,-0.912258
36,USD,0.782329,9.413846,11,-0.784683


Here we have some more relevant information. We can see a few items that we can remove through the BLACKLIST covered in earlier sections such as Fed and Treasury, but nonetheless this list is looking much better than before. We can apply sort to search for the entities with the highest overall score:

In [35]:
sentiment_df.sort_values('score', ascending=False).head(10)

Unnamed: 0,entity,positive,negative,frequency,score
132,Disney,1.530298,1.652645,4,-0.030587
139,QQQ,2.940655,4.983747,8,-0.255387
47,Amazon,5.617441,10.275467,18,-0.258779
66,ETFs,3.673527,8.930993,13,-0.40442
533,YoY,0.998812,2.80804,4,-0.452307
50,NASDAQ,1.531971,4.723426,7,-0.455922
55,EPS,1.567437,5.782712,8,-0.526909
138,Atlanta Fed,0.837226,2.976176,4,-0.534737
338,AMZN,0.733974,2.999198,4,-0.566306
62,VOO,3.45265,15.129776,20,-0.583856
