In [1]:
import pandas as pd
import spacy 

In [2]:
nlp = spacy.load('en_core_web_sm')

The NER code will be reformated into a function that will consume a length of text, and return a list of mentioned organizations:

In [3]:
def get_orgs(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    # loop through the identified entities and append ORG entities to org_list
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list


# Applying NER
All we need to do now is load in the /r/investing data and apply the get_orgs function to our text column to create a new organizations column.

Load the data and view the top five rows with df.head():

In [5]:
df = pd.read_csv('reddit_investing.csv', sep='|')
df.head()

Unnamed: 0,id,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,t3_vte51d,1657184469,investing,Daily General Discussion and Advice Thread - J...,Have a general question? Want to offer some c...,1.0,1,0,1
1,t3_vt8mmp,1657164103,investing,Can I make my own index fund?,I've taken an interest to certain medical stoc...,0.6,1,0,1
2,t3_vt59tx,1657154251,investing,Tool that combines the holdings of multiple ET...,I'm looking for a tool that would show concent...,0.83,7,0,7
3,t3_vt44ns,1657150930,investing,Why doesn't the Fed just say fuck it and hike ...,"\nIf a recession is coming why not do this, ta...",0.71,113,0,113
4,t3_vt0kre,1657141552,investing,"GameStop board approves stock split plan, shar...",[https://www.reuters.com/markets/us/gamestop-...,0.88,973,0,973


In [6]:
df['organizations'] = df['selftext'].apply(get_orgs)
df.head()

Unnamed: 0,id,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations
0,t3_vte51d,1657184469,investing,Daily General Discussion and Advice Thread - J...,Have a general question? Want to offer some c...,1.0,1,0,1,[FAQ]
1,t3_vt8mmp,1657164103,investing,Can I make my own index fund?,I've taken an interest to certain medical stoc...,0.6,1,0,1,[]
2,t3_vt59tx,1657154251,investing,Tool that combines the holdings of multiple ET...,I'm looking for a tool that would show concent...,0.83,7,0,7,[ETF]
3,t3_vt44ns,1657150930,investing,Why doesn't the Fed just say fuck it and hike ...,"\nIf a recession is coming why not do this, ta...",0.71,113,0,113,[]
4,t3_vt0kre,1657141552,investing,"GameStop board approves stock split plan, shar...",[https://www.reuters.com/markets/us/gamestop-...,0.88,973,0,973,"[Reuters, GME, NFT, Mixer]"


In [7]:
# merge organizations column into one big list
orgs = df['organizations'].to_list()
orgs = [org for sublist in orgs for org in sublist]
orgs[:10]

['FAQ',
 'ETF',
 'Reuters',
 'GME',
 'NFT',
 'Mixer',
 'FOMC',
 'Order',
 'Robinhood',
 'Inflation Protected']

In [8]:
from collections import Counter

In [9]:
# create dictionary of organization mention frequency
org_freq = Counter(orgs)

We now have a Counter dictionary containing all of our organization labels as keys, and their mention frequency as values. The most_common(n) method allows us to view the n most frequently mentioned organizations:

In [10]:
org_freq.most_common(10)

[('FAQ', 54),
 ('Fed', 46),
 ('ETF', 44),
 ('VOO', 20),
 ('DCA', 20),
 ('Amazon', 18),
 ('Fidelity', 15),
 ('the Federal Reserve', 15),
 ('Tesla', 14),
 ('ETFs', 13)]

Clearly there is a need to do some further pruning of the data to remove non-organization labels like EV (electric vehicle), COVID, ETF (exchange trading fund), NYSE, and SPAC (special acquisition company). Depending on the use-case it may even be useful to keep a few of these, or remove a few others like SEC and FDA.

To do this, we would create a custom list and implement it in our get_orgs function like so:

In [11]:
BLACKLIST = ['ev', 'covid', 'etf', 'nyse', 'sec', 'spac', 'fda']

def get_orgs(text):
    doc = nlp(text)
    org_list = []
    for entity in doc.ents:
        # here we modify the original code to check that entity text is not equal to one of our 'blacklisted' organizations
        # (we also add .lower() to lowercase the text, this allows us to match both 'nyse' and 'NYSE' with just 'nyse')
        if entity.label_ == 'ORG' and entity.text.lower() not in BLACKLIST:
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

In [12]:
df['organizations'] = df['selftext'].apply(get_orgs)
df.head()

Unnamed: 0,id,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations
0,t3_vte51d,1657184469,investing,Daily General Discussion and Advice Thread - J...,Have a general question? Want to offer some c...,1.0,1,0,1,[FAQ]
1,t3_vt8mmp,1657164103,investing,Can I make my own index fund?,I've taken an interest to certain medical stoc...,0.6,1,0,1,[]
2,t3_vt59tx,1657154251,investing,Tool that combines the holdings of multiple ET...,I'm looking for a tool that would show concent...,0.83,7,0,7,[]
3,t3_vt44ns,1657150930,investing,Why doesn't the Fed just say fuck it and hike ...,"\nIf a recession is coming why not do this, ta...",0.71,113,0,113,[]
4,t3_vt0kre,1657141552,investing,"GameStop board approves stock split plan, shar...",[https://www.reuters.com/markets/us/gamestop-...,0.88,973,0,973,"[Reuters, GME, NFT, Mixer]"


In [13]:
df.to_csv('reddit_investing_ner.csv', sep='|', index=False)