# Reddit sentiment <a class="anchor" id="top"></a>

## TOC:
* [Reddit filters and sentiment methods](#bullet1)
* [General functions and imports](#bullet2)
    - [Filter methods](#sub-bullet2.1)
* [Reddit method 1](#bullet3)
* [Reddit method 2](#bullet4)
* [Sentiment calculations](#bullet5)
    - [VADER sentiment](#sub-bullet5.1)
    - [finBERT sentiment](#sub-bullet5.2)
* [ToDo](#ToDo)

## Reddit filters and sentiment methods <a class="anchor" id="bullet1"></a>

Reddit sentiment is calculated in different ways. I use 2 different 'Reddit-methods' and 2 different [relevancy filters](#sub-bullet2.1) to approach the Reddit data. 

| Filter | Method 1 | Method 2 |
| --- | --- | --- |
| Filter 1 | m1f1 | m2f1 |
| Filter 2 | m1f2 | m2f2 |


On these four combinations, two different sentiment analysis methods will be performed.


**Reddit methods**
- The first way is similar to the method used for Twitter posts. All Reddit comments will be scanned and filtered for stock mentions. These stock mentions are then counted for each stock and subsequent sentiment analysis will be performed on them.
- The second method will not only look at comments, but also consider the Reddit posts. This is done in line with [Hu, Jones, Zhang and Zhang (2021) page 10-12](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3807655). The method begins start by checking whether a post or comment contains a company ticker or name. Once such a mention is found, any direct reply's are searched. All posts or comments mentioning a company together with any direct reply's to these posts or comments are counted as social media activity for this particular company. 

These methods differ slightly from the Reddit comments. The reason for this is that it seems that the `$ticker` naming convention does not seem popular on Reddit. Searching only for `$tsla` returns only 58 observations for the r/stocks subreddit during the 2020_07 period. When searching for the company name instead (Tesla), 4192 observations are found. Hence Reddit comments will also be searched using company names.


A total of `20.511.418` comments are scraped from the 5 subreddits investing, pennystocks, stockmarket, stocks and wallstreetbets. `844.715` comments posts are found for method 1. Method 2 ... `36.244` posts

**Sentiment methods**
The sentiment methods which are used are:
- cjhutto VADER sentiment
- finBERT sentiment



## General functions and imports <a class="anchor" id="bullet2"></a>

**Imports**

In [106]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import os
import time
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# import statsmodels.formula.api as sm


**General functions**

In [2]:
# List of company names
company_names = {'TSLA': 'Tesla', 
                 'MU': 'Micron Technology', 
                 'SNAP': 'Snapchat', 
                 'AMD': 'AMD', 
                 'DIS': 'Disney', 
                 'MSFT': 'Microsoft', 
                 'AAPL': 'Apple', 
                 'AMZN': 'Amazon', 
                 'SQ': 'Block', 
                 'BABA': 'Alibaba', 
                 'V': 'Visa', 
                 'NFLX': 'Netflix', 
                 'IQ': 'IQIYI', 
                 'ATVI': 'Activision Blizzard', 
                 'SHOP': 'Shopify', 
                 'BA': 'Boeing', 
                 'NVDA': 'NVIDIA', 
                 'GE': 'General Electric', 
                 'WMT': 'Walmart', 
                 'SBUX': 'Starbucks', 
                 'F': 'Ford', 
                 'TLRY': 'Tilray', 
                 'LULU': 'Lululemon', 
                 'BAC': 'Bank of America', 
                 'GME': 'GameStop'}

# Return name of company name dict
def find_company_name(ticker):
    ticker = ticker.upper()
    
    return company_names[ticker]
    
    
find_company_name("gme")

'GameStop'

In [97]:
# A list of top NASDAQ and S&P companies
company_list = ['MetLife', 'Exelon', "O'Reilly Automotive", 'Baker Hughes', 'Rivian',
                'Applied Materials', 'Palo Alto Networks', 'Alphabet', 'Nike',
                'ASML', 'Lululemon', 'Bank of America', 'Cadence Design Systems',
                'BNY Mellon', 'Pfizer', 'Cintas', 'Google', 'Marriott International', 'American Tower',
                'Thermo Fisher Scientific', 'CrowdStrike', 'Ross Stores', 'Emerson', 'Lilly', 'Linde',
                'Salesforce', 'Qualcomm', 'CoStar Group', 'T-Mobile','TMobile', 'Caterpillar', 'Cognizant',
                'FedEx', 'JD.com', 'Johnson & Johnson', 'Alibaba', 'Netflix', 'Seagen', 'Sirius XM',
                'Procter & Gamble', 'Microchip Technology', 'PepsiCo', 'Nvidia', 'Dow', 'Zscaler',
                'Abbott', 'Charter Communications', 'Charles Schwab', "McDonald's", 'Mastercard',
                'Ford', 'MercadoLibre', 'Booking Holdings', 'Diamondback Energy', 'Dollar Tree',
                'Verisk', 'Zoom Video Communications', 'Altria', 'IQIYI', 'Constellation Energy',
                'Wells Fargo', 'Starbucks', 'NXP', 'Adobe', 'eBay', 'Gilead', 'American Electric Power',
                'Lucid Motors', 'Disney', 'Coca-Cola', 'Visa', 'Illumina, Inc.', 'Moderna', 'Fiserv',
                'Boeing', 'Medtronic', 'Gilead Sciences', 'GlobalFoundries', 'NextEra Energy',
                'Texas Instruments', 'Intel', 'Monster Beverage', 'Block', 'Meta Platforms', 'Tesla',
                'KLA Corporation', 'Broadcom', 'Airbnb', 'PayPal', 'CVS Health', 'AT&T', 'Cisco',
                'Raytheon Technologies', 'U.S. Bank', 'Duke Energy', 'Union Pacific', '3M',
                'Lockheed Martin', 'Ansys', 'ExxonMobil', 'Verizon', 'Xcel Energy', 'Paccar', 'Costco',
                'Analog Devices', 'AbbVie', 'Lam Research', 'Vertex Pharmaceuticals', 'Intuitive Surgical',
                'Synopsys', 'AMD', 'American Express', 'Regeneron', 'Activision Blizzard', 'Target',
                'Atlassian', 'Advanced Micro Devices', 'Warner Bros. Discovery', 'Tilray', 'BlackRock',
                'Amazon', 'Walmart', 'Workday', 'Dr Pepper', 'JPMorgan Chase', 'Meta', 'Copart',
                'Biogen', 'Paychex', 'Capital One', 'UnitedHealth Group', 'Fastenal', 'Microsoft',
                'AstraZeneca', "Lowe's", 'Datadog', 'General Electric', 'ADP', 'Goldman Sachs', 'PDD Holdings',
                'Accenture', 'Mondelēz International', 'United Parcel Service', 'Comcast', 'Micron Technology',
                'CSX Corporation', 'Electronic Arts', 'Marvell Technology', 'ConocoPhillips',
                'Walgreens Boots Alliance', 'Home Depot', 'Idexx Laboratories', 'IBM', 'Oracle', 'Shopify',
                'Apple', 'GE', 'Morgan Stanley', 'Merck', 'Citigroup', 'Simon', 'Amgen',
                'Philip Morris International', 'General Dynamics', 'Bristol Myers Squibb', 
                'Old Dominion Freight Line', 'Snapchat', 'GameStop', 'Align Technology', 'Colgate-Palmolive', 
                'Berkshire Hathaway', 'Enphase Energy', 'Fortinet', 'Intuit', 'Danaher', 'Kraft Heinz', 'Chevron', 
                'GM', 'DexCom', 'Southern Company', 'Autodesk', 'American International Group', 'Honeywell']


In [5]:
pattern = re.compile('|'.join(['|'.join([f"\\b{name}\\b" for name in company_list])]))
pattern

re.compile(r"\bMetLife\b|\bExelon\b|\bO'Reilly Automotive\b|\bBaker Hughes\b|\bRivian\b|\bApplied Materials\b|\bPalo Alto Networks\b|\bAlphabet\b|\bNike\b|\bASML\b|\bLululemon\b|\bBank of America\b|\bCadence Design Systems\b|\bBNY Mellon\b|\bPfizer\b|\bCintas\b|\bGoogle\b|\bMarriott International\b|\bAmerican Tower\b|\bThermo Fisher Scientific\b|\bCrowdStrike\b|\bRoss Stores\b|\bEmerson\b|\bLilly\b|\bLinde\b|\bSalesforce\b|\bQualcomm\b|\bCoStar Group\b|\bT-Mobile\b|\bTMobile\b|\bCaterpillar\b|\bCognizant\b|\bFedEx\b|\bJD.com\b|\bJohnson & Johnson\b|\bAlibaba\b|\bNetflix\b|\bSeagen\b|\bSirius XM\b|\bProcter & Gamble\b|\bMicrochip Technology\b|\bPepsiCo\b|\bNvidia\b|\bDow\b|\bZscaler\b|\bAbbott\b|\bCharter Communications\b|\bCharles Schwab\b|\bMcDonald's\b|\bMastercard\b|\bFord\b|\bMercadoLibre\b|\bBooking Holdings\b|\bDiamondback Energy\b|\bDollar Tree\b|\bVerisk\b|\bZoom Video Communications\b|\bAltria\b|\bIQIYI\b|\bConstellation Energy\b|\bWells Fargo\b|\bStarbucks\b|\bNXP\b|\bAdobe\b

### Filter methods <a class="anchor" id="sub-bullet2.1"></a>

Besides the two [Reddit-methods](#bullet1) which were described in the introduction, I also make use of two different filter. These filters are used to check if a Tweet is relevant to a specific company.

To summarize:
- `filter_data_1` Checks whether the name of the company or the ticker of the company is mentioned. If one of these is true, the comment or post is seen as relevant.
- `filter_data_2` Is more strict that filter_data_1. Just like the filter 1, is checks for the company name and ticker. However, this time it also checks whether other tickers or company names are mentioned. It does this using NASDAQ and S&P company names which are summarized in `company_list`. These company names are then compiled into a regex search. If it turns out that different tickers or company names are mentioned in the post or comment, the post or comment is not considered relevant. If only the company ticker or company's name is mentioned, the post <u>is</u> considered relevant.

In [4]:
# Functions borrowed from Twitter 'Data cleaning and sentiment' Jupyter Notebook
def clean_text(text):
    # Remove twitter Return handles (RT @xxx:)
    text = re.sub("RT @[\w]*:", "", text)

    # Remove twitter handles (@xxx)
    text = re.sub("@[\w]*", "", text)

    # Remove URL links (httpxxx)
    url_matcher = "((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*"
    text = re.sub(url_matcher, "", text)
    
    # Remove any multiple white spaces, tabs or newlines
    text = re.sub('\s+',' ', text)
    
    #remove “”
    text = re.sub("“|”", "", text)
    
    return text

# This function is slightly adjusted. Not only are company tickers searched ("$ticker"), but also company names are searched.
# This change increases the amount of mentions for Tesla on the r/stocks subreddit during the 2020_07 period from 58 to 4192 observations
def filter_data_1(post, ticker):
    # Filter out posts that do not mention the company ticker.
    company_name = find_company_name(ticker)
    if bool(re.search(fr"(\${ticker})|({company_name})", post, re.IGNORECASE)):
        return True
    else:
        return False
    
# Method 2 filters the comments based on the rule that exactly 1 ticker or company name is mentioned and ...
# ... that this ticker is the ticker of the company of which the sentiment is being calculated   
def filter_data_2(post, ticker):
    # ---- Tickers ----
    # Count the number of tickers in the post
    ticker_matches = len(re.findall(r"\$[a-zA-Z]+", post, re.IGNORECASE))
    company_ticker_matches = len(re.findall(fr"\${ticker}", post, re.IGNORECASE))
    
    # The ticker_diff needs to be equal to zero, else other tickers than the company ticker are mentioned.
    ticker_diff = ticker_matches - company_ticker_matches
    
    # print(f"{len(ticker_matches)} - {len(company_ticker_matches)} = {ticker_diff}")
    
    # ---- Company name ----
    # Create pattern which matches any of the company names in the company_list
    # '\\b' prevents any occurences of short company names from being picked up mid-string:
    # GE --> 'Vegetable' would be picked up without this rule
    pattern = re.compile('|'.join(['|'.join([f"\\b{name}\\b" for name in company_list])]))
    name_matches = len(pattern.findall(post, re.IGNORECASE))
    
    # Next the company of the ticker is searched.
    company_name = find_company_name(ticker)
    company_name_matches = len(re.findall(f"\\b{company_name}\\b", post, re.IGNORECASE))
    
    # The company_name_diff needs to be equal to zero, else other tickers than the company ticker are mentioned.
    company_name_diff = name_matches - company_name_matches
    
    # print(f"{name_matches_count} - {len(company_name_matches)} = {company_name_diff}")
    
    # Filter out posts with more or less than 1 ticker, and check whether this 1 ticker is the company ticker.
    if ticker_diff == 0 and company_name_diff == 0 and (company_ticker_matches > 0 or company_name_matches > 0):
        return True
    else:
        return False

text = "This is a test text. If Apple is in the text, without any other companies mentioned, the functions returns True."
print(filter_data_2(text, "aapl"))

text = "If another company is mentioned, like Gamestop, it returns False"
print(filter_data_2(text, "aapl"))

text = "If no company is mentioned, it also returns False"
print(filter_data_2(text, "aapl"))

True
False
False


In [8]:
def filter_mentions(df, ticker, social_type='comment', filter_type=1):
    if social_type == 'comment':
        # Dropping any potential NA's
        df = df[df['body'].notna()]

        if filter_type == 1:
            # Filter out comments that do not mention company ticker using 'filter_data_1'
            df_filter = df['body'].apply(filter_data_1, ticker=ticker)
        elif filter_type == 2:
            # Filter out comments that do not mention company ticker using 'filter_data_2'
            df_filter = df['body'].apply(filter_data_2, ticker=ticker)
        else:
            raise Exception("Please select a valid filter_type for func [filter_mentions]. Either 1 or 2.")
        df = df[df_filter]
        
        # Skip cleaning if df is empty
        if df.shape[0] > 0:
            # Clean text
            df['body'] = df['body'].apply(clean_text)  
            
    elif social_type == 'post':
        # Dropping any potential NA's
        df = df[df['selftext'].notna()]

        if filter_type == 1:
            # Filter out posts that do not mention company ticker using 'filter_data_1'
            selftext_filter = df['selftext'].apply(filter_data_1, ticker=ticker)
            title_filter = df['title'].apply(filter_data_1, ticker=ticker)
        elif filter_type == 2:
            # Filter out posts that do not mention company ticker using 'filter_data_2'
            selftext_filter = df['selftext'].apply(filter_data_2, ticker=ticker)
            title_filter = df['title'].apply(filter_data_2, ticker=ticker)
        else:
            raise Exception("Please select a valid filter_type for func [filter_mentions]. Either 1 or 2.")
        df = df[selftext_filter | title_filter]
        
        
        # Skip cleaning if df is empty
        if df.shape[0] > 0:
            # Clean text
            df['selftext'] = df['selftext'].apply(clean_text)
           
        # For posts specifically, I drop observations for
        # - Authors postings multiple posts on 1 day (spam or duped posts)
        df = df.drop_duplicates(subset=['author', 'date'], keep=False)
        # - Posts with duplicate texts
        df = df.drop_duplicates(subset=['selftext'], keep=False)
        
    else:
        raise Exception("Please select a valid social_type for func [filter_mentions]. Either 'comment' or 'post'.")
    
    return df

def save_df(df, save_path):
    # Check if file already exists. Ignore headers if true
    if os.path.isfile(save_path):
        df.to_csv(save_path, mode='a', header=False, index=False)
    else: 
        df.to_csv(save_path, encoding='utf-8', index=False)
        print(f"Creating new file at [{save_path}]")
        

## Reddit method 1 <a class="anchor" id="bullet3"></a>

**Extract all comments which mention a company ticker or name**

In [40]:
save = False

start_time = time.time()

ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
rootdir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\unfiltered"
save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered"

if save:
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            # Create csv_path
            csv_path = os.path.join(subdir, file)


            # Read csv
            df = pd.read_csv(csv_path)

            # Dropping certain columns
            df.drop(columns=['score_hidden', 'total_awards_received', 'nest_level', 'author_fullname'], inplace=True)
            
            # Loop tickers
            for ticker in ticker_list:
                print(f"Time passed: {str(round((time.time() - start_time), 1)).ljust(8)} - Processing {ticker.ljust(4)} for [{csv_path}]", end='\r')

                # ----------------------- Filter_1 -----------------------
                work_df = df.copy()
                # Apply filter
                filtered_df = filter_mentions(work_df, ticker, social_type='comment', filter_type=1)

                # Creating save_path and saving file (either new file or appending)
                save_path = os.path.join(save_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
                save_df(filtered_df, save_path)

                # ----------------------- Filter_2 -----------------------
                work_df = df.copy()
                # Apply filter
                filtered_df = filter_mentions(work_df, ticker, social_type='comment', filter_type=2)

                # Creating save_path and saving file (either new file or appending)
                save_path = os.path.join(save_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
                save_df(filtered_df, save_path)

Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/AAPL.csv]ltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/AMD.csv]iltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/AMZN.csv]ltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/ATVI.csv]ltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/BA.csv]filtered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/BABA.csv]ltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/BAC.csv]iltered\investing\2018_04.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/r

## Reddit method 2 <a class="anchor" id="bullet4"></a>

Method 2 checks what posts contain mentions of companies in the form of ticker or company name mentions. 

In [9]:
save = False

ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
rootdir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\posts\unfiltered"
save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\posts\filtered"

if save:
    start_time = time.time()
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            # Create csv_path
            csv_path = os.path.join(subdir, file)


            # Read csv
            df = pd.read_csv(csv_path)

            # Loop tickers
            for ticker in ticker_list:
                print(f"Time passed: {str(round((time.time() - start_time), 1)).ljust(8)} - Processing {ticker.ljust(4)} for [{csv_path}]", end='\r')
                # ----------------------- Filter_1 -----------------------
                work_df = df.copy()
                # Apply filter
                filtered_df = filter_mentions(df, ticker, social_type='post', filter_type=1)


                # Creating save_path and saving file (either new file or appending)
                save_path = os.path.join(save_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
                save_df(filtered_df, save_path)

                # ----------------------- Filter_2 -----------------------
                work_df = df.copy()
                # Apply filter
                filtered_df = filter_mentions(work_df, ticker, social_type='post', filter_type=2)

                # Creating save_path and saving file (either new file or appending)
                save_path = os.path.join(save_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
                save_df(filtered_df, save_path)


Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_1/AAPL.csv]ltered\filtered_df.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_2/AAPL.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_1/AMD.csv]iltered\filtered_df.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_2/AMD.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_1/AMZN.csv]ltered\filtered_df.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_2/AMZN.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_1/ATVI.csv]ltered\filtered_df.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_2/ATVI.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/posts/filtered/filter_1/B

**Merging all comments to a single file**

After finding all posts mentioning a company (besides the comments as done under method 1), it is now time to see what replies are posted to these comments. For posts these replies will be top level comments. For replies on other comments, they can be found on any level. 

I begin by merging all comments into a single file. This does not serve any purpose besides making it easier to loop the comments.

In [5]:
save = False

# --- Merge all comments into single file ---
if save:
    rootdir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\unfiltered"
    save_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\all_comments.csv"
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            # Create csv_path
            csv_path = os.path.join(subdir, file)

            # Read csv
            df = pd.read_csv(csv_path)

            df.drop(columns=['score_hidden', 'total_awards_received', 'nest_level', 'author_fullname'], inplace=True)

            # Save new csv or append to existing csv
            save_df(df, save_path)

Creating new file at [E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\all_comments.csv]


**Searching for replies**

Method 2 also finds all comments which are replies to posts or comments containing the company name or ticker. 

To find out whether a comment is a reply, each comment will be checked to see if the `parent_id` matches an `id` of a comment containing a company mention. For responses to posts, the `parent_id` will be empty. For these, the `link_id` will need to match the post's id.

For replies to comments:
- `df['parent_id']` matches an `df['id']` of a comment containing a company mention.

For top-level replies to posts:
- `df['parent_id']` is empty and `df['link_id']` is equal to the post's `df['id']` which mentions a company.

In [24]:
def filter_replies(df_chunk, comments_df, posts_df):
    #           --- Comment filter ---
    # Create a dataframe filter which checks if comment is a reply to other comment
    comment_id_list = comments_df['id'].to_list()
    comment_filter = df_chunk['parent_id'].isin(comment_id_list)
    
    #           --- Post filter ---
    # Removing 't3_' from the comment 'link_id' field
    df_chunk['link_id'] = df_chunk['link_id'].str[3:]
    
    # Create a dataframe filter which checks if comment is a reply to other comment
    post_id_list = posts_df['id'].to_list()
    post_filter = (df_chunk['parent_id'].isna() & df_chunk['link_id'].isin(post_id_list))
    
    return_df = df_chunk[comment_filter | post_filter]
    return return_df

# Checks replies for the chunk dataframe input
def check_replies(df):
    
    ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
    comment_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered"
    posts_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\posts\filtered"
    save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\replies"
    
    for ticker in ticker_list:
        # ----------------------- Filter_1 -----------------------
        work_df = df.copy()
        # Creating paths
        comment_path = os.path.join(comment_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
        posts_path = os.path.join(posts_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
        
        # Reading csv's to df
        comments_df = pd.read_csv(comment_path)
        posts_df = pd.read_csv(posts_path)
        
        # Run function
        return_df = filter_replies(work_df, comments_df, posts_df)

        # Creating save_path and saving file (either new file or appending)
        save_path = os.path.join(save_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
        save_df(return_df, save_path)

        # ----------------------- Filter_2 -----------------------
        work_df = df.copy()
        # Creating paths
        comment_path = os.path.join(comment_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
        posts_path = os.path.join(posts_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')

        # Reading csv's to df
        comments_df = pd.read_csv(comment_path)
        posts_df = pd.read_csv(posts_path)
        
        # Run function
        return_df = filter_replies(work_df, comments_df, posts_df)

        # Creating save_path and saving file (either new file or appending)
        save_path = os.path.join(save_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
        save_df(return_df, save_path)


**Search for replies by chunking all_comments.csv**


To loop all comments, I loop the chunks from the all_comments.csv

In [25]:
save = False

start_time = time.time()

csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\all_comments.csv"
chunksize = 10 ** 6
counter = 1

if save:
    for chunk in pd.read_csv(csv_path, chunksize=chunksize):
        print(f"Processing chunk {str(counter).ljust(2)} --- time passed: {round((time.time() - start_time), 1)}", end='\r')
        check_replies(chunk)
        counter += 1

Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_1/AAPL.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_2/AAPL.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_1/AMD.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_2/AMD.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_1/AMZN.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_2/AMZN.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_1/ATVI.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_2/ATVI.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_1/BA.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Thesis/reddit/replies/filter_2/BA.csv]
Creating new file at [E:/Users/Christiaan/Large_Files/Th

**Merging comments with replies**

Besides this, duplicates are also removed

In [95]:
save = False

ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
comment_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered"
reply_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\replies"
save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\method_2"

if save:
    for ticker in ticker_list:
        # ----------------------- Filter_1 -----------------------
        comment_path = os.path.join(comment_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
        reply_path = os.path.join(reply_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')
        save_path = os.path.join(save_dir, "filter_1", f"{ticker}.csv").replace('\\', '/')

        # Reading csv's to df
        comments_df = pd.read_csv(comment_path)
        reply_df = pd.read_csv(reply_path)

        # Merge files
        merged_df = pd.concat([comments_df, reply_df], ignore_index=True)

        # Delete duplicates
        merged_df.drop_duplicates(subset=['permalink'], inplace=True)

        # Saving file
        merged_df.to_csv(save_path, encoding='utf-8', index=False)

        # ----------------------- Filter_2 -----------------------
        comment_path = os.path.join(comment_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
        reply_path = os.path.join(reply_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')
        save_path = os.path.join(save_dir, "filter_2", f"{ticker}.csv").replace('\\', '/')

        # Reading csv's to df
        comments_df = pd.read_csv(comment_path)
        reply_df = pd.read_csv(reply_path)

        # Merge files
        merged_df = pd.concat([comments_df, reply_df], ignore_index=True)

        # Delete duplicates
        merged_df.drop_duplicates(subset=['permalink'], inplace=True) 

        # Saving file
        merged_df.to_csv(save_path, encoding='utf-8', index=False)

## Sentiment calculations <a class="anchor" id="bullet5"></a>

Having the results for all four different 'Reddit-method' - filter combinations, I can now start with sentiment calculations. 

### VADER sentiment <a class="anchor" id="sub-bullet5.1"></a>

**Functions**

These functions are partially copied from the Twitter sentiment Jupyter Notebook.

In [107]:
# Adding word-sentiment pairs to the cjhutto vaderSentiment library.
new_words = {}

# Adding custom postive words
positive_words = {
    'buy': 2.0,
    'buying': 2.0,
    'bullish': 2.0,
    'long': 1.0,
    'call': 1.0,
    'calls': 1.0,
    'rocket': 3.0,        # Added for 'rocket' emoji 🚀
    'increasing': 2.0,     # Added for 'chart increasing' emoji 📈
    'to the moon': 2.5,
    "undervalued": 2.0
}
# Adding custom negative words
negative_words = {
    'decreasing': -2.0,   # Added for 'chart increasing' emoji 📉
    'sell': -2.0,
    'selling': -2.0,
    'bearish': -2.0,
    'put': -1,
    'puts': -1,
    'short': -1.0,
    'shorting': -1.5,
    "overvalued": -2.0,
    'expensive': -1.5
}

# Adding positive and negative words to new_worddictionary
new_words.update(positive_words)
new_words.update(negative_words)

In [108]:
#   ---------------------------   Sentiment   ---------------------------
# Creating SIA, which uses standard words.
SIA = SentimentIntensityAnalyzer()

def calc_sentiment_1(text, sent_type):
    result = SIA.polarity_scores(text)
    return result[sent_type]

# Creating SIA2 to add custom words.
SIA2 = SentimentIntensityAnalyzer()
SIA2.lexicon.update(new_words)

def calc_sentiment_2(text, sent_type):
    result = SIA2.polarity_scores(text)
    return result[sent_type]

In [147]:
def clean_data(df, ticker):
    start_time = time.time()
    # Prepping data by setting date column
    df['created_at'] = pd.to_datetime(df['utc_datetime_str'])
    df['date'] = df['created_at'].dt.date
    
    #   ---------------------------   Sentiment   ---------------------------
    # Calculate sentiment scores
    df[f'compound_sent_1'] = df['body'].astype(str).apply(calc_sentiment_1, sent_type='compound')
    df[f'compound_sent_2'] = df['body'].astype(str).apply(calc_sentiment_2, sent_type='compound')
    
    print(f"[{ticker}] Done calculating sentiment after --- %s seconds ---" % (time.time() - start_time))
    
    """Converting this to pos, neg or neu sentiment
    - positive sentiment: compound score >= 0.05
    - neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
    - negative sentiment: compound score <= -0.05
    """
    
    # Check if sentiment corresponds to pos, neg or neu sentiment for compound_sent_1
    df['s1_pos'] = np.where(df['compound_sent_1'] >= 0.05, 1, 0)
    df['s1_neg'] = np.where(df['compound_sent_1'] <= -0.05, 1, 0)

    # Check if sentiment corresponds to pos, neg or neu sentiment for compound_sent_2
    df['s2_pos'] = np.where(df['compound_sent_2'] >= 0.05, 1, 0)
    df['s2_neg'] = np.where(df['compound_sent_2'] <= -0.05, 1, 0)
   
    return df

In [129]:
def count_posts(df):
    # Create results_df with [sentiment_1]
    results_df = df[['date', 's1_pos', 's1_neg']].groupby('date', as_index=False).sum().rename(columns={"s1_pos": "[s1]pos", "s1_neg": "[s1]neg"})
    results_df['[s1]total'] = results_df['[s1]pos'] + results_df['[s1]neg']

    # Merge [sentiment_2]
    to_merge_df = df[['date', 's2_pos', 's2_neg']].groupby('date', as_index=False).sum().rename(columns={"s2_pos": "[s2]pos", "s2_neg": "[s2]neg"})
    to_merge_df['[s2]total'] = to_merge_df['[s2]pos'] + to_merge_df['[s2]neg']
    results_df = results_df.merge(to_merge_df, how='left', left_on='date', right_on='date')
    
    return results_df

**Different types of VADER sentiment measures**

Although different data filters and sentiment measuring methods are mentioned, this does not solve the problem of how sentiment needs to be measured. Do I take the total amount of positive posts as a proxy of sentiment? Or do I use the ratio of positive and negative posts. Or do I subtract the negative posts from the positive posts and divide that figure by the total amount of posts?

It is clear that there is no clear winner here. I propose that sentiment should be measure in two parts. The first part should contain the `relative volume` of the social media posts, as large volumes are more likely to be noticable. Part 1 will thus act as a way to strengen or weaken the total sentiment score of the day, by comparing the volume of that day with the average volume of the last 7 days. The second part of the measure should contain the actual `sentiment`. Is it positive or negative? It should also capture the severity of the sentiment. 


**Part 1**
The first part is the easier one of the two parts to measure. With the sentiment tools all returning either a positive, negative or neutral label, I decide that I will solely be focussing on the positive and negative posts. This means that these are also the posts that will be taken into considerations when looking at volume. The volume will be measured by counting the total of positive and negative posts for a given day. This daily total will then be divided by the average total posts of the last 7 days.

- $\text{Relative volume}_{t0} = \frac{\text{Total positive posts}_{t0}+\text{Total negative posts}_{t0}}{Rolling mean 30(\text{Total positive posts}_{t0}+\text{Total negative posts}_{t0})}$

**Part 2**
The second part is harded, as it is unclear what the best way to measure sentiment is. To tackle this problem, I will calculate the sentiment in different ways.
- Method 1: positive / negative
- <strike>Method 2: (positive - negative) / (positive + negative)</strike>
- Method 3: Daily mean of compound sentiment score, counting only posts categorised as positive or negative


In [136]:
def calc_sent_measures(return_df):
    # Get results dataframe
    sentiment_measures = count_posts(return_df)

    # Only keep sentiment 2 results
    sentiment_measures = sentiment_measures.loc[:, sentiment_measures.columns.str.contains('s1|s2|date')]
    
    # Method 1 - ratio
    sentiment_measures['[s1]method_1'] = sentiment_measures['[s1]pos'] / sentiment_measures['[s1]total']
    sentiment_measures['[s2]method_1'] = sentiment_measures['[s2]pos'] / sentiment_measures['[s2]total']

    # Method 2 - discontinued as it is the same as method 1
    #     sentiment_measures['[f1s2]method_2'] = (sentiment_measures['[f1s2]pos'] - sentiment_measures['[f1s2]neg']) / sentiment_measures['[f1s2]total']
    #     sentiment_measures['[f2s2]method_2'] = (sentiment_measures['[f2s2]pos'] - sentiment_measures['[f2s2]neg']) / sentiment_measures['[f2s2]total']

    # Method 3
    to_merge = return_df[((return_df['s1_pos'] == 1) | (return_df['s1_neg'] == 1))][['date', 'compound_sent_1']].groupby('date').mean().rename(columns={'compound_sent_1': '[s1]method_3'})
    sentiment_measures = sentiment_measures.merge(to_merge, how='left', left_on='date', right_on='date')
    to_merge = return_df[((return_df['s2_pos'] == 1) | (return_df['s2_neg'] == 1))][['date', 'compound_sent_2']].groupby('date').mean().rename(columns={'compound_sent_2': '[s2]method_3'})

    sentiment_measures = sentiment_measures.merge(to_merge, how='left', left_on='date', right_on='date')
    return sentiment_measures


**Actual sentiment calculations**

In [150]:
ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']

save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\sentiment\VADER"
methods = ['m1f1', 'm2f1', 'm1f2', 'm2f2']
method_1_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered"
method_2_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\method_2"


for ticker in ticker_list:
    def loop_func(comment_dir, filter_method="filter_1", method_name="m1f1"):
        # ----------------------- m1f1/m1f2/m2f1/m2f2 -----------------------
        comment_path = os.path.join(comment_dir, filter_method, f"{ticker}.csv").replace('\\', '/')
        
        # Reading csv's to df
        comments_df = pd.read_csv(comment_path)

        # Filter and clean data. Also perform VADER sentiment scoring.
        return_df = clean_data(comments_df, ticker)

        # Calculate sentiment scores for each method
        sentiment_measures = calc_sent_measures(return_df)

        # Creating save path and saving file
        save_path = os.path.join(save_dir, method_name, f"{ticker}.csv").replace('\\', '/')
        sentiment_measures.to_csv(save_path, encoding='utf-8', index=False)
        
        print(f"[{ticker}]", comment_path)
        print(f"[{ticker}]", save_path)
        print()
    
    # For each of the 4 method-filter combinations, run the the functions
    loop_func(comment_dir=method_1_dir, filter_method="filter_1",  method_name="m1f1")
    loop_func(comment_dir=method_1_dir, filter_method="filter_2",  method_name="m1f2")
    loop_func(comment_dir=method_2_dir, filter_method="filter_1",  method_name="m2f1")
    loop_func(comment_dir=method_2_dir, filter_method="filter_2",  method_name="m2f2")

[AAPL] Done calculating sentiment after --- 204.51891446113586 seconds ---
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_1/AAPL.csv
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f1/AAPL.csv

[AAPL] Done calculating sentiment after --- 59.460816621780396 seconds ---
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/AAPL.csv
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f2/AAPL.csv

[AAPL] Done calculating sentiment after --- 263.22150135040283 seconds ---
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_1/AAPL.csv
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f1/AAPL.csv

[AAPL] Done calculating sentiment after --- 82.80350613594055 seconds ---
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_2/AAPL.csv
[AAPL] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f2/AAPL.csv

[AMD] Done calculating sentiment after --- 94.48554

[GE] Done calculating sentiment after --- 3.4391849040985107 seconds ---
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_1/GE.csv
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f1/GE.csv

[GE] Done calculating sentiment after --- 0.3592112064361572 seconds ---
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/GE.csv
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f2/GE.csv

[GE] Done calculating sentiment after --- 5.959677457809448 seconds ---
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_1/GE.csv
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f1/GE.csv

[GE] Done calculating sentiment after --- 0.8680615425109863 seconds ---
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_2/GE.csv
[GE] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f2/GE.csv

[GME] Done calculating sentiment after --- 7.820158004760742 seconds ---
[GME] E:/Users/Chr

[SHOP] Done calculating sentiment after --- 12.072792053222656 seconds ---
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_1/SHOP.csv
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f1/SHOP.csv

[SHOP] Done calculating sentiment after --- 3.521402359008789 seconds ---
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/filter_2/SHOP.csv
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m1f2/SHOP.csv

[SHOP] Done calculating sentiment after --- 19.06176519393921 seconds ---
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_1/SHOP.csv
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f1/SHOP.csv

[SHOP] Done calculating sentiment after --- 5.842756986618042 seconds ---
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/method_2/filter_2/SHOP.csv
[SHOP] E:/Users/Christiaan/Large_Files/Thesis/reddit/sentiment/m2f2/SHOP.csv

[SNAP] Done calculating sentiment after --- 8.4758396

### finBERT sentiment <a class="anchor" id="sub-bullet5.2"></a>

## ToDo <a class="anchor" id="ToDo"></a>

[Go back up](#top)

In [16]:
dtypes = {'author': 'str',
         'author_fullname': 'str',
         'created_utc': 'int64',
         'permalink': 'str',
         'score': 'int64',
         'body': 'str',
         'is_submitter': 'bool',
         'id': 'str',
         'link_id': 'str',
         'parent_id': 'str',
         'nest_level': 'float64',
         'subreddit': 'str',
         'subreddit_id': 'str'}

csv_path = r"E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/AAPL.csv"
df = pd.read_csv(csv_path, dtype=dtypes)

In [61]:
chunk['parent_id'].isna()

2000000     True
2000001    False
2000002    False
2000003     True
2000004     True
           ...  
2999995    False
2999996     True
2999997    False
2999998     True
2999999    False
Name: parent_id, Length: 1000000, dtype: bool

In [67]:
csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\replies\filter_1\AMD.csv"
df = pd.read_csv(csv_path)


csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered\filter_1\AMD.csv"
df2 = pd.read_csv(csv_path)

In [86]:
df3 = pd.concat([df, df2], ignore_index=True)
# df3.dtypes

(267018, 12)

In [87]:
df3.drop_duplicates(subset=['author', 'created_utc']).shape


(267018, 12)

In [88]:
df3.drop_duplicates(subset=['author', 'id']).shape


(267066, 12)

In [92]:
df3.drop_duplicates(subset=['permalink']).shape


(267066, 12)

In [57]:
df2
# df2.drop(columns=['nest_level', 'author_fullname'])
# df2.drop(columns=['nest_level', 'author_fullname'], inplace=True)

Unnamed: 0,author,author_fullname,created_utc,utc_datetime_str,permalink,score,body,is_submitter,id,link_id,parent_id,nest_level,subreddit,subreddit_id
0,moldyjellybean,,1527772177,2018-05-31 13:09:37,/r/investing/comments/8nhv7o/daily_advice_thre...,2,"Interactive Brokers, Fidelity, Schwab. Which p...",False,dzvm8me,t3_8nhv7o,,1.0,investing,t5_2qhhq
1,Archdukeprinceking,,1527755558,2018-05-31 08:32:38,/r/investing/comments/8nc4vr/warren_buffett_re...,4,Yikes most all of their revenue is from consul...,False,dzvck31,t3_8nc4vr,dzv28hu,,investing,t5_2qhhq
2,andreabrodycloud,,1527752907,2018-05-31 07:48:27,/r/investing/comments/8nc4vr/warren_buffett_re...,5,Intel is working on 10nm for 2019 while AMD is...,False,dzvbb5b,t3_8nc4vr,dzv7ki7,,investing,t5_2qhhq
3,hqtitan,,1527749100,2018-05-31 06:45:00,/r/investing/comments/8nc4vr/warren_buffett_re...,7,IBM is mainly in high performance computing th...,False,dzv9fg3,t3_8nc4vr,dzv28hu,,investing,t5_2qhhq
4,hakkzpets,,1527745756,2018-05-31 05:49:16,/r/investing/comments/8nc4vr/warren_buffett_re...,15,&gt; Do you think Intel and AMD are the ones a...,False,dzv7ki7,t3_8nc4vr,dzv28hu,,investing,t5_2qhhq
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135158,MoonRei_Razing,t2_2511dxfd,1596240630,2020-08-01 00:10:30,/r/wallstreetbets/comments/i1czqd/how_screwed_...,1,"Being in tech, and making the decisions to pro...",False,fzxobcb,t3_i1czqd,fzwtvxk,,wallstreetbets,t5_2th52
135159,TrapHouseLessons,t2_1evejy95,1596240393,2020-08-01 00:06:33,/r/wallstreetbets/comments/i1ejxp/weekend_disc...,2,If you are a long term investor with a low ris...,False,fzxnuja,t3_i1ejxp,fzxk3kc,,wallstreetbets,t5_2th52
135160,Frostfright,t2_5tgc5n5,1596240363,2020-08-01 00:06:03,/r/wallstreetbets/comments/i1fcfi/which_is_the...,7,"AMD has higher meme potential, but TSM is a gu...",False,fzxnsd1,t3_i1fcfi,,1.0,wallstreetbets,t5_2th52
135161,Coffeepillow,t2_4h49c,1596240140,2020-08-01 00:02:20,/r/wallstreetbets/comments/i1ejxp/weekend_disc...,1,"Down $2500, feeling pretty bad. I could have b...",False,fzxnbxh,t3_i1ejxp,fzxkm2n,,wallstreetbets,t5_2th52


KeyError: "['nest_level', 'author_fullname'] not found in axis"

In [69]:
csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered\AAPL.csv"
df = pd.read_csv(csv_path)

In [73]:
df['link_id'].str[3:]

0         8nc4vr
1         8nj0th
2         8nj0th
3         8nds9s
4         8n80pn
           ...  
124552    i1ejxp
124553    i1ejxp
124554    i15376
124555    i1ejxp
124556    i1ejxp
Name: link_id, Length: 124557, dtype: object

In [46]:
chunk['link_id']

2000000    t3_hfwyw9
2000001    t3_hfww9a
2000002    t3_hfudwo
2000003    t3_hfqy3p
2000004    t3_hfwyw9
             ...    
2999995    t3_b1lden
2999996    t3_b1nenl
2999997    t3_b1grpq
2999998    t3_b1grpq
2999999    t3_b1ab6k
Name: link_id, Length: 1000000, dtype: object

In [53]:
(chunk['parent_id'].isna() | chunk['link_id'].isin(['t3_hfqy3p']))

2000000     True
2000001    False
2000002    False
2000003     True
2000004     True
           ...  
2999995    False
2999996     True
2999997    False
2999998     True
2999999    False
Length: 1000000, dtype: bool

In [52]:
reply_filter = chunk['parent_id'].isin(['fw0dv3f', 'fw0cet2'])
reply_filter

2000000    False
2000001     True
2000002     True
2000003    False
2000004    False
           ...  
2999995    False
2999996    False
2999997    False
2999998    False
2999999    False
Name: parent_id, Length: 1000000, dtype: bool

In [8]:
dtypes = {'author': 'str',
         'created_utc': 'int64',
         'permalink': 'str',
         'score': 'int64',
         'body': 'str',
         'is_submitter': 'bool',
         'id': 'str',
         'link_id': 'str',
         'parent_id': 'str',
         'subreddit': 'str',
         'subreddit_id': 'str'}

# csv_path = r"E:/Users/Christiaan/Large_Files/Thesis/reddit/comments/filtered/AAPL.csv"
# df = pd.read_csv(csv_path, dtype=dtypes)

In [11]:
csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\all_comments.csv"

chunksize = 10 ** 6
for chunk in pd.read_csv(csv_path, chunksize=chunksize):
    print(chunk.shape)
    print(chunk.dtypes)

(1000000, 12)
author              object
created_utc          int64
utc_datetime_str    object
permalink           object
score                int64
body                object
is_submitter          bool
id                  object
link_id             object
parent_id           object
subreddit           object
subreddit_id        object
dtype: object
(1000000, 12)
author              object
created_utc          int64
utc_datetime_str    object
permalink           object
score                int64
body                object
is_submitter          bool
id                  object
link_id             object
parent_id           object
subreddit           object
subreddit_id        object
dtype: object
(1000000, 12)
author              object
created_utc          int64
utc_datetime_str    object
permalink           object
score                int64
body                object
is_submitter          bool
id                  object
link_id             object
parent_id           object
subreddit   

KeyboardInterrupt: 

In [20]:
if 'fzxo7s0' in df['id'].to_list():
    print("Yes")

Yes


In [14]:
if 'fzxo7s0' in df['id']:
    print("yes")
else:
    print("NO")

NO


In [36]:


csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered\AAPL.csv"
df = pd.read_csv(csv_path, dtype=dtypes, parse_dates=['utc_datetime_str'])

In [78]:
csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\posts\filtered\AAPL.csv"
df = pd.read_csv(csv_path)
df.shape

(4597, 14)

In [81]:
df

Unnamed: 0,author,created_utc,full_link,id,num_comments,score,selftext,subreddit,subreddit_id,subreddit_subscribers,title,url,datetime,date
0,el_spidermonkey,1522725419,https://www.reddit.com/r/investing/comments/89...,898p8x,7,3,OLED has dropped 43% in the past three months ...,investing,t5_2qhhq,507760,OLED: Buy That Dip?,https://www.reddit.com/r/investing/comments/89...,2018-04-03 03:16:59,2018-04-03
1,Timelapze,1522691994,https://www.reddit.com/r/investing/comments/89...,892e8d,24,7,When the S&amp;P500 takes a dive at the hands ...,investing,t5_2qhhq,507476,Holding an S&amp;P500 fund opens you up to an ...,https://www.reddit.com/r/investing/comments/89...,2018-04-02 17:59:54,2018-04-02
2,trulytrulyisay,1522633918,https://www.reddit.com/r/investing/comments/88...,88vw7q,13,18,I’m curious to know of executives in the past ...,investing,t5_2qhhq,507103,Have there ever been company executives so suc...,https://www.reddit.com/r/investing/comments/88...,2018-04-02 01:51:58,2018-04-02
3,ziadmiqdadi,1523335130,https://www.reddit.com/r/investing/comments/8b...,8b50z5,1,0,They’re at an all year low and show promise fo...,investing,t5_2qhhq,510126,(STOCK)Thoughts on Cirrus Logic?,https://www.reddit.com/r/investing/comments/8b...,2018-04-10 04:38:50,2018-04-10
4,Zodiac2119,1524812210,https://www.reddit.com/r/investing/comments/8f...,8f9oxn,15,3,"Hey all, First time posting here but I have a ...",investing,t5_2qhhq,515514,"I have 15,000 sitting in a savings account. Bu...",https://www.reddit.com/r/investing/comments/8f...,2018-04-27 06:56:50,2018-04-27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4592,Dolecavis04,1598545512,https://www.reddit.com/r/wallstreetbets/commen...,iho5mp,0,1,"FUCK this poopy headed, shitty stock that won’...",wallstreetbets,t5_2th52,1439962,Apple... FUCK,https://www.reddit.com/r/wallstreetbets/commen...,2020-08-27 16:25:12,2020-08-27
4593,SanderGGs,1598540832,https://www.reddit.com/r/wallstreetbets/commen...,ihmp4a,52,1,"Alright boys, I know it’s hard as fuck to move...",wallstreetbets,t5_2th52,1439584,Apple Price Forecast,https://www.reddit.com/r/wallstreetbets/commen...,2020-08-27 15:07:12,2020-08-27
4594,spauldingzero,1598450187,https://www.reddit.com/r/wallstreetbets/commen...,igz9wt,8,1,Never seen a split happen for calls I was hold...,wallstreetbets,t5_2th52,1434832,How will calls be affected for the Tesla and A...,https://www.reddit.com/r/wallstreetbets/commen...,2020-08-26 13:56:27,2020-08-26
4595,M_lotta,1598437426,https://www.reddit.com/r/wallstreetbets/commen...,igwcux,60,1,"Hey guys, just wanted any advice relating to m...",wallstreetbets,t5_2th52,1434517,$TSLA &amp; $AAPL Calls,https://www.reddit.com/r/wallstreetbets/commen...,2020-08-26 10:23:46,2020-08-26


In [82]:
df.loc[4595,'full_link']

'https://www.reddit.com/r/wallstreetbets/comments/igwcux/tsla_aapl_calls/'

In [60]:
pd.set_option('display.max_rows', 500)
print(df.shape)
df = df.drop_duplicates(subset=['author', 'date'], keep=False)
print(df.shape)

df = df.drop_duplicates(subset=['selftext'], keep=False)
print(df.shape)


(5121, 14)
(4375, 14)
(4233, 14)


In [77]:
ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
rootdir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\posts\filtered"

counter = 0
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        # Create csv_path
        csv_path = os.path.join(subdir, file)
        
        
        # Read csv
        df = pd.read_csv(csv_path)
        
        obs = df.shape[0]
        counter = counter + obs
        print(f"Obs [{str(obs).ljust(12)}] - Total counter [{str(counter).ljust(12)}]", end='\r')

            
counter

Obs [974         ] - Total counter [36244       ]

36244

In [49]:
ticker_list = ['AAPL']
save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\filtered"

csv_path = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\unfiltered\investing\2018_06.csv"

# Read csv
df = pd.read_csv(csv_path)

print(df.shape)

print(df.shape)
df = df[df['body'].notna()]
df['body'].apply(filter_data_1, ticker=ticker)
# # Loop tickers
# for ticker in ticker_list:
#     print(f"Processing {ticker.ljust(4)} for [{csv_path}]", end='\r')
#     # Apply filter
#     filtered_df = filter_mentions(df, ticker)

#     # Creating save_path and saving file (either new file or appending)
#     save_path = os.path.join(save_dir, f"{ticker}.csv").replace('\\', '/')
#     save_df(filtered_df, save_path)

(38535, 16)
(38535, 16)


0        False
1        False
2        False
3        False
4        False
         ...  
38530    False
38531    False
38532    False
38533    False
38534    False
Name: body, Length: 38534, dtype: bool

In [11]:
@loop_tickers
def calculate_sentiment_ticker(*args, **kwargs):
    print(args)
    return "yes"
#     # Setting up save location
#     ticker = kwargs['ticker']
#     save_dir = r"E:\Users\Christiaan\Large_Files\Thesis\Twitter\sentiment\VADER"
#     save_path = os.path.join(save_dir, f"{ticker}.csv").replace('\\', '/')
    
#     # Check if file already exists and skip sentiment calculation if file exists
#     if os.path.isfile(save_path):
#         print(f"File exists: [{save_path}]")
        
    
#     else:  
#         csv_path = kwargs['csv_path']

#         # Read csv
#         df = pd.read_csv(csv_path)

#         # Filter and clean data. Also perform VADER sentiment scoring.
#         return_df = clean_data(df, ticker)

#         # Count the posts for each filter 
#         results_df = count_posts(return_df)

#         # Calculate sentiment scores for each method
#         sentiment_measures = calc_sent_measures(return_df)
        
#         # Saving the dataframe
#         sentiment_measures.to_csv(save_path, encoding='utf-8', index=False)
    

filedir = r"E:\Users\Christiaan\Large_Files\Thesis\reddit\comments\unfiltered"
calculate_sentiment_ticker(file_dir= filedir)

In [13]:
ticker_list = ['AAPL', 'AMD', 'AMZN', 'ATVI', 'BA', 'BABA', 'BAC', 'DIS', 'F', 'GE', 'GME', 'IQ', 'LULU', 'MSFT', 'MU', 'NFLX', 'NVDA', 'SBUX', 'SHOP', 'SNAP', 'SQ', 'TLRY', 'TSLA', 'V', 'WMT']
import time

for ticker in ticker_list:
    print(f"Processing: {ticker.ljust(4)}", end='\r')
    time.sleep(1)


Processing: ATVI

KeyboardInterrupt: 