# Project 3: Natural Language Processing of Subreddit Posts

## Problem Statement

As a new investment company which has two main trading desks - one for traditional securities and another for cryptocurrency, we are looking to automate the monitoring of reddit posts related to investing for new investing leads for both desks. Through this new leads and hot trends, we hope to filter this information to the specific trading desks. As such, we need a model to analyse and categorise the reddit posts for further review & investigation by either the securities or crypto trading desks. 

### Contents:

- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-&-Cleaning)
- [Data Dictionary](#Data-Dictionary)

## Background

Reddit is home to thousands of communities, endless conversation, and authentic human connection. Whether one is into breaking news, sports, TV fan theories, or a never-ending stream of the internet's cutest animals, there's a community on Reddit for everyone. With approximately 52 million active users and 50 billion monthly views, Reddit is a platform for people of all walks of life to come together and join the communities of their interest.
* [Reddit](https://www.redditinc.com/)

The subreddits we have chosen to focus on are the Investing and CryptoCurrency subreddits. 
* [r/investing](https://www.reddit.com/r/investing/)
* [r/CryptoCurrency](https://www.reddit.com/r/CryptoCurrency/)

In this subreddits & communities, we observe the engagement of authentic daily discussions. The discussions on the Investing is known to accomodate a wide range of financial discussions, from financial news, to market data, and traditonal securities. On the other hand, CryptoCurrency is a digital or virtual currency that is secured by cryptography, which makes it nearly impossible to counterfeit or double-spend. Many cryptocurrencies are decentralized networks based on blockchain technology—a distributed ledger enforced by a disparate network of computers. Hence, this makes CryptoCurrency an emerging form of investment that has been gaining traction for many individuals as well. 

In [2]:
# Import standard libraries
import pandas as pd
import numpy as np

# Import time- and API- related libraries
import time, requests
from datetime import datetime
import json

# Import warnings to remove flags when project is complete
import warnings
warnings.filterwarnings('ignore')

#import pre-processing libraries for cleaning the data
import string
import re
import nltk
from nltk.tokenize import RegexpTokenizer, sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

pd.set_option('display.max_colwidth', 100)

## Data Import & Cleaning

Under Data Import, we will collect data from Reddit using Pushift API for Reddit. The two subreddits we have chosen are: r/investing and r/CryptoCurrency. In collecting this data, we have several objectives and observations:

### Objective

1. Collect non-duplicate & unique posts before 25 Oct 2021 GMT +8 2359. 
2. Collect 'title', 'subreddit', 'selftext' and 'created_utc'.

### Scrap Data

#### Create a scrapper function

In [3]:
def scrape_subreddit(subreddit, num_of_posts = 1_050):
# subreddit: str, name of subreddit to search for
# loops: int, number of times to request posts

    # Define pushshift's base URL
    url = 'https://api.pushshift.io/reddit/search/submission'

    # Define columns i want to keep
    columns_keep = ['author', 'created_utc', 'selftext',
                    'subreddit', 'title']
        
    # Set current time in UTC timestamp format
    df_time = 1635177599
    
    # Create empty dataframe for concating loop later on
    df = pd.DataFrame(columns = columns_keep)
    df_length = len(df)
    
    # Create dataframe of posts
    while df_length < num_of_posts:

        # Set params
        # size: int, number of posts per request (max 100 per pushshift api guide)
        params = {
            'subreddit' : subreddit,
            'size' : 100,
            'before' : df_time
        }

        # Get request from pushshift.io, return error message if it is not from 200 series.
        #res = requests.get(url, params)
        try:
            res = requests.get(url, params)

            # Consider any status other than 2xx an error
            if (res.status_code // 100) != 2:
                return "Error: Unexpected response {}".format(res)

        except res.exceptions.RequestException as e:
            # A serious problem happened, like an SSLError or InvalidURL
            return "Error: {}".format(e)
        
        # define the .json data
        data = res.json()
        
        # Concat relevant .json data into dataframe
        df = pd.concat([
            df,
            pd.DataFrame(data['data'])[columns_keep]
        ], 
            axis = 0
        )
        
        # Clean data
        ##  1. drop empty posts
        df = df.loc[
            ((df['selftext'] != '') & (df['selftext'] != '[removed]') & (df['selftext'] != '[deleted]')),
            :
        ].sort_values('created_utc', ascending = False)
        
        ##  2. drop duplicates by author with same text
        df = df.drop_duplicates(subset=['selftext', 'author'], keep = 'first').reset_index(drop=True)
        
        ##  3. drop nulls, especially true for selftext column
        df.dropna(inplace = True)
        
        # Find earliest-dated post in dataframe
        df_time = df['created_utc'].min()
        print(f"min time is {df_time}.")
        
        # Find current length of dataframe
        df_length = len(df)
        print(f"df for r\\{subreddit} has {df_length} posts now.")
       
        # Being polite when scrapping, and to not overload pushshift server 
        # Sleep every 3 seconds
        time.sleep(3)
        
    return df[:num_of_posts]

**Scrapper function description:**

Arguments to feed in are (subreddit, num_of_posts)
- "subreddit" is the name of the target subreddit
- "num_of_posts" is the number of good-quality reddit posts to extract (default value is set at 500)

The size is set at 100 for each request for the posts. This is because per the pushshift api guide, the maximum number of requests per push is 100.

To ensure that non-duplicate & unique posts are collected, we do the following steps in our data cleaning:
1. Drop empty posts
- Since our model is to identify the latest trends for the investing & cryptocurrency subreddits via keywords, we will exclude posts that do not have any text. This could be empty posts or posts with images only.
- Notwithstanding the above, a relatively significant number of posts that are advertisements are labelled as "[removed]" in the 'selftext' columm. Hence, this posts are removed as well so that we will be able to obtain a more organic data from the subreddits. 

2. Drop duplicate posts by the same author with the same text
- This ensures that we obtain non-duplicate & unique posts only. This is because posts could be reposted by the same author. 

3. Drop nulls
- Nulls are removed to ensure that our model runs against posts with text only. 

### Get Subreddits and saving to CSV

#### investing1.csv

In [4]:
investing1_5 = scrape_subreddit('investing', num_of_posts=5)
print('shape', investing1_5.shape)

# saving the data to a different location
investing1_5.to_csv ('./data/investing1_5.csv', index = False)

min time is 1635087207.
df for r\investing has 17 posts now.
shape (5, 5)


In [5]:
investing1 = scrape_subreddit('investing')
print('shape', investing1.shape)

# saving the data to a different location
investing1.to_csv ('./data/investing1.csv', index = False)

min time is 1635087207.
df for r\investing has 17 posts now.
min time is 1634996375.
df for r\investing has 26 posts now.
min time is 1634905905.
df for r\investing has 38 posts now.
min time is 1634840262.
df for r\investing has 48 posts now.
min time is 1634769075.
df for r\investing has 56 posts now.
min time is 1634711031.
df for r\investing has 71 posts now.
min time is 1634636512.
df for r\investing has 84 posts now.
min time is 1634565433.
df for r\investing has 95 posts now.
min time is 1634442388.
df for r\investing has 111 posts now.
min time is 1634329894.
df for r\investing has 122 posts now.
min time is 1634258023.
df for r\investing has 137 posts now.
min time is 1634178743.
df for r\investing has 146 posts now.
min time is 1634102594.
df for r\investing has 158 posts now.
min time is 1634005480.
df for r\investing has 171 posts now.
min time is 1633922608.
df for r\investing has 183 posts now.
min time is 1633817055.
df for r\investing has 198 posts now.
min time is 1633

#### crypto1.csv

In [6]:
crypto1 = scrape_subreddit('CryptoCurrency')
print('shape', crypto1.shape)

# saving the data to a different location
crypto1.to_csv ('./data/crypto1.csv', index = False)

min time is 1635174926.
df for r\CryptoCurrency has 25 posts now.
min time is 1635172622.
df for r\CryptoCurrency has 58 posts now.
min time is 1635170275.
df for r\CryptoCurrency has 92 posts now.
min time is 1635167428.
df for r\CryptoCurrency has 115 posts now.
min time is 1635163385.
df for r\CryptoCurrency has 146 posts now.
min time is 1635158371.
df for r\CryptoCurrency has 171 posts now.
min time is 1635152366.
df for r\CryptoCurrency has 192 posts now.
min time is 1635144575.
df for r\CryptoCurrency has 215 posts now.
min time is 1635138325.
df for r\CryptoCurrency has 246 posts now.
min time is 1635133070.
df for r\CryptoCurrency has 263 posts now.
min time is 1635127038.
df for r\CryptoCurrency has 287 posts now.
min time is 1635120785.
df for r\CryptoCurrency has 309 posts now.
min time is 1635115479.
df for r\CryptoCurrency has 333 posts now.
min time is 1635110721.
df for r\CryptoCurrency has 362 posts now.
min time is 1635106717.
df for r\CryptoCurrency has 389 posts now

In [7]:
investing1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1050 entries, 0 to 1049
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       1050 non-null   object
 1   created_utc  1050 non-null   object
 2   selftext     1050 non-null   object
 3   subreddit    1050 non-null   object
 4   title        1050 non-null   object
dtypes: object(5)
memory usage: 49.2+ KB


In [8]:
crypto1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1050 entries, 0 to 1049
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       1050 non-null   object
 1   created_utc  1050 non-null   object
 2   selftext     1050 non-null   object
 3   subreddit    1050 non-null   object
 4   title        1050 non-null   object
dtypes: object(5)
memory usage: 49.2+ KB


In [9]:
#Check that all nan posts are removed by my scrapper
print(f"# of Nulls for investing1 df \n {investing1.isnull().sum()}")
print()
print(f"# of Nulls for crypto1 df \n {crypto1.isnull().sum()}")

# of Nulls for investing1 df 
 author         0
created_utc    0
selftext       0
subreddit      0
title          0
dtype: int64

# of Nulls for crypto1 df 
 author         0
created_utc    0
selftext       0
subreddit      0
title          0
dtype: int64


In [10]:
# Check that reposts by authors are removed by my scrapper
investing1.duplicated().sum(), crypto1.duplicated().sum()

(0, 0)

#### Combining Dataframe for both subreddits

In [11]:
combined_df = pd.concat(objs=[investing1, crypto1], axis=0)
combined_df.drop_duplicates(subset=['selftext'], inplace=True)
combined_df.reset_index(inplace=True, drop=True)
combined_df['subreddit'].value_counts()

investing         1050
CryptoCurrency    1049
Name: subreddit, dtype: int64

In [12]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099 entries, 0 to 2098
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       2099 non-null   object
 1   created_utc  2099 non-null   object
 2   selftext     2099 non-null   object
 3   subreddit    2099 non-null   object
 4   title        2099 non-null   object
dtypes: object(5)
memory usage: 82.1+ KB


In [13]:
#subreddit is the target variable, hence we shall let 'investing' subreddit be 1, and the 'cryptocurrency' subreddit be 0
combined_df.drop(columns=['created_utc'], inplace=True)
combined_df['subreddit'] = combined_df['subreddit'].map(lambda x: 1 if x == 'investing' else 0)

In [14]:
combined_df.head()

Unnamed: 0,author,selftext,subreddit,title
0,mildcharts,[Snapshot](https://www.tradingview.com/x/xdfUPjmd/)\n\n&amp;#x200B;\n\n* Double top on ATH level...,1,A Great Short Setup on NYSE
1,mildcharts,* Double top on ATH level\n* Bullish RSI divergence,1,A Great Short Setup on NYSE
2,EselSchwanz,I'm 25 years old and just inherited a portion of the family company which sums up to about $950k...,1,Inherited large amount of equity.. Not sure whether to liquidate and invest now or wait
3,AutoModerator,Have a general question? Want to offer some commentary on markets? Maybe you would just like t...,1,"Daily General Discussion and spitballin thread - October 25, 2021"
4,AutoModerator,"If your question is ""I have $10,000, what do I do?"" or other ""advice for my personal situation"" ...",1,"Daily Advice Thread - All basic help or advice questions must be posted here. October 25, 2021"


## Additional Data Cleaning

Additional data cleaning steps were engaged to do the following:
- removal of links or URLs
    - This refers to links that start with either 'http' or 'https'.
- removal of HTML symbols
    - For example, code behind symbols like '&amp' and greater than symbol like '&gt' are removed. 
- removal of other HTML special terms and texts
    - For example, non-breaking space like '#xa0;' and white spaces like '#x200b;' are removed. 
- removal of numbers
- removal of special characters
- ensure that all text are within the Basic Multilingual Plane (BMP) of Unicode. 
    - This allows us to filter text beyond the Basic Multilingual Plane (BMP) of Unicode. For example, emojis, and alphabetical characters with accents (like the tilde) are removed as well.
    
The following links were referenced against for the additional Data Cleaning steps:
- https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python
- https://www.w3schools.com/html/html_symbols.asp
- https://www.w3.org/TR/html4/struct/text.html
- https://www.programiz.com/python-programming/regex
- https://stackoverflow.com/questions/36283818/remove-characters-outside-of-the-bmp-emojis-in-python-3
- https://towardsdatascience.com/the-real-world-as-seen-on-twitter-sentiment-analysis-part-one-5ac2d06b63fb
- https://stackoverflow.com/questions/4328500/how-can-i-strip-all-punctuation-from-a-string-in-javascript-using-regex

In [38]:
pd.read_csv('./data/investing1.csv')

Unnamed: 0,author,created_utc,selftext,subreddit,title
0,mildcharts,1635167989,[Snapshot](https://www.tradingview.com/x/xdfUPjmd/)\n\n&amp;#x200B;\n\n* Double top on ATH level...,investing,A Great Short Setup on NYSE
1,mildcharts,1635167785,* Double top on ATH level\n* Bullish RSI divergence,investing,A Great Short Setup on NYSE
2,EselSchwanz,1635167216,I'm 25 years old and just inherited a portion of the family company which sums up to about $950k...,investing,Inherited large amount of equity.. Not sure whether to liquidate and invest now or wait
3,AutoModerator,1635152534,Have a general question? Want to offer some commentary on markets? Maybe you would just like t...,investing,"Daily General Discussion and spitballin thread - October 25, 2021"
4,AutoModerator,1635152476,"If your question is ""I have $10,000, what do I do?"" or other ""advice for my personal situation"" ...",investing,"Daily Advice Thread - All basic help or advice questions must be posted here. October 25, 2021"
...,...,...,...,...,...
1045,DisjointedHuntsville,1629116301,Bloomberg just reported that Tesla’s autopilot software is under investigation by the NHTSA\n\nh...,investing,U.S. Opens Formal Probe Into Tesla Autopilot System
1046,dotasolosafi,1629115515,"I wanted to get some stocks on the hongkong stock exchange and in hkd, but instead of using the ...",investing,USD.HKD sell trade went wrong?
1047,brieobreis,1629115364,I was talking to a friend about investing in something and he advised me to invest in airplane c...,investing,Should I invest in airplane companies?
1048,shinobifujin,1629104860,I was taking to someone in Japan who recommended it to me and so I checked it out and signed up ...,investing,Has anyone heard of Alitass


In [39]:
investing1 = pd.read_csv('./data/investing1.csv')

In [40]:
crypto1 = pd.read_csv('./data/crypto1.csv')

In [41]:
crypto1

Unnamed: 0,author,created_utc,selftext,subreddit,title
0,Markmanus,1635177574,"So I have to assume as the title says above. In the last couple of weeks, especially since CDC (...",CryptoCurrency,Is Binance afraid of Crypto.com?
1,frostybitz,1635177350,We've all made silly mistakes since our inception in the cryptoverse. Whether that be selling to...,CryptoCurrency,What's been your worst move in Crypto? (So far)
2,Mediocre-Sale8473,1635177349,Ok so I'll start with the article link from NPR:\n\nhttps://www.npr.org/2021/10/25/1048485043/ir...,CryptoCurrency,IRS wants to monitor your bank account flow. This should have people worried regardless of wheth...
3,Many_Arm7466,1635177185,"Hello there do you wish you could put some more money into Crypto dip, but you already allocated...",CryptoCurrency,Financial Lifehacks To Get Extra Money for Buying Dips
4,silver_sean,1635177106,Check out the best play-to-earn crypto game on the market - [Coin Hunt World](https://coinhunt.g...,CryptoCurrency,Want to earn some extra Crypto on the side? Check out this new game and start stacking extra Bit...
...,...,...,...,...,...
1045,qqwe22,1634984837,"Hello, hello traders, stakers and moon farmers from r/CryptoCurrency, the best online cryptocurr...",CryptoCurrency,What crypto you own perform the best right now?
1046,BerthjeTTV,1634984822,Hi guys\n\nJust a reminder for people who doesn´t know that there is a faucet out here for MOONs...,CryptoCurrency,Know There Is A Moon Faucet!
1047,kiratiiiii,1634984761,"Yesterday (22 October 2021) General Prayut Chan-O-Cha, Prime Minister of Thailand who overthrew ...",CryptoCurrency,Why old dictators fear cryptocurrency? Thai junta is warning young investors to beware of the ri...
1048,Embarrassed_Glass676,1634984438,"Everything seems really overwhelming! there are so many different coins, staking, lending, I rea...",CryptoCurrency,Many of us in this subreddit are extremely new to crypto currency. What are some important rules...


In [44]:
combined_df = pd.concat(objs=[investing1, crypto1], axis=0)
combined_df.drop_duplicates(subset=['selftext'], inplace=True)
combined_df.reset_index(inplace=True, drop=True)
combined_df['subreddit'].value_counts()

investing         1050
CryptoCurrency    1049
Name: subreddit, dtype: int64

In [45]:
combined_df

Unnamed: 0,author,created_utc,selftext,subreddit,title
0,mildcharts,1635167989,[Snapshot](https://www.tradingview.com/x/xdfUPjmd/)\n\n&amp;#x200B;\n\n* Double top on ATH level...,investing,A Great Short Setup on NYSE
1,mildcharts,1635167785,* Double top on ATH level\n* Bullish RSI divergence,investing,A Great Short Setup on NYSE
2,EselSchwanz,1635167216,I'm 25 years old and just inherited a portion of the family company which sums up to about $950k...,investing,Inherited large amount of equity.. Not sure whether to liquidate and invest now or wait
3,AutoModerator,1635152534,Have a general question? Want to offer some commentary on markets? Maybe you would just like t...,investing,"Daily General Discussion and spitballin thread - October 25, 2021"
4,AutoModerator,1635152476,"If your question is ""I have $10,000, what do I do?"" or other ""advice for my personal situation"" ...",investing,"Daily Advice Thread - All basic help or advice questions must be posted here. October 25, 2021"
...,...,...,...,...,...
2094,qqwe22,1634984837,"Hello, hello traders, stakers and moon farmers from r/CryptoCurrency, the best online cryptocurr...",CryptoCurrency,What crypto you own perform the best right now?
2095,BerthjeTTV,1634984822,Hi guys\n\nJust a reminder for people who doesn´t know that there is a faucet out here for MOONs...,CryptoCurrency,Know There Is A Moon Faucet!
2096,kiratiiiii,1634984761,"Yesterday (22 October 2021) General Prayut Chan-O-Cha, Prime Minister of Thailand who overthrew ...",CryptoCurrency,Why old dictators fear cryptocurrency? Thai junta is warning young investors to beware of the ri...
2097,Embarrassed_Glass676,1634984438,"Everything seems really overwhelming! there are so many different coins, staking, lending, I rea...",CryptoCurrency,Many of us in this subreddit are extremely new to crypto currency. What are some important rules...


In [48]:
#subreddit is the target variable, hence we shall let 'investing' subreddit be 1, and the 'cryptocurrency' subreddit be 0
combined_df.drop(columns=['created_utc'], inplace=True)
combined_df['subreddit'] = combined_df['subreddit'].map(lambda x: 1 if x == 'investing' else 0)

In [49]:
combined_df.to_csv('./data/combined_df.csv',index = False)

In [50]:
combined_df

Unnamed: 0,author,selftext,subreddit,title
0,mildcharts,[Snapshot](https://www.tradingview.com/x/xdfUPjmd/)\n\n&amp;#x200B;\n\n* Double top on ATH level...,1,A Great Short Setup on NYSE
1,mildcharts,* Double top on ATH level\n* Bullish RSI divergence,1,A Great Short Setup on NYSE
2,EselSchwanz,I'm 25 years old and just inherited a portion of the family company which sums up to about $950k...,1,Inherited large amount of equity.. Not sure whether to liquidate and invest now or wait
3,AutoModerator,Have a general question? Want to offer some commentary on markets? Maybe you would just like t...,1,"Daily General Discussion and spitballin thread - October 25, 2021"
4,AutoModerator,"If your question is ""I have $10,000, what do I do?"" or other ""advice for my personal situation"" ...",1,"Daily Advice Thread - All basic help or advice questions must be posted here. October 25, 2021"
...,...,...,...,...
2094,qqwe22,"Hello, hello traders, stakers and moon farmers from r/CryptoCurrency, the best online cryptocurr...",0,What crypto you own perform the best right now?
2095,BerthjeTTV,Hi guys\n\nJust a reminder for people who doesn´t know that there is a faucet out here for MOONs...,0,Know There Is A Moon Faucet!
2096,kiratiiiii,"Yesterday (22 October 2021) General Prayut Chan-O-Cha, Prime Minister of Thailand who overthrew ...",0,Why old dictators fear cryptocurrency? Thai junta is warning young investors to beware of the ri...
2097,Embarrassed_Glass676,"Everything seems really overwhelming! there are so many different coins, staking, lending, I rea...",0,Many of us in this subreddit are extremely new to crypto currency. What are some important rules...


In [60]:
combined_df = pd.read_csv('./data/combined_df.csv')

In [61]:
def clean(row):
               
    # Remove links or URLs
    row['selftext'] = re.sub(
        pattern=r'https?:\/\/.*\/\w*', 
        repl='', 
        string=row['selftext'],
        flags=re.M)
    row['title'] = re.sub(
        pattern=r'https?:\/\/.*\/\w*', 
        repl='', 
        string=row['title'],
        flags=re.M)
    
    # Remove HTML special entities (e.g.. &amp, &gt;)
    row['selftext'] = re.sub(
        pattern=r'\&\w*;',
        repl='',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern=r'\&\w*;',
        repl='',
        string=row['title'])
      
    # Remove other html special terms like #x200B; #xa0; etc
    row['selftext'] = re.sub(
        pattern='#x\w*;?_?',
        repl='',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern='#x\w*;?_?',
        repl='',
        string=row['title'])
    
    # Remove all digits
    row['selftext'] = re.sub(
        pattern=r'\d+',
        repl='',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern=r'\d+',
        repl='',
        string=row['title'])
    
    # Remove all special characters
    row['selftext'] = re.sub(
        pattern=r'\W+',
        repl=' ',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern=r'\W+',
        repl=' ',
        string=row['title'])
    
    return row

In [62]:
combined_df2 = combined_df.apply(clean, axis=1)

In [63]:
combined_df2.head()

Unnamed: 0,author,selftext,subreddit,title
0,mildcharts,Snapshot Double top on ATH level Bullish RSI divergence,1,A Great Short Setup on NYSE
1,mildcharts,Double top on ATH level Bullish RSI divergence,1,A Great Short Setup on NYSE
2,EselSchwanz,I m years old and just inherited a portion of the family company which sums up to about k and I ...,1,Inherited large amount of equity Not sure whether to liquidate and invest now or wait
3,AutoModerator,Have a general question Want to offer some commentary on markets Maybe you would just like to th...,1,Daily General Discussion and spitballin thread October
4,AutoModerator,If your question is I have what do I do or other advice for my personal situation questions you ...,1,Daily Advice Thread All basic help or advice questions must be posted here October


In [64]:
def only_BMP(text):
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode. 
    # Plane 0 (U+0000 - U+FFFF) is called the Basic Multilingual Plane (BMP)
    # it contains the most frequent characters. It was populated starting in Unicode 1.0.
    text = ''.join(c for c in text if c <= '\uFFFF') 
    return text  

In [65]:
combined_df2['selftext'] = combined_df2['selftext'].apply(only_BMP)

In [66]:
combined_df2['title'] = combined_df2['title'].apply(only_BMP)

In [67]:
combined_df2

Unnamed: 0,author,selftext,subreddit,title
0,mildcharts,Snapshot Double top on ATH level Bullish RSI divergence,1,A Great Short Setup on NYSE
1,mildcharts,Double top on ATH level Bullish RSI divergence,1,A Great Short Setup on NYSE
2,EselSchwanz,I m years old and just inherited a portion of the family company which sums up to about k and I ...,1,Inherited large amount of equity Not sure whether to liquidate and invest now or wait
3,AutoModerator,Have a general question Want to offer some commentary on markets Maybe you would just like to th...,1,Daily General Discussion and spitballin thread October
4,AutoModerator,If your question is I have what do I do or other advice for my personal situation questions you ...,1,Daily Advice Thread All basic help or advice questions must be posted here October
...,...,...,...,...
2094,qqwe22,Hello hello traders stakers and moon farmers from r CryptoCurrency the best online cryptocurrenc...,0,What crypto you own perform the best right now
2095,BerthjeTTV,Hi guys Just a reminder for people who doesn t know that there is a faucet out here for MOONs Wh...,0,Know There Is A Moon Faucet
2096,kiratiiiii,Yesterday October General Prayut Chan O Cha Prime Minister of Thailand who overthrew the democra...,0,Why old dictators fear cryptocurrency Thai junta is warning young investors to beware of the ris...
2097,Embarrassed_Glass676,Everything seems really overwhelming there are so many different coins staking lending I really ...,0,Many of us in this subreddit are extremely new to crypto currency What are some important rules ...


In [68]:
combined_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099 entries, 0 to 2098
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     2099 non-null   object
 1   selftext   2099 non-null   object
 2   subreddit  2099 non-null   int64 
 3   title      2099 non-null   object
dtypes: int64(1), object(3)
memory usage: 65.7+ KB


In [55]:
combined_df2[['selftext', 'title']] = combined_df2[['selftext', 'title']].apply(only_BMP)

In [56]:
# Final check for null values
print(f"# of Nulls for investing1 df \n {investing1.isnull().sum()}")
print()
print(f"# of Nulls for crypto1 df \n {crypto1.isnull().sum()}")

# of Nulls for investing1 df 
 author         0
created_utc    0
selftext       0
subreddit      0
title          0
dtype: int64

# of Nulls for crypto1 df 
 author         0
created_utc    0
selftext       0
subreddit      0
title          0
dtype: int64


In [58]:
combined_df2.head(20)

Unnamed: 0,author,selftext,subreddit,title
0,mildcharts,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
1,mildcharts,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
2,EselSchwanz,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
3,AutoModerator,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
4,AutoModerator,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
5,Worried_individual_,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
6,Mtns_to_Sea,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
7,somalley3,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
8,DesertAlpine,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...
9,Hobb7T,Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI div...,1,A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure ...


In [22]:
combined_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099 entries, 0 to 2098
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     2099 non-null   object
 1   selftext   2099 non-null   object
 2   subreddit  2099 non-null   int64 
 3   title      2099 non-null   object
dtypes: int64(1), object(3)
memory usage: 65.7+ KB


### List of Stopwords

In [69]:
#create list of additional stopwords

stopword = nltk.corpus.stopwords.words('english')

additional_stopwords = ['like', 'blah', 'poll',
                       'get', 'thank', 'thankyou',
                       'you', 'word', 'does',
                       'anyone', 'know', 'blah', 
                       'hear', 'words']

all_stopwords = stopwords.words('english')
all_stopwords.extend(additional_stopwords)

### Tokenizing, Dropping of Stopwords and Lemmatizing

The words in the 'selftext' and 'title' columns are tokenize, stop words were then dropped, and finally, lemmatized. A new column was created for each step. Additionally, a new column to combine the words in the 'selftext' and 'title' columns were created, and this words were stored as a string.

In [70]:
def tok_drop_lem(text):
    
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')
    #tokenize
    tokens = re.split('W\+', text)
    
    # Drop stop words
    stopword = nltk.corpus.stopwords.words('english')
    text_stop = [word for word in tokens if word not in stopword]

    # Lemmatizing
    wn = nltk.WordNetLemmatizer()
    lemma = [wn.lemmatize(word) for word in text_stop if word not in all_stopwords]
    
    return lemma

In [71]:
combined_df2['selftext'] = combined_df2['selftext'].apply(tok_drop_lem)

In [None]:
combined_df2['title'] = combined_df2['title'].apply(tok_drop_lem)

In [37]:
combined_df2.head(5)

Unnamed: 0,subreddit,all_text
0,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...
1,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...
2,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...
3,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...
4,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...


In [28]:
combined_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099 entries, 0 to 2098
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     2099 non-null   object
 1   selftext   2099 non-null   object
 2   subreddit  2099 non-null   int64 
 3   title      2099 non-null   object
dtypes: int64(1), object(3)
memory usage: 65.7+ KB


In [29]:
#combine the title & selftext

combined_df2['all_text'] = combined_df2['title'] + combined_df2['selftext']

In [30]:
combined_df2.head(1)

Unnamed: 0,author,selftext,subreddit,title,all_text
0,mildcharts,[ Snapshot Double top on ATH level Bullish RSI divergence Double top on ATH level Bullish RSI di...,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...


All columns, with the exception of the 'subreddit' and the 'all_text' were dropped. 

The 'author' column was no longer needed our initial data cleaning had ensured that there were no duplicated and/or repeated posts by the same author. The remaining dropped columns were dropped as the new 'all_text' column is already a combination of the 'title' and 'selftext'.

The 'subreddit' column is not dropped as it is our target variable that allows us to identify which subreddit the post belongs to.

In [31]:
combined_df2.drop(columns=['author', 'selftext', 'title'], inplace=True)

In [32]:
combined_df2.head(1)

Unnamed: 0,subreddit,all_text
0,1,[A Great Short Setup on NYSEA Great Short Setup on NYSEInherited large amount of equity Not sure...


In [33]:
combined_df2.shape

(2099, 2)

In [34]:
combined_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099 entries, 0 to 2098
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  2099 non-null   int64 
 1   all_text   2099 non-null   object
dtypes: int64(1), object(1)
memory usage: 32.9+ KB


In [None]:
# saving the merged data to a different location

combined_df2.to_csv('./data/combined_df2.csv',index = False)

## Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---| 
|subreddit|*int*|combined_df2.csv|The subreddit from Reddit where the data is collected from. The chosen subreddits are r/investing & r/CryptoCurrency and they are indicated by the integers '1' and '0' respectively | 
|all_text|*object*|combined_df2.csv|A list of all the words in the title and selftext for each post in the subreddit| 