## Navigation
1. **Subreddit Scrapping**
2. Working Notebook
    1. Background
    2. Executive Summary
    3. Problem Statement
    4. Methodology 
    5. Data Import
    6. Data Preprocessing
    7. Exploratory Data Analysis (EDA)
    8. Model and Evaluation
    9. Final Model Selection and Threshold Selection
    10. Conclusion and Recommendation

In [1]:
import requests
import json
import pandas as pd
import time
import re

In [2]:
#function to get the reddit post
def request_post(subreddit,n):
    url = 'https://api.pushshift.io/reddit/search/submission'
    post = []
    utc = 1667738423
    while len(post) < n: #1000 for validation set and 5000 for training set & test set
        params = {
            'subreddit': subreddit,
            'size': 100, 
            'before': utc
        }
        req = requests.get(url,params)
        if req.status_code == 200:
            data = req.json()
            new_data = data['data']
            for new in new_data:
                post.append(new)
                if new['created_utc'] < utc:
                    utc = new['created_utc']
        time.sleep(2)
        print(utc)#making sure it pulls earlier data
    return post

In [3]:
stocks = request_post('stocks',10000) #validating that all data is captured from different timing

1667584128
1667507671
1667418236
1667325290
1667227085
1667120148
1666979572
1666902699
1666827724
1666750397
1666667036
1666574197
1666454411
1666316019
1666208093
1666115465
1666015020
1665897013
1665785671
1665685510
1665597893
1665515066
1665434783
1665340835
1665225813
1665112456
1665011460
1664916269
1664835505
1664749083
1664648227
1664555302
1664472696
1664383098
1664308891
1664219982
1664130761
1664006880
1663931551
1663836824
1663735193
1663647033
1663532250
1663412461
1663303897
1663194183
1663097514
1662993164
1662860823
1662739722
1662644361
1662547546
1662456608
1662343395
1662230265
1662133291
1662048227
1661961642
1661889080
1661805149
1661704912
1661554280
1661470144
1661419809
1661345234
1661264313
1661187810
1661111689
1661002684
1660920968
1660836672
1660765294
1660694273
1660638141
1660558211
1660437861
1660341811
1660280825
1660194280
1660133714
1660049383
1659975330
1659889890
1659755952
1659657356
1659591983
1659535410
1659456674
1659376457
1659279978
1659155556

In [4]:
crypto = request_post('CryptoCurrency',10000) #validating that all data is captured from different timing

1667718719
1667692721
1667677850
1667666722
1667657446
1667645911
1667624330
1667605929
1667591465
1667577691
1667567425
1667553994
1667537296
1667520218
1667502654
1667491499
1667479184
1667466874
1667450100
1667432415
1667418418
1667406899
1667395256
1667380857
1667361525
1667337944
1667324858
1667312204
1667301265
1667282352
1667260803
1667240023
1667227304
1667214228
1667195331
1667166002
1667150116
1667135177
1667113444
1667085476
1667068257
1667053420
1667039446
1667020796
1666996824
1666982034
1666969217
1666954768
1666935872
1666916176
1666898166
1666887481
1666876915
1666866018
1666850275
1666831867
1666816613
1666806737
1666795907
1666781419
1666766553
1666746336
1666732155
1666721687
1666710306
1666700565
1666687084
1666668772
1666649328
1666638257
1666627888
1666616533
1666602419
1666576162
1666555974
1666537802
1666524244
1666499117
1666477236
1666459186
1666439330
1666409177
1666384506
1666371868
1666360176
1666348442
1666329541
1666306929
1666294684
1666282292
1666271311

In [33]:
stocks = pd.DataFrame(stocks)
stocks = stocks[['selftext','title','author','subreddit']]

crypto = pd.DataFrame(crypto)
crypto = crypto[['selftext','title','author','subreddit']]

In [43]:
df = pd.concat([stocks,crypto])

In [44]:
#first, remove those duplicates
df['check'] = df['selftext'] + df['title'] + df['author']
initial = df.shape[0]
df.drop_duplicates(subset = ['check'],keep = 'first',inplace=True)
after = df.shape[0]
print(f'There are {initial-after} duplicated data dropped.')

There are 503 duplicated data dropped.


In [45]:
#author
stocks_author = df[df['subreddit']=='stocks']['author']
crypto_author = df[df['subreddit']=='CryptoCurrency']['author']
common_author = set(stocks_author)&set(crypto_author)
print(f'There are {len(common_author)} common authors between stocks and CryptoCurrency subreddits.')

There are 67 common authors between stocks and CryptoCurrency subreddits.


In [46]:
#making title and self text as 2 different rows
df_self = df[['selftext','author','subreddit']]
df_title = df[['title','author','subreddit']]
df_self.rename(columns={'selftext':'post'},inplace=True)
df_title.rename(columns={'title':'post'},inplace=True)
df_separated = pd.concat([df_self,df_title])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_self.rename(columns={'selftext':'post'},inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title.rename(columns={'title':'post'},inplace=True)


In [47]:
df_separated.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39358 entries, 0 to 10085
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   post       39357 non-null  object
 1   author     39358 non-null  object
 2   subreddit  39358 non-null  object
dtypes: object(3)
memory usage: 1.2+ MB


In [48]:
#filling in the null values on the selftext, although these posts were removed by moderators, we would like our model to be able to recognize these outliers as well
df_separated['post'].loc[(df_separated['post']=='[removed]')|(df_separated['post']=='[deleted]')|(df_separated['post']== '')].value_counts()

[removed]    8621
             5034
[deleted]       8
Name: post, dtype: int64

In [49]:
#removing those with null values along with removed and deleted posts
df_separated['post']=df_separated['post'].replace(['[deleted]','[removed]',''],None)
df_separated=df_separated[df_separated['post'].notnull()]
#df_separated.drop(columns='author',inplace=True)
df_separated.to_csv('../data/subreddit_data.csv',index=None)