# > **REDDIT SCRAPER SCRIPT 🍎 🐿**

Duygu Ider, Nov. 30th, 2021 ☕

MSc Thesis: Cryptocurrency Price Forecasting Using BERT-Based Sentiment Analysis

# Setup

In [3]:
# import packages
from pmaw import PushshiftAPI
import datetime as dt
import numpy as np
import pandas as pd
import time
from google.colab import files, drive

In [2]:
pip install pmaw

Collecting pmaw
  Downloading pmaw-2.1.3-py3-none-any.whl (25 kB)
Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 7.5 MB/s 
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.2.3-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.0 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw, pmaw
Successfully installed pmaw-2.1.3 praw-7.5.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.2.3


In [None]:
import os
print("Number of processors:", os.cpu_count())

Number of processors: 2


In [None]:
api.metadata_.get('shards')
#failed should be 0, successful should be equal to total (node fell out of the cluster - temporary issue)
#otherwise, you are getting partial results and not the whole thing

NameError: ignored

# Entire Dataset Scraping

Initialize Pushshift API with exponential backoff and full jitter
- Check PMAW documentation for details and specifications: https://github.com/mattpodolak/pmaw

In [4]:
api = PushshiftAPI(shards_down_behavior=None, num_workers=10, limit_type='backoff', jitter='equal')
#default num_workers=10
#recommended num_workers = number of processors * 5

Set parameters for relevant subreddit and start/end dates:

In [12]:
start_year = 2022
end_year = 2022

subreddit="Bitcoin"
reddit_all = []

Scrape selected subreddit for the given time period:

In [13]:
start_time = time.time()

after = int(dt.datetime(2021,11,1,0,0).timestamp())
before = int(dt.datetime(2022,2,22,0,0).timestamp())
#before = int(dt.datetime(end_year,1,1,0,0).timestamp())
posts = api.search_submissions(subreddit=subreddit, limit=None, before=before, after=after)
posts_df = pd.DataFrame(posts)
posts_df['datetime'] = posts_df['created_utc'].map(lambda t: dt.datetime.fromtimestamp(t))
posts_df = posts_df.sort_values(by='datetime')
#print(posts_df)

end_time = time.time()

# Output Organization

Runtime and execution results:

In [14]:
print("Execution runtime: %s minutes" % round((end_time - start_time)/60, 2))
print("Number of posts scraped: %s samples" % len(posts_df))

Execution runtime: 7.03 minutes
Number of posts scraped: 30919 samples


In [None]:
print("Column names:", posts_df.axes[1])

Column names: Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium',
       ...
       'content_categories', 'removal_reason', 'poll_data', 'archived',
       'can_gild', 'hidden', 'quarantine', 'subreddit_name_prefixed',
       'top_awarded_type', 'datetime'],
      dtype='object', length=110)


Select relevant columns:

In [15]:
posts_selected = posts_df[['subreddit', 'url', 'datetime', 'author', 'num_comments', 'score', 'title', 'selftext']]
posts_selected.head(3)

Unnamed: 0,subreddit,url,datetime,author,num_comments,score,title,selftext
22200,Bitcoin,https://www.cnbc.com/2021/10/31/bitcoin-mining...,2021-11-01 00:03:06,CryptoCurrencEEE,35,1,Two of the biggest bitcoin mining companies in...,
22199,Bitcoin,https://www.reddit.com/r/Bitcoin/comments/qk23...,2021-11-01 00:08:15,Curious_County_2069,0,1,GIVEAWAY🎁🎁🎁 We will increase your wallet price...,[removed]
22198,Bitcoin,https://www.reddit.com/r/Bitcoin/comments/qk24...,2021-11-01 00:10:33,obesefamily,0,1,Umbrell vs Embassy OS - please help me understand,What are the differences between Umbrell and E...


Remove NA values and filter out deleted/removed posts:

In [16]:
posts_filtered = posts_selected[~np.isin(posts_selected.selftext, ['','[deleted]','[removed]'])]
posts_filtered = posts_filtered.dropna()
print("Number of samples after filtering: %s samples" % len(posts_filtered))
posts_filtered.head(3)

Number of samples after filtering: 11785 samples


Unnamed: 0,subreddit,url,datetime,author,num_comments,score,title,selftext
22198,Bitcoin,https://www.reddit.com/r/Bitcoin/comments/qk24...,2021-11-01 00:10:33,obesefamily,0,1,Umbrell vs Embassy OS - please help me understand,What are the differences between Umbrell and E...
22196,Bitcoin,https://www.reddit.com/r/Bitcoin/comments/qk2i...,2021-11-01 00:32:55,dr_h-donna-gust,19,1,Coinbase prices,I’m wondering why the coinbase price for Bitco...
22192,Bitcoin,https://www.reddit.com/r/Bitcoin/comments/qk2u...,2021-11-01 00:54:52,asdvlkjkjdos,25,1,Implementing Michael Saylor's thesis of never ...,Michael always says that you should never sell...


Select filenames for initial, raw and filtered output dataframes:

In [None]:
filename = "reddit_"+str.lower(subreddit)+"_"+str(start_year)+"_"+str(end_year)
filename_raw = filename+"_raw.csv"
filename_filtered = filename+"_filtered.csv"

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Mount and authenticate Google Drive usage:

In [10]:
drive.mount('/drive')

Mounted at /drive


In [17]:
posts_filtered.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/reddit_btc_test.csv", header=True, index=False, columns=list(posts_filtered.axes[1]))

Save output files as .csv in Google Drive:

In [None]:
posts_df.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/"+filename+".csv", header=True, index=False, columns=list(posts_df.axes[1]))
posts_selected.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/"+filename_raw, header=True, index=False, columns=list(posts_selected.axes[1]))
posts_filtered.to_csv("/drive/My Drive/Colab Notebooks/thesis_all_datasets/"+filename_filtered, header=True, index=False, columns=list(posts_filtered.axes[1]))

Download and save output files as .csv:

In [None]:
posts_df.to_csv(filename+".csv", header=True, index=False, columns=list(posts_df.axes[1]))
files.download(filename+".csv")

posts_selected.to_csv(filename_raw, header=True, index=False, columns=list(posts_selected.axes[1]))
files.download(filename_raw)

posts_filtered.to_csv(filename_filtered, header=True, index=False, columns=list(posts_filtered.axes[1]))
files.download(filename_filtered)

# Addition: Date Matching

In [None]:
dates_all = pd.date_range(start="2015-01-01",end="2021-11-30").strftime("%d-%m-%Y")
dates_data = posts_filtered['datetime'].map(lambda t: dt.datetime.date(t))
#posts_filtered.groupby(['date'])['title'].count()

 #- reddit_filtered.date
