# Data Collection

Referenced: https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be <br>
Referenced: https://git.generalassemb.ly/DSIR-Lancelot/5.02-lesson-webscraping/blob/master/Codealong-Adi-InClass.ipynb

In [1]:
# Imports
import pandas as pd
import requests
import time

In [2]:
# Choose two subreddits
subreddit_1 = 'wallstreetbets'
subreddit_2 = 'SatoshiStreetBets'

In [3]:
# Define base url
base_url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# Define present utc
present_utc = 1614650800

**Explanation of the Pushshift API scraping process:** I begin by defining a function that returns the first 100 posts from a subreddit because Pushshift's API limits the number of posts that can be scraped to 100 posts per request. I compare two subreddits in this project, so the function facilitates the process of retrieving the first 100 posts for both subreddits. I then create separate data frames for both subreddits using the first 100 posts so that I can later apply a for loop that can scrape additional posts as needed and concatenate the new data frame with the original data frame. I combine the data frames for r/wallstreetbets and r/SatoshiStreetBets at the end once I am able to verify that the individual data frames collected ample unique posts.

In [5]:
# Define a function that returns the first 100 posts from a subreddit
def posts_100(subreddit,before):
    
    # Define parameters
    params = {
        'subreddit': subreddit,
        'size': 100,
        'before': present_utc
    }

    # Request content
    res = requests.get(base_url, params)

    # Convert to json
    json_dict = res.json()

    # First 100 posts
    posts = json_dict['data']

    return posts

In [6]:
# First 100 posts for WSB
wsb_posts_100 = posts_100('wallstreetbets',present_utc)

# WSB last utc
# Received help from Gabrielle Burgos
wsb_last_utc = wsb_posts_100[-1]['created_utc']
wsb_last_utc

1614486808

In [7]:
# WSB dataframe with 100 posts
wsb_df = pd.DataFrame(wsb_posts_100)[['subreddit','title']]

# Check code execution
wsb_df.head(3)

Unnamed: 0,subreddit,title
0,wallstreetbets,You all should try to SPAC Jeremy Fragrance's ...
1,wallstreetbets,Are you ready for marriage love test
2,wallstreetbets,Serious question: Is it too late to buy in now?


In [8]:
# Create a WSB dataframe with 2000 comments using a for loop
for i in range(19):
    params = {'subreddit':'wallstreetbets',
              'size':100,
              'before':wsb_last_utc}
    res = requests.get(base_url,params).json()
    posts = res['data']
    wsb_last_utc = posts[-1]['created_utc']
    new_df = pd.DataFrame(posts)[['subreddit','title']]
    wsb_df = pd.concat([wsb_df,new_df])
    time.sleep(3)

In [9]:
# Check code execution
wsb_df

Unnamed: 0,subreddit,title
0,wallstreetbets,You all should try to SPAC Jeremy Fragrance's ...
1,wallstreetbets,Are you ready for marriage love test
2,wallstreetbets,Serious question: Is it too late to buy in now?
3,wallstreetbets,Need some advice...
4,wallstreetbets,tin foil hat time! Is Vlad really that stupid?
...,...,...
95,wallstreetbets,"Is there really paid shills, positive or negat..."
96,wallstreetbets,GME
97,wallstreetbets,"3 Years - 3,000 to 3,000,000. - Day 1 (Saturda..."
98,wallstreetbets,We like the Stock.


In [10]:
# Check unique number of rows
len(wsb_df['title'].unique())

1900

In [11]:
# Drop duplicate rows from dataframe
wsb_df.drop_duplicates(inplace=True)

In [12]:
# Check shape
wsb_df.shape

(1900, 2)

In [13]:
# First 100 posts for SSB
ssb_posts_100 = posts_100('SatoshiStreetBets',present_utc)

# SSB last utc
ssb_last_utc = ssb_posts_100[-1]['created_utc']
ssb_last_utc

1614463124

In [14]:
# SSB dataframe with 100 posts
ssb_df = pd.DataFrame(ssb_posts_100)[['subreddit','title']]

# Check code execution
ssb_df.head(3)

Unnamed: 0,subreddit,title
0,SatoshiStreetBets,Why are alts still so attached to bitcoin...
1,SatoshiStreetBets,"Market Cap should not be an issue, Lets Moonsh..."
2,SatoshiStreetBets,$SRK Sparkpoint


In [15]:
# Create a SSB dataframe with 2000 comments using a for loop
for i in range(19):
    params = {'subreddit':'SatoshiStreetBets',
              'size':100,
              'before':ssb_last_utc}
    res = requests.get(base_url,params).json()
    posts = res['data']
    ssb_last_utc = posts[-1]['created_utc']
    new_df = pd.DataFrame(posts)[['subreddit','title']]
    ssb_df = pd.concat([ssb_df,new_df])
    time.sleep(3)

In [16]:
# Check code execution
ssb_df

Unnamed: 0,subreddit,title
0,SatoshiStreetBets,Why are alts still so attached to bitcoin...
1,SatoshiStreetBets,"Market Cap should not be an issue, Lets Moonsh..."
2,SatoshiStreetBets,$SRK Sparkpoint
3,SatoshiStreetBets,Hidden gem and sleeping giant
4,SatoshiStreetBets,Grayscale Bitcoin Premium Shrinks As Cardano R...
...,...,...
95,SatoshiStreetBets,🚀$MITX Massive Buy Opportunity 7x Inbound🚀
96,SatoshiStreetBets,"Buy CELR, here's why!"
97,SatoshiStreetBets,Crash?
98,SatoshiStreetBets,Why are prices crashing?


In [17]:
# Check unique number of rows
len(ssb_df['title'].unique())

1921

In [18]:
# Drop duplicate rows from dataframe
ssb_df.drop_duplicates(inplace=True)

In [19]:
# Check shape
ssb_df.shape

(1921, 2)

In [20]:
# Combine WSB and SSB dataframes
combined_df = pd.concat([wsb_df,ssb_df])
combined_df

Unnamed: 0,subreddit,title
0,wallstreetbets,You all should try to SPAC Jeremy Fragrance's ...
1,wallstreetbets,Are you ready for marriage love test
2,wallstreetbets,Serious question: Is it too late to buy in now?
3,wallstreetbets,Need some advice...
4,wallstreetbets,tin foil hat time! Is Vlad really that stupid?
...,...,...
95,SatoshiStreetBets,🚀$MITX Massive Buy Opportunity 7x Inbound🚀
96,SatoshiStreetBets,"Buy CELR, here's why!"
97,SatoshiStreetBets,Crash?
98,SatoshiStreetBets,Why are prices crashing?


In [21]:
# Check unique number of rows
len(combined_df['title'].unique())

3814

In [22]:
# Save as csv file
combined_df.to_csv('../datasets/posts.csv', index=False)