# Data Collection
---

Using Pushshift's API, posts were gathered from the twitch and youtube subreddits. The code cells below are written as follows:

1. Create the base URL, an empty dataframe, and the parameters to be used to get requests (subreddits and number of posts per (capped at 100 posts)
2. To get 10,000 posts, while the size of the empty dataframe is less than 10,000:
- Wait 5 seconds
- Request 100 posts
- Create a temporary dataframe from the response
- Append the temporary dataframe to the main dataframe and re-index
- Drop unwanted rows, nulls, media, and duplicates
- Update parameters for the requests with the earliest retrieved post
- Repeat
3. Save the resulting dataframes

**Please note**: Executing the retrieval cell blocks may cause lengthy run times due to wait times.

## Imports

In [4]:
# Import libaries here
import numpy as np
import pandas as pd

import time
import datetime
import requests

## Twitch data

In [2]:
# create URL, parameters, and twitch_df
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit':'twitch',
    'size':100, #100 is the limit
}

twitch_df = pd.DataFrame()

In [3]:
# while-loop to get 10000 posts
while twitch_df.shape[0] < 10000:
    # wait 4 seconds before requesting
    time.sleep(5)
    res = requests.get(url,params)

    # turning response into JSON, then grabbing 'data' and turning it into dataframe
    data = res.json()
    posts = data['data']
    posts_temp_df = pd.DataFrame(posts)
    
    # appending posts
    twitch_df = twitch_df.append(posts_temp_df)
    
    # resetting index
    twitch_df.reset_index(drop = True, inplace = True)

    # removing where 'selftext' == '[removed]' | '[deleted]' | "''" or null
    rmv_dlt_drop = twitch_df[(twitch_df['selftext'] == '[removed]') |
                                (twitch_df['selftext'] == '[deleted]') |
                                (twitch_df['selftext'] == '')]
    twitch_df.drop(index = rmv_dlt_drop.index, inplace = True)
    twitch_df.dropna(subset=['selftext'],inplace = True)

    # removing where only media
    media = twitch_df[twitch_df['media'].notnull()]
    twitch_df.drop(index = media.index, inplace = True)

    # removing duplicates
    duplicates = twitch_df[twitch_df.duplicated(subset=['selftext']) == True]
    twitch_df.drop(index = duplicates.index, inplace = True)
    
    # updating 'before' with 'created_utc' to get older posts
    params.update({'before':posts[-1]['created_utc']})

In [4]:
# saving as CSVs
twitch_df.to_csv('../data/twitch_df_full.csv',index=False)

## Youtube data

In [5]:
# create URL, parameters, and youtube_df
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit':'youtube',
    'size':100, #100 is the limit
}

youtube_df = pd.DataFrame()
eligible = 0

In [6]:
# while-loop to get 10000 posts
while youtube_df.shape[0] < 10000:
    # wait 4 seconds before requesting
    time.sleep(5)
    res = requests.get(url,params)

    # turning response into JSON, then grabbing 'data' and turning it into dataframe
    data = res.json()
    posts = data['data']
    posts_temp_df = pd.DataFrame(posts)

    # appending posts
    youtube_df = youtube_df.append(posts_temp_df)
    
    # resetting index
    youtube_df.reset_index(drop = True, inplace = True)

    # removing where only media
    media = youtube_df[youtube_df['media'].notnull()]
    youtube_df.drop(index = media.index, inplace = True)

    # removing where 'selftext' == '[removed]' | '[deleted]' | "''" or null
    rmv_dlt_drop = youtube_df[(youtube_df['selftext'] == '[removed]') |
                                (youtube_df['selftext'] == '[deleted]') |
                                (youtube_df['selftext'] == '')]
    youtube_df.drop(index = rmv_dlt_drop.index, inplace = True)
    youtube_df.dropna(subset=['selftext'],inplace = True)

    # removing duplicates
    duplicates = youtube_df[youtube_df.duplicated(subset = ['selftext']) == True]
    youtube_df.drop(index = duplicates.index, inplace = True)
    
    # updating 'before' with 'created_utc' to get older posts
    params.update({'before':posts[-1]['created_utc']})

In [7]:
# saving as CSVs
youtube_df.to_csv('../data/youtube_df_full.csv',index=False)