# Reddit Data Collection — Workbook

*Don't forget to rename this notebook if you want to save changes!*

In this lesson, we're going to introduce learn how to collect Reddit posts with the API wrapper known as [PSAW](https://github.com/dmarx/psaw).

> Why do people use Pushshift’s API instead of the official Reddit API?

> In short, Pushshift makes it
much easier for researchers to query and retrieve historical
Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger
single query limits. 

>— Jason Baumgartner, et al., ["The Pushshift Reddit Dataset"](https://arxiv.org/pdf/2001.08435.pdf)

## Install PSAW

First, we're going to install the PSAW package with pip. The `!` allows us to run a command that is normally used on the command line.

In [None]:
!pip install psaw

Then we will import pandas and set the default display options.

In [39]:
import pandas as pd
pd.options.display.max_colwidth =  400
pd.options.display.max_columns = 50

Next we will import a specific part of the PSAW package, PushshiftAPI.

In [40]:
from psaw import PushshiftAPI

Then we will "initialize" the PushshiftAPI, so we can work with it below.

In [41]:
api = PushshiftAPI()

## Collect Reddit submissions for a subreddit

The way PSAW works is a little unique. First, we will set up an "API request generator," then we will loop through the generator to extract individual Reddit posts.

In [65]:
api_request_generator = api.search_submissions(subreddit='TodayILearned',
                                               score = ">10000",
                                               limit=200)

Here we extract individual Reddit posts from the API request generator, extracting the data, which is stored in the attribute `submission.d_`.

In [66]:
all_submissions = []
for submission in api_request_generator:
    all_submissions.append(submission.d_)

How would we calculate the length of the list `all_submissions`?

In [None]:
all_submissions

How would we examine the first item in the list `all_submissions`?

In [None]:
all_submissions

How would we create a DataFrame from `all_submissions`?

In [None]:
all_submissions

In [68]:
reddit_submissions = pd.DataFrame(all_submissions)

We could do all of the above in a single line of code, like so:

## Examine Data

In [None]:
reddit_submissions 

Check what columns/metdata exist in this data:

In [None]:
reddit_submissions.columns

In [None]:
reddit_submissions[['title', 'score']].sample(5)

Transform the `created_utc` column to a normal date

In [70]:
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')

Select columns of interest

In [None]:
reddit_submissions = reddit_submissions[['date','score', 'title', 'author', 'selftext',
                  'url', 'subreddit',  'num_comments',
                  'num_crossposts']]
reddit_submissions

## Your Turn!

Sort the DataFrame to look at the top 10 Reddit posts with the highest upvote score (note that upvote score is stored in the colum `score`):

In [None]:
reddit_submissions...

Now choose your own subreddit to collect data from:

In [45]:
subreddit = 'CHOOSE YOUR OWN'

In [84]:
api_request_generator = api.search_submissions(subreddit=subreddit,
                                               score = ">3000", limit=100)

In [85]:
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'title', 'author', 'selftext',
                  'url', 'subreddit',  'num_comments',
                  'num_crossposts']]

Sort the DataFrame to look at the 10 Reddit posts with the highest upvote score:

In [None]:
reddit_submissions...

## Collect Reddit submissions based on search keyword

Now search through Reddit posts based on a query word.

In [55]:
query = 'CHOOSE YOUR OWN QUERY'

In [56]:
api_request_generator = api.search_submissions(q= query,
                                                score = ">2000", limit=100)

In [57]:
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'title', 'author', 'selftext',
                  'url', 'subreddit',  'num_comments',
                  'num_crossposts']]

Find all the subreddits where this query word appears (aka find the number of unique values for subreddits, which is stored in the column `subreddit`):

In [None]:
reddit_submissions...

## Bonus (If You Finish Early or Want to Explore More)

### Collect Reddit *comments* based on search keyword

In [None]:
api_request_generator = api.search_comments(q='Missy Elliott',
                                            score = ">2000")
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'subreddit','body', 'author']]
reddit_submissions.head()

### Collect Reddit submissions/comments based on multiple search keywords

To search for multiple phrases —  George Orwell OR J.R.R. Tolkein — use parentheses and the bitwise OR operator

In [None]:
api_request_generator = api.search_comments(q='(George Orwell)|(J. R. R. Tolkien)', limit=100)
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'subreddit','body', 'author']]
reddit_submissions.head()

To search for multiple phrases —  Shakespeare AND Beyonce — use parentheses and the bitwise AND operator

In [None]:
api_request_generator = api.search_comments(q='(Shakespeare)&(Beyonce)')
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'subreddit','body', 'author']]
reddit_submissions.head()

## Collect Reddit submissions/comments with start and end dates

From January 1, 2020 to January 10, 2020

In [None]:
import datetime as dt
start_epoch=int(dt.datetime(2020, 1, 1).timestamp())
end_epoch=int(dt.datetime(2020, 1, 10).timestamp())

api_request_generator = api.search_comments(q='(Shakespeare)&(Beyonce)"', after = start_epoch, before=end_epoch)
reddit_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])
reddit_submissions['date'] = pd.to_datetime(reddit_submissions['created_utc'], utc=True, unit='s')
reddit_submissions = reddit_submissions[['date','score', 'subreddit','body', 'author']]
reddit_submissions.head()