# Linguistics
####  Web Scraping Reddit

Though Reddit has its own API, there is a more popular API for working with Reddit called **Pushshift**. You can read more about Pushshift in this [arXiv article](https://arxiv.org/abs/2001.08435). (PDF)


In [16]:
pip install pmaw pandas

Note: you may need to restart the kernel to use updated packages.


Import packages: [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [matplotlib](https://matplotlib.org/3.1.1/contents.html).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Import PushshiftAPI to use the API

In [2]:
from pmaw import PushshiftAPI

Initialize PushShiftAPI

In [3]:
api = PushshiftAPI()

#### PMAW Usage


To collect Reddit posts:

`api.search_submissions(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

To collect Reddit comments:

`api.search_comments(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

#### Collect Reddit submissions for a subreddit (with more than a certain upvote score)

#Set up generator to make API request

In [18]:
## dates from march 13 -- may 2021

import datetime as dt
before = int(dt.datetime(2021,5,31,0,0).timestamp())
after = int(dt.datetime(2020,3,6,0,0).timestamp())

submissions = api.search_submissions(subreddit='Cornell', before=before, after=after)

INFO:pmaw.PushshiftAPIBase:23594 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 100 - Batches: 10 - Items Remaining: 13636


ConnectionError: HTTPSConnectionPool(host='api.pushshift.io', port=443): Max retries exceeded with url: /reddit/submission/search?subreddit=Cornell&before=1597634450&after=1597478604&size=100&sort=desc&metadata=true (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x12b176880>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

Grab data for each Reddit submission and make it into a dataframe.

In [7]:
cornell_submissions = pd.DataFrame(submissions)
cornell_comments = pd.DataFrame(comments)

Check how many Reddit posts have been collected.

In [8]:
cornell_submissions.shape

(23584, 82)

Check what columns/metadata are in the dataframe.

In [14]:
cornell_submissions.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail',
       'thumbnail_height', 'thumbnail_width', 'tit

In [10]:
cornell_submissions[['title', 'score']].sample(10)

Unnamed: 0,title,score
1375,CS in arts or engineering,0
7783,iPad Pro vs Air,1
22611,Christmas music from the clock tower in April ...,1
22666,INFO 1300 is the worst course ive ever taken,1
17159,"SA come on y'all,, promote optional S/U we're ...",1
1475,CS TA work hours,1
19865,Is there any way to cancel a GET app order?,1
7147,Can you take Math 2930 and Math 2940 at the sa...,1
19150,Graduating Early: Pre-Enroll Status Change?,1
209,Whether to return that is the question,24


Only select columns of interest and assign to a new dataframe.

Now, we can export our finalized cleaned dataframe into a csv file.

In [15]:
cornell_submissions.to_csv("cornell_submissions.csv", encoding='utf-8', index=False)