# Linguistics
####  Web Scraping Reddit

Though Reddit has its own API, there is a more popular API for working with Reddit called **Pushshift**. You can read more about Pushshift in this [arXiv article](https://arxiv.org/abs/2001.08435). (PDF)

> Why do people use Pushshift’s API instead of the official Reddit API?
>
>In short, Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger single query limits.
>
>Jason Baumgartner, et al., "The Pushshift Reddit Dataset"

#### Install PSAW

To work with the Pushshift API, we're going to install and use a Python wrapper called [PSAW](https://github.com/dmarx/psaw).

In [2]:
!pip3 install psaw

You should consider upgrading via the '/Users/lina/Downloads/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


Import packages: [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [matplotlib](https://matplotlib.org/3.1.1/contents.html).

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

Import PushshiftAPI to use the API

In [4]:
from psaw import PushshiftAPI

Initialize PushShiftAPI

In [5]:
api = PushshiftAPI()

#### PSAW Usage


To collect Reddit posts:

`api.search_submissions(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

To collect Reddit comments:

`api.search_comments(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

#### Collect Reddit submissions for a subreddit (with more than a certain upvote score)

Set up generator to make API request

In [6]:
import datetime as dt
end = int(dt.datetime(2021,5,31,0,0,0).timestamp())
start = int(dt.datetime(2020,3,13,0,0,0).timestamp())

Grab data for each Reddit submission and make it into a dataframe.

In [7]:
api_request_generator = api.search_submissions(subreddit='Cornell', after=start, before=end)

In [8]:
cornell_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])



KeyboardInterrupt: 

Check how many Reddit posts have been collected.

In [None]:
cornell_submissions.shape

Check what columns/metadata are in the dataframe.

In [None]:
cornell_submissions.columns

In [None]:
cornell_submissions[['title', 'score']].sample(10)

Only select columns of interest and assign it to the dataframe

In [None]:
cornell_final = cornell_submissions[['author', 'title', 'selftext', 'created_utc', 'created', 'score', 'num_comments', 'num_crossposts']]

cornell_final

Now, we can export our finalized cleaned dataframe into a csv file.

cleaning data and transforming unix time to standard time 

In [None]:
cornell_final['created_utc'] = pd.to_datetime(cornell_final['created_utc'], unit='s')
cornell_final

In [None]:
cornell_final.to_csv("cornell_final.csv", encoding='utf-8', index=False)