# Linguistics

**Code adapted from webscrapping workshop**


####  Web Scraping Reddit

Though Reddit has its own API, there is a more popular API for working with Reddit called **Pushshift**. You can read more about Pushshift in this [arXiv article](https://arxiv.org/abs/2001.08435). (PDF)

> Why do people use Pushshift’s API instead of the official Reddit API?
>
>In short, Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger single query limits.
>
>Jason Baumgartner, et al., "The Pushshift Reddit Dataset"

#### Install PSAW

To work with the Pushshift API, we're going to install and use a Python wrapper called [PSAW](https://github.com/dmarx/psaw).

In [1]:
!pip3 install psaw



Import packages: [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [matplotlib](https://matplotlib.org/3.1.1/contents.html).

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

Import PushshiftAPI to use the API

In [3]:
from psaw import PushshiftAPI

Initialize PushShiftAPI

In [4]:
api = PushshiftAPI()

#### PSAW Usage


To collect Reddit posts:

`api.search_submissions(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

To collect Reddit comments:

`api.search_comments(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

#### Collect Reddit submissions for a subreddit (with more than a certain upvote score)

Set up generator to make API request

In [5]:
import datetime as dt
end = int(dt.datetime(2021,5,31,0,0,0).timestamp())
start = int(dt.datetime(2020,3,13,0,0,0).timestamp())

Grab data for each Reddit submission and make it into a dataframe.

In [10]:
api_request_generator = api.search_submissions(subreddit='Cornell', after=start, before=end)

In [11]:
cornell_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

Check how many Reddit posts have been collected.

In [12]:
cornell_submissions.shape

(23145, 83)

Check what columns/metadata are in the dataframe.

In [13]:
cornell_submissions.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_received',
       'treatment_tags', 'upvote_ratio',

In [14]:
cornell_submissions[['title', 'score']].sample(10)

Unnamed: 0,title,score
11945,I’m really tired of this lol. I’m like positiv...,1
4632,I’m doing an internship over the summer that h...,1
11240,@2940,1
20880,Spring @ Cornell,1
22953,From the Luna Collegetown Grubhub menu,1
8137,When are you going back to campus?,1
11531,"Just to confirm, everyone needs to arrange the...",10
13831,"If anyone is feeling down, just remember this.",1
18606,Returning to Campus fromNYC,2
11719,Change my mind: MechEs are the peasants of the...,1


Only select columns of interest and assign it to the dataframe

In [15]:
cornell_final = cornell_submissions[['author', 'title', 'selftext', 'created_utc', 'created', 'score', 'num_comments', 'num_crossposts']]

cornell_final

Unnamed: 0,author,title,selftext,created_utc,created,score,num_comments,num_crossposts
0,sweet_sticky_sobol3v,HAPPY GRADUATION MY HILBERT SPACES,MAY YOUR NORMS ALWAYS SATISFY THE PARALLELOGRA...,1622432312,1.622450e+09,2,0,0
1,imahuman232323,Where can we find the graduation photo slide?,"Earlier this month, grads are encouraged to up...",1622431734,1.622450e+09,2,6,0
2,dragonslikepi,Spring '21 Medians?,"Hello all,\n\nWhen you request your transcript...",1622431013,1.622449e+09,1,5,0
3,Financial-Trade-7064,are any other 2020s sad,I'm thrilled that the class of 2021 got to enj...,1622427121,1.622445e+09,1,10,0
4,rickyrichboy,Thanks for all the memories Cornell 🐻❤️,Although it wasn’t quite the end we were hopin...,1622425715,1.622444e+09,1,0,0
...,...,...,...,...,...,...,...,...
23140,okurrrr2348,Orgo prelim 2,Thoughts? Curious how it went for everybody!,1584077864,1.584096e+09,1,3,0
23141,College_Sadness,2800,wtf am I even doing in this class? choked so h...,1584075269,1.584093e+09,1,7,0
23142,vaani23,Found out I did really badly on the CS 2800 pr...,"As the title says, I just got my grade back fo...",1584075261,1.584093e+09,1,3,0
23143,Cu1106,Did Ruttledge actually retire?,If so that's such an asshole move. Literally a...,1584074794,1.584093e+09,1,16,0


Now, we can export our finalized cleaned dataframe into a csv file.

cleaning data and transforming unix time to standard time 

In [21]:
cornell_final['created_utc'] = pd.to_datetime(cornell_final['created_utc'], unit='s')
cornell_final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cornell_final['created_utc'] = pd.to_datetime(cornell_final['created_utc'], unit='s')


Unnamed: 0,author,title,selftext,created_utc,created,score,num_comments,num_crossposts
0,sweet_sticky_sobol3v,HAPPY GRADUATION MY HILBERT SPACES,MAY YOUR NORMS ALWAYS SATISFY THE PARALLELOGRA...,2021-05-31 03:38:32,1.622450e+09,2,0,0
1,imahuman232323,Where can we find the graduation photo slide?,"Earlier this month, grads are encouraged to up...",2021-05-31 03:28:54,1.622450e+09,2,6,0
2,dragonslikepi,Spring '21 Medians?,"Hello all,\n\nWhen you request your transcript...",2021-05-31 03:16:53,1.622449e+09,1,5,0
3,Financial-Trade-7064,are any other 2020s sad,I'm thrilled that the class of 2021 got to enj...,2021-05-31 02:12:01,1.622445e+09,1,10,0
4,rickyrichboy,Thanks for all the memories Cornell 🐻❤️,Although it wasn’t quite the end we were hopin...,2021-05-31 01:48:35,1.622444e+09,1,0,0
...,...,...,...,...,...,...,...,...
23140,okurrrr2348,Orgo prelim 2,Thoughts? Curious how it went for everybody!,2020-03-13 05:37:44,1.584096e+09,1,3,0
23141,College_Sadness,2800,wtf am I even doing in this class? choked so h...,2020-03-13 04:54:29,1.584093e+09,1,7,0
23142,vaani23,Found out I did really badly on the CS 2800 pr...,"As the title says, I just got my grade back fo...",2020-03-13 04:54:21,1.584093e+09,1,3,0
23143,Cu1106,Did Ruttledge actually retire?,If so that's such an asshole move. Literally a...,2020-03-13 04:46:34,1.584093e+09,1,16,0


In [22]:
cornell_final.to_csv("cornell_final.csv", encoding='utf-8', index=False)