# r/Cornell Webscraping 

**Code adapted from webscrapping workshop**


####  Web Scraping Reddit

Though Reddit has its own API, there is a more popular API for working with Reddit called **Pushshift**. You can read more about Pushshift in this [arXiv article](https://arxiv.org/abs/2001.08435). (PDF)

> Why do people use Pushshift’s API instead of the official Reddit API?
>
>In short, Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger single query limits.
>
>Jason Baumgartner, et al., "The Pushshift Reddit Dataset"

#### Install PSAW

To work with the Pushshift API, we're going to install and use a Python wrapper called [PSAW](https://github.com/dmarx/psaw).

In [1]:
!pip3 install psaw



Import packages: [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [matplotlib](https://matplotlib.org/3.1.1/contents.html).

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

Import PushshiftAPI to use the API

In [5]:
from psaw import PushshiftAPI

Initialize PushShiftAPI

In [6]:
api = PushshiftAPI()

#### PSAW Usage


To collect Reddit posts:

`api.search_submissions(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

To collect Reddit comments:

`api.search_comments(subreddit="subrredit of interest", score=">certain upvote score", q="search keyword", before=date, after=date)`

#### Collect Reddit submissions for a subreddit (with more than a certain upvote score)

Set up generator to make API request

In [7]:
import datetime as dt
start = int(dt.datetime(2019,3,1,0,0,0).timestamp())
end = int(dt.datetime(2022,3,1,0,0,0).timestamp())

Grab data for each Reddit submission and make it into a dataframe.

In [8]:
api_request_generator = api.search_submissions(subreddit='Cornell', after=start, before=end)

In [9]:
cornell_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])



Check how many Reddit posts have been collected.

In [10]:
cornell_submissions.shape

(49851, 90)

Check what columns/metadata are in the dataframe.

In [11]:
cornell_submissions.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_is_blocked', 'author_patreon_flair', 'author_premium',
       'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
       'full_link', 'gildings', 'id', 'is_created_from_ads_ui',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscriber

In [12]:
cornell_submissions[['title', 'score']].sample(10)

Unnamed: 0,title,score
15018,My shot at making new friends,1
36963,Approved mini fridges?,1
979,Is it fair/allowed for a proffesor to offer li...,1
23040,Difference of a minor and concentration?,1
829,Is there like a badminton club or similar? I’m...,1
23284,Academic Standing in CALS,1
22520,"I’m Kan, the Louis Vuitton Don,",1
39573,feeling heavy,1
25268,Where to order a cake from for Valentines day?,1
24059,I got ya homie!,1


Only select columns of interest and assign it to the dataframe

In [13]:
cornell_final = cornell_submissions[['author', 'title', 'selftext', 'created_utc', 'created', 'score', 'num_comments', 'num_crossposts']]

cornell_final

Unnamed: 0,author,title,selftext,created_utc,created,score,num_comments,num_crossposts
0,emzow,Good places to Slavic squat on campus?,,1646109247,1.646124e+09,1,0,0
1,Its3amInRisley,COVID MEGATHREAD,Please guys remember to direct COVID stuff to ...,1646108528,1.646123e+09,1,0,0
2,sellingminifridge,Selling hasan minhaj ticket!,dm if anyone wants to buy one ticket for wedne...,1646108470,1.646123e+09,1,0,0
3,SwordfishInfamous,Is Louise open rn,Title,1646108392,1.646123e+09,1,0,0
4,HaliTheGreat,"Anyone here in the Johnstown, Altoona, or Pitt...",I have a 4 hour drive to my town in West PA bu...,1646108279,1.646123e+09,1,0,0
...,...,...,...,...,...,...,...,...
49846,bumchala,"aight fuck it lets start this shit, i want eri...",willing to buy up to 4 tix,1551455855,1.551470e+09,0,4,0
49847,Broilier,waiting for Eric Andre tickets like,,1551453199,1.551468e+09,113,12,0
49848,Liberalscums,Off campus housing smh...,So is there a clearly labeled and comprehensiv...,1551444808,1.551459e+09,6,3,0
49849,hawaiianbarrels,Cornell Dining stealthy replacing meat with ve...,,1551442085,1.551456e+09,0,5,0


Now, we can export our finalized cleaned dataframe into a csv file.

cleaning data and transforming unix time to standard time 

In [14]:
cornell_final['created_utc'] = pd.to_datetime(cornell_final['created_utc'], unit='s')
cornell_final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cornell_final['created_utc'] = pd.to_datetime(cornell_final['created_utc'], unit='s')


Unnamed: 0,author,title,selftext,created_utc,created,score,num_comments,num_crossposts
0,emzow,Good places to Slavic squat on campus?,,2022-03-01 04:34:07,1.646124e+09,1,0,0
1,Its3amInRisley,COVID MEGATHREAD,Please guys remember to direct COVID stuff to ...,2022-03-01 04:22:08,1.646123e+09,1,0,0
2,sellingminifridge,Selling hasan minhaj ticket!,dm if anyone wants to buy one ticket for wedne...,2022-03-01 04:21:10,1.646123e+09,1,0,0
3,SwordfishInfamous,Is Louise open rn,Title,2022-03-01 04:19:52,1.646123e+09,1,0,0
4,HaliTheGreat,"Anyone here in the Johnstown, Altoona, or Pitt...",I have a 4 hour drive to my town in West PA bu...,2022-03-01 04:17:59,1.646123e+09,1,0,0
...,...,...,...,...,...,...,...,...
49846,bumchala,"aight fuck it lets start this shit, i want eri...",willing to buy up to 4 tix,2019-03-01 15:57:35,1.551470e+09,0,4,0
49847,Broilier,waiting for Eric Andre tickets like,,2019-03-01 15:13:19,1.551468e+09,113,12,0
49848,Liberalscums,Off campus housing smh...,So is there a clearly labeled and comprehensiv...,2019-03-01 12:53:28,1.551459e+09,6,3,0
49849,hawaiianbarrels,Cornell Dining stealthy replacing meat with ve...,,2019-03-01 12:08:05,1.551456e+09,0,5,0


In [15]:
cornell_final.to_csv("cornell_final.csv", encoding='utf-8', index=False)