# Data Scraping

Thanks to Reddit's semi-relaxed platform and conditions of data scraping, [sourced here in their robots.txt path](https://www.reddit.com/robots.txt), we are able to successfully obtain various data to help us create a model that answers our problem statement. While there is no actual subreddit known as "r/dangerouslycute" to actually obtain the jumbled data which my girlfriend is complaining about, we are able to simulate this dataframe conundrum by pulling data from the two original reddits whose content were merged together: "r/natureismetal" and "r/aww." Upon pulling both sets of data form the reddits, we can then combine them to create one large dataframe to work with.  

The most easy way of doing this without needing to go through an arduous web-scraping process of navigating ugly HTML is by utilizing an API. APIs are tools made by website developers and/or data-enthusiastic communities which help users (such as ourselves) access features to software. Some APIs are sophisticated enough to allow full modding support for certain software applications. In this case, we are going to use a web API known as [pushshift](https://pushshift.io) to obtain information from Reddit's different communities. Pushshift is a [community generated API originally made by moderators from "r/datasets"](https://github.com/pushshift/api). Its documentation is found in both related pushshift hyperlink sources found within this very paragraph. It is very sophisticated and very useful in quickly obtaining data in json key-value pairs which can be readily parsed through in Python.   

In [1]:
import pandas as pd #imports pandas package
import datetime as dt #imports datetime package
import time #imports time package
import requests #imports requests package

## What Access Do We Have to the Data & What Data Do We Have Access to?

The API has the capability of pulling in data from two main pathways. One pathway searches through reddit submissions (mainly including posts) while the other pathway searches through comments. For this study, we are only going to go through Reddit submissions to help us gather our data for our model. The Reddit submission pathway through the API contains a lot of other accessible data, including the author name of the post, popularity of the post as whole, and more. As a frequent Reddit user and commentater, I personally can vouch for why it may not be wise to use comments as a method for creating a model. The general understanding to remember is this: *posts on reddit go through bot moderated scruitiny, whereas comments do not*. To elaborate, each successful subreddit will have strict guidelines and rules for submitting content relevant to the "sub" and will often have a very dedicated team of moderators who oversee that these rules are followed to the best of everyone's abilities. These rules allow the majority of submitted content to stay relevant to the community's interests and stay relevant to the subreddit's purpose. Comments are typically more plentiful than submissions and are often overlooked by moderators. Comments as an entity are always left open to interpretation and do not undergo as much scrutiny as posts do. This allows commentators to write narratives that may go on complete tangents from the original context of the submission -- solely in the spirit of online discussion. As a reuslt, we may find examples of people talking about their favorite movies or foods on subreddits only discussing how cute cats are.

The next step is to consider which features are going to be important to understand for our model. Using the API's documentation, we can decide on these features right here to make any necessary data cleanup later on easier. The list of features which were thought to be relevant for our model are:

- `title` (title of the submission)
- `selftext` (text of post submission )
-  `subreddit` (name of the submission's associated subreddit)
- `created_utc` (time stamp of submission)
- `author` (name of the submission author)
- `num_comments` (number of comments with the submission)
- `score` (aggregated score of the submission incorporating the difference of upvotes and downvotes)
-  `is_self` (boolean to determine if the submission is solely text post)
-  `over_18` (boolean to determine if content is NSFW)
- `author_flair_text` (flair text native to the author when posting on a specific subreddit)
- `total_awards_received` (number of awards received)

## Why These Features?

These features were thought out and chosen out of the plethora of features we could pull because of their possible predictive ability to create our model. Below is a quick list of reasoning behind each variable's selection:

- `title` (necessary to determine titular key words)
- `selftext` (useful to find any key words to be added to our NLP modeling)
-  `subreddit` (our prediction values)
- `created_utc` (verifies the uniqueness of data)
- `author` (may be useful in determining user interest per subreddit)
- `num_comments` (useful to find popular submissions within a subreddit)
- `score` (useful in determining the validity of a post to a subreddit by its popularity)
-  `is_self` (useful in identifying posts with added text)
-  `over_18` (boolean to determine if content is NSFW)
- `author_flair_text` (may be useful is finding relevant text related to a subreddit per author)
- `total_awards_received` (useful in understanding a post's weight on the subreddit)

## The Pushshift API Custom Function

In [2]:
#Credit to Mahdi Shadkam-Farrokhi for fundtion
#The below function obtains and "cleans" the data from a subreddit. 
#The below function utilizes the pushshift API

def query_pushshift(subreddit, kind = 'submission', day_window = 30, n = 5):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self', 'over_18', 'author_flair_text', 'total_awards_received'] #relevant subfields
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i) #calls the URL we are searching based on function input
        print("Querying from: " + URL) #displays the reddit and path we are querying from
        response = requests.get(URL) #grabs the actual data
        assert response.status_code == 200 #will give us an error if the request is not met
        mine = response.json()['data'] #grabs the json file of the relevant data
        df = pd.DataFrame.from_dict(mine) #converts the json json data into a dataframe
        posts.append(df) #appends this dataframe into a list known as posts
        time.sleep(5) #sets a sleep timer for 5 seconds to 
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False) #concats the list of dataframes into a giant dataframe
    
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS] #putting in our subfields
        # drop duplicates
        full.drop_duplicates(inplace = True) #drops duplicates in our giant dataframe 

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp) #converts the utc column into proper date time


    full.reset_index(inplace = True) #resets the index to eliminate the index repetition
    print("Query Complete!") #lets us know when the query is complete    
    return full 

In [3]:
nature_is_metal = query_pushshift("natureismetal") #pulls from the subreddit r/natureismetal

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=150d
Query Complete!


In [4]:
aww = query_pushshift("aww") #pulls from the subreddit r/aww

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=aww&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=aww&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=aww&size=500&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=aww&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=aww&size=500&after=150d
Query Complete!


In [5]:
nature_is_metal.head()

Unnamed: 0,index,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,over_18,author_flair_text,total_awards_received,timestamp
0,0,Huge Grizzly Bear,,natureismetal,1584925587,cobrakiller2000,198,1,False,False,,0,2020-03-22
1,1,In my kitchen houseplant..,,natureismetal,1584929238,Bronco7771,2,1,False,False,,0,2020-03-22
2,2,In my kitchen houseplant..,,natureismetal,1584929255,Bronco7771,2,1,False,False,,0,2020-03-22
3,3,Deathlock,,natureismetal,1584931304,Hamstah_Huey,1,1,False,False,,0,2020-03-22
4,4,Seal eats an octopus,,natureismetal,1584940215,huntergill123,231,1,False,False,,0,2020-03-23


In [6]:
aww.head()

Unnamed: 0,index,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,over_18,author_flair_text,total_awards_received,timestamp
0,0,I looked out the window and this is what I saw,,aww,1584924687,dvsnlsn321,4,1,False,False,,0,2020-03-22
1,1,Why is he looking at me like that help,,aww,1584924690,EaliyXX,1,1,False,False,,0,2020-03-22
2,2,"My foster, Princess Noodle Tophat, wishes you ...",,aww,1584924702,SheburnsAZ,6,1,False,False,,0,2020-03-22
3,3,Cat and dog crazy funny video - funny animals ...,,aww,1584924706,dilshan9,0,1,False,False,,0,2020-03-22
4,4,Afternoon snooze,,aww,1584924710,codadoda,0,1,False,False,,0,2020-03-22


In [7]:
nature_is_metal.shape

(2500, 13)

In [8]:
aww.shape

(2500, 13)

In [9]:
all_data = pd.concat([nature_is_metal, aww])

In [10]:
all_data.shape

(5000, 13)

In [11]:
all_data.to_csv('../data/dangerouslycute_data.csv')

*All sources better referenced in the  `project_3_main.ipynb` file under the section "Sources and References"*