# Data Scraping

Thanks to Reddit's semi-relaxed platform and conditions of data scraping, [sourced here in their robots.txt path](https://www.reddit.com/robots.txt), we are able to successfully obtain various data to help us create a model that answers our problem statement. While there is no actual subreddit known as "r/dangerouslycuteanimals" to actually obtain the jumbled data which my girlfriend is complaining about, we are able to simulate this dataframe conundrum by pulling data from the two original reddits whose content were merged together: "r/natureismetal" and "r/aww." Upon pulling both sets of data form the reddits, we can then combine them to create one large dataframe to work with.  

The most easy way of doing this without needing to go through an arduous web-scraping process of navigating ugly HTML is by utilizing an API. APIs are tools made by website developers and/or data-enthusiastic communities which help users (such as ourselves) access features to software. Some APIs are sophisticated enough to allow full modding support for certain software applications. In this case, we are going to use a web API known as [pushshift](https://pushshift.io) to obtain information from Reddit's different communities. Pushshift is a [community generated API originally made by moderators from "r/datasets"](https://github.com/pushshift/api). Its documentation is found in both related pushshift hyperlink sources found within this very paragraph. It is very sophisticated and very useful in quickly obtaining data in json key-value pairs which can be readily parsed through in Python.   

In [1]:
import pandas as pd #imports pandas package
import datetime as dt #imports datetime package
import time #imports time package
import requests #imports requests package

The API has the capability of pulling in data from two main pathways. One pathway searches through reddit submissions (mainly including posts) while the other pathway searches through comments. For this study, we are only going to go through Reddit submissions to help us gather our data for our model. The Reddit submission pathway through the API contains a lot of other accessible data, including the author name of the post, popularity of the post as whole, and more. As a frequent Reddit user and commentater, I personally can vouch for why it may not be wise to use comments as a method for creating a model. The general understanding to remember is this: *posts on reddit go through bot moderated scruitiny, whereas comments do not*. To elaborate, each successful subreddit will have strict guidelines and rules for submitting content relevant to the "sub" and will often have a very dedicated team of moderators who oversee that these rules are followed to the best of everyone's abilities. These rules allow the majority of submitted content to stay relevant to the community's interests and stay relevant to the subreddit's purpose. Comments are typically more plentiful than submissions and are often overlooked by moderators. Comments as an entity are always left open to interpretation and do not undergo as much scrutiny as posts do. This allows commentators to write narratives that may go on complete tangents from the original context of the submission -- solely in the spirit of online discussion. As a reuslt, we may find examples of people talking about their favorite movies or foods on subreddits only discussing how cute cats are.

The next step is to consider which features are going to be important to understand for our model. Using the API's documentation, we can decide on these 

In [44]:
#Credit to Mahdi Shadkam-Farrokhi for fundtion
#The below function obtains and "cleans" the data from a subreddit. 
#The below function utilizes the pushshift API

def query_pushshift(subreddit, kind = 'submission', day_window = 30, n = 5):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self', 'over_18', 'author_flair_text', 'total_awards_received']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i)
        print("Querying from: " + URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        posts.append(df)
        time.sleep(5)
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        #full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)


    full.reset_index(inplace = True)
    print("Query Complete!")    
    return full 

In [45]:
nature_is_metal = query_pushshift("natureismetal")

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=90d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=natureismetal&size=500&after=150d
Query Complete!


In [None]:
aww = query_pushshift("aww")

In [46]:
nature_is_metal.loc[2, 'title']

'Jaguar killing big Caiman in water'

In [None]:
#['author', 'domain', 'full_link', 'is_self', 'num_comments','over_18', 'selftext', 'subreddit_type','subreddit', 'total_awards_received', 'created_utc']
    

In [102]:
full.reset_index(inplace = True)
print(type(full)) 

Unnamed: 0,index,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,send_replies,stickied,subreddit,subreddit_id,total_awards_received,author_cakeday,distinguished,steward_reports,edited,timestamp
0,0,[],,dyllmatic777,,,[],,,,...,True,False,natureismetal,t5_324zi,0,,,,,2020-03-21
1,1,[],,reddatazz,,,[],,,,...,True,False,natureismetal,t5_324zi,0,,,,,2020-03-21
2,2,[],,Titaniumspyborgbear,,,[],,,,...,True,False,natureismetal,t5_324zi,0,,,,,2020-03-21
3,3,[],,sm1rr0r,,,[],,,,...,True,False,natureismetal,t5_324zi,0,,,,,2020-03-21
4,4,[],,Shadowbanned_User,,,[],,,,...,True,False,natureismetal,t5_324zi,0,,,,,2020-03-21


domain, full_link, is_original_content, is_self, link_flair_background_color, link_flair_text_color, link_flair_type, num_comments, over_18, selftext, subreddit_type, title, url, crosspost_parent, crosspost_parent_list, link_flair_css_class, link_flair_text

In [170]:
nature_is_metal.shape

(0, 9)

In [85]:
nature_is_metal.loc[2,'all_awardings']

2    []
2    []
2    []
2    []
2    []
Name: all_awardings, dtype: object