# Project 3: Reddit Post Sorting

- **ExplainLikeImFive (ELI5)** - Explain Like I'm Five is the best forum and archive on the internet for layperson-friendly explanations. Don't Panic!
- **AskScience** - Ask a science question, get a science answer.


---

We will be analyzing a random collection of posts from two subReddits, **ExplainLikeImFive** and **AskScience**, in order to build a model to predict if an individual posts belong to ELI5 or AskScience; we will be analyzing the Title and Body of the Post.

**What am I hoping to achieve with this?**
> If ELI5 is distinguishable from AskScience.

**Why?**
> To see if a subreddit focused on explaining things in a simple manner is that much different than a subreddit that wants to explain it any way they can.

# Data Collection
To collect our data we will be utilizing the Requests library. This library allows us to access websites and, in our case for this project, utilize the websites API to control and obtain desired information.

In [1]:
import requests
import pandas as pd

import time

## Testing and Exploring the Pushshift API

We will be using the pushshift API wrapper to access Reddit posts, obtained through the following url.

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

Now we identify the subreddits we wish to access and what information we wish to pull.

In [3]:
# params1 and params2 completes the url above according to the pushshift API.
params1 = {
    'subreddit': 'explainlikeimfive',
    'size': 100,
    'before': 1631249809
}

We are ready to pull the data from the API, given our parameters. We then check to ensure it was connected successfully via status_code; 200 being successful.

In [4]:
res = requests.get(url, params1)
res.status_code

200

Our requests.get() gives us a large text document that can be organized into a json format in what appears to be a dictionary of dictionaries. We can format this with the .json() method. Specifically we want the data section of the encompassing dictionary.

In [5]:
posts = res.json()['data']

In [6]:
len(posts)

100

We see that we were able to pull the 100 posts, as indicated by our 'size' param. This is also the maximum amount of posts we can pull in a single request.

What kind of data does this represent?

In [7]:
posts[0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'removed_by_category', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'suggested_sort', 'thumbnail', 'tit

To make it more readable, we can turn our data into a dataframe. We don't need all of the informational categories from above, so let's pick what we believe would be most useful.

In [8]:
df = pd.DataFrame(posts)[['id','subreddit', 'title', 'author', 'created_utc', 'selftext', 'created_utc']]
df.head()

Unnamed: 0,id,subreddit,title,author,created_utc,selftext,created_utc.1
0,plee4p,explainlikeimfive,ELI5: How long does it take for defensins on s...,alisensei,1631249765,[removed],1631249765
1,pled6q,explainlikeimfive,ELI5: What is the difference between the words...,Danaaerys,1631249646,,1631249646
2,plebya,explainlikeimfive,ELi5: Why can I seemingly breathe out two diff...,KingRexxi,1631249494,[removed],1631249494
3,ple9hy,explainlikeimfive,ELI5: Why is it impossible to get comfortable ...,smore-phine,1631249200,Have to be up in six hours for work but my min...,1631249200
4,ple8i3,explainlikeimfive,ELI5: How are ideas formed?,Mjosaphine,1631249085,,1631249085


In [9]:
df.tail()

Unnamed: 0,id,subreddit,title,author,created_utc,selftext,created_utc.1
95,pl6ihz,explainlikeimfive,Eli5 How can the skin regenerate but a cut off...,klusterxx,1631221154,,1631221154
96,pl6e4z,explainlikeimfive,ELI5: How do wind turbines work?,urfavefilipina,1631220760,How does each significant part of the turbine ...,1631220760
97,pl6cgb,explainlikeimfive,ELI5: What is particle horizon?,iahimide,1631220606,,1631220606
98,pl6ah2,explainlikeimfive,ELI5: how do colorblind glasses work?,Jhams3,1631220429,[removed],1631220429
99,pl68us,explainlikeimfive,ELI5: What is the difference between the dot p...,camthedps,1631220288,,1631220288


**Note: 'metadata' is a good way to check total posts to see if the subreddit is a feasible choice for the project.**

Also, use time.sleep(x) between subsequent pulls in order to get past the max size of 100 per request without getting blocked.

In [10]:
params2 = {
    'subreddit': 'explainlikeimfive',
    'metadata': 'true'
}

In [11]:
res2 = requests.get(url, params2)
res2.status_code

200

In [12]:
res2.json()['metadata']

{'after': None,
 'agg_size': 100,
 'api_version': '3.0',
 'before': None,
 'es_query': {'query': {'bool': {'filter': {'bool': {'must': [{'terms': {'subreddit': ['explainlikeimfive']}}],
      'should': []}},
    'must_not': []}},
  'size': 25,
  'sort': {'created_utc': 'desc'}},
 'execution_time_milliseconds': 56.79,
 'index': 'rs',
 'metadata': 'true',
 'ranges': [],
 'results_returned': 25,
 'shards': {'failed': 0, 'skipped': 0, 'successful': 20, 'total': 24},
 'size': 25,
 'sort': 'desc',
 'sort_type': 'created_utc',
 'subreddit': ['explainlikeimfive'],
 'timed_out': False,
 'total_results': 1312328}

**1.3 million posts should be enough. ExplainLikeImFive will be an acceptable choice.**

In [13]:
params3 = {
    'subreddit': 'askscience',
    'metadata': 'true'
}

In [14]:
res3 = requests.get(url, params3)
res3.status_code

200

In [15]:
res3.json()['metadata']['total_results']

1112795

**1.1 million posts should be enough. AskScience will be an acceptable choice.**

---

## Collect Desired Data

Now that we have done some exploration on our SubReddits and found what is available to us through Pushshift's API, let's pull enough data to perform an analysis.

In [44]:
def pull_data(subreddit):
    '''
    This function will attempt to pull 10,000 posts from the provided subreddit in chunks of 100.
    
    Parameter 'subreddit' should be a str, referring to the portion of the url that represents the subreddit.
    '''
    
    # Used to initial our dataframe
    i=0
    big_df = [[]]
    
    # We will be finding new posts by using the 'before' paramater available to us by pushshift api.
    # Initially, we do not want there to be any 'before' parameter.
    set_time = 1631249809
    
    # Keep running until we have 10,000 posts
    while len(big_df) < 10_000:
    
        # The parameters we pass to the pushshift api to pull data. size=100 is the maximum available 
        #and we only require data from the listed columns in 'fields'
        params = {
            'subreddit': subreddit,
            'size': 100,
            'fields': ['id', 'subreddit', 'title', 'author', 'created_utc', 'selftext', 'body'],
            'selftext:not': '[removed]',
            'title:not': 'AMA Series',
            'num_comments': '>5',
            'before': set_time
        }

        # Access the api through our requests library
        res = requests.get(url, params)

        # Ensure that our connection was successful (=200); if it isnt, exit and tell us the code it got.
        if res.status_code != 200:
            return print('There as been an error: res.status_code =', res.status_code)

        # Continue with our cycling program if status_code == 200
        else:
            
            # Pull in our data into a variable 'posts'
            posts = res.json()['data']
            
            # Find the post that was submitted at the farthest time from present and set it to 'set_time'
            set_time = posts[-1]['created_utc']
            
            # Initialize our dataframe that will be holding all of the posts for this subreddit
            # Grab our first 100 posts
            if i == 0:
                big_df = pd.DataFrame(posts)
                
                # Change i so this will not be ran again
                i = 1
                
                # Set a pause time so we don't get blocked by Reddit for abusing the API
                time.sleep(1)
            
            # After our first 100 posts, grab each subsequent 100 posts and concat onto our dataframe
            else:
                df = pd.DataFrame(posts)
                big_df = pd.concat([big_df, df])

                # Set a pause time so we don't get blocked by Reddit for abusing the API
                time.sleep(1)
                
    return big_df

In [45]:
ask_df = pull_data('askscience')
print(len(ask_df))
ask_df.head()

10095


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act..."
4,hairycoo,1629566821,p8vinv,,askscience,Can't we include multiple virus traits rather ...


In [46]:
ex_df = pull_data('explainlikeimfive')
print(len(ex_df))
ex_df.head()

10000


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a..."
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...
2,80sKidCA,1631246964,pldq29,,explainlikeimfive,ELI5: Why and how does your body store tension...
3,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?
4,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...


### Check unique ID's to ensure we pulled our data correctly using the 'before' parameters.

The number of unique IDs should equal the length of our dataframe.

In [47]:
assert len(ask_df['id'].unique()) == len(ask_df)

In [48]:
assert len(ex_df['id'].unique()) == len(ex_df)

**Make sure we don't have any deleted or removed posts**

In [49]:
ex_df[ex_df['selftext']!= '[removed]'].shape

(10000, 6)

In [50]:
ask_df[ask_df['selftext']!= '[removed]'].shape

(10095, 6)

**Success!**

Now let's save these dataframes into csv's to be explored and analyzed.

In [51]:
ask_df.to_csv('../data/ask_df.csv', index=False)
ex_df.to_csv('../data/ex_df.csv', index=False)

In [109]:
stop

NameError: name 'stop' is not defined

# Comment Extraction

I really wanted this to work but my attempts thus far have been unsuccessful.
- Currently stuck connecting to the specific submission via 'parent_id'. There is a successful connection, but no comments are pulled. Either the 'parent_id' is wrong, there is an issue on the backend, or there is an issue with the res.json().
 - Successful pulls top comments in the subreddit when 'parent_id' is not given.
 - When 'parent_id' is given, gives empty list for posts and empty dataset for temp.

---

Top comments will not be included in the model.
- It would be very helpful in our classification model as the top comment by score is usually the 'accepted answer' and is in a specific vernacular depending on which subreddit we are in.

In [110]:
def get_comments(df, subreddit):
        
    text_url = 'https://api.pushshift.io//reddit/search/comment'
    new_df = df.copy()
    new_df['top_comment'] = ['']*len(df)
    
    for j, Id in enumerate(df['id']):
        
                            
        params = {
        'subreddit': subreddit,
        'size': 1,
        'sort_type': 'score',
        'parent_id': 't1' + new_df.iloc[1]['id'],
        }

        # Access the api through our requests library
        res = requests.get(text_url, params)

        # Ensure that our connection was successful (=200); if it isnt, exit and tell us the code it got.
        if res.status_code != 200:
            return print('There as been an error: res.status_code =', res.status_code)

        # Continue with our cycling program if status_code == 200
        else:

            # Pull in our data into a variable 'posts'
            posts = res.json()['data']
            temp = pd.DataFrame(posts)         
            
            if posts != []:
                new_df.iloc[j]['top_comment'] = temp['body'][0]
            
            # Set a pause time so we don't get blocked by Reddit for abusing the API
            time.sleep(1)

          
    return new_df

In [111]:
get_comments(ex_df, 'explainlikeimfive')

KeyboardInterrupt: 

In [76]:
ex_df.iloc[12]

author                                              mintee-fresh
created_utc                                           1631239520
id                                                        plbsi4
selftext       I don't really understand how seeding works wh...
subreddit                                      explainlikeimfive
title                    ELI5: How does seeding work in weather?
Name: 12, dtype: object

In [60]:
new_ask = get_comments(ask_df, 'explainlikeimfive')
new_ask.head()

KeyError: 'body'

In [25]:
new_ex = get_comments(ex_df, 'askscience')
new_ex.head()