# Project 3: Reddit Post Sorting

- **ExplainLikeImFive (ELI5)** - Explain Like I'm Five is the best forum and archive on the internet for layperson-friendly explanations. Don't Panic!
- **AskScience** - Ask a science question, get a science answer.


---

We will be analyzing a random collection of posts from two subReddits, **ExplainLikeImFive** and **AskScience**, in order to build a model to predict if an individual posts belong to ELI5 or AskScience; we will be analyzing the Title and Body of the Post.

**What am I hoping to achieve with this?**
> If ELI5 is distinguishable from AskScience.

**Why?**
> To see if a subreddit focused on explaining things in a simple manner is that much different than a subreddit that wants to explain it any way they can.

# Data Collection
To collect our data we will be utilizing the Requests library. This library allows us to access websites and, in our case for this project, utilize the websites API to control and obtain desired information.

In [1]:
import requests
import pandas as pd

import time

## Testing and Exploring the Pushshift API

We will be using the pushshift API wrapper to access Reddit posts, obtained through the following url.

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

Now we identify the subreddits we wish to access and what information we wish to pull.

In [3]:
# params1 and params2 completes the url above according to the pushshift API.
params1 = {
    'subreddit': 'explainlikeimfive',
    'size': 100,
    'before': 1631249809
}

We are ready to pull the data from the API, given our parameters. We then check to ensure it was connected successfully via status_code; 200 being successful.

In [4]:
res = requests.get(url, params1)
res.status_code

200

Our requests.get() gives us a large text document that can be organized into a json format in what appears to be a dictionary of dictionaries. We can format this with the .json() method. Specifically we want the data section of the encompassing dictionary.

In [5]:
posts = res.json()['data']

In [6]:
len(posts)

100

We see that we were able to pull the 100 posts, as indicated by our 'size' param. This is also the maximum amount of posts we can pull in a single request.

What kind of data does this represent?

In [7]:
posts[0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'removed_by_category', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'suggested_sort', 'thumbnail', 'tit

To make it more readable, we can turn our data into a dataframe. We don't need all of the informational categories from above, so let's pick what we believe would be most useful.

In [8]:
df = pd.DataFrame(posts)[['id','subreddit', 'title', 'author', 'created_utc', 'selftext', 'created_utc']]
df.head()

Unnamed: 0,id,subreddit,title,author,created_utc,selftext,created_utc.1
0,plee4p,explainlikeimfive,ELI5: How long does it take for defensins on s...,alisensei,1631249765,[removed],1631249765
1,pled6q,explainlikeimfive,ELI5: What is the difference between the words...,Danaaerys,1631249646,,1631249646
2,plebya,explainlikeimfive,ELi5: Why can I seemingly breathe out two diff...,KingRexxi,1631249494,[removed],1631249494
3,ple9hy,explainlikeimfive,ELI5: Why is it impossible to get comfortable ...,smore-phine,1631249200,Have to be up in six hours for work but my min...,1631249200
4,ple8i3,explainlikeimfive,ELI5: How are ideas formed?,Mjosaphine,1631249085,,1631249085


In [9]:
df.tail()

Unnamed: 0,id,subreddit,title,author,created_utc,selftext,created_utc.1
95,pl6ihz,explainlikeimfive,Eli5 How can the skin regenerate but a cut off...,klusterxx,1631221154,,1631221154
96,pl6e4z,explainlikeimfive,ELI5: How do wind turbines work?,urfavefilipina,1631220760,How does each significant part of the turbine ...,1631220760
97,pl6cgb,explainlikeimfive,ELI5: What is particle horizon?,iahimide,1631220606,,1631220606
98,pl6ah2,explainlikeimfive,ELI5: how do colorblind glasses work?,Jhams3,1631220429,[removed],1631220429
99,pl68us,explainlikeimfive,ELI5: What is the difference between the dot p...,camthedps,1631220288,,1631220288


**Note: 'metadata' is a good way to check total posts to see if the subreddit is a feasible choice for the project.**

Also, use time.sleep(x) between subsequent pulls in order to get past the max size of 100 per request without getting blocked.

In [10]:
params2 = {
    'subreddit': 'explainlikeimfive',
    'metadata': 'true'
}

In [11]:
res2 = requests.get(url, params2)
res2.status_code

200

In [12]:
res2.json()['metadata']

{'after': None,
 'agg_size': 100,
 'api_version': '3.0',
 'before': None,
 'es_query': {'query': {'bool': {'filter': {'bool': {'must': [{'terms': {'subreddit': ['explainlikeimfive']}}],
      'should': []}},
    'must_not': []}},
  'size': 25,
  'sort': {'created_utc': 'desc'}},
 'execution_time_milliseconds': 30.36,
 'index': 'rs',
 'metadata': 'true',
 'ranges': [],
 'results_returned': 25,
 'shards': {'failed': 0, 'skipped': 0, 'successful': 20, 'total': 24},
 'size': 25,
 'sort': 'desc',
 'sort_type': 'created_utc',
 'subreddit': ['explainlikeimfive'],
 'timed_out': False,
 'total_results': 1329080}

**1.3 million posts should be enough. ExplainLikeImFive will be an acceptable choice.**

In [13]:
params3 = {
    'subreddit': 'askscience',
    'metadata': 'true'
}

In [14]:
res3 = requests.get(url, params3)
res3.status_code

200

In [15]:
res3.json()['metadata']['total_results']

1129348

**1.1 million posts should be enough. AskScience will be an acceptable choice.**

---

## Data Collection

Now that we have done some exploration on our SubReddits and found what is available to us through Pushshift's API, let's pull enough data to perform an analysis.

In [16]:
def pull_data(subreddit):
    '''
    This function will attempt to pull 10,000 posts from the provided subreddit in chunks of 100.
    
    Parameter 'subreddit' should be a str, referring to the portion of the url that represents the subreddit.
    '''
    
    # Used to initial our dataframe
    i=0
    big_df = [[]]
    
    # We will be finding new posts by using the 'before' paramater available to us by pushshift api.
    # Initially, we do not want there to be any 'before' parameter.
    set_time = 1631249809
    
    # Keep running until we have 10,000 posts
    while len(big_df) < 10_000:
    
        # The parameters we pass to the pushshift api to pull data. size=100 is the maximum available 
        #and we only require data from the listed columns in 'fields'
        params = {
            'subreddit': subreddit,
            'size': 100,
            'fields': ['id', 'subreddit', 'title', 'author', 'created_utc', 'selftext', 'body'],
            'selftext:not': '[removed]',
            'title:not': 'AMA Series',
            'num_comments': '>5',
            'before': set_time
        }

        # Access the api through our requests library
        res = requests.get(url, params)

        # Ensure that our connection was successful (=200); if it isnt, exit and tell us the code it got.
        if res.status_code != 200:
            return print('There as been an error: res.status_code =', res.status_code)

        # Continue with our cycling program if status_code == 200
        else:
            
            # Pull in our data into a variable 'posts'
            posts = res.json()['data']
            
            # Find the post that was submitted at the farthest time from present and set it to 'set_time'
            set_time = posts[-1]['created_utc']
            
            # Initialize our dataframe that will be holding all of the posts for this subreddit
            # Grab our first 100 posts
            if i == 0:
                big_df = pd.DataFrame(posts)
                
                # Change i so this will not be ran again
                i = 1
                
                # Set a pause time so we don't get blocked by Reddit for abusing the API
                time.sleep(0.25)
            
            # After our first 100 posts, grab each subsequent 100 posts and concat onto our dataframe
            else:
                df = pd.DataFrame(posts)
                big_df = pd.concat([big_df, df])

                # Set a pause time so we don't get blocked by Reddit for abusing the API
                time.sleep(1)
                
    return big_df

In [32]:
ask_df = pull_data('askscience')
print(len(ask_df))
ask_df.head()

10093


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act..."
4,hairycoo,1629566821,p8vinv,,askscience,Can't we include multiple virus traits rather ...


In [33]:
ex_df = pull_data('explainlikeimfive')
print(len(ex_df))
ex_df.head()

10096


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a..."
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...
2,80sKidCA,1631246964,pldq29,,explainlikeimfive,ELI5: Why and how does your body store tension...
3,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?
4,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...


### Check unique ID's to ensure we pulled our data correctly using the 'before' parameters.

The number of unique IDs should equal the length of our dataframe.

In [34]:
assert len(ask_df['id'].unique()) == len(ask_df)

In [35]:
assert len(ex_df['id'].unique()) == len(ex_df)

**Make sure we don't have any deleted or removed posts**

In [36]:
ex_df[ex_df['selftext']!= '[removed]'].shape

(10096, 6)

In [37]:
ask_df[ask_df['selftext']!= '[removed]'].shape

(10093, 6)

**Success!**

Now let's save these dataframes into csv's and utilize pandas default null value parameters to see how many null values our dataset contains.

In [38]:
ask_df.to_csv('../data/ask_df.csv', index=False)
ex_df.to_csv('../data/ex_df.csv', index=False)

In [39]:
ex_df = pd.read_csv('../data/ex_df.csv')
print(ex_df.shape)
ex_df.head()

(10096, 6)


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a..."
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...
2,80sKidCA,1631246964,pldq29,,explainlikeimfive,ELI5: Why and how does your body store tension...
3,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?
4,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...


In [40]:
ask_df = pd.read_csv('../data/ask_df.csv')
print(ask_df.shape)
ask_df.head()

(10093, 6)


Unnamed: 0,author,created_utc,id,selftext,subreddit,title
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act..."
4,hairycoo,1629566821,p8vinv,,askscience,Can't we include multiple virus traits rather ...


## Data Cleaning

In [41]:
ex_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10096 entries, 0 to 10095
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       10096 non-null  object
 1   created_utc  10096 non-null  int64 
 2   id           10096 non-null  object
 3   selftext     5154 non-null   object
 4   subreddit    10096 non-null  object
 5   title        10096 non-null  object
dtypes: int64(1), object(5)
memory usage: 473.4+ KB


In [42]:
ask_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10093 entries, 0 to 10092
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       10093 non-null  object
 1   created_utc  10093 non-null  int64 
 2   id           10093 non-null  object
 3   selftext     6801 non-null   object
 4   subreddit    10093 non-null  object
 5   title        10093 non-null  object
dtypes: int64(1), object(5)
memory usage: 473.2+ KB


**We appear to have some empty selftext variables, which represent the body of the post. These posts without any selftext would only provide value through the title, and may be difficult for our model to analyze.**
> Since we still have a significant amount of data without these posts, we will remove posts without selftext.

In [43]:
ex_df.dropna(inplace=True)
ask_df.dropna(inplace=True)

In [114]:
ex_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5154 entries, 0 to 10095
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       5154 non-null   object
 1   created_utc  5154 non-null   int64 
 2   id           5154 non-null   object
 3   selftext     5154 non-null   object
 4   subreddit    5154 non-null   object
 5   title        5154 non-null   object
dtypes: int64(1), object(5)
memory usage: 410.9+ KB


In [115]:
ask_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6801 entries, 0 to 10089
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       6801 non-null   object
 1   created_utc  6801 non-null   int64 
 2   id           6801 non-null   object
 3   selftext     6801 non-null   object
 4   subreddit    6801 non-null   object
 5   title        6801 non-null   object
dtypes: int64(1), object(5)
memory usage: 371.9+ KB


**Beyond this simple cleaning, we want to keep everything we possibly can for our model. If variables happen to cause issue while modeling, they will be handled at that time.**

# Comment Extraction

**Responses to the questions asked in these threads are largely the point of the subreddits, and this data science project. We really want to know how different the answers are and the top comment is typically the best answer to the posted question.**

We want to go through our remaining submissions obtained above and retrieve the comments associated with these submissions. 

We have to do this in 2 steps:
1. We will grab comment ids as well as the number of comments on that submission.
2. We look up the individual comments associated with each submission and keep only the comment with the most upvotes, disgarding the rest.

## 1. Get Commend IDs and Number of Comments for each Submission

In [122]:
def get_comments_list(dataframe):
        
    new_df = dataframe.copy()
    comment_ids = []
    comment_amnt = []
    
    for j, Id in enumerate(new_df['id']):
        
        comment_url = 'https://api.pushshift.io/reddit/submission/comment_ids/' + str(Id)

        # Access the api through our requests library
        res = requests.get(comment_url)

        # Ensure successful connection (=200); if it isnt, exit and print code.
        if res.status_code != 200:
            print('There as been an error: res.status_code =', res.status_code)
            
            # Fill the rest of the lists with 0's in order to add it to our dataframe
            for i in range(new_df.shape[0] - len(comment_amnt)):
                comment_ids.append(0)
                comment_amnt.append(0)
            
            new_df['comment_ids'] = comment_ids
            new_df['comment_amnt'] = comment_amnt
            return new_df

        # Continue with our cycling program if status_code == 200
        else:
            
            # Pull in our data into a variable 'posts'
            posts = res.json()['data']
            temp = pd.DataFrame(posts)         
            
            if posts != []:
                comment_ids.append(temp.values)
                comment_amnt.append(temp.shape[0])
            
            else:
                comment_ids.append(0)
                comment_amnt.append(0)
            
            # Set a pause time so we don't get blocked by Reddit for abusing the API
            time.sleep(0.25)

    # add our created columns to the dataframe
    new_df['comment_ids'] = comment_ids
    new_df['comment_amnt'] = comment_amnt
    
    # Return the new dataframed with its 2 new columns
    return new_df

## Comment IDs for ELI5

In [123]:
new_ex_df = get_comments_list(ex_df)

There as been an error: res.status_code = 500


In [124]:
new_ex_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a...","[[hc9xp88], [hc9xubv], [hc9y66a], [hc9ybwi], [...",13
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...,"[[hc9uc6o], [hc9umen], [hc9usj0], [hc9uxi7], [...",6
3,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?,"[[hc9om2y], [hc9ortz], [hc9osmq], [hc9osnr], [...",10
4,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...,"[[hc9okcf], [hc9pkte], [hc9r0vn], [hc9rl2i], [...",17
5,Aleitei,1631243837,plcxn4,"Having trouble grasping this, I know time and ...",explainlikeimfive,ELI5: Why does time slow down near stronger gr...,"[[hc9nq9s], [hc9omqr], [hc9thei], [hc9ti4w], [...",17


In [126]:
new_ex_df.shape

(5154, 8)

**These took a very long time to run, so we are saving them immediately.**

In [128]:
#checkpoint 
new_ex_df.to_csv('../data/ex_df_comments.csv', index=False)

## Comment IDs for AskScience

In [129]:
new_ask_df = get_comments_list(ask_df)

There as been an error: res.status_code = 500


In [130]:
new_ask_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...,"[[hb2u26q], [hb2w2r6], [hb2xjpe], [hb2zvcg], [...",20
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?,"[[h9tnotf], [h9tq6r5], [h9tuejy], [h9u1ubp], [...",12
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...,"[[h9tuq0p], [h9tuuqg], [h9u10c7], [h9umpe1], [...",12
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act...","[[h9tem11], [h9tfbjt], [h9tfk2j], [h9tfspy], [...",292
5,Comprehensive_Lead41,1629564364,p8ushd,Why does it look like this? https://images.app...,askscience,Why is the floor of the Uyuni salt desert part...,"[[h9t8m72], [h9t8rg2], [h9t9h9s], [h9t9zht], [...",94


In [132]:
new_ask_df.shape

(6801, 8)

**These took a very long time to run, so we are saving them immediately.**

In [131]:
#checkpoint 
new_ask_df.to_csv('../data/ask_df_comments.csv', index=False)

### Load in saved comment datasets

In [4]:
new_ex_df = pd.read_csv('../data/ex_df_comments.csv')
new_ex_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a...",[['hc9xp88']\n ['hc9xubv']\n ['hc9y66a']\n ['h...,13
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...,[['hc9uc6o']\n ['hc9umen']\n ['hc9usj0']\n ['h...,6
2,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?,[['hc9om2y']\n ['hc9ortz']\n ['hc9osmq']\n ['h...,10
3,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...,[['hc9okcf']\n ['hc9pkte']\n ['hc9r0vn']\n ['h...,17
4,Aleitei,1631243837,plcxn4,"Having trouble grasping this, I know time and ...",explainlikeimfive,ELI5: Why does time slow down near stronger gr...,[['hc9nq9s']\n ['hc9omqr']\n ['hc9thei']\n ['h...,17


In [5]:
new_ask_df = pd.read_csv('../data/ask_df_comments.csv')
new_ask_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...,[['hb2u26q']\n ['hb2w2r6']\n ['hb2xjpe']\n ['h...,20
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?,[['h9tnotf']\n ['h9tq6r5']\n ['h9tuejy']\n ['h...,12
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...,[['h9tuq0p']\n ['h9tuuqg']\n ['h9u10c7']\n ['h...,12
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act...",[['h9tem11']\n ['h9tfbjt']\n ['h9tfk2j']\n ['h...,292
4,Comprehensive_Lead41,1629564364,p8ushd,Why does it look like this? https://images.app...,askscience,Why is the floor of the Uyuni salt desert part...,[['h9t8m72']\n ['h9t8rg2']\n ['h9t9h9s']\n ['h...,94


### Iterate through ```comment_ids``` and make it a cleaner list

In [22]:
def clean_comment_ids(dataframe):
    
    new_df = dataframe.copy()
    new_comment_ids = []
    
    for lists in new_df['comment_ids']:
        temp = lists.replace('\n', ',').replace('[','').replace(']','')
        new_comment_ids.append(temp)
    
    new_df['comment_ids'] = new_comment_ids
    return new_df

In [23]:
new_ex_df = clean_comment_ids(new_ex_df)
new_ex_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a...","'hc9xp88', 'hc9xubv', 'hc9y66a', 'hc9ybwi', 'h...",13
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...,"'hc9uc6o', 'hc9umen', 'hc9usj0', 'hc9uxi7', 'h...",6
2,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?,"'hc9om2y', 'hc9ortz', 'hc9osmq', 'hc9osnr', 'h...",10
3,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...,"'hc9okcf', 'hc9pkte', 'hc9r0vn', 'hc9rl2i', 'h...",17
4,Aleitei,1631243837,plcxn4,"Having trouble grasping this, I know time and ...",explainlikeimfive,ELI5: Why does time slow down near stronger gr...,"'hc9nq9s', 'hc9omqr', 'hc9thei', 'hc9ti4w', 'h...",17


In [24]:
new_ask_df = clean_comment_ids(new_ask_df)
new_ask_df.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...,"'hb2u26q', 'hb2w2r6', 'hb2xjpe', 'hb2zvcg', 'h...",20
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?,"'h9tnotf', 'h9tq6r5', 'h9tuejy', 'h9u1ubp', 'h...",12
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...,"'h9tuq0p', 'h9tuuqg', 'h9u10c7', 'h9umpe1', 'h...",12
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act...","'h9tem11', 'h9tfbjt', 'h9tfk2j', 'h9tfspy', 'h...",292
4,Comprehensive_Lead41,1629564364,p8ushd,Why does it look like this? https://images.app...,askscience,Why is the floor of the Uyuni salt desert part...,"'h9t8m72', 'h9t8rg2', 'h9t9h9s', 'h9t9zht', 'h...",94


In [135]:
new_ask_df.to_csv('../data/ask_df_comments.csv', index=False)
new_ex_df.to_csv('../data/ex_df_comments.csv', index=False)

## 2. Find Top Comment for each Submission

**Brief overview of this process:**
- We pass our list of comment_ids into the api call to bring up all comments for a submission.
- We then sort by score and keep that top submissiong

**Issues:**
- The maximum number of rows the api can pull at once is 500, when comment_amnt is > 500, the program fails and all remaining top_comments are set to 0.
- Sometimes the top comment is 'removed'.

**Fixes:**
- Skip all submissions with > 500 comments. Each subreddit has <150 of these types of submissions.
- Drop all submissions with top_comment = 'removed'. This is also a small number of submissions.

In [18]:
new_ask_df = pd.read_csv('../data/ask_df_comments.csv')
new_ex_df = pd.read_csv('../data/ex_df_comments.csv')

In [59]:
def find_top_comment(dataframe):
    
    new_df = dataframe.copy()
    top_comment = []
    
    for j, ids in enumerate(new_df['comment_ids']):
        #print(j)
        if j >= new_df.shape[0]:
            new_df['top_comment'] = top_comment
            return new_df
        
        if new_df['comment_amnt'].iloc[j] > 500:
            top_comment.append(0)
            
        else:
            comment_url = 'https://api.pushshift.io/reddit/comment/search?ids=' + ids.replace("'",'').replace(' ','')

            # Access the api through our requests library
            res = requests.get(comment_url)

            # Ensure successful connection (=200); if it isnt, exit and print code.
            if res.status_code != 200:
                print('There as been an error: res.status_code =', res.status_code)

                # Fill the rest of the lists with 0's in order to add it to our dataframe
                for i in range(new_df.shape[0] - len(top_comment)):
                    top_comment.append(1)

                new_df['top_comment'] = top_comment
                return new_df

            # Continue with our cycling program if status_code == 200
            else:
                if new_df['comment_amnt'].iloc[j] == 0:
                    top_comment.append(0)
                
                else:
                    # Pull in our data into a variable 'posts'
                    posts = res.json()['data']
                    temp = pd.DataFrame(posts)       
                    temp = temp[['score','body']].sort_values(by='score', ascending=False)
                    temp = temp.iloc[0]['body']

                    if posts != []:
                        top_comment.append(temp)

                    else:
                        top_comment.append(0)

                    # Set a pause time so we don't get blocked by Reddit for abusing the API
                    time.sleep(0.1)

    # add our created columns to the dataframe
    new_df['top_comment'] = top_comment
    
    # Return the new dataframed with its 2 new columns
    return new_df

## Top Comments for ELI5

In [50]:
new_ex_df2 = find_top_comment(new_ex_df)

In [51]:
new_ex_df2.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt,top_comment
0,j_d0tnet,1631248750,ple5nz,Disclaimer: I did see a previous question touc...,explainlikeimfive,"ELI5: Seriously, WTF is up with surface area a...","'hc9xp88', 'hc9xubv', 'hc9y66a', 'hc9ybwi', 'h...",13,The surface area to volume ratio is unit-indep...
1,ImpossibleZero,1631247022,pldqkn,I have a 30 year VA loan at 3.75% and my prope...,explainlikeimfive,ELI5: What does Refinancing a Mortgage Mean an...,"'hc9uc6o', 'hc9umen', 'hc9usj0', 'hc9uxi7', 'h...",6,I just refinanced mine to get a lower interest...
2,Chardington,1631244279,pld1sd,"I’ve been getting into finance, stonks and cry...",explainlikeimfive,ELI5: What exactly is “liquidity”?,"'hc9om2y', 'hc9ortz', 'hc9osmq', 'hc9osnr', 'h...",10,Liquidity is how quickly an asset you possess ...
3,DentonJoe,1631244183,pld0wi,Always wondered why it doesn’t make sense to u...,explainlikeimfive,Eli5 why are diesel/electric powertrains econo...,"'hc9okcf', 'hc9pkte', 'hc9r0vn', 'hc9rl2i', 'h...",17,Diesel electric drivelines in a locomotive are...
4,Aleitei,1631243837,plcxn4,"Having trouble grasping this, I know time and ...",explainlikeimfive,ELI5: Why does time slow down near stronger gr...,"'hc9nq9s', 'hc9omqr', 'hc9thei', 'hc9ti4w', 'h...",17,This is ELI5 so bare with me with a made up ex...


In [52]:
new_ex_df.shape

(5154, 8)

In [44]:
new_ex_df2.shape

(5154, 9)

In [46]:
new_ex_df2[new_ex_df2['top_comment'] != 0].shape

(3776, 9)

In [47]:
new_ex_df2 = new_ex_df2[new_ex_df2['top_comment'] != 0]

In [48]:
new_ex_df2[new_ex_df2['top_comment'] != 1].shape

(3776, 9)

In [49]:
new_ex_df2 = new_ex_df2[new_ex_df2['top_comment'] != 1]

In [50]:
new_ex_df2.to_csv('../data/ex_df_final.csv', index=False)

## Top Comments for AskScience

The original function, which worked well for ELI5, had issues with AskScience, so we will modify it to deal with the specific issues it encountered on the ask_df dataset.

In [29]:
def find_top_comment2(dataframe):
    
    new_df = dataframe.copy()
    top_comment = []
    error_check = 0
    
    for j, ids in enumerate(new_df['comment_ids']):

        if j >= new_df.shape[0]:
            new_df['top_comment'] = top_comment
            return new_df
        
        if new_df['comment_amnt'].iloc[j] > 500:
            top_comment.append(0)
            
        else:
            comment_url = 'https://api.pushshift.io/reddit/comment/search?ids=' + ids.replace("'",'').replace(' ','')

            # Access the api through our requests library
            res = requests.get(comment_url)

            # Ensure successful connection (=200); if it isnt, exit and print code.
            if res.status_code != 200:
                print('There as been an error: res.status_code =', res.status_code)
                
                if error_check < 5:
                    error_check+=1
                    top_comment.append(1)
                    time.sleep(0.2)
                    
                else: 
                    
                    if (new_df.shape[0] - len(top_comment)) != 0:
                        # Fill the rest of the lists with 0's in order to add it to our dataframe
                        for i in range(new_df.shape[0] - len(top_comment)):
                            top_comment.append(2)

                        new_df['top_comment'] = top_comment
                        return new_df
                    
                    else:
                        new_df['top_comment'] = top_comment
                        return new_df

            # Continue with our cycling program if status_code == 200
            else:
                error_check = 0
                
                if new_df['comment_amnt'].iloc[j] == 0:
                    top_comment.append(0)
                
                else:
                    # Pull in our data into a variable 'posts'
                    posts = res.json()['data']
                    temp = pd.DataFrame(posts)
                    
                    if 'score' in temp.columns and 'body' in temp.columns:
                        temp = temp[['score','body']].sort_values(by='score', ascending=False)
                        temp = temp.iloc[0]['body']

                        if posts != []:
                            top_comment.append(temp)

                        else:
                            top_comment.append(0)
                            
                    else:
                        top_comment.append(3)

                    # Set a pause time so we don't get blocked by Reddit for abusing the API
                    time.sleep(0.2)

    # add our created columns to the dataframe
    new_df['top_comment'] = top_comment
    
    # Return the new dataframed with its 2 new columns
    return new_df

In [51]:
new_ask_df2 = find_top_comment2(new_ask_df)

There as been an error: res.status_code = 500
There as been an error: res.status_code = 502
There as been an error: res.status_code = 500
There as been an error: res.status_code = 500


In [21]:
new_ask_df2.head()

Unnamed: 0,author,created_utc,id,selftext,subreddit,title,comment_ids,comment_amnt,top_comment
0,ChrisGnam,1630428742,pf9tvb,So most of my peers (26 y/o and older) don't h...,askscience,Are there physiological or psychological diffe...,"'hb2u26q', 'hb2w2r6', 'hb2xjpe', 'hb2zvcg', 'h...",20,Have you ruled out synaesthesia? \n\nhttps://e...
1,MaoGo,1629571284,p8wued,Neutrinos are neutrally charged particles that...,askscience,How do we know that the neutrinos have spin?,"'h9tnotf', 'h9tq6r5', 'h9tuejy', 'h9u1ubp', 'h...",12,Neutrinos were originally proposed as a way to...
2,the_protagonist,1629571059,p8ws1c,How does that “memory” work? \n\nThis comes f...,askscience,If white blood cells are constantly dying and ...,"'h9tuq0p', 'h9tuuqg', 'h9u10c7', 'h9umpe1', 'h...",12,When a T- or B-cell is activated by it's assoc...
3,CyKii,1629567841,p8vtoe,Obviously it's best to be careful about these ...,askscience,"If mRNA vaccines remain proven safe, is it act...","'h9tem11', 'h9tfbjt', 'h9tfk2j', 'h9tfspy', 'h...",292,"For a completely new virus, sure - you'd defin..."
4,Comprehensive_Lead41,1629564364,p8ushd,Why does it look like this? https://images.app...,askscience,Why is the floor of the Uyuni salt desert part...,"'h9t8m72', 'h9t8rg2', 'h9t9h9s', 'h9t9zht', 'h...",94,[Convective currents in the evaporating surfac...


In [31]:
new_ask_df.shape

(6801, 8)

In [23]:
new_ask_df2.shape

(6801, 9)

In [32]:
new_ask_df2[new_ask_df2['top_comment'] != 0].shape

(3916, 9)

In [33]:
new_ask_df2 = new_ask_df2[new_ask_df2['top_comment'] != 0]

In [25]:
new_ask_df2[new_ask_df2['top_comment'] != 1].shape

(6792, 9)

In [34]:
new_ask_df2 = new_ask_df2[new_ask_df2['top_comment'] != 1]

In [35]:
new_ask_df2[new_ask_df2['top_comment'] != 2].shape

(3907, 9)

In [36]:
new_ask_df2 = new_ask_df2[new_ask_df2['top_comment'] != 2]

In [37]:
new_ask_df2[new_ask_df2['top_comment'] != 3].shape

(3360, 9)

In [38]:
new_ask_df2 = new_ask_df2[new_ask_df2['top_comment'] != 3]

In [39]:
new_ask_df2.to_csv('../data/ask_df_final.csv', index=False)

# Summary

We were able to pull submissions and their top comment. Some comments we were unable to pull either due to the API call malfunctioning or a faulty pathway and other comments were empty, but we are left with approximately **3400 entries for AskScience** and **3800 entries for ELI5**.