#### Project 3: Reddit NLP
#### Corey J Sinnott
# Data Collection

## Executive Summary

This report was commissioned to perform natural language processing (NLP) and analysis on two subreddits of Reddit.com. Data includes over 8000 posts, 4000 belonging to r/AskALiberal, and 4000 belonging to r/AskAConservative. The problem statement was defined as, can we classify to which subreddit a post belongs? After in-depth analysis, conclusions and recommendations will be presented.

*See model_classification_exec_summary.ipynb for the full summary, data dictionary, and findings.*

## Contents:
- [API Testing](#API-Testing)
- [Defining a Function](#Defining-a-Function)
- [Data Collection](#Data-Collection)

#### Importing Libraries

In [79]:
import pandas as pd
import requests 
import numpy as np
import time

# API Testing 
 - Testing the basic functions of the Pushshift API.

In [27]:
url = ('https://api.pushshift.io/reddit/search/submission')

In [28]:
params = {
    'subreddit' : 'askaliberal',
    'size' : 25,
    #'before': last_post
}

In [29]:
res = requests.get(url, params)

In [30]:
type(res)

requests.models.Response

In [31]:
#initial test of url successful
res.status_code

200

In [60]:
#test successful
data = res.json()['data']
data[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'EsperantistoUsona',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': '69751e82-c00e-11e7-bf56-0e79ff121398',
 'author_flair_text': 'Social Democrat',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'text',
 'author_fullname': 't2_7u4gzwvn',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1610947103,
 'domain': 'self.AskALiberal',
 'full_link': 'https://www.reddit.com/r/AskALiberal/comments/kznxgi/whats_a_good_political_thriller_that_talks_about/',
 'gildings': {},
 'id': 'kznxgi',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 

In [80]:
full_lib_df = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])
full_cons_df = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])

In [55]:
# def full_pull(subreddit_1, subreddit_2, size, iterations):
#     """
#     Pulls posts by a specified amount, with time breaks between iterations.

#     Args:
#         subreddit_1 (string): name of first subreddit
#         subreddit_2 (string): name of first subreddit
#         size (int)          : number of posts to be pulled per iteration
#         iterations (int)    : the number of pulls to be performed
        
#     Returns:
#         full_post_df (pandas DataFrame): dataframe containing the complete, 
#         raw collection of pulled posta
#     """
#     utc = 1610983983
#     full_lib_df  = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])
#     full_cons_df = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])
    
#     for pull in range(iterations):
#         url = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit_1}&size={size}&before={utc}'
#         res = requests.get(url)
#         data = res.json()['data']
#         pull_dict_1 = {
#             'title'      : [],
#             'selftext'   : [],
#             'subreddit'  : [],
#             'created_utc': []
#                             }
#         #if res.status_code == 200: #optional; helpful for troubleshooting
#         for i in data:
#             pull_dict_1['title'].append(i['title'])
#             pull_dict_1['selftext'].append(i['selftext'])
#             pull_dict_1['subreddit'].append(i['subreddit'])
#             pull_dict_1['created_utc'].append(i['created_utc'])
#             #print(f'You have obtained {len(pull_dict)} posts') #real-time counter
#         temp_posts_1 = pd.DataFrame(pull_dict_1)
#         full_lib_df = pd.concat([full_lib_df, temp_posts_1])
#         utc = full_lib_df['created_utc'].astype('int64').min() #pulls the final timestamp
#         time.sleep(30)                                            #to obtain unique data
        
#     print(f'Pull complete; you have obtained {len(full_lib_df)} total posts from r/{subreddit_1}')
    
# #-------------- second subreddit ---------------------------------------------------------------------------#        
#     time.sleep(30)
#     for pull_2 in range(iterations):
#         url_2 = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit_2}&size={size}&before={utc}'
#         res_2 = requests.get(url_2)
#         data_2 = res_2.json()['data']
#         pull_dict_2 = {
#             'title'      : [],
#             'selftext'   : [],
#             'subreddit'  : [],
#             'created_utc': []
#                             }
#         #if res.status_code == 200: #optional; helpful for troubleshooting
#         for i in data_2:
#             pull_dict_2['title'].append(i['title'])
#             pull_dict_2['selftext'].append(i['selftext'])
#             pull_dict_2['subreddit'].append(i['subreddit'])
#             pull_dict_2['created_utc'].append(i['created_utc'])
#             #print(f'You have obtained {len(pull_dict)} posts') #real-time counter
#         temp_posts_2 = pd.DataFrame(pull_dict_2)
#         full_cons_df = pd.concat([full_cons_df, temp_posts_2])
#         utc = full_cons_df['created_utc'].astype('int64').min() #pulls the final timestamp
#         time.sleep(30)                                            #to obtain unique data
    
#     print(f'Pull complete; you have obtained {len(full_cons_df)} total posts from r/{subreddit_2}')
        
#         #else:
#             #print(res.status_code)                    
# #---------- combine dfs -----------------------
    
#     full_pull_df = pd.concat([full_lib_df, full_cons_df])
    
#     return full_pull_df

# Defining a Function
 - Function's purpose is to utilize the Pushshift API while overcoming its size limitation.
 - Sleep timers were used to overcome rate limitations.

In [81]:
def full_pull(subreddit_1, subreddit_2, size, iterations):
    """
    Pulls posts by a specified amount, with time breaks between iterations.

    Args:
        subreddit_1 (string): name of first subreddit
        subreddit_2 (string): name of first subreddit
        size (int)          : number of posts to be pulled per iteration
        iterations (int)    : the number of pulls to be performed
        
    Returns:
        full_post_df (pandas DataFrame): dataframe containing the complete, 
        raw collection of pulled posta
    """
    utc = 1610946272
    full_lib_df = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])
    full_cons_df = pd.DataFrame(columns = ['title', 'selftext', 'subreddit', 'created_utc'])
    
    for pull in range(iterations):
        url = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit_1}&size={size}&before={utc}'
        res = requests.get(url)
        data = res.json()['data']
        pull_dict_1 = {
            'title'      : [],
            'selftext'   : [],
            'subreddit'  : [],
            'created_utc': []
                            }
        #if res.status_code == 200: #optional; helpful for troubleshooting
        for i in data:
            try:
                pull_dict_1['title'].append(i['title'])
                pull_dict_1['subreddit'].append(i['subreddit'])
                pull_dict_1['created_utc'].append(i['created_utc'])
                
                try: # some posts are missing 'self-text'
                    pull_dict_1['selftext'].append(i['selftext'])
                except:
                    pull_dict_1['selftext'].append(['not there'])
                
            except:
                continue
            
            #print(f'You have obtained {len(pull_dict_1)} posts') #real-time counter
        temp_posts_1 = pd.DataFrame(pull_dict_1)
        full_lib_df = pd.concat([full_lib_df, temp_posts_1])
        utc = full_lib_df['created_utc'].astype('int64').min() #pulls the final timestamp
        time.sleep(30)                                            #to obtain unique data
        
    print(f'Pull complete; you have obtained {len(full_lib_df)} total posts from r/{subreddit_1}')
    
#-------------- second subreddit ---------------------------------------------------------------------------#        
    time.sleep(30)
    for pull_2 in range(iterations):
        url_2 = f'https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit_2}&size={size}&before={utc}'
        res_2 = requests.get(url_2)
        data_2 = res_2.json()['data']
        pull_dict_2 = {
            'title'      : [],
            'selftext'   : [],
            'subreddit'  : [],
            'created_utc': []
                            }
        #if res.status_code == 200: #optional; helpful for troubleshooting
        for i in data_2:
            try:
                pull_dict_2['title'].append(i['title'])
                pull_dict_2['subreddit'].append(i['subreddit'])
                pull_dict_2['created_utc'].append(i['created_utc'])
                
                try: # some posts are missing 'self-text'
                    pull_dict_2['selftext'].append(i['selftext'])
                except:
                    pull_dict_2['selftext'].append(['not there'])
                
            except:
                continue
            
            #print(f'You have obtained {len(pull_dict_2)} posts') #real-time counter
        temp_posts_2 = pd.DataFrame(pull_dict_2)
        full_cons_df = pd.concat([full_cons_df, temp_posts_2])
        utc = full_cons_df['created_utc'].astype('int64').min() #pulls the final timestamp
        time.sleep(30)                                            #to obtain unique data
    
    print(f'Pull complete; you have obtained {len(full_cons_df)} total posts from r/{subreddit_2}')
        
        #else:
            #print(res.status_code)                    
#---------- combine dfs -----------------------
    
    full_pull_df = pd.concat([full_lib_df, full_cons_df])
    
    return full_pull_df

# Data Collection
 - Successfully obtained 8000 posts; 4000 from each subreddit.
 - Below is an example pull of only 40 posts.

In [82]:
full_pull_df = full_pull('AskALiberal', 'askaconservative', 10, 2)

Pull complete; you have obtained 20 total posts from r/AskALiberal
Pull complete; you have obtained 20 total posts from r/askaconservative


In [72]:
full_pull_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc
0,Biden plans to cancel the Keystone XL pipeline...,[https://www.cbc.ca/amp/1.5877038](https://www...,AskALiberal,1610945588
1,2020 Best of r/AskALiberal Results,"#Good afternoon, everyone!\n\n\nThe winners an...",AskALiberal,1610943721
2,Place your bets: will Trump be removed by forc...,We already know Trump will not attend Biden's ...,AskALiberal,1610942754
3,Have you ever gotten conservatives to rethink ...,I’m had both positive/negative conversations f...,AskALiberal,1610942080
4,Who is winning the culture war right now?,Liberals? Conservatives? China?,AskALiberal,1610939740


In [74]:
full_pull_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8000 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        8000 non-null   object
 1   selftext     8000 non-null   object
 2   subreddit    8000 non-null   object
 3   created_utc  8000 non-null   object
dtypes: object(4)
memory usage: 312.5+ KB


In [73]:
full_pull_df.shape

(8000, 4)

In [78]:
full_pull_df.to_csv('full_pull_4000_each_incl_self_text.csv')