# NOTEBOOK 01: DATA COLLECTION

## Problem Statement

In this project I will collect posts from the sub-reddits r/conservative and r/libertarian to identify terminology commonly used to self-identify within each community. Because the two communities share considerable ideological overlap, it's my hope that this project delivers interesting insights into how each group views itself. 

I will use the Pushshift API to build a corpus of text data that I will parse by Natural Language Processing (NLP) and then model with an optimized classification algorithm. Text will be pulled from comment threads as many posts contain only images in the body. 

In [2]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api

In [2]:
url = 'https://api.pushshift.io/reddit/search/'

Setting the parameters for the query

In [None]:
params = {'searchType':'submission',
          'subreddit':'conservative,libertarian',
          'sort':'desc',
          'size':10,
#           'before':,
#           'after':,
         }

Making the request.

In [None]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [None]:
response.status_code

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [None]:
len(response.json()['data'])

Length is 10, as expected. Assessing the file structure for keys of interest.

In [None]:
response.json()

Keys of interest are:
- author
- body
- created_utc
- link_id
- parent_id
- permalink
- subreddit
- subreddit_id

In [3]:
col_list = ['author',
            'body',
            'subreddit',
            'subreddit_id',
            'created_utc',
            'retrieved_on',
            'link_id',
            'parent_id',
            'permalink',
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [4]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [5]:
def reddit_query(subreddits, n_samples=1500, searchType='comment', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':1500,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 100k comments from the conservative subreddit.

In [None]:
reddit_query(subreddits='conservative',
             n_samples=100000,
             searchType='comment')

Using the query function to collect 100k comments from the libertarian subreddit beginning at the same time as the conservative comments.

In [None]:
reddit_query(subreddits='libertarian',
             n_samples=100000,
             before=1544922850,
             searchType='comment')

Checking final timestamps to understand the time over which the comments were collected. While our data is sensative to current events, in an effort to preserve class balance and minimize bootstrapping or other class rebalancing methods we will assume that overall syntatical choices are consistent over our timefreame. 

In [7]:
x=1543634134
time.asctime(time.gmtime(x))

'Sat Dec  1 03:15:34 2018'

In [6]:
x=1543611587
time.asctime(time.gmtime(x))

'Fri Nov 30 20:59:47 2018'

## Extracting features of interest and converting into a DataFrame:

Loading in the conservative samples as a .json file.

In [3]:
with open(f'../data/1541516104_raw_comments.json', 'r') as f:
    cons_sample_list = json.load(f)

Checking file length to ensure complete dataset.

In [36]:
len(cons_sample_list)

100000

Reviewing the structure of the first entry to compare across both datasets and ensure the correct samples were collected from the query.

In [37]:
cons_sample_list[0]

{'author': 'rojindahar',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_26vla1s8',
 'author_patreon_flair': False,
 'body': 'He already does Ketamine, 58 second mark: https://youtu.be/Pmrp3JVFrb8',
 'created_utc': 1544922841,
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'ebvq8at',
 'link_id': 't3_a6krv2',
 'no_follow': True,
 'parent_id': 't3_a6krv2',
 'permalink': '/r/Conservative/comments/a6krv2/10yearold_boy_dances_on_stage_for_money_at_adult/ebvq8at/',
 'retrieved_on': 1544922850,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'Conservative',
 'subreddit_id': 't5_2qh6p'}

Repeating the import and file review steps for the libertarian samples.

In [38]:
with open(f'../data/1543468607_raw_comments.json', 'r') as f:
    libr_sample_list = json.load(f)

In [39]:
len(libr_sample_list)

100000

In [40]:
libr_sample_list[0]

{'author': 'spacefish3',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_j15fb',
 'author_patreon_flair': False,
 'body': '"Workers\' collective ownership of capital" never implies that a state must exist.',
 'created_utc': 1544931627,
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'ebw07aq',
 'link_id': 't3_a21e9n',
 'no_follow': True,
 'parent_id': 't1_eax5n3o',
 'permalink': '/r/Libertarian/comments/a21e9n/the_admins_lied_our_mods_did_not_approve_the/ebw07aq/',
 'retrieved_on': 1544931628,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'Libertarian',
 'subreddit_id': 't5_2qh63'}

Parsing the json file into a dataframe containing the features of interest.

In [7]:
def reddit_parse(sample):
    
    col_list = ['author',
                'body',
                'subreddit',
                'subreddit_id',
                'created_utc',
                'link_id',
                'parent_id',
                'permalink',
                ]
    
    comments_df = pd.DataFrame(sample)
    comments_df = comments_df[col_list]
    
    comments_df.rename(columns={'subreddit':'libertarian'}, inplace=True)
    comments_df['libertarian'] = comments_df['libertarian'].map({'Conservative':0, 'Libertarian':1})
    
    col_order = ['author',
                 'body',
                 'libertarian',
                 'created_utc',
                 'subreddit_id',
                 'parent_id',
                 'link_id',
                 'permalink',
                ]

    return comments_df[col_order]

Reviewing the shape of the dataframe to ensure correct transformation

In [8]:
cons_comments = reddit_parse(cons_sample_list)

In [43]:
cons_comments_df.shape

(100000, 10)

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [44]:
cons_comments_df.head()

Unnamed: 0,author,body,libertarian,created_on,retrieved_on,created_utc,subreddit_id,parent_id,link_id,permalink
0,rojindahar,"He already does Ketamine, 58 second mark: http...",0,Sun Dec 16 01:14:01 2018,Sun Dec 16 01:14:10 2018,1544922841,t5_2qh6p,t3_a6krv2,t3_a6krv2,/r/Conservative/comments/a6krv2/10yearold_boy_...
1,[deleted],[removed],0,Sun Dec 16 01:13:28 2018,Sun Dec 16 01:13:39 2018,1544922808,t5_2qh6p,t1_ebvclrz,t3_a6icni,/r/Conservative/comments/a6icni/to_guarantee_a...
2,[deleted],[removed],0,Sun Dec 16 01:13:15 2018,Sun Dec 16 01:13:26 2018,1544922795,t5_2qh6p,t1_ebvg44o,t3_a6icni,/r/Conservative/comments/a6icni/to_guarantee_a...
3,leadrain86,Actually that is quite the opposite. Conservat...,0,Sun Dec 16 01:13:09 2018,Sun Dec 16 01:13:20 2018,1544922789,t5_2qh6p,t1_ebvhrk2,t3_a6a7h7,/r/Conservative/comments/a6a7h7/one_year_ago_t...
4,[deleted],[removed],0,Sun Dec 16 01:11:49 2018,Sun Dec 16 01:12:00 2018,1544922709,t5_2qh6p,t3_a4llsj,t3_a4llsj,/r/Conservative/comments/a4llsj/keep_tyrants_l...


Dropping any duplicates

In [45]:
cons_comments_df.drop_duplicates().shape

(100000, 10)

No duplicates were found, so shape is the same as the original dataframe. Repeating these steps for the libertarian dataset.

In [None]:
libr_comments = reddit_parse(libr_sample_list)

In [48]:
libr_comments_df.shape

(100000, 10)

Shape output corresponds to expected values.

In [49]:
libr_comments_df.head()

Unnamed: 0,author,body,libertarian,created_on,retrieved_on,created_utc,subreddit_id,parent_id,link_id,permalink
0,spacefish3,"""Workers' collective ownership of capital"" nev...",1,Sun Dec 16 03:40:27 2018,Sun Dec 16 03:40:28 2018,1544931627,t5_2qh63,t1_eax5n3o,t3_a21e9n,/r/Libertarian/comments/a21e9n/the_admins_lied...
1,EatsPandas,hmmmmmmm interesting,1,Sun Dec 16 03:40:20 2018,Sun Dec 16 03:40:21 2018,1544931620,t5_2qh63,t3_a6lw8o,t3_a6lw8o,/r/Libertarian/comments/a6lw8o/libertarianism_...
2,KruglorTalks,"To be fair, credit to Cratchit just for litera...",1,Sun Dec 16 03:40:05 2018,Sun Dec 16 03:40:07 2018,1544931605,t5_2qh63,t3_a6k2kt,t3_a6k2kt,/r/Libertarian/comments/a6k2kt/scroogedidnothi...
3,Itl_chi_15,"Game show hosts, they’re are truly the worst. ...",1,Sun Dec 16 03:39:53 2018,Sun Dec 16 03:39:54 2018,1544931593,t5_2qh63,t3_a686wz,t3_a686wz,/r/Libertarian/comments/a686wz/not_every_actor...
4,seabreezeintheclouds,I think the better explanation is who are the ...,1,Sun Dec 16 03:39:48 2018,Sun Dec 16 03:39:49 2018,1544931588,t5_2qh63,t3_a6jrtf,t3_a6jrtf,/r/Libertarian/comments/a6jrtf/the_most_libert...


In [50]:
libr_comments_df.drop_duplicates().shape

(100000, 10)

No duplicates found in this dataset. Since both dataframes have aligning features we will merge to create our training data.

In [52]:
comments_df = pd.concat([cons_comments_df, libr_comments_df],axis=0, ignore_index=True)

Reviewing the structure of the combined dataframe.

In [53]:
comments_df.columns

Index(['author', 'body', 'libertarian', 'created_on', 'retrieved_on',
       'created_utc', 'subreddit_id', 'parent_id', 'link_id', 'permalink'],
      dtype='object')

Sorting samples in chronilogical order and resetting the index.

In [54]:
comments_df.sort_values(by=['created_utc'], ascending=False, inplace=True)

In [55]:
comments_df.reset_index(drop=True, inplace=True)

Reviewing the datatypes and checking for any null values.

In [56]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 10 columns):
author          200000 non-null object
body            200000 non-null object
libertarian     200000 non-null int64
created_on      200000 non-null object
retrieved_on    200000 non-null object
created_utc     200000 non-null int64
subreddit_id    200000 non-null object
parent_id       200000 non-null object
link_id         200000 non-null object
permalink       200000 non-null object
dtypes: int64(2), object(8)
memory usage: 15.3+ MB


In [75]:
comments_df.isna().sum()

author          0
body            0
libertarian     0
created_utc     0
subreddit_id    0
parent_id       0
link_id         0
permalink       0
dtype: int64

Ensuring balanced classes.

In [57]:
comments_df['libertarian'].value_counts()

1    100000
0    100000
Name: libertarian, dtype: int64

Replacing any newline, carrage reuturns, or long whitespace elements with a single whitespace character to prevent any errors when saving out / reading in the data.

In [76]:
comments_df = comments_df.body.map(lambda x :re.sub('\s+', ' ', x))

## Saving out comments_df for Preprocessing:

Saving out the combined dataframe as a csv for preprocessing in the next notebook.

In [65]:
formatted_name, now, file_description = filename_format_log(file_path ='../assets/comments_df.csv')
comments_df.to_csv(formatted_name, index=False)

# CONTINUE TO NOTEBOOK 02: PREPROCESSING