## Data Scraping

The purpose of this notebook is to scrape the data we need. 

In [2]:
import numpy as np
import requests, time, csv, json, re  
import pandas as pd

This imports the libraries we need to scrape the data. 

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [3]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [4]:
def reddit_query(url, n_samples, before=None, after=None):
    last_comment = round(before)
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'subreddit':'competitivehs',
              'sort':'desc',
              'size':1000,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_comments.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} comments.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

This function is designed to query the pushshift.io api for 1000 posts at a time from the competitivehs subreddit and add those posts to a json file. Once all of the posts have been acquired, it saves the filename as the earliest timestamp that was queried. 

In [6]:
KOFT_comments = reddit_query(url = 'https://api.pushshift.io/reddit/search/comment/',n_samples = 10000, before =1507593600 )

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 comments.
Reddit text is ready for processing.
Last timestamp was 1505757232.


This queries pushshift.io for 10,000 reddit comments from competitivehs during the KOFT time period. 

In [7]:
with open(f'../data/1505757232_raw_comments.json', 'r') as f:
    KOFT_comments = json.load(f)

This loads the comments from the KOFT json into memory. 

In [8]:
def reddit_parse(sample):
    
    col_list =  [
                'author',
                'body',
                'created_utc',
                ]
    
    comments_df = pd.DataFrame(sample)
    comments_df = comments_df[col_list]
    
    
    col_order =  [
                'author',
                'body',
                'created_utc',
                ]

    return comments_df[col_order]

This function is designed to parse the json files we load into memory and return a DataFrame containing the comments' author, body, and time created. 

In [9]:
KOFT_comments_df =  reddit_parse(KOFT_comments)

This turns the KOFT json into a dataframe. 

In [10]:
KOFT_comments_cleaned = KOFT_comments_df[~KOFT_comments_df.author.str.contains("\[deleted\]")]
KOFT_comments_cleaned = KOFT_comments_cleaned[~KOFT_comments_df.body.str.contains("\[deleted\]")]

  


This gets rid of all deleted posts from the KOFT dataframe. 

In [11]:
KOFT_comments_cleaned = KOFT_comments_cleaned[~KOFT_comments_df.author.str.contains("bot")]

  """Entry point for launching an IPython kernel.


This gets rid of all posts by bots from the KOFT dataframe. 

In [12]:
KOFT_comments_cleaned['KOFT'] = 1

This classifies all of the KOFT comments as being from KOFT. 

In [13]:
MSOG_comments = reddit_query(url = 'https://api.pushshift.io/reddit/search/comment/',n_samples = 10000, before = 1488283200)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 comments.
Reddit text is ready for processing.
Last timestamp was 1485281935.


This queries pushshift.io for 10,000 reddit comments from competitivehs during the MSOG time period. 

In [14]:
with open(f'../data/1485281935_raw_comments.json', 'r') as f:
    MSOG_comments = json.load(f)

This loads the MSOG json into memory. 

In [15]:
MSOG_comments_df =  reddit_parse(MSOG_comments)

This turns the MSOG json into a dataframe. 

In [16]:
MSOG_comments_cleaned = MSOG_comments_df[~MSOG_comments_df.author.str.contains("\[deleted\]")]
MSOG_comments_cleaned = MSOG_comments_cleaned[~MSOG_comments_df.body.str.contains("\[deleted\]")]

  


This gets rid of all deleted posts from the MSOG dataframe. 

In [17]:
MSOG_comments_cleaned = MSOG_comments_cleaned[~MSOG_comments_df.author.str.contains("bot")]

  """Entry point for launching an IPython kernel.


This gets rid of all posts by bots from the MSOG dataframe. 

In [18]:
MSOG_comments_cleaned['KOFT'] = 0

This sets all of the MSOG comments as not being from KOFT. 

In [19]:
Complete_df = pd.concat([KOFT_comments_cleaned,MSOG_comments_cleaned])

This combines the KOFT and MSOG dataframes into a new dataframe called Complete_df

In [20]:
Complete_df['body'] = Complete_df.body.map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

This removes all of the links that may still be present in the text. 

In [21]:
Complete_df.drop_duplicates(inplace=True)
Complete_df.reset_index(drop=True, inplace=True)

This drops all duplicates and resets the index so that we can convert our dataframe to json format. 

In [22]:
Complete_df.to_json(path_or_buf = '../data/Complete_dataframe.json')

This exports the combined dataframe to json format to be used for EDA. 