# CLASSIFICATION OF R/SOURDOUGH POSTS PER FLAIR TAGS: 00 - DATA COLLECTION

## 1) Introduction

The aim of this project is to help predict the category of a post on the r/sourdough subreddit, based on its text content: its title, the content of the post itself and of the comments. More details on the general context of the project are provided in the following notebook : [01 - Exploratory Data Analysis](./01_exploratory_data_analysis.ipynb). 

This specific notebook performs the data collection of the data used in the project. 

The Data is collected using the PushShift API. This RESTful API gives full functionality for searching Reddit data and also includes the capability of creating powerful data aggregations.

The API allow us to query the website and retrieve Posts and comments, using different filters. For example: 
* between different dates
* from a specific subreddit

The result is a json file that can then be converted into a csv file. 

The data was collected in two steps: one query for the posts and one query for the comments. 

In [1]:
# Loading  libraries

import pandas as pd
import requests
import json
import csv
import time
import datetime
import numpy as np

In [2]:
# retrieve labelled data
# Define function to query Reddit posts. 

def getPushshiftData(query, after, before, sub):
    '''
    Function querrying the reddit API to retrieve posts of interests.
    Parameters: 
        - query: substrings contained in the title, if we are only interested in posts mentioning certain terms
        - after: start date of the period we want to query (as a int)
        - before: end date of the period we want to query (as a int)
        - sub: the sub-reddit to query
    Output: 
        - A dictionary containing the parsed json file continaing all infos on the posts corresponding to the criteria defined in the parameters
    '''
    while True:
        url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
        print(url)
        r = requests.get(url)
        try: 
            data = json.loads(r.text)
            return data['data']
            #break
        except json.decoder.JSONDecodeError:
            print('corrupted json')
            time.sleep(60)

# Define function to collect specific data from the the parsed json
# here, we are only interested in titles 
def collectSubData(subm):
    '''
    Function retrieveing specific items from the parsed jsons. 
    Parameters: 
        - subm: dictionary 
    Output: 
        - a list of list with all items of interest per post
    '''
    subData = list() #list to store data points
    try: 
        sub = subm['subreddit']
    except KeyError: 
        sub = 'Error: No subreddit'
    title = subm['title'].encode('ascii', 'ignore').decode()
    permalink = subm['permalink']
    
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    try:
        flair = subm['link_flair_text'].encode('ascii', 'ignore').decode()
    except KeyError:
        flair = " "  
    sub_id = subm['id']
    try:
        text = subm['selftext'].encode('ascii', 'ignore').decode()
    except KeyError: 
        text = ""
    
    subData.append((sub_id, sub, title, text, permalink, created,flair))
    subStats[sub_id] = subData

# Define parameters 
#Subreddit to query
sub='Sourdough'
#before and after dates
#before = "1635721200" #01.11.2021
#after = "1572562800"  #01.11.2019 
#after = "1635611150"  #test

before = "1638867600" #07.12.2021
#after =  "1638800000" # small period for test
after = "1605650400"  #17.11.2020 
#after =  "1291680000" # since creation

query = "*"
subCount = 0
subStats = {}

data = getPushshiftData(query, after, before, sub)
# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data) > 0:
    for submission in data:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.utcfromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(query, after, before, sub)
    
print(len(data))

def updateSubs_file():
    upload_count = 0
    #location = "\\Reddit Data\\"
    #print("input filename of submission file, please add .csv")
    #filename = input()
    #file = location + filename
    file = './data/sourdough_flairs_df.csv'
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID", "Subreddit" ,"Title","OP Text", "Permalink", "Publish Date", "Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1605650400&before=1638867600&subreddit=Sourdough
100
2020-11-19 01:50:51
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1605750651&before=1638867600&subreddit=Sourdough
100
2020-11-20 16:33:17
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1605889997&before=1638867600&subreddit=Sourdough
100
2020-11-22 02:20:11
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1606011611&before=1638867600&subreddit=Sourdough
100
2020-11-22 23:54:55
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1606089295&before=1638867600&subreddit=Sourdough
100
2020-11-24 02:27:29
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1606184849&before=1638867600&subreddit=Sourdough
100
2020-11-25 13:01:48
https://api.pushshift.io/reddit/search/submission/?title=*&size=1000&after=1606309308&before=1638867600&subreddi

In [3]:
# retrieve comments associated to the collected posts
# Define function to query Reddit posts. 

def getPushshiftDataComments(query,after, before, sub):
    '''
    Function querrying the reddit API to retrieve posts of interests.
    Parameters: 
        - query: substrings contained in the title, if we are only interested in posts mentioning certain terms
        - after: start date of the period we want to query (as a int)
        - before: end date of the period we want to query (as a int)
        - sub: the sub-reddit to query
    Output: 
        - A dictionary containing the parsed json file continaing all infos on the posts corresponding to the criteria defined in the parameters
    '''
    while True:
        url = 'https://api.pushshift.io/reddit/search/comment/?q='+str(query)+'&size=500&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)+'&author=!AutoModerator'
        print(url)
        r = requests.get(url)
        try: 
            data = json.loads(r.text)
            return data['data']
            #break
        except json.decoder.JSONDecodeError:
            print('corrupted json')
            time.sleep(60)


# Define function to collect specific data from the the parsed json
# 
def collectCommentData(comm):
    '''
    Function retrieveing specific items from the parsed jsons. 
    Parameters: 
        - comm: dictionary 
    Output: 
        - a list of list with all items of interest per comment
    '''
    commData = list() #list to store data points

    comm_id = comm['id']
    try:
        comm_body = comm['body'].encode('ascii', 'ignore').decode()
    except KeyError: 
        comm_body = ""
    comm_parent = comm['link_id']
    
    commData.append((comm_id, comm_parent, comm_body))
    if comm_parent in commStats: 
        commStats[comm_parent] += commData
    else:
        commStats[comm_parent] = commData

#before = "1638867600" #07.12.2021
before = "1639040400" #09.12.2021 (to retrieve all comments of the last post)
#after =  "1638800000" # small period for test
after = "1605650400"  #17.11.2020 
#after =  "1291680000" # since creation
query = "*"
commCount = 0
commStats = {}
data_comm = getPushshiftDataComments(query, after, before, sub)
# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data_comm) > 0:
    for submission in data_comm:
        collectCommentData(submission)
        commCount+=1
    # Calls getPushshiftDataComments() with the created date of the last submission
    print(len(data_comm))
    print(str(datetime.datetime.utcfromtimestamp(data_comm[-1]['created_utc'])))
    after = data_comm[-1]['created_utc']
    data_comm = getPushshiftDataComments(query, after, before, sub)
    
print(len(data_comm))
    

https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605650400&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-18 02:13:35
https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605665615&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-18 04:59:37
https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605675577&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-18 14:14:46
https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605708886&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-18 18:53:04
https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605725584&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-18 21:04:00
https://api.pushshift.io/reddit/search/comment/?q=*&size=500&after=1605733440&before=1639040400&subreddit=Sourdough&author=!AutoModerator
100
2020-11-19 02:24:34
https://api.pushshift.io/red

In [4]:
def updateComms_file():
    upload_count = 0
    #location = "\\Reddit Data\\"
    #print("input filename of submission file, please add .csv")
    #filename = input()
    #file = location + filename
    file = './data/sourdough_comments_df.csv'
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Comment ID", "Post ID" ,"Body"]
        a.writerow(headers)
        for key,value in commStats.items():
            for comm in value:
                a.writerow(comm)
                upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")
updateComms_file()

110198 submissions have been uploaded
