## PM3 - Revised Data Collection
5/2/2022  
Anna Lieb

In my previous data collection notebook for PM2, I used the Pushshift API to collect Reddit data from the r/technology subreddit. However, my search was too narrow and did not generate enough text data for thorough analysis. 

In this notebook, I will use downloaded Reddit dumps from https://files.pushshift.io/reddit/submissions/ and https://files.pushshift.io/reddit/comments/ to widen my search to all subreddits instead. To get relevant submissions, I will still slightly narrow my search to posts that contain the words ("data" and "privacy") OR ("data" and "personal"). Note that the search does not consider these terms as bigrams; instead, it searches for both words individually.

### Table of Contents
1. [Helper functions](#sec1)
2. [Collect submissions](#sec2)
3. [Collect comments](#sec3)

<a id="sec1"></a>
## 1. Helper functions
This section includes helper functions that decompress the Reddit dump .zst file, determines whether a post is "relevant" (ie. contains keywords), and writes relevant Reddit posts to an output .csv file.

Reddit posts from the .zst file are represented as objects. Some of the object attributes used in this code are listed below. 

Useful attributes: 
- author ==> username of the author of the post
- subreddit ==> subreddit forum where the post was posted
- title (submission only) ==> title of submissions
- selftext (submission only) ==> body text of submissions; often, Reddit users leave this blank.
- body (comment only) ==> text of comment
- created_utc ==> time of post in utc time
- stickied ==> boolean for whether the post was pinned to top of thread
- id ==> post id
- permalink ==> link to the post 

In [2]:
import os # for file management
import zstandard as zstd # enables parsing of .zst files
import pandas as pd # for data handling and storage
import datetime # for UTC parsing
import csv
import json 

In [3]:
def excludeFn(entry, textAttr):
    '''
        Determines whether or not an entry should be discarded 
        based on a set of conditional checks.
        
        return True ==> exclude
        return False ==> include
        
        Parameters: 
        entry - Pushshift object that represents the post
        textAttr - 'body' for comments and 'title' for posts 
    '''
    valueof = lambda key: entry.get(key, "")
    
    text = valueof(textAttr) # textAttr ="body" for comments, "title" for submissions
    
    # do not include if post has been removed
    if (text in ['', '[deleted]', '[removed]']): 
        return True
    
    # do not include if post is "stickied", ie. pinned to top of thread
    if (+valueof('stickied')): 
        return True
    
    # must be in given subreddit
    #if (valueof('subreddit').lower() in ['amitheasshole']): 
    #    return False
    
    if ("data" in text) and (("privacy" in text) or ("personal" in text)): 
        return False
    else: 
        return True
    
    return False

### extractZst
Based on code snippets posted in this Reddit thread: https://www.reddit.com/r/pushshift/comments/ajmcc0/information_and_code_examples_on_how_to_use_the/

In [4]:
def extractZst(inPath, outPath, attributes, excludeFn = lambda x: False):
    '''
        Decompresses the given Reddit dump, reads from it as a stream, 
        and continuously writes relevant posts to an output .csv file. 
        
        Parameters: 
        inPath ==> file path to input .zst file
        outPath ==> file path to output .csv file 
        attributes ==> attributes of the post objects that you wish to include 
        in the output file
            *For my purposes I used the following attributes: 
            ["created_utc", "author", "title", "selftext", "subreddit", "id", "permalink"] for submissions
            ["created_utc", "author", "body", "subreddit", "id", "permalink"] for comments
            *Note that the text attribute should be the third item.
        excludeFn ==> predicate function to determine relevance of post
    '''
    print(f'decompressing {inPath}...')
    textAttr = attributes[2] # 'body' for comments, 'title' for submissions
    
    with open(inPath, 'rb') as fh:
        # iterate through zst contents via a filestream to minimize memory load
        dctx = zstd.ZstdDecompressor(max_window_size=2147483648)
        with dctx.stream_reader(fh) as reader:
            previous_line = ""
            while True:
                chunk = reader.read(2**24)
                if not chunk:
                    break
                else:
                    string_data = chunk.decode('utf-8')
                    lines = string_data.split("\n")
                    
                    
                    for i, line in enumerate(lines[:-1]):
                        if i == 0:
                            line = previous_line + line
                        
                        # entry is the Reddit post as an object
                        entry = json.loads(line)
                            
                        if excludeFn(entry, textAttr):
                            pass
                        else:
                            
                            # write to file 
                            with open(outPath, "a") as outF: 
                                writer = csv.writer(outF)
                                row = []
                                for attr in attributes: 
                                    row.append(entry[attr])
                                writer.writerow(row)
                            
                        previous_line = lines[-1]

In [5]:
def relevantZstToCsv(inPath, outPath, attributes, excludeFn): 
    '''
    Takes a .zst file and writes the relevant posts to a .csv file.
    
    inpath ==> file path to input .zst 
    outPath ==> file path to output .csv
    attributes ==> list of desired post attrbutes to be included in output csv
    excludeFn ==> function that returns boolean value based on post relevance
    '''
    if os.path.exists(inPath) and not os.path.exists(outPath):
        with open(outPath, "w") as outF: 
            writer = csv.writer(outF)
    
            # header of output csv file 
            writer.writerow(attributes)
            
        extractZst(inPath, outPath, attributes, excludeFn)
    
    else: 
        print("Path error")

<a id="sec2"></a>
## 2. Collect submissions

In [None]:
# submission attributes to be collected
subAttributes = ["created_utc", "author", "title", "selftext", "subreddit", "id", "permalink"]

# collect submissions from March 2021 to June 2021
for month in range (3, 7): 
    subInPath = f"Data.nosync/Dumps/RS_2021-0{month}.zst"
    subOutPath = f"Data.nosync/Extracted/RS_2021-0{month}_extracted.csv"
    relevantZstToCsv(subInPath, subOutPath, subAttributes, excludeFn)

<a id="sec3"></a>
## 3. Collect comments

In [None]:
# comment attributes to be collected
commAttributes = ["created_utc", "author", "body", "subreddit", "id", "permalink"]

# collect comments from March 2021 to June 2021
for month in range (3, 7): 
    commInPath = f"Data.nosync/Dumps/RC_2021-0{month}.zst"
    commOutPath = f"Data.nosync/Extracted/RC_2021-0{month}_extracted.csv"
    relevantZstToCsv(commInPath, commOutPath, commAttributes, excludeFn)