
# Classifying Passive vs. Active Revenge in Related Subreddits using NLP

--- 
# Part 1: Web Scraping

--- 

## Contents
- [Initial Exploration of Using Pushshift on Chosen Subreddits](#Initial-Exploration-of-Using-Pushshift-on-Chosen-Subreddits)
- [Slight Modification of Gwen's Script](#Slight-Modification-of-Gwen's-Script)
- [Alternative Function I Reworked](#Alternative-Function-I-Reworked)

Using the Pushshift API, I was able to query and pull 550 days of submissions from 3 Reddit Subreddits. 

I was initially having trouble figuring out how to loop through and get the latest created_utc as the before parameter so I used Gwen's script modified slightly here in this notebook to at least get all my data collected. I then went back and worked with Devin and figured out how to get an alternative function working appropriately (did not repull all my data for time purposes). 

In [1]:
# Import libaries here

import pandas as pd
import requests
import datetime as dt
import time
import sys
import csv
from datetime import datetime
import tldextract

## Initial Exploration of Using Pushshift on Chosen Subreddits

In [2]:
# Subreddits I want to query

url_1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=MaliciousCompliance'
url_2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=ProRevenge'
url_3 = 'https://api.pushshift.io/reddit/search/submission?subreddit=pettyrevenge'

In [3]:
params1 = {'subreddit': 'MaliciousCompliance',
           'size':'100',
           'before': 1627207406}

params2 = {'subreddit': 'ProRevenge',
           'size':'100'}

In [4]:
res_1 = requests.get(url_1, params1)
res_1.status_code

200

In [5]:
res_2 = requests.get(url_2, params2)
res_2.status_code

200

In [6]:
res_3 = requests.get(url_3)
res_3.status_code

200

In [7]:
json_1 = res_1.json()
#json_1

In [8]:
json_2 = res_2.json()
#json_2

In [9]:
json_1['data'][-1]['created_utc']

1626858928

In [11]:
# Checking to see what features I want to keep and also the UTC timestamp to use - COMMENTING OUT FOR SPACE
#json_1['data'][:-1]

In [12]:
#json_2['data'][:-1] - COMMENTING OUT FOR SPACE

## Slight Modification of Gwen's Script

In [2]:
#REFERENCE: Gwen's script

subs = ['MaliciousCompliance', 'pettyrevenge', 'ProRevenge']


# Set number of days of data to gather
try:
    days = int(sys.argv[1])
except:
    days = int(input('Please enter the number of days: '))
    
base_url =  'https://api.pushshift.io/reddit/'

# Function to make an individual Pushshift API request
# Returns dictionary of the .json API response
def request_posts(subreddit, days_ago, base_url=base_url, 
                  endpoint='search/submission/', is_video='is_video=false'):
    try:    
        response = requests.get(f'{base_url}{endpoint}?subreddit={subreddit}&{is_video}&before={days_ago}d&after={days_ago+1}d&size=100')
        assert response.status_code == 200
    except:
        pass
    
    return response

# Function to make n requests of 100 posts from n days
# Returns dataframe of API responses from a subreddit
def make_requests(subreddit, days_of_data):
    all_results = []
    
    for i in range(1, days_of_data):
        try:
            entry = request_posts(subreddit,i)
            all_results.append(pd.DataFrame(entry.json()['data']))
        except:
            pass
        if i % 100 == 0:
            print(f'{i} of {days_of_data} requests completed')
        time.sleep(1.5)
        
    return pd.concat(all_results)

# Function to make n requests of 100 posts from n days over m subreddits
# Returns dataframe of API responses from all subreddits
def request_all_subs(list_of_subreddits, days_of_data):
    all_results = []
    for sub in list_of_subreddits:
        print(f'Querying {sub}...')
        sub_df = make_requests(sub,days_of_data)
        all_results.append(sub_df)
    return pd.concat(all_results)

# Executes all requests for n days of data across the subreddits list and writes results to a .csv
def main(days=days):
    df = request_all_subs(subs,days)
    df.to_csv('../data/subreddit_data.csv', index=False)

if __name__ == "__main__":
    main()

Please enter the number of days: 550
Querying MaliciousCompliance...
100 of 550 requests completed
200 of 550 requests completed
300 of 550 requests completed
400 of 550 requests completed
500 of 550 requests completed
Querying pettyrevenge...
100 of 550 requests completed
200 of 550 requests completed
300 of 550 requests completed
400 of 550 requests completed
500 of 550 requests completed
Querying ProRevenge...
100 of 550 requests completed
200 of 550 requests completed
300 of 550 requests completed
400 of 550 requests completed
500 of 550 requests completed


## Alternative Function I Reworked

In [6]:
#REFERENCES: pushshift_demo lesson, help from Devin Fay

def query_pushshift(subreddit):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/submission"
    
    #params to pass to url
    params = {'subreddit': subreddit,
             'size': 100
             }
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with time.sleep
    for i in range(1, 3):
        response = requests.get(BASE_URL, params) #added params to url request
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        params['before'] = df["created_utc"].min()  #setting the before param to min of last scrape instance
        posts.append(df)
        time.sleep(1.5)
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    
    # select desired columns
    full = full[SUBFIELDS]
    # drop duplicates
    full.drop_duplicates(inplace = True)
    # select `is_self` == True
    full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

In [7]:
# Testing to make sure it worked
malicious = query_pushshift('MaliciousCompliance')

Query Complete!


Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Manager forces me to get a doctor's note despi...,I posted this but it got removed and I think i...,MaliciousCompliance,1627928961,kathjoy,0,1,True,2021-08-02
1,"Not mine, reposted from Quora",[removed],MaliciousCompliance,1627921742,ericherde,0,1,True,2021-08-02
2,What pool?,Obligatory reminded of this by another post/wa...,MaliciousCompliance,1627919788,robbie5643,0,1,True,2021-08-02
3,Put your jammies in the laundry hamper,I had the laundry hamper on the floor waiting ...,MaliciousCompliance,1627917011,exhaustedmommyof2,0,1,True,2021-08-02
4,Make Money Online Free Sign up,[removed],MaliciousCompliance,1627915625,zqw004,0,1,True,2021-08-02
...,...,...,...,...,...,...,...,...,...
95,Go home and iron my trousers? Okay,Short but sweet MC.\n\nI used to work in a fac...,MaliciousCompliance,1627069084,annieseesyou,49,1,True,2021-07-23
96,"Using the ""training purposes"" recording agains...",[removed],MaliciousCompliance,1627063206,RepresentativeFit527,0,1,True,2021-07-23
97,Ignorant manager wanted me to build a training...,So this has been a little while coming and not...,MaliciousCompliance,1627063054,ex-turpi-causa,82,1,True,2021-07-23
98,Can't wear swimming trunks to the pool? Fine I...,Few years back I was on a camping with some fr...,MaliciousCompliance,1627062948,StunkRebel,23,1,True,2021-07-23


In [None]:
# Save data to CSVs

malicious.to_csv('../data/malicious_pushshift.csv')
petty.to_csv('../data/petty_pushshift.csv')
pro.to_csv('../data/pro_pushshift.csv')