<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP 

# Classifying Active vs. Passive Revenge in Subreddits

--- 
# Part 2: Overview

--- 

## Problem Statement

Can we distinguish the difference between active vs. passive revenge in the subreddits r/MaliciousCompliance, r/pettyrevenge, and r/ProRevenge. r/MaliciousCompliance is more passive unplanned revenge and the other 2 subreddits are more active planned revenge. This model can be used by the moderators for r/MaliciousCompliance if it can detect predictive features that distinguish it from the other 2 subreddits. 

## Contents
- [Background on Subreddits Chosen](#Background-on-Subreddits-Chosen)
- [References](#References)
- [Data Cleaning](#Part-3-:-Data-Cleaning)
- [Data Dictionary](#Data-Dictionary)

## Overview

Numerous studies about aggression and revenge have been undertaken by the Psychology community. One paper I perused [Source: https://doi.org/10.1093/scan/nsv082] found evidence to suggest that areas of the brain that control aggression were more activated and preceded by instances of provocation. While this may provide pleasure and reward of retaliatory aggression if acted upon, the paper came to the conclusion that this effect cannot be fully separated from the possible significance of the person just simply observing someone be punished for their wrongdoing. This distinction between active and passive revenge is really interesting to explore and attempt to classify. 

According to Wikipedia, Malicious Compliance is the passive-aggressive "behaviour of intentionally inflicting harm by strictly following the orders of a superior while knowing or intending that compliance with the orders will have an unintended or negative result." This differs from acts of revenge in that they are unplanned but intentional and simply following orders. In contrast, revenge is a planned act or instance of retaliating in order to get even. 

A big part of getting reddit submission data ready for modeling is cleaning and removing any artifacts of remanants from html, punctuation, characters like emojis, etc. Removed or deleted submissions were dropped. The particular subreddits I chose did not appear to have any moderator bot messages that needed to be removed. After cleaning, there were 11,189 subreddit submissions total, with r/MaliciousCompliance having 5,303, r/pettyrevenge with 3,360, and r/ProRevenge with 2,526. Words were lemmatized and stop words were removed. Preliminary EDA showed that of the 30 most frequent words in each class, approximately 1/3 were unique to the class and 2/3 were the same in both. After cleaning and EDA, several different models were built to attempt to tackle classifying passive vs. active revenge in subreddits.


## Background on Subreddits Chosen
### r/MaliciousCompliance

**Fast Facts:**
* Created on Jan 4, 2016
* 1.6 million Members
* All posts must be a story that must contain some form of malicious compliance. Malicious is interpreted broadly, but posts where people do not comply with rules will be removed. Update posts must link to the previous post on this subreddit and are subject to moderator approval.
* Compliance must be intentional.


### r/pettyrevenge
**Fast Facts:**
* Created Nov 1, 2012
* 1.1 million Members
* Stories should be revenge-based. Karma =/= Revenge Someone has wronged you, but you got your revenge, oh yes, you got your revenge. Reporting someone to the police is not revenge, it is simply just reporting someone to the police.
* Stories should be petty. I messed with their toothpaste. I turned their disk upside-down in their XBox. I gasp put shaving cream in their shoes. The more creative the better


### r/ProRevenge
**Fast Facts:**
* Created Nov 14, 2012
* 1.1 million Members
* Your story should be about getting back at someone who wronged you in generally an interesting and/or funny way. In order for your story to be pro revenge, it should involve you going out of your way and going above and beyond to get revenge. If that isn't the case with your story, it may be better suited for another revenge subreddit
* All posts must contain a full revenge story (no requests for ideas or excessive minor updates)


## References
1. Data was scraped from Pushshift Reddit API https://github.com/pushshift/api
2. https://www.psychologytoday.com/us/blog/tech-support/201707/the-psychology-revenge-and-vengeful-people
3. David S. Chester, C. Nathan DeWall, The pleasure of revenge: retaliatory aggression arises from a neural imbalance toward reward, Social Cognitive and Affective Neuroscience, Volume 11, Issue 7, July 2016, Pages 1173–1182, https://doi.org/10.1093/scan/nsv082
4. https://en.wikipedia.org/wiki/Malicious_compliance

--- 
# Part 3 : Data Cleaning

--- 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import datetime as dt
from datetime import datetime
import time
import sys
import string
import re
from bs4 import BeautifulSoup
from sklearn.feature_extraction import text
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

In [58]:
# Read in CSV & create df

raw = pd.read_csv('./data/subreddit_data.csv')

In [3]:
raw.shape

(26680, 79)

In [4]:
raw.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_background_color,author_flair_text_color,banned_by,edited,gilded,top_awarded_type,author_flair_template_id,distinguished,thumbnail_height,thumbnail_width
0,[],False,Erahth,,[],,text,t2_115qx5,False,False,...,,,,,,,,,,
1,[],False,MorrisonsLament,,[],,text,t2_22r3dlgw,False,False,...,,,,,,,,,,
2,[],False,infiniteknights,,[],,text,t2_2whl5ei,False,False,...,,,,,,,,,,
3,[],False,SimRayB,,[],,text,t2_9vd8p2kn,False,False,...,,,,,,,,,,
4,[],False,mathzak,,[],,text,t2_kfje6,False,False,...,,,,,,,,,,


In [5]:
# Dropping irrelevant columns
raw = raw.drop(columns = ['all_awardings', 'allow_live_comments',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'send_replies', 'spoiler',
       'stickied', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'total_awards_received',
       'treatment_tags', 'upvote_ratio', 'url', 'whitelist_status', 'wls',
       'removed_by_category', 'crosspost_parent', 'crosspost_parent_list',
       'url_overridden_by_dest', 'author_cakeday',
       'author_flair_background_color', 'author_flair_text_color', 'banned_by',
       'post_hint', 'preview', 'edited', 'gilded', 'top_awarded_type',
       'author_flair_template_id', 'distinguished', 'thumbnail_height',
       'thumbnail_width','domain', 'is_video', 'subreddit_id'])

In [6]:
# Converting and creating a timestamp column from UTC
raw['timestamp'] = raw["created_utc"].map(datetime.fromtimestamp)

In [7]:
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46
4,mathzak,1627807535,True,0,1,[removed],MaliciousCompliance,Socially ostracize all people that don’t take ...,2021-08-01 02:45:35


In [8]:
raw.isnull().sum()

author            0
created_utc       0
is_self           0
num_comments      0
score             0
selftext        885
subreddit         0
title             0
timestamp         0
dtype: int64

In [9]:
# Drop NaNs in selftext
raw = raw.dropna()

In [10]:
raw.shape

(25795, 9)

In [11]:
# Drop posts where text has been removed or deleted
raw = raw[raw['selftext']!='[removed]']
raw = raw[raw['selftext']!='[deleted]']

In [12]:
raw.shape

(11189, 9)

In [13]:
# Drop posts by AutoModerator (if you see really high-count words like "Daily" "Discussion" etc.)
#raw = raw.drop(raw[raw['author'] == 'AutoModerator'].index)
raw[raw['author'].str.contains('AutoModerator', case = False)]

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp


In [14]:
# Creating a new column with all text combined and not including [removed] selftext rows

raw['all_text']=raw['title'] +' '+ raw['selftext']
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01,"One more sip… So, this just happened. My 3.5yo..."
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45,"""You can't fire me. But you can make me stop w..."
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24,"""Personal responsibility""? Ok! I've been doing..."
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46,You put all of those in your mouth or you can’...
6,CSPhCT,1627819267,True,28,1,Patient comes into my pharmacy. “I’m here for ...,MaliciousCompliance,Patient wants what he wants so I just need to ...,2021-08-01 06:01:07,Patient wants what he wants so I just need to ...


In [15]:
# Reset index
raw.reset_index(inplace = True, drop = True)
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01,"One more sip… So, this just happened. My 3.5yo..."
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45,"""You can't fire me. But you can make me stop w..."
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24,"""Personal responsibility""? Ok! I've been doing..."
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46,You put all of those in your mouth or you can’...
4,CSPhCT,1627819267,True,28,1,Patient comes into my pharmacy. “I’m here for ...,MaliciousCompliance,Patient wants what he wants so I just need to ...,2021-08-01 06:01:07,Patient wants what he wants so I just need to ...


In [16]:
# REFERENCE: In-class coding challenge

def clean_strings(sentences, stopwords = []):
    import pandas as pd
    import numpy as np
    import re
    
    output = []
    
    # lowercasing all
    sentences = [st.lower() for st in sentences]
    stopwords = [st.lower() for st in stopwords]
    
    # remove URLs before punctuation otherwise we won't be able to find URLs
    sentences = [re.sub(r'^https?:\/\/.*[\r\n]*','', text, flags=re.MULTILINE) for text in sentences]
    
    #new lines and tabs
    [st.replace('\n', ' ').replace('\t', ' ') for st in sentences]
    
    #digits and punctuation
    for st in sentences:
        new_st = ''.join([char for char in st if char.isalpha() or char ==' '])
        
        #stopwords
        new_st = ' '.join([word for word in new_st.split() if word not in stopwords])
        
        output.append(new_st)
        
    return output

In [17]:
clean_text=[]

for text in raw['all_text']:
     # Convert text to words, then append to clean_text.
    clean_text.append(clean_strings(text))

In [18]:
raw.shape

(11189, 10)

In [19]:
# Checking is_self is all true
raw['is_self'].value_counts()

True    11189
Name: is_self, dtype: int64

In [20]:
# Can drop is_self and also created_utc now that we have a timestamp column
raw = raw.drop(columns = ['is_self', 'created_utc'])
raw.shape

(11189, 8)

In [35]:
# Remove more reddit markdown/html artifacts I found below (weird characters combinations—often crop up when you vectorize)

#raw['all_text'] = [re.sub(r'http\S+', '', text) for text in raw['all_text']]
raw['all_text'] = raw['all_text'].str.replace('x200B','')

In [36]:
malicious = raw[raw['subreddit']=='MaliciousCompliance']
petty = raw[raw['subreddit']=='pettyrevenge']
pro = raw[raw['subreddit']=='ProRevenge']

In [37]:
malicious.shape

(5303, 8)

In [38]:
petty.shape

(3360, 8)

In [39]:
pro.shape

(2526, 8)

In [55]:
# Final amount of submissions per subreddit
raw['subreddit'].value_counts()

MaliciousCompliance    5303
pettyrevenge           3360
ProRevenge             2526
Name: subreddit, dtype: int64

In [56]:
# Save cleaned file
raw.to_csv('./data/cleaned_subreddit_data.csv', index=False)

## Data Dictionary