<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP 

# Classifying Active vs. Passive Revenge in Similar Subreddits

--- 
# Part 2: Overview

--- 

## Problem Statement

Are the differences between Active vs. Passive revenge classifiable in the subreddits r/MaliciousCompliance vs.  r/pettyrevenge and r/ProRevenge? r/MaliciousCompliance is Passive not premeditated revenge and the other 2 subreddits are more Active planned acts of vengeance. 

Moderators for r/MaliciousCompliance can use this classification model to help filter submissions that might be better suited for another subreddit.

## Contents
- [Background on Subreddits Chosen](#Background-on-Subreddits-Chosen)
- [References](#References)
- [Data Cleaning](#Part-3-:-Data-Cleaning)
- [Summary](#Summary)

## Background

Numerous studies about aggression and revenge have been undertaken by the Psychology community. One paper I read found evidence to suggest that areas of the brain that control aggression were more activated and preceded by instances of provocation but that this effect cannot be fully separated from the possible significance of the person just simply observing someone be punished for their wrongdoing in contrast to partaking in an actual act of revenge. This distinction between active and passive revenge is really interesting to explore and attempt to classify. 

According to Wikipedia, Malicious Compliance is the passive-aggressive "behaviour of intentionally inflicting harm by strictly following the orders of a superior while knowing or intending that compliance with the orders will have an unintended or negative result." This differs from acts of revenge in that they are unplanned but intentional and simply following orders. In contrast, revenge is a planned act or instance of retaliating in order to get even. 

Moderating subreddits can be a time-consuming process. This project aims to help the Malicious Compliance reddit community, better filter submissions that might belong in subreddits with a more Active Revenge subject focus. 

## Background on Subreddits Chosen
### r/MaliciousCompliance

**Fast Facts:**
* Created on Jan 4, 2016
* 1.6 million Members
* All posts must be a story that must contain some form of malicious compliance. Malicious is interpreted broadly, but posts where people do not comply with rules will be removed. Update posts must link to the previous post on this subreddit and are subject to moderator approval.
* Compliance must be intentional.


### r/pettyrevenge
**Fast Facts:**
* Created Nov 1, 2012
* 1.1 million Members
* Stories should be revenge-based. Karma =/= Revenge Someone has wronged you, but you got your revenge, oh yes, you got your revenge. Reporting someone to the police is not revenge, it is simply just reporting someone to the police.
* Stories should be petty. I messed with their toothpaste. I turned their disk upside-down in their XBox. I gasp put shaving cream in their shoes. The more creative the better


### r/ProRevenge
**Fast Facts:**
* Created Nov 14, 2012
* 1.1 million Members
* Your story should be about getting back at someone who wronged you in generally an interesting and/or funny way. In order for your story to be pro revenge, it should involve you going out of your way and going above and beyond to get revenge. If that isn't the case with your story, it may be better suited for another revenge subreddit
* All posts must contain a full revenge story (no requests for ideas or excessive minor updates)


## References
1. Data was scraped from Pushshift Reddit API https://github.com/pushshift/api
2. https://www.psychologytoday.com/us/blog/tech-support/201707/the-psychology-revenge-and-vengeful-people
3. David S. Chester, C. Nathan DeWall, The pleasure of revenge: retaliatory aggression arises from a neural imbalance toward reward, Social Cognitive and Affective Neuroscience, Volume 11, Issue 7, July 2016, Pages 1173–1182, https://doi.org/10.1093/scan/nsv082
4. https://en.wikipedia.org/wiki/Malicious_compliance
5. Gwen's Repo was MASSIVELY helpful - https://github.com/gwenrathgeber/subreddit_text_classification

--- 
# Part 3 : Data Cleaning

--- 

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# import requests
import datetime as dt
from datetime import datetime
# import time
# import sys
# import string
import re
# from bs4 import BeautifulSoup
# from sklearn.feature_extraction import text
# from nltk.tokenize import word_tokenize, sent_tokenize
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer
# from nltk.stem.porter import PorterStemmer
# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from sklearn.model_selection import train_test_split

In [32]:
# Read in CSV & create df

raw = pd.read_csv('../data/subreddit_data.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [33]:
raw.shape

(26680, 79)

In [34]:
raw.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_background_color,author_flair_text_color,banned_by,edited,gilded,top_awarded_type,author_flair_template_id,distinguished,thumbnail_height,thumbnail_width
0,[],False,Erahth,,[],,text,t2_115qx5,False,False,...,,,,,,,,,,
1,[],False,MorrisonsLament,,[],,text,t2_22r3dlgw,False,False,...,,,,,,,,,,
2,[],False,infiniteknights,,[],,text,t2_2whl5ei,False,False,...,,,,,,,,,,
3,[],False,SimRayB,,[],,text,t2_9vd8p2kn,False,False,...,,,,,,,,,,
4,[],False,mathzak,,[],,text,t2_kfje6,False,False,...,,,,,,,,,,


In [35]:
# Dropping irrelevant columns
raw = raw.drop(columns = ['all_awardings', 'allow_live_comments',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'send_replies', 'spoiler',
       'stickied', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'total_awards_received',
       'treatment_tags', 'upvote_ratio', 'url', 'whitelist_status', 'wls',
       'removed_by_category', 'crosspost_parent', 'crosspost_parent_list',
       'url_overridden_by_dest', 'author_cakeday',
       'author_flair_background_color', 'author_flair_text_color', 'banned_by',
       'post_hint', 'preview', 'edited', 'gilded', 'top_awarded_type',
       'author_flair_template_id', 'distinguished', 'thumbnail_height',
       'thumbnail_width','domain', 'is_video', 'subreddit_id'])

In [36]:
# Converting and creating a timestamp column from UTC
raw['timestamp'] = raw["created_utc"].map(datetime.fromtimestamp)

In [37]:
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46
4,mathzak,1627807535,True,0,1,[removed],MaliciousCompliance,Socially ostracize all people that don’t take ...,2021-08-01 02:45:35


In [38]:
raw.isnull().sum()

author            0
created_utc       0
is_self           0
num_comments      0
score             0
selftext        885
subreddit         0
title             0
timestamp         0
dtype: int64

In [39]:
# Drop NaNs in selftext
raw = raw.dropna()

In [40]:
raw.shape

(25795, 9)

In [41]:
raw[raw['selftext']=='[removed]']

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp
4,mathzak,1627807535,True,0,1,[removed],MaliciousCompliance,Socially ostracize all people that don’t take ...,2021-08-01 02:45:35
5,Justanerd582,1627813401,True,0,1,[removed],MaliciousCompliance,"Boss tells me to change departments? Ok, I'll ...",2021-08-01 04:23:21
8,BigBossSnake,1627828322,True,0,1,[removed],MaliciousCompliance,Company's 4th Prime Directive policy wouldn't ...,2021-08-01 08:32:02
9,zqw004,1627832051,True,0,1,[removed],MaliciousCompliance,Make Money Online Free Sign up,2021-08-01 09:34:11
10,littlesharkieshark,1627834593,True,0,1,[removed],MaliciousCompliance,Don't want to use his stuff? Ok!,2021-08-01 10:16:33
...,...,...,...,...,...,...,...,...,...
26672,zJochen1,1580497124,True,2,1,[removed],ProRevenge,Pray for death and it will come.,2020-01-31 11:58:44
26675,FearXHusky1,1580520073,True,2,1,[removed],ProRevenge,How I scammed a Fallout 76 scammer,2020-01-31 18:21:13
26677,bean-anator,1580531362,True,1,1,[removed],ProRevenge,This pedo sent my underage friend nudes. Here’...,2020-01-31 21:29:22
26678,Kurai-Okami1906,1580534369,True,2,1,[removed],ProRevenge,Cheater Gets A Score of Nine In Simple Geometry,2020-01-31 22:19:29


In [42]:
# Drop posts where text has been removed or deleted
raw = raw[raw['selftext']!='[removed]']
raw = raw[raw['selftext']!='[deleted]']

In [43]:
raw.shape

(11189, 9)

In [44]:
# Checking if there are posts by AutoModerator (if you see really high-count words like "Daily" "Discussion" etc.)
raw[raw['author'].str.contains('AutoModerator', case = False)]

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp


In [45]:
# Creating a new column with all text combined and not including [removed] selftext rows

raw['all_text']=raw['title'] +' '+ raw['selftext']
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01,"One more sip… So, this just happened. My 3.5yo..."
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45,"""You can't fire me. But you can make me stop w..."
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24,"""Personal responsibility""? Ok! I've been doing..."
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46,You put all of those in your mouth or you can’...
6,CSPhCT,1627819267,True,28,1,Patient comes into my pharmacy. “I’m here for ...,MaliciousCompliance,Patient wants what he wants so I just need to ...,2021-08-01 06:01:07,Patient wants what he wants so I just need to ...


In [46]:
# Reset index
raw.reset_index(inplace = True, drop = True)
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
0,Erahth,1627798261,True,12,1,"So, this just happened. My 3.5yo son is showin...",MaliciousCompliance,One more sip…,2021-08-01 00:11:01,"One more sip… So, this just happened. My 3.5yo..."
1,MorrisonsLament,1627798365,True,39,1,Many years ago I was working for a fairly larg...,MaliciousCompliance,"""You can't fire me. But you can make me stop w...",2021-08-01 00:12:45,"""You can't fire me. But you can make me stop w..."
2,infiniteknights,1627798884,True,215,1,"I've been doing all my shopping online, from g...",MaliciousCompliance,"""Personal responsibility""? Ok!",2021-08-01 00:21:24,"""Personal responsibility""? Ok! I've been doing..."
3,SimRayB,1627800166,True,19,1,"According to my Mother, this happened when I w...",MaliciousCompliance,You put all of those in your mouth or you can’...,2021-08-01 00:42:46,You put all of those in your mouth or you can’...
4,CSPhCT,1627819267,True,28,1,Patient comes into my pharmacy. “I’m here for ...,MaliciousCompliance,Patient wants what he wants so I just need to ...,2021-08-01 06:01:07,Patient wants what he wants so I just need to ...


In [47]:
# REFERENCES: In-class coding challenge, https://stackoverflow.com/questions/1276764/stripping-everything-but-alphanumeric-chars-from-a-string-in-python
   
# lowercasing all text
raw['all_text'] = [st.lower() for st in raw['all_text']]
    
# remove URLs before punctuation otherwise we won't be able to find URLs
raw['all_text'] = [re.sub(r'^https?:\/\/.*[\r\n]*','', text, flags=re.MULTILINE) for text in raw['all_text']]
    
# remove new lines and tabs
raw['all_text'] = [st.replace('\n', ' ').replace('\t', ' ') for st in raw['all_text']]
    
# remove digits and punctuation (anything non alphanumeric)
raw['all_text'] = [re.sub(r'\W+', ' ', text,  flags=re.MULTILINE) for text in raw['all_text']]

In [48]:
# Quickly perusing text to see if I missed anything - COMMENTING OUT for space
#[i for i in raw['all_text']]

In [49]:
# Remove more reddit markdown/html artifacts I found above

raw['all_text'] = raw['all_text'].str.replace('http','')
raw['all_text'] = raw['all_text'].str.replace('x200B','')
raw['all_text'] = raw['all_text'].str.replace('imgur','')
raw['all_text'] = raw['all_text'].str.replace('com','')
raw['all_text'] = raw['all_text'].str.replace('www','')

In [50]:
raw.shape

(11189, 10)

In [51]:
# Checking is_self is all true
raw['is_self'].value_counts()

True    11189
Name: is_self, dtype: int64

In [52]:
# Can drop both title and selftext columns now that we have an all_text column as well as is_self and created_utc
raw = raw.drop(columns = ['selftext', 'title', 'is_self', 'created_utc'])
raw.shape

(11189, 6)

In [53]:
malicious = raw[raw['subreddit']=='MaliciousCompliance']
petty = raw[raw['subreddit']=='pettyrevenge']
pro = raw[raw['subreddit']=='ProRevenge']

In [57]:
# Final amount of submissions per subreddit
raw['subreddit'].value_counts()

MaliciousCompliance    5303
pettyrevenge           3360
ProRevenge             2526
Name: subreddit, dtype: int64

In [58]:
# Save cleaned file
raw.to_csv('../data/cleaned_subreddit_data.csv', index=False)

## Summary

A big part of getting reddit submission data ready for modeling is cleaning and removing any artifacts of remnants from html, punctuation, characters like emojis, etc. Removed or deleted submissions were dropped. The particular subreddits I chose did not appear to have any moderator bot messages that needed to be removed. After cleaning, there were 11,189 subreddit submissions total, with r/MaliciousCompliance having 5,303, r/pettyrevenge with 3,360, and r/ProRevenge with 2,526.