<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

--- 
# Part 2: Overview

--- 

## Problem Statement

Can we distinguish the difference between active vs. passive revenge in the subreddits r/MaliciousCompliance, r/pettyrevenge, and r/ProRevenge. 

## Contents
- [Background on Subreddits Chosen](#Background-on-Subreddits-Chosen)
- [References](#References)
- [Data Cleaning](#Part-3-:-Data-Cleaning)

## Overview
We will vectorize subreddit submission text scraped from reddit using the pushshift API and create a classification model with this data.

The following were removed from the comments: html, hyperlinks, punctuation, words with 2 or fewer letters, whitespace including line returns, non-standard characters (emoji). Duplicate messages were dropped. There were approximately 1200 duplicate moderator bot messages (600 in cats and 600 in dogs). After cleaning, there were approximately 18,000 records total, still split approximately in half by cat and dog classes. Words in the comments were lemmatized and stop words were removed. Preliminary EDA showed that of the 30 most frequent words in each class, approximately 1/3 were unique to the class and 2/3 were the same in both.


## Background on Subreddits Chosen
### r/MaliciousCompliance


### r/pettyrevenge


### r/ProRevenge


## References
1. http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
2. Pardoe , I. (2008), “Modeling home prices using realtor data”, Journal of Statistics Education Volume 16, Number 2 (2008). http://jse.amstat.org/v16n2/datasets.pardoe.pdf

--- 
# Part 3 : Data Cleaning

--- 

In [106]:
import pandas as pd
import numpy as np
import requests
import datetime as dt
from datetime import datetime
import time
import sys
import string
import re
from bs4 import BeautifulSoup
from sklearn.feature_extraction import text
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [130]:
# Read in both CSVs & create dfs

raw = pd.read_csv('./data/subreddit_data.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [109]:
raw.shape

(17104, 79)

In [110]:
raw.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,banned_by,post_hint,preview,edited,gilded,top_awarded_type,author_flair_template_id,distinguished,thumbnail_height,thumbnail_width
0,[],False,Able_Engine_9515,,[],,text,t2_a32pwgrv,False,False,...,,,,,,,,,,
1,[],False,PurveyorOfSapristi,,[],,text,t2_vwbcw1c,False,False,...,,,,,,,,,,
2,[],False,No-Cartoonist-2079,,[],,text,t2_5xuf5a6i,False,False,...,,,,,,,,,,
3,[],False,BreWanKenobi,,[],,text,t2_8lpt0vr8,False,False,...,,,,,,,,,,
4,[],False,Electronic_Nebula999,,[],,text,t2_cwtom2ev,False,False,...,,,,,,,,,,


In [131]:
# Dropping irrelevant columns
raw = raw.drop(columns = ['all_awardings', 'allow_live_comments',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'send_replies', 'spoiler',
       'stickied', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'total_awards_received',
       'treatment_tags', 'upvote_ratio', 'url', 'whitelist_status', 'wls',
       'removed_by_category', 'crosspost_parent', 'crosspost_parent_list',
       'url_overridden_by_dest', 'author_cakeday',
       'author_flair_background_color', 'author_flair_text_color', 'banned_by',
       'post_hint', 'preview', 'edited', 'gilded', 'top_awarded_type',
       'author_flair_template_id', 'distinguished', 'thumbnail_height',
       'thumbnail_width','domain', 'is_video', 'subreddit_id'])

In [132]:
# Converting and creating a timestamp column from UTC
raw['timestamp'] = raw["created_utc"].map(datetime.fromtimestamp)

In [113]:
raw.head()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp
0,Able_Engine_9515,1627673154,True,24,1,Like many stories I've read here I'm not entir...,MaliciousCompliance,Jeans are against the dress code now?,2021-07-30 13:25:54
1,PurveyorOfSapristi,1627674014,True,8,1,When I found myself jobless I took a chance on...,MaliciousCompliance,Ex-Boss asked me for a shoutout due to some of...,2021-07-30 13:40:14
2,No-Cartoonist-2079,1627674459,True,1,1,After about 100 comments from other subs sayin...,MaliciousCompliance,Make you a sandwich? Ok,2021-07-30 13:47:39
3,BreWanKenobi,1627674792,True,657,1,While I was putting myself through university ...,MaliciousCompliance,Don’t want this saleswoman? Let me find anothe...,2021-07-30 13:53:12
4,Electronic_Nebula999,1627676223,True,0,1,[removed],MaliciousCompliance,You don’t want my help with your workload? Fin...,2021-07-30 14:17:03


In [114]:
raw.isnull().sum()

author            0
created_utc       0
is_self           0
num_comments      0
score             0
selftext        479
subreddit         0
title             0
timestamp         0
dtype: int64

In [133]:
# Drop NaNs in selftext
raw = raw.dropna()

In [134]:
raw.shape

(16625, 9)

In [137]:
# Drop posts where text has been removed or deleted
raw = raw[raw['selftext']!='[removed]']
raw = raw[raw['selftext']!='[deleted]']

In [141]:
raw.shape

(7047, 10)

In [140]:
# Drop posts by AutoModerator (if you see really high-count words like "Daily" "Discussion" etc.
raw = raw.drop(raw[raw['author'] == 'AutoModerator'].index)

In [142]:
# Creating a new column with all text combined and not including [removed] selftext rows

raw['all_text']=raw['title'] +' '+ raw['selftext']
raw.tail()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
17091,Prestige_Memes,1596315772,True,14,1,A bit of background; I was physically assaulte...,ProRevenge,I ruined my bully's life.,2020-08-01 15:02:52,I ruined my bully's life. A bit of background;...
17094,thisismew2king,1596330893,True,8,1,"So when I was around 20 (M), I was working in ...",ProRevenge,Fucking the girl who hates me the most as revenge,2020-08-01 19:14:53,Fucking the girl who hates me the most as reve...
17096,Roonwogsamduff,1596340005,True,47,1,I’ve been thinking about posting this for a wh...,ProRevenge,Landlord doesn't play nice and pays for it,2020-08-01 21:46:45,Landlord doesn't play nice and pays for it I’v...
17100,EpicWinterWolf,1596360257,True,29,1,"So, this happened about over a month back, and...",ProRevenge,Don't mess with my cat. It won't end well for ...,2020-08-02 03:24:17,Don't mess with my cat. It won't end well for ...
17103,Torsod,1596387491,True,14,1,"So about 2 months ago, I had a falling out wit...",ProRevenge,Scam dozens of people out from a MINECRAFT SER...,2020-08-02 10:58:11,Scam dozens of people out from a MINECRAFT SER...


In [143]:
# REFERENCE: In-class coding challenge

def clean_strings(sentences, stopwords = []):
    import pandas as pd
    import numpy as np
    import re
    
    output = []
    
    # lowercasing all
    sentences = [st.lower() for st in sentences]
    stopwords = [st.lower() for st in stopwords]
    
    # remove URLs before punctuation otherwise we won't be able to find URLs
    sentences = [re.sub(r'^https?:\/\/.*[\r\n]*','', text, flags=re.MULTILINE) for text in sentences]
    
    #new lines and tabs
    [st.replace('\n', ' ').replace('\t', ' ') for st in sentences]
    
    #digits and punctuation
    for st in sentences:
        new_st = ''.join([char for char in st if char.isalpha() or char ==' '])
        
        #stopwords
        new_st = ' '.join([word for word in new_st.split() if word not in stopwords])
        
        output.append(new_st)
        
    return output

In [144]:
clean_text=[]

for text in raw['all_text']:
     # Convert text to words, then append to clean_text.
    clean_text.append(clean_strings(text))

In [123]:
raw.tail()

Unnamed: 0,author,created_utc,is_self,num_comments,score,selftext,subreddit,title,timestamp,all_text
17091,Prestige_Memes,1596315772,True,14,1,A bit of background; I was physically assaulte...,ProRevenge,I ruined my bully's life.,2020-08-01 15:02:52,I ruined my bully's life. A bit of background;...
17094,thisismew2king,1596330893,True,8,1,"So when I was around 20 (M), I was working in ...",ProRevenge,Fucking the girl who hates me the most as revenge,2020-08-01 19:14:53,Fucking the girl who hates me the most as reve...
17096,Roonwogsamduff,1596340005,True,47,1,I’ve been thinking about posting this for a wh...,ProRevenge,Landlord doesn't play nice and pays for it,2020-08-01 21:46:45,Landlord doesn't play nice and pays for it I’v...
17100,EpicWinterWolf,1596360257,True,29,1,"So, this happened about over a month back, and...",ProRevenge,Don't mess with my cat. It won't end well for ...,2020-08-02 03:24:17,Don't mess with my cat. It won't end well for ...
17103,Torsod,1596387491,True,14,1,"So about 2 months ago, I had a falling out wit...",ProRevenge,Scam dozens of people out from a MINECRAFT SER...,2020-08-02 10:58:11,Scam dozens of people out from a MINECRAFT SER...


In [145]:
# Reddit markdown/html artifacts (weird characters combinations—often crop up when you vectorize)

raw['all_text'] = [re.sub(r'http\S+', '', text) for text in raw['all_text']]

In [146]:
malicious = raw[raw['subreddit']=='MaliciousCompliance']
petty = raw[raw['subreddit']=='pettyrevenge']
pro = raw[raw['subreddit']=='ProRevenge']

In [147]:
malicious.shape

(3595, 10)

In [148]:
petty.shape

(1911, 10)

In [149]:
pro.shape

(1541, 10)

In [None]:
# Remove more stop words

my_stops = stopwords.words('english')
my_stops.extend(['none','\n', 'www', 'reddit', 'com', 'comment', 'http'])
my_stops = set(my_stops)
    

meaningful_words = [w for w in words if not w in my_stops]
    
# lemmatizer the words.
lemmatizer = WordNetLemmatizer()
tokens_lem = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
# Join the words back into one string
' '.join(tokens_lem)

In [None]:
def tokenize(column):
    raw[column] = [word_tokenize(str(text)) for text in word[column]]
    return raw[column]

def lemmatize(tokenized_column):
    lemmatizer = WordNetLemmatizer()
    raw[tokenized_column] = [' '.join([lemmatizer.lemmatize(word) for word in text]) for text in raw[tokenized_column]]
    return raw[tokenized_column]


In [1]:
# Most Common Words by Subreddit

tables = [most_common_malicious,most_common_petty,most_common_pro]
titles = ['r/MaliciousCompliance', 'r/pettyrevenge', 'r/ProRevenge']

fig, ax = plt.subplots(1,3, sharex=True, figsize=(16,8))
axs = ax.ravel()
fig.suptitle('Top 20 Most Common Words by Subreddit', fontsize=20, y=1.07)
for i, ax in enumerate(axs):
    plt.setp(ax.get_yticklabels(), fontsize=14)
    ax.set_title(titles[i], fontsize=16)
    ax.set_xlabel('Word Frequency', fontsize=18)
    ax.set_ylabel('Word', fontsize=18)
    sns.barplot(tables[i][0:20],tables[i].index.str.title()[0:20], orient='h',ax=ax, ec='k', linewidth=1)
fig.tight_layout()

NameError: name 'most_common_malicious' is not defined

In [None]:
# Save cleaned file
raw.to_csv('../data/cleaned_subreddit_data.csv', index=False)