# Executive Summary
How many times have you opened up a browser for a random subreddit only to find that it wasn't the random subreddit you were looking for?  We've all been there.  Furthermore, what about when you wonder "golly, just how similar are different subreddits that are focused one concept but from entirely different points of view?"  Well, we hear you.  We've scrapped data from two active subreddits which focus around sexuality and using them build a model that's able to detect if it's one subreddit or the other with over an 80% certainty.  Furthermore, if future exploritory data analysis, we hope to one day be able to talk about the defining features of each subculter that's being represented by these subreddits.

# Imports

In [1]:
import requests
import json
import time
import pandas as pd
import numpy as np
from nltk import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import regex as re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

  from numpy.core.umath_tests import inner1d


This is a function that scrapes a subreddit and turns it into a pandas dataframe.
Followed by it being used for the actuallesbians and Braincels subreddits

In [2]:
def scrape_reddit(the_subreddit, pages = 40):
    all_posts = []
    first_url = 'http://www.reddit.com/r/' + the_subreddit + '.json'
    url = first_url
    list_of_df = []
    
    # Get Sanity Check:
    quick_check = requests.get(first_url, headers = {'User-agent':'Electronic Goddess'})
    if int(str(quick_check)[11:14]) == 200:
        print("Get request successful.")
        time.sleep(3)
        print("Initiating Scrape...")
    else:
        print("Get request not 200, instead recieved:" + str(quick_check))
        return
    
    # Scraping:
    for round in range(pages):
        try:
            res = requests.get(url, headers = {'User-agent':'Electronic Goddess'})
            data = res.json()
            list_of_posts = data['data']['children']
            all_posts = all_posts + list_of_posts
            after = data['data']['after']
            url = first_url +'?after=' + after
            print('Round: '+ str(round + 1))
            time.sleep(3)
        except:
            print('Limit likely hit.  Returning available posts.')
            break
#        return all_posts # This can be un-commented out for a straight forward raw scrape

    # Formats the parts we care about into a list of dictionaries that'll become the dataframe
    for i in range(len(all_posts)):
        index_dictionary = {
                'title' : all_posts[i]['data']['title'],
                'selftext': all_posts[i]['data']['selftext'],
                'subreddit' : all_posts[i]['data']['subreddit']
            }
        list_of_df.append(index_dictionary)
    return pd.DataFrame(list_of_df, columns = ['title','selftext','subreddit'])

### Scraped, saved and available to be loaded from csv

In [3]:
# df_lesbians = scrape_reddit('actuallesbians')
# df_incels = scrape_reddit('braincels')

In [4]:
# Export to csv (Commented out to avoid re-saving errors)
#df_lesbians.to_csv('actuallesbians_9_9_400', index=False)
#df_incels.to_csv('braincels_9_9_400', index=False)

In [5]:
# Import from CSV
df_lesbians = pd.read_csv('./Laboritory/Data/actuallesbians_9_9_400')
df_incels = pd.read_csv('./Laboritory/Data/braincels_9_9_400')

# Natural Language Processing
Using CountVectorizer to generate features from the post text and title of posts.

In [6]:
# Instantiations of the tokenizer, lemmatizer and Count Vectorizer (with preprocessor)
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()
def preprocess(text):
    text = re.sub(r'[^a-zA-Z]',' ', text.lower())
    tokens = word_tokenize(text)
    lemmer = WordNetLemmatizer()
    stop_words = stopwords.words("english")
    return " ".join([lemmer.lemmatize(word) for word in tokens if len(word) > 1 and not word in stop_words])
cvec = CountVectorizer(analyzer = "word",
                       tokenizer = tokenizer.tokenize,
                       preprocessor = preprocess,
                       stop_words = 'english',
                       min_df = 2)

Combining and altering the dataframes to be modeled.

In [7]:
# Identifying the y Values
df_lesbians['is_lesbians'] = 1
df_incels['is_lesbians'] = 0

# Concatination of the two subreddits
les_or_inc = pd.concat([df_lesbians.drop('subreddit', axis=1),
                        df_incels.drop('subreddit', axis=1)])

# Filling Nulls
les_or_inc.fillna('', inplace=True)

# Combining the title and selftext columns
les_or_inc['all_text'] = les_or_inc['title'] + ' ' + les_or_inc['selftext']

# Resetting the Index
les_or_inc.reset_index(inplace=True)

## Exploritory Data Analysis

In [9]:
# Creating Cvec DataFrame of both forums
df_words = pd.DataFrame(cvec.fit_transform(les_or_inc['all_text']).todense(), 
                        columns=cvec.get_feature_names())

# Inserting the target column
df_words['is_lesbians'] = les_or_inc['is_lesbians']

In [10]:
# Listing the correlations to the two data frames.
# 1 = represents coming from lesbians subreddit.
# 0 = represents coming from incels subreddit.
df_corrs = df_words.corr().sort_values(['is_lesbians'])['is_lesbians']
print("Most correlated to Lesbians subreddit")
df_corrs.tail(20)[18::-1]

Most correlated to Lesbians subreddit


lesbian       0.263913
really        0.201454
gay           0.200222
know          0.183887
like          0.172065
girlfriend    0.166913
feel          0.157275
straight      0.156200
time          0.155890
long          0.147156
ago           0.144012
week          0.142556
thing         0.141777
friend        0.138071
tell          0.136536
little        0.136055
cute          0.135052
pretty        0.134702
month         0.132970
Name: is_lesbians, dtype: float64

In [10]:
print("Most correlated to Incels subreddit")
df_corrs.head(20)

Most correlated to Incels subreddit


chad               -0.221560
incel              -0.185280
incels             -0.154910
oneitis            -0.143180
ugly               -0.142970
cope               -0.124894
blackpill          -0.118361
bro                -0.105852
jfl                -0.105735
ascend             -0.105704
braincels          -0.100735
fuck               -0.100300
black              -0.097455
blackpillcentral   -0.097165
normies            -0.096021
pill               -0.090948
subhuman           -0.090072
reminder           -0.088896
normie             -0.088658
beta               -0.088407
Name: is_lesbians, dtype: float64

By looking at words most correlated to one subreddit or the other we can infer what these forums have most different from eachother. Some of which were obvious, such as words being associated with sexuality having a high correlation to the lesbian forum, but others are more odd, like how the use of words like "month", "time", "week" and "ago" seeming to point to a higher mention of recent or future timeframes when compaired to the incels. For the incels this seems to have sifted those words around thier specific ingroup terminology such as "chad", "oneitis" and "blackpill". 
 - Planned: Topic Modeling and Word2Vec/Doc2Vec functions.

## Modeling

Setting up the X, y, tests and trains

In [11]:
# Defining X and y
X = les_or_inc['all_text']
y = les_or_inc['is_lesbians']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=76)

# Count Vectorizing the train and test X's while fitting the Training X
X_train = pd.DataFrame(cvec.fit_transform(X_train).todense(), columns=cvec.get_feature_names())
X_test  = pd.DataFrame(cvec.transform(X_test).todense(),      columns=cvec.get_feature_names())

The baseline accuracy for this model is about 50% because one could simply guess 1 or 0 for all of the rows and get 50% correct.

In [12]:
multi_model = MultinomialNB().fit(X_train,y_train)
print("Multinomial Neïve Bayes")
print("Train:", multi_model.score(X_train,y_train))
print("Test:", multi_model.score(X_test, y_test))
print("")
extra_trees = ExtraTreesClassifier().fit(X_train, y_train)
print("Extra Trees")
print("Train:", extra_trees.score(X_train,y_train))
print("Test:", extra_trees.score(X_test,y_test))
print("")
log_reg = LogisticRegression().fit(X_train, y_train)
print("Logistic Regression")
print("Train:", log_reg.score(X_train,y_train))
print("Test:", log_reg.score(X_test,y_test))
print("")
gradient = GradientBoostingClassifier().fit(X_train, y_train)
print("Gradient Boost")
print("Train:", gradient.score(X_train,y_train))
print("Test:", gradient.score(X_test,y_test))
print("")
KNN = KNeighborsClassifier().fit(X_train, y_train)
print("K Nearest Neighbors")
print("Train:", KNN.score(X_train,y_train))
print("Test:", KNN.score(X_test,y_test))
print("")
support = SVC().fit(X_train, y_train)
print("Support Vector Machine")
print("Train:", support.score(X_train,y_train))
print("Test:", support.score(X_test,y_test))

Multinomial Neïve Bayes
Train: 0.9263513513513514
Test: 0.8663967611336032

Extra Trees
Train: 0.9898648648648649
Test: 0.8016194331983806

Logistic Regression
Train: 0.977027027027027
Test: 0.8421052631578947

Gradient Boost
Train: 0.8722972972972973
Test: 0.7975708502024291

K Nearest Neighbors
Train: 0.7763513513513514
Test: 0.6497975708502024

Support Vector Machine
Train: 0.6054054054054054
Test: 0.6376518218623481
