## Problem Statement

Differentiate between posts more commonly associated with either the male or female fashion advice.

In [414]:
import numpy as np
import requests
import pandas as pd
import time
import random
from bs4 import BeautifulSoup
import regex as re
from nltk.corpus import stopwords # Import the stop word list
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize

# After the imports
warnings.filterwarnings(action='ignore')

## Define the URLs

class_1 refers to posts in female fashion advice, class_2 refers to posts in male fashion advice.

In [246]:
urls = {'confessions' : 'https://www.reddit.com/r/confessions.json', 
        'relationships' : 'https://www.reddit.com/r/relationships.json'}

Create our scraping function:

In [247]:
%time

def reddit_scrapper(key,url,n_iterations=10):
    
    #load_previous_file
    prev_posts = pd.read_csv('./data/' + str(key) + '.csv')
    print("Number of records loaded : {}".format(prev_posts.shape[0]))
    
    posts = []
    after = None

    for a in range(n_iterations):
        if after == None:
            current_url = url + '?limit=100'
        else:
            current_url = url + '?after=' + after + '&limit=100'
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Falcon 2.0'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        
        time.sleep(sleep_duration)
    
    #add_to_existing
    posts = pd.DataFrame(posts)
    posts_df = posts.append(prev_posts,ignore_index=True)
    #remove duplicates
    #posts_df.drop_duplicates(inplace=True)
    print("Number of records stored : {}".format(posts_df.shape[0]))
    posts_df.to_csv('./data/' + str(key) + '.csv', index = False)

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 10 µs


In [248]:
%time

for subreddit,url in urls.items():
    reddit_scrapper(subreddit,url)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs
Number of records loaded : 2082
https://www.reddit.com/r/confessions.json?limit=100
https://www.reddit.com/r/confessions.json?after=t3_bz5m1u&limit=100
https://www.reddit.com/r/confessions.json?after=t3_bz846y&limit=100
https://www.reddit.com/r/confessions.json?after=t3_bypsma&limit=100
https://www.reddit.com/r/confessions.json?after=t3_byhgr9&limit=100
https://www.reddit.com/r/confessions.json?after=t3_by95fh&limit=100
https://www.reddit.com/r/confessions.json?after=t3_by40bz&limit=100
https://www.reddit.com/r/confessions.json?after=t3_bxs4zx&limit=100
https://www.reddit.com/r/confessions.json?after=t3_bxaifo&limit=100
https://www.reddit.com/r/confessions.json?after=t3_bwzxtr&limit=100
Number of records stored : 3073
Number of records loaded : 1026
https://www.reddit.com/r/relationships.json?limit=100
https://www.reddit.com/r/relationships.json?after=t3_bzlq65&limit=100
https://www.reddit.com/r/relationships.json?after=t3

## Load in Data

In [253]:
df_relationships = pd.read_csv('./data/relationships.csv')
df_confessions = pd.read_csv('./data/confessions.csv')

## Data Cleaning

We create a `filter_columns` function that filters out the title, self text and subreddit name (our target)

In [254]:
def filter_columns(df):
    columns_to_retain = ['title','selftext','subreddit','author']
    return df[columns_to_retain]

In [255]:
df_relationships_clean = filter_columns(df_relationships)
df_conf_clean = filter_columns(df_confessions)

In [256]:
display(df_relationships_clean.count())
display(df_conf_clean.count())

title        1948
selftext     1948
subreddit    1948
author       1948
dtype: int64

title        3073
selftext     2710
subreddit    3073
author       3073
dtype: int64

We can observe that the classes are imbalanced. For our classification dataset, we will aim to have a 1:1 class balance - specifically, we will choose 4200 male and 4200 female fashion posts.

In [257]:
df_relationships_clean.head()

Unnamed: 0,title,selftext,subreddit,author
0,Should I (28M) tell my good friend (26F) what ...,I've (28M) been very good friends with M (26F)...,relationships,mr_phyr
1,My (28f) mom (53f) is wanting to move in with ...,My mom text me yesterday and asked me if she c...,relationships,lablife28
2,My [27F] husband [27M] has betrayed me and lie...,I originally posted some of this on a throwawa...,relationships,glassballerina
3,A guy that I know turned out to be an ex-boyfr...,She (38F) finally told me (40M) because she th...,relationships,sospecial77
4,Am I being an ungrateful bitch??,My husband of 28 years keeps insisting on gett...,relationships,tundracatz907


In [258]:
df_conf_clean.head()

Unnamed: 0,title,selftext,subreddit,author
0,"As a call center employee in my early 20s, I p...",Context:\n\nThis took place in the late 2000s....,confessions,IBoris
1,"Whenever someone asks about my parents, I tell...",It happened when I was 15. My father was alway...,confessions,chicagodrama
2,"As I scroll reddit, I save all the things I th...",He thinks that's what I'm looking at when I'm ...,confessions,earthlingmollyrising
3,I am a feminist at work but at home I am a wife,I believe in equal rights for men and women. I...,confessions,AlohaWorld18
4,I miss the days where there was no social media,I feel like social media is what is wrong with...,confessions,EmperorJoker


Prior to this, we may wish to remove posts that have 'Moderator' as an author to train our model on more 'authentic' posts.

In [259]:
df_relationships_clean.loc[:,'author'] = df_relationships_clean.author.map(lambda x : x.lower())
df_conf_clean.loc[:,'author'] = df_conf_clean.author.map(lambda x : x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [260]:
df_relationships_clean = df_relationships_clean[~df_relationships_clean.author.str.contains('moderator')]
df_conf_clean = df_conf_clean[~df_conf_clean.author.str.contains('moderator')]

In [261]:
df_relationships_clean.isna().sum()

title        0
selftext     0
subreddit    0
author       0
dtype: int64

In [262]:
df_conf_clean.isna().sum()

title          0
selftext     363
subreddit      0
author         0
dtype: int64

We also observe empty selftext in both subreddits. we shall drop rows with empty selftext.

In [263]:
df_relationships_clean = df_relationships_clean.dropna(axis=0)
df_conf_clean = df_conf_clean.dropna(axis=0)

Ensure only posts with selftext more than 10 are selected

In [264]:
df_relationships_clean ['selftext_len'] = df_relationships_clean .selftext.map(lambda x: len(x.split()))
df_relationships_clean  = df_relationships_clean [df_relationships_clean .selftext_len > 10]
df_conf_clean['selftext_len'] = df_conf_clean.selftext.map(lambda x: len(x.split()))
df_conf_clean = df_conf_clean[df_conf_clean.selftext_len > 10]

In [265]:
df_relationships_clean.drop_duplicates(inplace=True)
df_conf_clean.drop_duplicates(inplace=True)

In [266]:
display(df_relationships_clean.count())
display(df_conf_clean.count())

title           595
selftext        595
subreddit       595
author          595
selftext_len    595
dtype: int64

title           837
selftext        837
subreddit       837
author          837
selftext_len    837
dtype: int64

We will then randomly select 3000 of each class since quite a significant number were from a moderator-author as well as empty text.

In [341]:
subset_relationships_clean = df_relationships_clean.sample(n=500,random_state=666)
subset_conf_clean = df_conf_clean.sample(n=500,random_state=666)

In [342]:
# combine both subsets into a DF
df = subset_relationships_clean.append(subset_conf_clean,ignore_index=True)
df.subreddit.value_counts()

confessions      500
relationships    500
Name: subreddit, dtype: int64

In [343]:
# create target class columns 0 = relationships, 1 = confessions 

df['label'] = df.subreddit.map({'relationships':'0','confessions':'1'}) 
df.head()

Unnamed: 0,title,selftext,subreddit,author,selftext_len,label
0,A HS friend [23F] I haven’t spoken to in a yea...,"I had a friend in high school, T, that I got c...",relationships,cthegoldfish,740,0
1,Im (26M) seeing a (23F) that draws block eyebr...,I (26M) and seeing A (23F) Taiwanese girl who ...,relationships,kynewt,204,0
2,Me (24F) and my boyfriend (24M) haven't spoken...,Long story short - we had agreed to go to a fe...,relationships,bigeasterbunny,281,0
3,Should I request my(21F) Fiance(25M) track dow...,So my fiance and I have been together for 1.5 ...,relationships,hamburgleryourgirl,589,0
4,My [28m] GF [29m] of about 4 months declined t...,So I need to start by saying I realize how stu...,relationships,tsaaron,345,0


Ensure formatting of text by:
- Converting all to lower cases
- removing groups of words in parantheses
- remove line breaks
- removing special characters


In [359]:
# convert the stop words to a set.
stops = set(stopwords.words('english'))

def clean_text(text):
    #01 convert titles, selftext into lowercase
    lower_text = text.lower()
    #02 remove brackets and parenthesis from the title and selftext.
    no_br_paret_text = re.sub(r'\(.+?\)|\[.+?\]','',str(lower_text))
    #03 remove line breaks
    strip_text =  no_br_paret_text.strip()
    #04 remove special characters
    removed_special = re.sub(r'[^0-9a-zA-Z ]+','',str(strip_text))
    #05 remove words less than 3 characters
    words_length = re.sub(r'(\b\w{1,2}\b)', '',removed_special) # for words
    #05 split into individual words
    words = words_length.split()
    #06 Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    return (" ".join(meaningful_words))

In [360]:
df[['title','selftext']] = df[['title','selftext']].applymap(clean_text)
df.tail()

Unnamed: 0,title,selftext,subreddit,author,selftext_len,label
995,girl fell love didnt,far friends told crazy loved much couldnt shar...,confessions,ali4069,323,1
996,wish morrissey would run president,weird obsession want morrissey run president d...,confessions,theriderreturns,127,1
997,want join ram ranch,idea ram ranch really arouses anybody advice o...,confessions,theriderreturns,21,1
998,dont think really believe anything anymore,feel like dont strong beliefs religion really ...,confessions,mrlurkety,89,1
999,think parents gas lighted ever since young,first events really havent happened long ago o...,confessions,sofw2424,1066,1


Split title and self text into two classifiers where the output of title_classifier and self_text classifier would provide indication of subreddit belonging.

In [361]:
#split titles, and self text into seperate df

df_title = df[['title','label']]
df_selftext = df[['selftext','label']]

### Split selftext 

In [362]:
X_text = df_selftext['selftext']
y_text = df_selftext['label']

X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(X_text,y_text,stratify=y_text) 

## Create pipelines 

In [395]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [411]:
pipe = Pipeline([
    ('cvec', CountVectorizer(tokenizer=LemmaTokenizer())),
    ('lr', LogisticRegression(solver='saga',max_iter=300))
])

In [417]:
pipe_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__ngram_range': [(1,1), (1,2)],
    'lr__penalty' : ['elasticnet'],
    'lr__C' : np.arange(0.1,5,0.1),
    'lr__l1_ratio' : np.arange(0.1,1.1,0.1)
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3,verbose=1,n_jobs=2)
gs.fit(X_text_train, y_text_train)
print(gs.best_score_)

Fitting 3 folds for each of 2940 candidates, totalling 8820 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  1.3min
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:  4.2min
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed: 10.1min
[Parallel(n_jobs=2)]: Done 796 tasks      | elapsed: 19.8min
[Parallel(n_jobs=2)]: Done 1246 tasks      | elapsed: 34.8min
[Parallel(n_jobs=2)]: Done 1796 tasks      | elapsed: 51.7min
[Parallel(n_jobs=2)]: Done 2446 tasks      | elapsed: 71.6min


KeyboardInterrupt: 