# Project 3: Binary Classification of subreddits
# Notebook 1: Introduction and Extraction

## Problem Statement

Mental health issues have been on a high in recent times, with increasing cases of as a result of factors like post traumatic stress disorder (PTSD), bullying (real-life or cyber) and suicides. All these lead to depression which is a cause of concern for the government. 

Being part of a research team in the Institute of Mental Health (IMH), my team is tasked to come up a solution that can accurately predict and classify a post on reddit according to the subreddits (depression and forever alone). Both topics are commonly thought to be similar but they are not. Forever alone is a phrase or term used by a single person to express his loneliness at not having a significant other; or in broader perspective, lack of friends. It is also used more humorously at times as a slang or internet meme. This term is linked to people claiming to be "depressed" but in fact they are not.

On the other hand, depression is a serious matter and more often than not, one do not even realize that they actually have depression. It is also extremely challenging to determine if a person has depression by relying on behaviorial cues, even for a highly qualified psychiatrist. Furthermore, if the depressed person is introverted, he/she might not want to share his problems openly to his/her family or friends or even seek professional help in fear being stigmatized by society. As such, the person might turn to reddit, an online forum, to express his/her personal feelings and thoughts with other users in the same subreddit annoymously. He/she might feel more comfortable in that way. That being said, a user with depression might get confused over the terms depression and forever alone in thinking they are the same , thus posting in the forever alone subreddit rather than the depression subreddit. 

Hence, we aim to come up with a model that is able to accurately classify a post to the respective subreddits (depression/forever alone). Our model also aims to check for misclassification of post by diving deeper into the post, unraveling the false positive and identifying top 10 words of each subreddit. Hence, we are able to provide early detection of potential depression cases along with identify existing cases. We can then look to get them the help they need. The target audience of this project is everyone, especially those working in the psychological and healthcare departments.

## Executive Summary

The objective of this project is to execute different binary classification algorithms/models and find the best model over several attempts. Firstly, i will extract about 2000 posts for the depression and forever alone subreddits. I will then clean the data to remove duplicates, missing values and outliers. 

Next up is the exploratory data analysis (EDA) where i analyze the data, identifying relationships  and displaying their distrbution.  This would give a quick view of what are the deciding variables in relation to classifying the respective subreddits.

It is then followed by pre-processing, where i began to prepare my data for the modeling process. This step includes lemmatizing my words to the base root, removing default english stopwords and my own customised stopwords which consists of top common words that exist in both datasets.

As for modeling, my initial models will be without any adjustments to hyperparameters. My second attempt will be to run the models with tuned hyperparamaters and find the most accurate model. Lastly, i will use my final model and evaluate it further to conclude.

I will be using count and tfidf vectorizers to tokenise the variable columns. I decided to make use of 4 different classifiers (Logistic Regression, Random Forest, K Nearest Neighbours and Multinomial Naive Bayes) to create different variations. I will then evaluate their respective scores along with the confusion matrix to evaluate and select my final model.

The final model selected was the Multinimial Naive Bayes model equipped with the tfidf vectorizer.

## Imports

In [1]:
# Imports
import requests
import pandas as pd
import numpy as np
import random
import time
from time import sleep

### Extraction

First, i will go on to pull posts from the 2 subreddits using PushShift API and store them as dataframes. I will create a function to execute the extraction. Pushshift API limits me to 100 posts per pull. I have included a sleep timer to avoid getting blocked by reddit for repeated pulling in short intervals. 

The function also include the number of times the pull is executed and the cumulative number of posts pulled.

In [2]:
# This code was adapted and referenced from my groupmates Mark and Chin Xia who guided me on the process

def pull_reddit(subreddit):
    ''' Function to pull posts from subreddits.
        Output returns the post pulled. '''
    df = pd.DataFrame()
    url = 'https://api.pushshift.io/reddit/search/submission'
    bef_counter = '0d'
    i = 0
    while 2000 - len(df) > 0:
        time.sleep(np.random.randint(1,5)) # Setting a sleep timer
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': bef_counter,
            'unique': 'selftext'}
        res = requests.get(url, params)
        print(f'Pull Number {i}, Status Code: {res.status_code}')
        if res.status_code == 200: # Status_code check
            data = res.json()
            new_df = pd.DataFrame(data['data'])
            cleaned_df = new_df.drop(new_df[
                (new_df['selftext'] == '[removed]') | # Dropping removed posts
                (new_df['selftext'] == '[deleted]') | # Dropping deleted posts
                (new_df['selftext'] == '') | # Dropping empty posts
                (new_df['is_video'] == True) # Dropping non-text posts
            ].index)
            bef_counter = str(i*10)+'d'
            df = pd.concat([df, cleaned_df], axis = 0)
            i += 1
            print(f'Number of post pulled : {len(df)}')
        else:
            print(f'Loop {i} failed, pausing script')
            time.sleep(np.random.randint(2,3))
    return df

In [3]:
depression = pull_reddit('depression')
depression

Pull Number 0, Status Code: 200
Number of post pulled : 85
Pull Number 1, Status Code: 200
Number of post pulled : 170
Pull Number 2, Status Code: 200
Number of post pulled : 252
Pull Number 3, Status Code: 200
Number of post pulled : 337
Pull Number 4, Status Code: 200
Number of post pulled : 417
Pull Number 5, Status Code: 200
Number of post pulled : 500
Pull Number 6, Status Code: 200
Number of post pulled : 575
Pull Number 7, Status Code: 200
Number of post pulled : 654
Pull Number 8, Status Code: 200
Number of post pulled : 738
Pull Number 9, Status Code: 200
Number of post pulled : 807
Pull Number 10, Status Code: 200
Number of post pulled : 894
Pull Number 11, Status Code: 200
Number of post pulled : 977
Pull Number 12, Status Code: 200
Number of post pulled : 1056
Pull Number 13, Status Code: 200
Number of post pulled : 1136
Pull Number 14, Status Code: 200
Number of post pulled : 1219
Pull Number 15, Status Code: 200
Number of post pulled : 1304
Pull Number 16, Status Code: 20

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,whitelist_status,wls,author_cakeday,author_flair_background_color,author_flair_text_color,banned_by,post_hint,preview,edited,gilded
1,[],False,rasdo357,,[],,text,t2_eifao,False,False,...,no_ads,0.0,,,,,,,,
2,[],False,gone2go,,[],,text,t2_bfnnbn0q,False,False,...,no_ads,0.0,,,,,,,,
3,[],False,raysofdavies,,[],,text,t2_ebgwx,False,False,...,no_ads,0.0,,,,,,,,
4,[],False,tristenisawesome,,[],,text,t2_4vdsn09b,False,True,...,no_ads,0.0,,,,,,,,
5,[],False,MrJacobLeblanc,,[],,text,t2_45a2ixnx,False,False,...,no_ads,0.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,[],False,lifeishard99,,[],,text,t2_6wnxioaq,False,False,...,no_ads,0.0,,,,,,,,
96,[],False,CBT144,,[],,text,t2_1e66wl3y,False,False,...,no_ads,0.0,,,,,,,,
97,[],False,Fahrenplz,,[],,text,t2_3apxjjl7,False,False,...,no_ads,0.0,,,,,,,,
98,[],False,lifeishard99,,[],,text,t2_6wnxioaq,False,False,...,no_ads,0.0,,,,,,,,


In [4]:
forever_alone = pull_reddit('ForeverAlone')
forever_alone

Pull Number 0, Status Code: 200
Number of post pulled : 51
Pull Number 1, Status Code: 200
Number of post pulled : 102
Pull Number 2, Status Code: 200
Number of post pulled : 169
Pull Number 3, Status Code: 200
Number of post pulled : 230
Pull Number 4, Status Code: 200
Number of post pulled : 289
Pull Number 5, Status Code: 200
Number of post pulled : 346
Pull Number 6, Status Code: 200
Number of post pulled : 402
Pull Number 7, Status Code: 200
Number of post pulled : 461
Pull Number 8, Status Code: 200
Number of post pulled : 517
Pull Number 9, Status Code: 200
Number of post pulled : 572
Pull Number 10, Status Code: 200
Number of post pulled : 627
Pull Number 11, Status Code: 200
Number of post pulled : 675
Pull Number 12, Status Code: 200
Number of post pulled : 733
Pull Number 13, Status Code: 200
Number of post pulled : 764
Pull Number 14, Status Code: 200
Number of post pulled : 802
Pull Number 15, Status Code: 200
Number of post pulled : 851
Pull Number 16, Status Code: 200
Nu

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,secure_media,secure_media_embed,author_flair_background_color,banned_by,media_metadata,is_gallery,author_cakeday,edited,gilded,gallery_data
1,[],False,DegisterMallgurius,,[],,text,t2_4s9xzpzh,False,False,...,,,,,,,,,,
2,[],False,AnarchyFire,wojak,[],,text,t2_r5fxe,False,False,...,,,,,,,,,,
3,[],False,Onlyhere4help_,,[],,text,t2_3g6evx1k,False,False,...,,,,,,,,,,
6,[],False,OutcastByChoice,,[],,text,t2_9myipoyt,False,False,...,,,,,,,,,,
10,[],False,Jake3572,,[],,text,t2_8h7s3jv7,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,[],False,lindan44,,[],,text,t2_60s2wuxn,False,False,...,,,,,,,,,,
89,[],False,artin_kafshi,,[],,text,t2_5bjtqkv8,False,False,...,,,,,,,,,,
90,[],False,nurseandmeddoctor,,[],,text,t2_41fky9sh,False,False,...,,,,,"{'9iw9c6p8o5x41': {'e': 'Image', 'id': '9iw9c6...",,,,,
95,[],False,smalltiddyblasiangf,,[],,text,t2_49yxblh9,False,False,...,,,,,,,,,,


In [5]:
print(depression.shape)
print(forever_alone.shape)

(2046, 67)
(2013, 80)


In [6]:
depression = depression[['subreddit', 'selftext','title', 'author']]
depression = pd.DataFrame(depression)
depression.head()

Unnamed: 0,subreddit,selftext,title,author
1,depression,I'm a worthless excuse for a human and a below...,I hate myself more than I hate anything,rasdo357
2,depression,through all the tribulations I've gone through...,tired of living with this guilt and pain every...,gone2go
3,depression,I start my first job for over a year next Tues...,Advice for work fear/anxiety,raysofdavies
4,depression,"I constantly think about killing myself, I tho...",I long for death,tristenisawesome
5,depression,"I've told myself today, that this is me.\nTher...",It will always stay like this.,MrJacobLeblanc


In [7]:
forever_alone = forever_alone[['subreddit', 'selftext','title', 'author']]
forever_alone = pd.DataFrame(forever_alone)
forever_alone.head()

Unnamed: 0,subreddit,selftext,title,author
1,ForeverAlone,The setting:\n\nYou are a male third worlder i...,Choose your own adventure,DegisterMallgurius
2,ForeverAlone,When I was between relationships for a couple ...,I know exactly how you feel,AnarchyFire
3,ForeverAlone,I’m a girl in a long term relationship but I l...,I appreciate you all,Onlyhere4help_
6,ForeverAlone,"I've met this woman, she is 33 years old...\n\...",Too Good to Be True?,OutcastByChoice
10,ForeverAlone,lets say i or you do crawl out of this crappy ...,Is it possible to get over it ?,Jake3572


In [8]:
#depression.to_csv('./data/depression.csv', index=False)
#forever_alone.to_csv('./data/forever_alone.csv', index=False)