# Project 3: Subreddit Classification using NLP

## Problem Statement
To develop a classifier that can predict if a reddit user is suffering from depression, or has the tendency to commit suicide by classifying posts from two subreddit "depression_help" and "SuicideWatch"  

## Background
Suicide is the 10th leading cause of death in the US.  According to the statistics from American foundation for Suicide Prevention, there are estimated 3836 suicide attempts per day in 2017. Suicide prevention is one way to combat the problem.  

Depression usually develops before suicide. The purpose of this project is to create a NLP classifier that hopefully can distinguish someone with suicidal intents from depression by analyzing the words they say.  This is done by classifying posts from two Reddit subreddit "depression_help" and "SuicideWatch"  

 Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called "subreddits".  According to Alexa Internet, as of July 2019, Reddit ranks as the No. 5 most visited website in the U.S. and No. 13 in the world.  Subreddit "SuicideWatch" has 179K members, while "depression_help" has 47.3K members.  As their names suggest, these two subreddits are platforms where users with depression or suicidal thoughts could use to seek help.  Texts from these two subreddits form the dataset of this classifier project.

## Part I: Data Query

In [6]:
import re
import requests
import pandas as pd
import random
import time

In [3]:
url = 'https://www.reddit.com/r/depression_help.json'

In [4]:
res = requests.get(url, headers={'User-agent': 'Campeltown Inc 1.0'})

In [5]:
res

<Response [200]>

In [6]:
ask_dict = res.json()

In [7]:
ask_dict

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'depression_help',
     'selftext': '&amp;#x200B;\n\nhttps://preview.redd.it/8akni1zthjs31.jpg?width=840&amp;format=pjpg&amp;auto=webp&amp;s=7acbfa2b903ba415b333bc01ece2ca02b77efa66\n\nExperiencing or relate to any of the above? Do you wish to speak to people in a safe, well-moderated online community? [Our Discord server](https://discord.gg/JTGxsjb) offers 1:1 support, off-topic channels to talk with AMAZING people, and chats for mental health.',
     'author_fullname': 't2_xmlxk',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Immediate Life Support - Discord Server',
     'link_flair_richtext': [{'e': 'text', 't': 'ANNOUNCEMENT'}],
     'subreddit_name_prefixed': 'r/depression_help',
     'hidden': False,
     'pwls': None,
     'link_flair_css_class': '',
     'downs': 0,
     'h

In [8]:
ask_dict.keys()

dict_keys(['kind', 'data'])

In [9]:
ask_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [10]:
len(ask_dict['data']['children'])

26

In [11]:
ask_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [12]:
ask_dict['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'media_metadata', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 

In [13]:
len(ask_dict['data']['children'][0]['data'].keys())

101

In [14]:
ask_dict['data']['children'][0]['data']['subreddit']

'depression_help'

In [15]:
ask_dict['data']['children'][0]['data']['title']

'Immediate Life Support - Discord Server'

In [17]:
ask_dict['data']['children'][25]['data']['name']

't3_etsvbz'

## Collect depression_help data

In [None]:
posts = []
after = None

for a in range(50):
    print(f"Loop {a}")
    
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Campeltown Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

pd.DataFrame(posts).to_csv('../data/depression_help.csv', index = False)

## Collect SuicideWatch data

In [7]:
url = 'https://www.reddit.com/r/SuicideWatch.json'

In [8]:
posts = []
after = None

for a in range(50):
    print(f"Loop {a}")
    
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Campeltown Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

pd.DataFrame(posts).to_csv('../data/suicide.csv', index = False)

Loop 0
https://www.reddit.com/r/SuicideWatch.json
4
Loop 1
https://www.reddit.com/r/SuicideWatch.json?after=t3_evg3ww
4
Loop 2
https://www.reddit.com/r/SuicideWatch.json?after=t3_evg5j6
6
Loop 3
https://www.reddit.com/r/SuicideWatch.json?after=t3_ev8ix3
5
Loop 4
https://www.reddit.com/r/SuicideWatch.json?after=t3_evckcu
3
Loop 5
https://www.reddit.com/r/SuicideWatch.json?after=t3_evf0qm
2
Loop 6
https://www.reddit.com/r/SuicideWatch.json?after=t3_evaacg
5
Loop 7
https://www.reddit.com/r/SuicideWatch.json?after=t3_ev67qc
6
Loop 8
https://www.reddit.com/r/SuicideWatch.json?after=t3_evanz7
4
Loop 9
https://www.reddit.com/r/SuicideWatch.json?after=t3_ev45lm
4
Loop 10
https://www.reddit.com/r/SuicideWatch.json?after=t3_ev74qg
5
Loop 11
https://www.reddit.com/r/SuicideWatch.json?after=t3_euzexh
2
Loop 12
https://www.reddit.com/r/SuicideWatch.json?after=t3_ev0j32
4
Loop 13
https://www.reddit.com/r/SuicideWatch.json?after=t3_euz8zf
6
Loop 14
https://www.reddit.com/r/SuicideWatch.json?after=t3_

In [9]:
len(posts)

1243