# Problem Statement 

The objective of this project is to utilize the [API](https://www.reddit.com/wiki/api) of [reddit.com](https://www.reddit.com/) to pull data from two different subreddits and build a model that will predict which subreddit a given post comes from. Because the data comes in the form of text, natural language processing (NLP) is needed to bring the data into a workable shape so that machine learning algorithms can be run on it. In the end, several machine learning models will be built, evaluated and compared with one another for their predictive power. One model will be chosen in the end as the best model that predicts the subreddit from which a given post is originated. The criteria for what constitutes a 'best' model is is based on achieving 

1. high accuracy score 
2. low bias and low variance

The candidate models will be therefore evaluated on optimizing between the above two criteria. 

![reddit_logo](../images/reddit_logo.png)

# 1. Data

The data used for this project is pulled from two subreddits: [r/depression](https://www.reddit.com/r/depression/) and [r/anxiety](https://www.reddit.com/r/Anxiety/). The relevant data that will be used for this project is the text body of the posts. There are 2 reasons why these subreddits were chosen:

 1. **Personal interest in understanding how well machine learning alrogirhtms can separate out narratives about psychological disorders.** The insight gained from this can have larger implications about how scial media platforms on which people discuss mental health issues can be leveraged with the help of machine learning to categorize mental health disorders and potentially discover unprecedented symptoms hidden within people's narratives of given mental health disorder. These insights can be extended beyond mental health disorders and into the more general medical field itself. 

2. **None disimilarity of narratives within the two subreddits**. It is a well established fact in of psychology that people who suffer from anxiety often also suffer from depression, thus, the two disorders frequently go hand in hand. This makes the problem I wnat to investigate more challenging because the data gathered from the two subreddits is not disimilar: posts within the depression subreddit are expected to discuss depression, but a lot of them are also expected to discuss anxiety in addition to depression. The same in reverse is true for posts from anxiety subreddits. Thus, there will be a lot of overlap of narrative between the two subreddits. This is expected to make it more difficult for an ML model to seperate out the two. 

# 2. Checking Assumptions

In point 1. above, it is assumed that an ML algorithm should be able to sucessfully categorize posts coming from depression subreddit and anxiety subreddit. In order to test this assumption, I will leverage the conventional definitions of depression and anxiety as they are known today wtihin the field of psychology. These definitions will serve as the control for the experiment. After the best model is chosen, I will look into which features (words) the model finds to be most important for this classification problem. Those features will then be qualitatively assessed for how well they fit within the general definitions of depression and psychology. Further details on this will be discussed in the notebook **5. Conclusive Insights and Recommendations** 

In point 2. above, it is assumed that it will be more difficult for an ML model to classify posts from two subreddits if the narrative in the posts of the two subreddits are similar. This assumption needs to be formally tested with a hypotheses test. In order to do this, I introduce a control group to the experiment - the [r/datascience subreddit](https://www.reddit.com/r/datascience/). This subreddit is assumed to be highly disimilar to r/depression and r/anxiety subreddits because because its theme and therefore the content matter of its posts is completeley unrelated to psychological disorders. Subsequent to introducing this control group, a hypotheses test will be conducted. A more thorough discussion about this can be found in notebook **2. Data Cleaning and NLP** 

In [5]:
# import all libraries that will be used in this notebook

import pandas as pd
import requests
import time

# 3. Data Collection from Reddit's API 

**Note**: Credit goes to Riley from General Assembly for the code to set up a personal user agent and for iteratively pulling data from reddit's API. 

In [6]:
# setting up a personal user agent to pull data from reddit's API in the form of a json file. 

url = 'https://www.reddit.com/r/depression.json'
headers = {'User-agent': 'annie_bot'}
res = requests.get(url, headers = headers)
res.status_code
ds_json = res.json()

The data from reddit's API comes in the form of a json file - a dictionary like object. In order to pull the relevant information out of it, further traversal of the dictionary is necessary. 

In [9]:
# exploring the data to see where the actual posts are
sorted(ds_json.keys())

['data', 'kind']

In [10]:
# looking further into the 'data' key of the dictionary
sorted(ds_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

This is where the actual posts hide within the dictionary: 

In [11]:
# view an actual post 
ds_json['data']['children'][1]['data']['selftext']

'Welcome to /r/depression\'s check-in post - a place to take a moment and share what is going on and how you are doing. If you have an accomplishment you want to talk about (these shouldn\'t be standalone posts in the sub as they violate the "role model" rule, but are welcome here), or are having a tough time but prefer not to make your own post, this is a place you can share.\n\nWe try our best to keep this space as safe and supportive as possible on reddit\'s wide-open anonymity-friendly platform. The community rules can be found in the sidebar, or under "Community Info" in the official mobile apps. If you aren\'t sure about a rule, [please ask us](https://www.reddit.com/message/compose?to=%2Fr%2Fdepression). \n\n*****\n\nSorry about letting the last post get archived.  We\'ve been super-busy keeping up with the modqueue and that\'s slowing down our work on making the community rules more clear and visible and migrated to the standard reddit rules system - we *are* spending all the a

As apparent above, each post comes with a lot of text and html artifacts that will need to be cleaned and removed before the posts can be used for modeling. Data cleaning is done and discussed in the **Natural Language Processing and Data Cleaning** notebook.

Reddit's API is set up in such a way that one can pull only 25 posts at a time. I need a few hundred posts per subreddit for this project. In order to get all this data, I will use the post ID of the 25th post as a stopping point before moving on to a next iteration in a for loop to iteratively collect more posts beyond the 25 that are allowed at a time by reddit. 

In [12]:
# the ID of the last (25th) post
ds_json['data']['after']

't3_b9dmk5'

In [14]:
params = {'after':'t3_b9dmk5'}
requests.get(url, params = param, headers = headers)

In [14]:
# save the urls of the subreddits we will be using to variables 
datascience = 'https://www.reddit.com/r/datascience.json'
depression = 'https://www.reddit.com/r/depression.json'
anxiety = 'https://www.reddit.com/r/Anxiety.json'

## 3.1. Get data from r/depression 

To prevent hitting reddit's API with the for loop excessively and overwhelming reddit's server, I willl use the **time.sleep(2)** command to give a 2 second break between the iterations of the for loop.

In [15]:
posts = []
after = None

for i in range(0,30):
    #print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    url = url
    res = requests.get(depression, params = params, headers = headers)
    ds_json = res.json()
    if res.status_code == 200:
        ds_json = res.json()
        posts.extend(ds_json['data']['children'])
        after = ds_json['data']['after']
    else: 
        print(res.status_code)
        break
    time.sleep(2)

In [16]:
post_text = []
for text in range(1,len(posts)):
    post_text.append(posts[text]['data']['selftext'])
    
dep = pd.DataFrame(post_text)
dep['label'] = 0

In [17]:
dep.head()

Unnamed: 0,0,label
0,Welcome to /r/depression's check-in post - a p...,0
1,"Yesterday a friend told me that. To be honest,...",0
2,"As bad as things may be, right now... the worl...",0
3,I just want to get this shit out and feel bett...,0
4,I wonder how many of us can relate to [this ar...,0


In [19]:
# save data from r/depression to a csv file
dep.to_csv('./data/dep.csv') 

## 3.2. Get data from r/anxiety

In [20]:
posts = []
after = None

for i in range(0,30):
    #print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    #url = se_url
    res = requests.get(anxiety, params = params, headers = headers)
    ds_json = res.json()
    if res.status_code == 200:
        ds_json = res.json()
        posts.extend(ds_json['data']['children'])
        after = ds_json['data']['after']
    else: 
        print(res.status_code)
        break
    time.sleep(2)

In [21]:
post_text = []
for text in range(1,len(posts)):
    post_text.append(posts[text]['data']['selftext'])
    
anx = pd.DataFrame(post_text)
anx['label'] = 1

In [26]:
anx.head()

Unnamed: 0,0,label
0,What have you accomplished this week? Share yo...,1
1,I’ve been diagnosed with anxiety and depressio...,1
2,Lately I've been watching Star Trek Discovery....,1
3,My timid personality isn't just reflected outs...,1
4,...so I could not give a shit about life and j...,1


In [23]:
# save the data from r/anxiety to a csv file 
anx.to_csv('./data/anx.csv')

## 3.3. Get data from r/datascience (the control)

In [8]:
posts = []
after = None

for i in range(0,30):
    #print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    #url = se_url
    res = requests.get(datascience, params = params, headers = headers)
    ds_json = res.json()
    if res.status_code == 200:
        ds_json = res.json()
        posts.extend(ds_json['data']['children'])
        after = ds_json['data']['after']
    else: 
        print(res.status_code)
        break
    time.sleep(2)

In [10]:
post_text = []
for text in range(1,len(posts)):
    post_text.append(posts[text]['data']['selftext'])
    
ds = pd.DataFrame(post_text)
ds['label'] = 1

In [11]:
ds.head()

Unnamed: 0,0,label
0,Welcome to this week's entering &amp; transiti...,1
1,,1
2,"Hello everyone,\n\nThe project is called Deep ...",1
3,,1
4,"Currently, I'm working as as a Customer Servic...",1


In [12]:
# save data from r/datascience to a csv file 
ds.to_csv('./data/ds.csv')

## 3.4. Create a dataframe for posts coming from r/depression and r/anxiety

In [27]:
depanx = pd.concat([dep, anx], axis = 0)
depanx.rename(columns = {0:'post'}, inplace = True)

In [28]:
depanx.label.value_counts()

1    751
0    751
Name: label, dtype: int64

In [29]:
# save depression-anxiety data to a csv file 
depanx.to_csv('./data/depanx.csv')

## 3.5. Create a dataframe fro posts coming from r/depression and r/datascience

In [18]:
depds = pd.concat([dep, ds], axis = 0)
depds.rename(columns = {0:'post'}, inplace = True)

In [21]:
depds['label'].value_counts()

0    751
1    733
Name: label, dtype: int64

In [23]:
# save depression-datascience data to a csv file 
depds.to_csv('./data/depds.csv')

So we have two dataframes, each with roughly about 1500 posts. 