# Project 3: Identifying Depression in r/domesticviolence Subreddit Posts

<img src="./images/domestic_violence.jpg" alt="domestic violence pic"/>

---
## Problem statement 

As a result of the COVID-19 pandemic, there has been an increase in domestic violence at incidents home as people are confined to their homes for long periods of time<sup>[[1]](https://www.nytimes.com/2020/04/06/world/coronavirus-domestic-violence.html)</sup>. Research has also found that people suffering from domestic violence are also found to be at higher risk of depression<sup>[[2]](https://www.theguardian.com/society/2019/jun/07/domestic-abuse-victims-more-likely-to-suffer-mental-illness-study)</sup>. Also, when r/domesticviolence posts, many posts about domestic violence experiences seemingly contain words that indicate depressive feelings. Due to the large number of posts on the r/domesticviolence subreddit daily, authors who suffer from depression may go unnoticed and may not be identified to be offered help. This model seeks to solve this problem by taking a proactive approach to identifying such posts in order to direct help to these authors. 

Using two subreddits, r/domesticviolence and r/depression, in which people post about their domestic violence and depression experience respectively, a model will be developed to learn the words that typically exist in posts from r/domesticviolence or r/depression, and predict which subreddit a post should belong in --- ```is_depression``` = 1 for a r/depression post and ```is_depression``` = 0 for a r/domesticviolence post.

The model can then be run on posts extracted from r/domesticviolence. If a post contains enough words indicating depression to be predicted as a r/depression post (```is_depression``` = 1), the authors of these posts can be identified by their user ID. These people, who are at risk of depression, could be identified to mental health support groups who will reach out to offer support and assistance. 

---
## Executive summary

From the r/domesticviolence and r/depression subreddits, 998 and 960 posts were collected via json extraction from Reddit's API respectively. These subreddits were chosen as they had dominantly text-based data which was advantageous for NLP and had a high number of posts. 

Following which, data cleaning was done to replace null values. Preprocessing was then done on the raw data, which consisted of tokenizing, removing stop words and lemmatizing. From the cleaned and preprocessed data, three key features were retained: words in title (```title```) and words in post (```self_text```). An ```is_depression``` column was added to show if the post was from r/depression (```is_depression``` = 1) or r/domesticviolence (```is_depression``` = 0). Another column (```all_words```) which combined words from ```title``` and ```selftext``` was added.

EDA conducted to take a preliminarily look at the top words in each subreddit to verify that there are indeed some similarities in words between both subreddits. The EDA also identified that posts in r/domesticviolence tend to be longer than posts in r/depression. In addition, the data also showed that authors who posted in r/domesticviolence did not post in r/depression, and vice-versa. This gives additional credence to the hypothesis that authors of r/domesticviolence posts showing indications of depression through their posts may go unnoticed. 

Based on the cleaned data, the feature matrix (X) was created with either ```title``` (words from title), ```post``` (words from post) or ```all_words``` (word from both title and post). The target vector is ```is_depression```). Modeling was done on 3 classification models: Multinomial Naive Bayes, Logistic Regression and Random Forest, all with either Count Vectorization or TF-IDF Vectorization. The results of the models were assessed based on their performance (accuracy score, ROC AUC score, variance between train and test scores, etc). All of the models outperformed the baseline accuracy score of 0.50. The results are below:

<img src="./images/bestmodels.png" alt="Best models"/>

The final production model was determined to be the Multinomial Naive Bayes model + Count Vectorization on the ```all_words``` feature matrix due to its high accuracy, ROC AUC and sensitivity (low Type-II error) scores as well as low variance, which means the model will generalise well on unseen data. In addition, the model was tested on two posts from r/domesticviolence - one containing many words indicating depression and one without words indicating depression. It predicted ```is_depression``` = 1 for the 'depressive' post and ```is_depression``` = 0 for the 'non-depressive' post. This is evidence that the model, in addition to being able to differentiate between r/domesticviolence and r/depression posts, is able to identify posts from r/domesticviolence that contain many words indicating depression, so that mental health groups may reach out to the author. 

This model is just the tip of the iceberg in the relatively new usage of Natural Language Processing to predict mental health issues<sup>[[3]](https://www.hindawi.com/journals/cmmm/2016/8708434/)</sup>. As the use of social media and discussion sites such as Reddit increases, the need for such tools will become increasingly important, particularly in light of increased domestic violence incidents due to COVID-19. Going forward, there are several things that can be done to improve on this model. For instance, gathering more data (words) to improve model performance, introduce an ```is_suicide``` classification as the EDA revealed words indicating suicidal thoughts in r/domesticviolence posts, and adapting the model to predicting ```is_depression``` in other subreddits where people post about their traumatic experiences (similar to r/domesticviolence). 

---
## This project is split into the following notebooks
- <b>Webscraping and Data Collection</b> 
- [Preprocessing and EDA](./2_Preprocessing_and_EDA.ipynb)
- [Modeling - Multinomial Naive Bayes](./3_Modeling_Multinomial_Naive_Bayes.ipynb)
- [Modeling - Logistic Regression](./4_Modeling_Logistic_Regression.ipynb)
- [Modeling - Random Forest](./5_Modeling_Random_Forest.ipynb)
- [Production Model and Insights](./6_Production_Model_and_Insights.ipynb)

---
## Overview

In this notebook, I will scrape the subreddits r/domesticviolence and r/depression to extract the data from about 1000 posts in each subreddit via Reddit's API. After converting the data into dataframes, I will export the dataframes as .csv files for use in the next notebook.

---
## Contents of this notebook
- [Exploring r/domesticviolence page](#Exploring-r/domesticviolence-page)
- [Webscraping](#Webscraping)
- [Exporting csv files](#Exporting-csv-files)

In [1]:
import requests
import pandas as pd
import time
import random
import numpy as np

pd.set_option('display.max_columns', 200)

## Exploring r/domesticviolence page

In [2]:
# Urls for r/domesticviolence and r/depression 
url_1= 'https://www.reddit.com/r/domesticviolence/.json'
url_2= 'https://www.reddit.com/r/depression/.json'

In [3]:
res_d_violence = requests.get(url_1, headers = {'User-agent': 'Unicorn'})

In [4]:
res_d_violence.status_code

200

In [6]:
d_violence = res_d_violence.json()

In [7]:
print(d_violence)



In [8]:
d_violence['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [9]:
len(d_violence['data']['children'])

26

In [10]:
d_violence['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [11]:
d_violence['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'domesticviolence',
  'selftext': 'We know many of you are struggling to manage with already traumatic events and now are dealing with a global pandemic of COVID-19. Many of you may be quarantined with an abuser or dealing with their ramped up abuse due to their proximity or need for that outlet. Abusers are coming back from years ago to get to you with hoovers. Being isolated is very hard and this is a situation completely unexpected and anxiety driving in and of itself. So we wanted to put together a listing of resources, many of which are listed in our resource listing in the sidebar for you in this difficult time. Stay safe out there, folks. We are right here with you, and we will get through this together. \n\n\nSupport for Domestic Abuse:\n\n* [Thehotline.org]( https://www.thehotline.org/) is available 24/7 for chat and calls (1800-787-3224) during this crisis for women, men as well as LGBTQ folks. Please be sure to

In [12]:
d_violence['data']['children'][0]['data']

{'approved_at_utc': None,
 'subreddit': 'domesticviolence',
 'selftext': 'We know many of you are struggling to manage with already traumatic events and now are dealing with a global pandemic of COVID-19. Many of you may be quarantined with an abuser or dealing with their ramped up abuse due to their proximity or need for that outlet. Abusers are coming back from years ago to get to you with hoovers. Being isolated is very hard and this is a situation completely unexpected and anxiety driving in and of itself. So we wanted to put together a listing of resources, many of which are listed in our resource listing in the sidebar for you in this difficult time. Stay safe out there, folks. We are right here with you, and we will get through this together. \n\n\nSupport for Domestic Abuse:\n\n* [Thehotline.org]( https://www.thehotline.org/) is available 24/7 for chat and calls (1800-787-3224) during this crisis for women, men as well as LGBTQ folks. Please be sure to use safe electronics to c

In [13]:
d_violence['data']['after']

't3_ha5ken'

## Webscraping 

I will now conduct webscraping for r/domesticviolence and r/depression to obtain about 1000 unique posts from each subreddit and place their data into dataframes

I collected 998 unique posts from r/domesticviolence and 960 unique posts from r/depression

In [1]:
#Webscraping function 
def scrape (url, number_of_scrapes, output_list_name):
    
    after = None 
    
    for i in range(number_of_scrapes):
        if i == 0:
            print(f"scraping {url}")
            print(f"scraping batch {1} of {number_of_scrapes}")
        elif (i+1) % 4 ==0:
            print(f"scraping batch {i+1} of {number_of_scrapes}")
        
        if after == None:
            params = {}
        else:
            params = {"after": after}             
        res = requests.get(url, params=params, headers={'User-agent': 'Unicorn'})
        
        if res.status_code == 200:
            the_json = res.json()
            output_list_name.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        
        time.sleep(random.randint(1,6))
    
    print("scraping done")
    print(f"no. of posts: {len(output_list_name)}")
    print(f"no. of unique posts: {len(set([p['data']['name'] for p in output_list_name]))}")

In [15]:
#create list of only unique data
def unique_list(original_list, new_list_name):
    new_list=[]
    for i in range(len(original_list)):
        if original_list[i]["data"]["name"] not in new_list:
            new_list_name.append(original_list[i]["data"])
            new_list.append(original_list[i]["data"]["name"])
    print(f"unique list contains {len(new_list_name)} unique posts")

In [16]:
#scrapping r/domesticviolence 52 times to try to get close to 1000 unique posts 
dv_scraped = []
scrape(url_1, 52, dv_scraped)

scraping https://www.reddit.com/r/domesticviolence/.json
scraping batch 1 of 52
scraping batch 4 of 52
scraping batch 8 of 52
scraping batch 12 of 52
scraping batch 16 of 52
scraping batch 20 of 52
scraping batch 24 of 52
scraping batch 28 of 52
scraping batch 32 of 52
scraping batch 36 of 52
scraping batch 40 of 52
scraping batch 44 of 52
scraping batch 48 of 52
scraping batch 52 of 52
scraping done
no. of posts: 1299
no. of unique posts: 998


In [17]:
#list of unique posts in r/domesticviolence 
dv_unique = []
unique_list(dv_scraped, dv_unique)

unique list contains 998 unique posts


In [26]:
domestic_violence = pd.DataFrame(dv_unique)
domestic_violence['is_depression'] = 0
domestic_violence.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview,is_depression
0,,domesticviolence,We know many of you are struggling to manage w...,t2_2egrzrvq,False,,0,False,COVID-19 RESOURCES FOR ABUSE VICTIMS,[],r/domesticviolence,False,,new,0,,,False,t3_fsrd59,False,dark,1.0,,public,67,0,{},,,False,[],,False,False,,{},[new],False,67,,False,self,False,,[],{},,True,,1585739000.0,text,,,,text,self.domesticviolence,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,True,,[],False,,,moderator,t5_2s2fr,,,,fsrd59,True,,ImYesILeffHisAss2398,,0,True,,False,[],False,,/r/domesticviolence/comments/fsrd59/covid19_re...,,True,https://www.reddit.com/r/domesticviolence/comm...,10666,1585710000.0,1,,False,,,,0
1,,domesticviolence,Maybe this doesnt belong here Im not sure wher...,t2_6dnmfknj,False,,0,False,Im stupid. How long to feel better after minor...,[],r/domesticviolence,False,,m-be TW Multiple Triggers,0,,,False,t3_hcgbh8,False,light,0.9,,public,8,0,{},,,False,[],,False,False,,{},Trigger Warning: Multiple Triggers,False,8,,False,nsfw,False,,[],{},,True,,1592659000.0,text,,,,text,self.domesticviolence,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,True,[],[],False,False,False,False,,[],False,,,,t5_2s2fr,,,#014980,hcgbh8,True,,Outofsight9123,,24,True,,False,[],False,,/r/domesticviolence/comments/hcgbh8/im_stupid_...,,False,https://www.reddit.com/r/domesticviolence/comm...,10666,1592630000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,,0
2,,domesticviolence,My main questions are at the bottom if you jus...,t2_6yb1144r,False,,0,False,Vent but advice/knowledge is appreciated. My b...,[],r/domesticviolence,False,,m-be TW Multiple Triggers,0,,,False,t3_hcha3d,False,light,1.0,,public,2,0,{},,,False,[],,False,False,,{},Trigger Warning: Multiple Triggers,False,2,,False,self,False,,[],{},,True,,1592664000.0,text,,,,text,self.domesticviolence,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2s2fr,,,#014980,hcha3d,True,,doerayisme,,0,True,,False,[],False,,/r/domesticviolence/comments/hcha3d/vent_but_a...,,False,https://www.reddit.com/r/domesticviolence/comm...,10666,1592635000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,,0
3,,domesticviolence,I'm a 34 year old guy who lost his job to covi...,t2_6zh3tj02,False,,0,False,I feel stupid for posting this,[],r/domesticviolence,False,,new,0,,,False,t3_hch4oy,False,dark,1.0,,public,2,0,{},,,False,[],,False,False,,{},[new],False,2,,False,self,False,,[],{},,True,,1592663000.0,text,,,,text,self.domesticviolence,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2s2fr,,,,hch4oy,True,,Snoo_12868,,1,True,,False,[],False,,/r/domesticviolence/comments/hch4oy/i_feel_stu...,,False,https://www.reddit.com/r/domesticviolence/comm...,10666,1592634000.0,0,,False,,,,0
4,,domesticviolence,Not sure if what I'm experiencing could be ver...,t2_6zd0zopf,False,,0,False,Is this verbal or emotional abuse or am I over...,[],r/domesticviolence,False,,new,0,,,False,t3_hcbfp4,False,dark,1.0,,public,6,0,{},,,False,[],,False,False,,{},[new],False,6,,False,self,False,,[],{},,True,,1592639000.0,text,,,,text,self.domesticviolence,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2s2fr,,,,hcbfp4,True,,throwawayyy654765,,7,True,,False,[],False,,/r/domesticviolence/comments/hcbfp4/is_this_ve...,,False,https://www.reddit.com/r/domesticviolence/comm...,10666,1592610000.0,0,,False,,,,0


In [22]:
#scrapping r/depression 52 times to try to get close to 1000 unique posts 
depression_scraped = []
scrape(url_2, 52, depression_scraped)

scraping https://www.reddit.com/r/depression/.json
scraping batch 1 of 52
scraping batch 4 of 52
scraping batch 8 of 52
scraping batch 12 of 52
scraping batch 16 of 52
scraping batch 20 of 52
scraping batch 24 of 52
scraping batch 28 of 52
scraping batch 32 of 52
scraping batch 36 of 52
scraping batch 40 of 52
scraping batch 44 of 52
scraping batch 48 of 52
scraping batch 52 of 52
scraping done
no. of posts: 1287
no. of unique posts: 960


In [23]:
depression_unique = []
unique_list(depression_scraped, depression_unique)

unique list contains 960 unique posts


In [24]:
depression = pd.DataFrame(depression_unique)
depression['is_depression'] = 1
depression.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_depression
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],r/depression,False,0,,0,,False,t3_doqwow,False,dark,1.0,,public,2324,1,{},,False,[],,False,False,,{},,False,2324,,True,,False,,[],{},,True,,1572390000.0,text,0,,,text,self.depression,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,True,False,False,False,False,"[{'giver_coin_reward': 0, 'subreddit_id': None...",[],False,False,False,False,,[],False,,,moderator,t5_2qqqf,,,,doqwow,True,,SQLwitch,,176,True,no_ads,False,[],False,,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,648266,1572361000.0,0,,False,,1
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],r/depression,False,0,,0,,False,t3_exo6f1,False,dark,1.0,,public,971,0,{},,False,[],,False,False,,{},,False,971,,False,,False,,[],{},,True,,1580678000.0,text,0,,,text,self.depression,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,new,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,moderator,t5_2qqqf,,,,exo6f1,True,,circinia,,5205,False,no_ads,False,[],False,,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,648266,1580649000.0,0,,False,,1
2,,depression,Even if some posts blow up and have a bit of a...,t2_5xpk5iif,False,,1,False,This sub is counterproductive,[],r/depression,False,0,,0,,False,t3_hcco2h,False,dark,0.98,,public,1473,4,{},,False,[],,False,False,,{},,False,1473,,True,,False,,[],"{'gid_1': 1, 'gid_2': 1}",,True,,1592643000.0,text,0,,,text,self.depression,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,hcco2h,True,,thiswhereipost,,103,True,no_ads,False,[],False,,/r/depression/comments/hcco2h/this_sub_is_coun...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648266,1592614000.0,0,,False,,1
3,,depression,As i go down the rabbit hole of why any of thi...,t2_564vn2mq,False,,0,False,The more depressed i get the more music i list...,[],r/depression,False,0,,0,,False,t3_hca12w,False,dark,1.0,,public,250,0,{},,False,[],,False,False,,{},,False,250,,False,,False,,[],{},,True,,1592634000.0,text,0,,,text,self.depression,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,hca12w,True,,JinxedlLeague1,,32,True,no_ads,False,[],False,,/r/depression/comments/hca12w/the_more_depress...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648266,1592605000.0,0,,False,,1
4,,depression,I miss having someone to love. I miss holding ...,t2_121smx8c,False,,0,False,I miss having someone to love,[],r/depression,False,0,,0,,False,t3_hchfc7,False,dark,1.0,,public,30,0,{},,False,[],,False,False,,{},,False,30,,False,,False,,[],{},,True,,1592665000.0,text,0,,,text,self.depression,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qqqf,,,,hchfc7,True,,goatfucker21,,21,True,no_ads,False,[],False,,/r/depression/comments/hchfc7/i_miss_having_so...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648266,1592636000.0,0,,False,,1


## Exporting csv files

Exporting ```domestic_violence``` and ```depression``` dataframes for use in the subsequent notebook

In [27]:
domestic_violence.to_csv('./data/domestic_violence_raw.csv', index=False)
depression.to_csv('./data/depression_raw.csv', index=False)