<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Executive Summary</a></span></li><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Problem Statement</a></span><ul class="toc-item"><li><span><a href="#About-the-Source" data-toc-modified-id="About-the-Source-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>About the Source</a></span></li></ul></li><li><span><a href="#Objective" data-toc-modified-id="Objective-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Objective</a></span><ul class="toc-item"><li><span><a href="#Data-Science-Problem" data-toc-modified-id="Data-Science-Problem-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Data Science Problem</a></span></li></ul></li></ul></li><li><span><a href="#Data-Scraping" data-toc-modified-id="Data-Scraping-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Scraping</a></span></li></ul></div>

## Introduction

### Executive Summary
ABC Research Group was tasked with creating a chat bot for its client, the Department of Psychology of the National University of Singapore, using NLP to correctly predict the nature of enquirers' problems and to accurately direct them to the right staff-on-hand for the purpose of reducing the need for paid volunteer staff and costly miscategorized enquiries.

In Phase I of this project, a model trained on posts in Reddit's r/ADHD and r/OCD subreddit post data predicted that posts belonged to either subreddit with an accuracy of nearly 90% utilizing lemmatized text, sklearn's TF-IDF Vectorizer and Multinomial Naive Bayes classifier. This model was also better at inference than a competing model using sklearn's Count Vectorizer and Logistic Regression. Data from reddit was used in absence of prior data from the Department itself.

Our process is summarized as such:

1. Data from r/ADHD and r/OCD is scarped and then wrangled.
2. Duplicated data is removed from our dataset. Text-based data we will use for NLP is merged into a single column.
3. Data is lemmatized and stop words removed.
4. Data is processed via a combination of algorithm models using TF-IDF/Count Vectorizers, and the Multinomial Naive Bayes and Logistic Regression classifiers.
5. Important words and patterns in predictive features were analyzed for insights into the data.

While the Phase I model is sufficient for understanding the context of each enquiry if it pertains to ADHD or OCD, other algorithms will have to be utilized to if classifications are to be more fine. For example, whether to direct an enquirer who is already diagnosed and simply needs to buy medicine for his condition, or if it is a enquirer who has yet to see a professional for a diagnosis attempting to find out more about the condition he believes he has. 

### Problem Statement
In 2020, the Department of Psychology of the National University of Singapore noticed a steep increase in partiticipation of its internet-based chat and phone helpline tools by patients or enquirers seeking psychiatric help. This is likely a direct result of the Covid-19 pandemic and controlled/restricted mobility of both professionals and their patients. 

It was noted during this period that a large subset of participants who reached out had no formal diagnosis, but had diagnosed themselves and mispresented themselves as such. Many human volunteer chatline and phone desk handlers, many of which only had short periods of professional training and worked with word prompt charts, utilised a significant amount of time investigating and attempting to guess at the true nature of participants issues before they could forward enquirers to the right staff-on-hand for follow up. These forwarding attempts were sometimes also incorrect due to the mistaken assessed nature of enquirer's issues, necessitating a second re-direct to another staff-on-hand, increasing more waiting time for enquirers and reducing productivity of professional staff-on-hand.

The ABC Research Group is tasked with creating a chat bot using NLP to correctly predict the nature of enquirers' problems and to accurately direct them to the right staff-on-hand. It is envisioned that doing so will save the Department of Psychology approximately SGD 10,000 per month in work hours utilised in resolving incorrect volunteer diagoses and redirection of enquirers.

In Phase 1 of this project, we will attempt to study semantics and attitudes with regards to teenagers and adults with ADHD and OCD--two of the most common neuropsychiatric diseases in paediatric populations. These two disorders are chosen in favour of other common disorders like depression due to the similarity in coping mechanisms (which enquirers have been observed to describe in great detail). This similarity leads to easy misdiagosis by both enquirers and volunteers and will be a perfect starting point for the development of our chat bot.

For the chat bot to operate, it will need to supervised and trained on already available data. Unfortunately, the current chatline program was designed not to save its data for later reference (for the sake of enquirer privacy), and so alternative sources of data will have to be found until the new chat bot collects new anonymous data.

#### About the Source
Reddit is a popular web forum where authors 'redditors' write about their personal experiences under anonymous pseudonyms. Similar to the Department of Psychology, an increase in forum participation was noted in 2020. Redditors often write about their personal experiences in an attempt to find companionship in a greater online community. While Reddit is modern and is continually updating source of information for our NLP model, we have taken note that participants on reddit (redditors & subredditors) are majorly from the USA, and so data collected may reflect trends treatment of ADHD and OCD in the USA.

For example, Adderall, the most popular drug used to treat ADHD in the USA is not available in Singapore.

We will account for these biases in our study.

### Objective
Our objective is to determine patterns in modern lingua franca used to express experiences related to both ADHD and OCD on Reddit for the purpose of training a chat bot to accurately determine the context of a chat bot enquiry. The model will be scored based on its overall accuracy.

We will do this by first developing a model to accurately predict if a post originated from r/ADHD or r/OCD. While we attempt to make the model adequate at prediction, improvements to the chat bot will require that the chat bot is optimized for inference. Further improvements will be built upon this model. 

#### Data Science Problem
From the model:
1. What the 35 most important words most commonly used by redditors on r/ADHD and r/OCD when discussing their respective conditions?
3. How do these words, broadly, compare to each other in terms of importance?
4. Is/are there (a) pattern(s) in the words we can observe and describe?

## Data Scraping

We will first begin by scraping and obtaining reddit posts from r/ADHD and r/OCD. We will use the 'requests' library in Python for this purpose.

In [1]:
#Import required libraries
import re
import requests
import pandas as pd
import time
import random

In [2]:
# Target r/ADHD and r/OCD via their .json links.
url = 'https://www.reddit.com/r/adhd.json'
res = requests.get(url, headers={'User-agent': 'Paradis Iron Inc 1.03'})
res.status_code

200

In the code below, we will look for the appropriate nested dictionaries from which to obtain relevant data.

In [3]:
reddit_dict = res.json()

In [4]:
print(reddit_dict)

{'kind': 'Listing', 'data': {'modhash': '', 'dist': 27, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'ADHD', 'selftext': "Got a good grade on a test? A new promotion at work? Finally finished a chore you've been putting off? We want to hear about it! Let us celebrate your successes with you!", 'author_fullname': 't2_6l4z3', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'What are you proud of today?', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/ADHD', 'hidden': False, 'pwls': 0, 'link_flair_css_class': '', 'downs': 0, 'thumbnail_height': None, 'top_awarded_type': None, 'hide_score': False, 'name': 't3_m95m9x', 'quarantine': False, 'link_flair_text_color': 'light', 'upvote_ratio': 0.91, 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 16, 'total_awards_received': 1, 'media_embed': {}, 'thumbnail_width': None, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports

In [5]:
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [6]:
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'ADHD',
   'selftext': "Got a good grade on a test? A new promotion at work? Finally finished a chore you've been putting off? We want to hear about it! Let us celebrate your successes with you!",
   'author_fullname': 't2_6l4z3',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'What are you proud of today?',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/ADHD',
   'hidden': False,
   'pwls': 0,
   'link_flair_css_class': '',
   'downs': 0,
   'thumbnail_height': None,
   'top_awarded_type': None,
   'hide_score': False,
   'name': 't3_m95m9x',
   'quarantine': False,
   'link_flair_text_color': 'light',
   'upvote_ratio': 0.91,
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 16,
   'total_awards_received': 1,
   'media_embed': {},
   'thumbnail_width': None,
   'author_flair_template_id': None,
   'is_original_content':

In [7]:
# Posts are found at this level
posts = [p['data'] for p in reddit_dict['data']['children']]

In [8]:
posts

[{'approved_at_utc': None,
  'subreddit': 'ADHD',
  'selftext': "Got a good grade on a test? A new promotion at work? Finally finished a chore you've been putting off? We want to hear about it! Let us celebrate your successes with you!",
  'author_fullname': 't2_6l4z3',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'What are you proud of today?',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/ADHD',
  'hidden': False,
  'pwls': 0,
  'link_flair_css_class': '',
  'downs': 0,
  'thumbnail_height': None,
  'top_awarded_type': None,
  'hide_score': False,
  'name': 't3_m95m9x',
  'quarantine': False,
  'link_flair_text_color': 'light',
  'upvote_ratio': 0.91,
  'author_flair_background_color': None,
  'subreddit_type': 'public',
  'ups': 16,
  'total_awards_received': 1,
  'media_embed': {},
  'thumbnail_width': None,
  'author_flair_template_id': None,
  'is_original_content': False,
  'user_reports': [],
  'secure_media': None,


In [9]:
pd.DataFrame(posts)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video
0,,ADHD,Got a good grade on a test? A new promotion at...,t2_6l4z3,False,,0,False,What are you proud of today?,[],...,,/r/ADHD/comments/m95m9x/what_are_you_proud_of_...,no_ads,True,https://www.reddit.com/r/ADHD/comments/m95m9x/...,1162301,1616242000.0,0,,False
1,,ADHD,Hey y'all!\n\nI got a suggestion on our Discor...,t2_1op0u22q,False,,0,False,Productivity tools: looking for recommendations!,[],...,dark,/r/ADHD/comments/m3pm79/productivity_tools_loo...,no_ads,True,https://www.reddit.com/r/ADHD/comments/m3pm79/...,1162301,1615579000.0,0,,False
2,,ADHD,Is it just me or anyone else face the situatio...,t2_15k1zt,False,,0,False,Starting a new tv show or a movie is so hard,[],...,,/r/ADHD/comments/m9bbvy/starting_a_new_tv_show...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m9bbvy/...,1162301,1616260000.0,0,,False
3,,ADHD,"""That's not good, they usually reply back to t...",t2_9ifauv4i,False,,0,False,That feeling of rejection because you took 10 ...,[],...,,/r/ADHD/comments/m9fop4/that_feeling_of_reject...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m9fop4/...,1162301,1616273000.0,0,,False
4,,ADHD,"**TW: language, depression, alcohol, death, su...",t2_4utg1byl,False,,0,False,Confession: I cheated. A lot. Can you relate?,[],...,,/r/ADHD/comments/m97xla/confession_i_cheated_a...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m97xla/...,1162301,1616250000.0,0,,False
5,,ADHD,"It's like clockwork for me. I will feel fine, ...",t2_h4kdz,False,,0,False,Do you guys get depressed after hanging out wi...,[],...,,/r/ADHD/comments/m9ja84/do_you_guys_get_depres...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m9ja84/...,1162301,1616284000.0,0,,False
6,,ADHD,At least my wife and I were able to get a good...,t2_hbe1n,False,,0,False,I just got up to refill my giant cup of ice wa...,[],...,,/r/ADHD/comments/m9lhub/i_just_got_up_to_refil...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m9lhub/...,1162301,1616291000.0,0,,False
7,,ADHD,I was just talking to my wife about my recent ...,t2_59vtkihv,False,,0,False,Irony while opening up to my wife about feelin...,[],...,,/r/ADHD/comments/m9e1c2/irony_while_opening_up...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m9e1c2/...,1162301,1616268000.0,0,,False
8,,ADHD,i don’t think we talk enough about how traumat...,t2_7qd1nohe,False,,0,False,i don’t think we talk enough about how traumat...,[],...,,/r/ADHD/comments/m8y9ut/i_dont_think_we_talk_e...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m8y9ut/...,1162301,1616210000.0,0,,False
9,,ADHD,I’m angry because I’m only few days into start...,t2_36zr8jlm,False,,0,False,I’ve recently been diagnosed at 22 and I’m jus...,[],...,,/r/ADHD/comments/m97k88/ive_recently_been_diag...,no_ads,False,https://www.reddit.com/r/ADHD/comments/m97k88/...,1162301,1616249000.0,0,,False


Using the code below, we will iteratively create a list of posts just like the pd.DataFrame above. Our function verbosely informs us of the 'last post' it marks and requests 25 reddit posts worth of information at a time. It also stops for a randomly selected time between each request to avoid being disconnected or rejected by Reddit's servers.

In [10]:
posts = []
after = None

for a in range(56): # this is the num of iterations. We are trying to obtain up to this many times 25 posts in total.
    if after == None: # if this is a new request loop, then start from the latest post
        current_url = url
    else:
        current_url = url + '?after=' + after # otherwise, if we are already iterating, select the posts 
                                                # after this final post from the last request
    print(current_url) # Let user know where we currently are
    res = requests.get(current_url, headers={f'User-agent': 'Buragini Inc 563'})
    
    if res.status_code != 200: # if the Request is not accepted (HTTP request code 200),
                                # break out of the for loop and end requests
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts) # add on requested posts to collection of posts requested
    after = current_dict['data']['after']
    
    # generate a random sleep duration to attempt to avoid being disconnected
    sleep_duration = random.randint(10,70)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/adhd.json
69
https://www.reddit.com/r/adhd.json?after=t3_m9mnny
42
https://www.reddit.com/r/adhd.json?after=t3_m9ozro
41
https://www.reddit.com/r/adhd.json?after=t3_m9mjsv
46
https://www.reddit.com/r/adhd.json?after=t3_m9oiyx
65
https://www.reddit.com/r/adhd.json?after=t3_m9mcrq
45
https://www.reddit.com/r/adhd.json?after=t3_m93473
60
https://www.reddit.com/r/adhd.json?after=t3_m9cakv
14
https://www.reddit.com/r/adhd.json?after=t3_m9008v
20
https://www.reddit.com/r/adhd.json?after=t3_m96j4m
69
https://www.reddit.com/r/adhd.json?after=t3_m8x7l9
32
https://www.reddit.com/r/adhd.json?after=t3_m8z2oh
50
https://www.reddit.com/r/adhd.json?after=t3_m8xmie
48
https://www.reddit.com/r/adhd.json?after=t3_m8rl98
70
https://www.reddit.com/r/adhd.json?after=t3_m8fipa
23
https://www.reddit.com/r/adhd.json?after=t3_m8muug
41
https://www.reddit.com/r/adhd.json?after=t3_m8nr6f
45
https://www.reddit.com/r/adhd.json?after=t3_m8lucn
47
https://www.reddit.com/r/adhd.json?after=t3_

In [11]:
data = pd.DataFrame(posts)

In [12]:
# Save collected posts into a csv file for use in the next section with today's date to keep version history
from datetime import date
date.today().strftime("%y%m%d")
data.to_csv(f'../datasets/adhd_{date.today().strftime("%y%m%d")}.csv', index=False)

'210321'

This process was repeated for r/OCD. We will look closely into our collected data in the next section.