# NLP, DeepLearning, & Reddit Project: Scrape Data

**Background**: 
Having a mental health issue can be a confusing and isolating experience. One way individuals seek information and support is through social media forums such as Reddit. People often share their emotions looking for connection or somewhere to just voice their feelings. This presents an opportunity for researchers to learn more about how those with emotional pain and/or a psychiatric illness may reach out for help, what words they use, and what type of language is more characteristic of one illness over another. The use of NLP in analyzing support communities for psychiatric illness can help us gain insight into hard to understand diseases in order to offer better support for patients.

**Problem Statement**: 
Building on a previous project using Natural Language Processing (NLP) to analyze texts from two subreddits r/Anorexia and r/Bulimia and further inspired by a *Nature*  Scientific Report paper titled "A deep learning model for detecting mental illness from user content on social media", this project aims to employ a Convolutional Neural Network (CNN) classification model  based on user comments from Reddit to identify whether a user's post belongs to a mental illness subreddit category.

**Research Questions**: 
- Can we identify whether a user's post belongs to a mental illness subreddit category? 
- How have posts changed pre and post onset of covid-19?
- Which mental illness category seems to have changed the most? 

TO DO -
- Mention the motivation behind including eating disorders categories, and reference the MIT article

**Plan**:
Scrape data from the following subreddits: 
- r/mentalhealth
- r/depression
- r/Anxiety
- r/bipolar
- r/BPD
- r/schizophrenia
- r/autism
- r/AnorexiaNervosa
- r/Bulimia

Start with 4000(?) comments for each subreddit, 2000 pre and 2000 post covid onset
- !! May want more files - Can or should I use a cloud for this if my computer doesn't have enough storage?
- the Nature article had a lot more, but my model from the previous project had a high accuracy score with only 2000 comments
- This could be too much data for my computer to handle
- May need more from the r/mentalhealth subreddit... come back to this

In [69]:
import pandas as pd
import requests
import json
import csv
import time
import datetime

In [70]:
# Use pushshift api
url = "https://api.pushshift.io/reddit/search/submission"

- TO DO: Comment code more to explain function - save function as script???????
- Edit function that returns data frame to anonymize df within function... 
- Edit function to collect data based on UTC, find out what the utc is for pre-Feb 2020
- Change 'created_UTC' into a date 

In [3]:
# Code help from: https://www.textjuicer.com/2019/07/crawling-all-submissions-from-a-subreddit/

In [71]:
#### ADD MORE COMMENTS TO EXPLAIN WHAT THE CODE IS DOING

def crawl_page(subreddit: str, before: str, after: str, last_page = None):
    
    '''Crawl a page of results from a given subreddit.
    subreddit: The subreddit page, last_page: The last downloaded page.'''
        
    params = {"subreddit": subreddit,
              "before": before,
              "after": after,
              "size": 100, 
              "sort": "desc", 
              "sort_type": "created_utc"}
    
    if last_page is not None:
        if len(last_page) > 0:
            # resume from where we left at the last page
            params["before"] = last_page[-1]["created_utc"]
        else:
            # the last page was empty, or we are past the last page
            return []
    results = requests.get(url, params)
    if not results.ok:
        # alerts us that something went wrong
        raise Exception("Server returned status code {}".format(results.status_code))
    return results.json()["data"]

In [73]:
def crawl_subreddit(subreddit, before, after, max_submissions = 1000):
    
    '''Crawl submissions from a subreddit. 
    subreddit: The subreedit page, max_submissions: The max number of
    submissions to download.'''
    
    submissions = []
    last_page = None
    while last_page != [] and len(submissions) < max_submissions:
        last_page = crawl_page(subreddit, after, last_page)
        submissions += last_page
        # import time as to not flood the server
        time.sleep(2)
        #return submissions as a dataframe
    return pd.DataFrame(submissions[:max_submissions])

In [None]:
## Before and after time stamps don't seem to work unless it's a small number (i.e. 100)
# To Do- see if I can find other code that helps do this

In [74]:
# r/bulimia
bulimia_comments_PreCovid = crawl_subreddit('bulimia', 
                                            before = '1584743344', # before March 20, 2021  
                                            after = '1584487370') # after March 20, 2017

Exception: Server returned status code 414

In [75]:
# r/bulimia
bulimia_comments_PostCovid = crawl_subreddit('bulimia', 
                                            before = 'None', # most recent
                                            after = '1584743344') # after March 20, 2021

Exception: Server returned status code 414

In [None]:
# Check dataframe 



In [None]:
# Save to csv



In [62]:
# r/mentalhealth
mentalhealth_comments_PreCovid = crawl_subreddit('mentalhealth', 
                                            before = '1584743344', # before March 20, 2021  
                                            after = '1584487370') # after March 20, 2017

TypeError: crawl_subreddit() got an unexpected keyword argument 'before'

In [63]:
# r/mentalhealth
mentalhealth_comments_PostCovid = crawl_subreddit('mentalhealth') # after March 20, 2021

ConnectionError: HTTPSConnectionPool(host='api.pushshift.io', port=443): Max retries exceeded with url: /reddit/search/submission?subreddit=mentalhealth&size=100&sort=desc&sort_type=created_utc&before=1613692712 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x12f071ed0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [53]:
mentalhealth_comments_PostCovid.shape

(2000, 73)

In [59]:
mentalhealth_comments_PostCovid['created_utc']

0       1614034399
1       1614034247
2       1614034110
3       1614033826
4       1614033823
           ...    
1995    1613387769
1996    1613387437
1997    1613387375
1998    1613386701
1999    1613386502
Name: created_utc, Length: 2000, dtype: int64

In [None]:
mentalhealth_reddit_comments.shape 

In [None]:
mentalhealth_reddit_comments.head()# It doesn't seem I'm getting the comments or self text here

In [None]:
pd.set_option('display.max_columns', 500)

In [None]:
mentalhealth_reddit_comments.sort_values(['created_utc'],ascending=True).head(3)

In [None]:
# r/depression
depression_reddit_comments = crawl_subreddit('depression')

In [None]:
depression_reddit_comments.shape

In [None]:
depression_reddit_comments.head()

In [None]:
depression_reddit_comments.columns

In [None]:
depression_reddit_comments['title']

In [None]:
depression_reddit_comments['selftext']