# NLP, DeepLearning, & Reddit Project: Scrape Data

**Background**: 


**Problem Statement**: 
Building on a previous project using Natural Language Processing (NLP) to analyze texts from two subreddits r/Anorexia and r/Bulimia and further inspired by a *Nature* Scientific Report titled "A deep learning model for detecting mental illness from user content on social media", this project aims to build a model based on user comments from Reddit that can identify whether a user's post belongs to a mental illness subreddit category.

**Research Questions**: 
- Can we identify whether a user's post belongs to a mental illness subreddit category? 
- How have posts changed pre and post onset of covid-19?
- Which mental illness category seems to have changed the most? 

**Plan**:
Scrape data from the following subreddits: 
- r/mentalhealth
- r/depression
- r/Anxiety
- r/bipolar
- r/BPD
- r/schizophrenia
- r/autism
- r/AnorexiaNervosa
- r/Bulimia

Start with 2000(?) comments for each subreddit (the Nature article had a lot more, but my model from the previous project had a high accuracy score with only 2000 comments), collect comments from dates pre and post March 2020
- This could be too much data for my computer to handle
- May need more from the r/mentalhealth subreddit... come back to this

In [2]:
# Load libraries
import pandas as pd
pd.set_option('display.max_rows', 500)
import requests
import time

In [3]:
# Use pushshift api
url = "https://api.pushshift.io/reddit/search/submission"

Code below adapted from: https://www.textjuicer.com/2019/07/crawling-all-submissions-from-a-subreddit/
- TO DO: Comment code more to explain function - save function as script??? Ask Riley
- Edit function that returns data frame to anonymize df within function... 
- Edit function to collect data based on UTC, find out what the utc is for pre-Feb 2020
- Change 'created_UTC' into a date 

In [4]:
def crawl_page(subreddit: str, last_page = None):
    
    '''Crawl a page of results from a given subreddit.
    subreddit: The subreddit page, last_page: The last downloaded page.'''
    
    params = {"subreddit": subreddit, 
              "size": 500, 
              "sort": "desc", 
              "sort_type": "created_utc"}
    
    if last_page is not None:
        if len(last_page) > 0:
            # resume from where we left at the last page
            params["before"] = last_page[-1]["created_utc"]
        else:
            # the last page was empty, or we are past the last page
            return []
    results = requests.get(url, params)
    if not results.ok:
        # alerts us that something went wrong
        raise Exception("Server returned status code {}".format(results.status_code))
    return results.json()["data"]

In [16]:
def crawl_subreddit(subreddit, max_submissions = 2000):
    
    '''Crawl submissions from a subreddit. 
    subreddit: The subreedit page, max_submissions: The max number of
    submissions to download.'''
    
    submissions = []
    last_page = None
    while last_page != [] and len(submissions) < max_submissions:
        last_page = crawl_page(subreddit, last_page)
        submissions += last_page
        # import time as to not flood the server
        time.sleep(2)
        #return submissions as a dataframe
    return pd.DataFrame(submissions[:max_submissions])

In [6]:
# r/mentalhealth
mentalhealth_reddit_comments = crawl_subreddit('mentalhealth')

In [7]:
mentalhealth_reddit_comments.shape 

(5000, 74)

In [13]:
mentalhealth_reddit_comments.head()# It doesn't seem I'm getting the comments or self text here

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,crosspost_parent_list,url_overridden_by_dest,poll_data,author_flair_background_color,author_flair_text_color,author_cakeday,removed_by_category,edited,author_flair_template_id,banned_by
0,[],False,Outrageous_Writer600,,[],,text,t2_aghacp0f,False,False,...,,,,,,,,,,
1,[],False,KiimberlyCollins,,[],,text,t2_6itp09ql,False,False,...,,,,,,,,,,
2,[],False,Notyour-averagesquid,,[],,text,t2_8k6mc3qb,False,False,...,,,,,,,,,,
3,[],False,SleepAllDay247,,[],,text,t2_ag4a5x0n,False,False,...,,,,,,,,,,
4,[],False,ccccccxy,,[],,text,t2_79gj8amd,False,False,...,,,,,,,,,,


In [17]:
# r/depression
depression_reddit_comments = crawl_subreddit('depression')

In [18]:
depression_reddit_comments.shape

(2000, 66)

In [19]:
depression_reddit_comments.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,whitelist_status,wls,removed_by_category,post_hint,preview,author_flair_background_color,author_flair_text_color,banned_by,author_cakeday,edited
0,[],False,painocchio,,[],,text,t2_9tp4k1n9,False,False,...,no_ads,0.0,,,,,,,,
1,[],False,unclemegan69,,[],,text,t2_7jp7ue8g,False,False,...,no_ads,0.0,automod_filtered,,,,,,,
2,[],False,Cherryblossom0104,,[],,text,t2_5cod2vmr,False,False,...,no_ads,0.0,reddit,,,,,,,
3,[],False,Lucaded,,[],,text,t2_a7srcsjx,False,False,...,no_ads,0.0,,,,,,,,
4,[],False,PrinceChicken209,,[],,text,t2_9bj6uk86,False,False,...,no_ads,0.0,,,,,,,,


In [22]:
depression_reddit_comments.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'suggested_sort', 'thumbnail', 'title',
       'total_awards_received', 'treatment_tags

In [24]:
depression_reddit_comments['created_utc']

0       1613758854
1       1613758830
2       1613758730
3       1613758674
4       1613758119
           ...    
1995    1613397599
1996    1613397585
1997    1613397494
1998    1613397468
1999    1613397449
Name: created_utc, Length: 2000, dtype: int64

In [20]:
depression_reddit_comments['title']

0                   Late menstruation on antidepressants?
1       Has anyone here been prescribed and taken Vray...
2                       When will things get better (18F)
3                         I don’t know what I should do??
4                                                 Autism.
                              ...                        
1995    .... Is this improvement....? Or just another ...
1996    The closest I've come to a relationship in six...
1997                Fuck growing up. Life is meaningless.
1998               I was molested by a doctor as a child.
1999    I have so much anger inside me I don't wanna c...
Name: title, Length: 2000, dtype: object

In [21]:
depression_reddit_comments['selftext']

0       Sorry for the bad english.\n\nIm on week 3 of ...
1                                               [removed]
2                                               [removed]
3       Basically I met someone online and we were tal...
4       Just a few days ago I realized who I really am...
                              ...                        
1995    I started daydreaming. I haven't daydreamed si...
1996    I think at this point I am just doomed to be a...
1997                                            [deleted]
1998    I have suffered with depression and anxiety si...
1999    God I feel so lonely,angry depressed dead emot...
Name: selftext, Length: 2000, dtype: object