# Data Collection
This code is to collect data and update our datasets based on the newest posts from reddit. We are using Reddit's API to collect this data. The two subreddits that the data is from r/depression and r/SuicideWatch. This will allow our classifier to classify data in 3 broad classes, either depression or suicidal, or none. After that, we will develop a model to determine the stages of depression and suicide the user is going through. 

In [2]:
import requests
import time
import pandas as pd
from random import randint


In [3]:
# Begin scraping of the two subreddits
url_1 = "https://www.reddit.com/r/CasualConversation.json"

In [5]:
# creating user agent
headers = {"User-agent" : "ayushi7564321"}
res = requests.get(url_1, headers=headers)
res.status_code

200

In [7]:
# Preview of our data
depress_json = res.json()
depress_json

{'kind': 'Listing',
 'data': {'after': 't3_12r57l8',
  'dist': 26,
  'modhash': '',
  'geo_filter': None,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'CasualConversation',
     'selftext': "Welcome to r/CasualConversation! Thank you for joining and coming to our corner of Reddit. \n\n&gt;The friendlier part of Reddit. Have a fun conversation about anything that is on your mind. Ask a question or start a conversation about (almost) anything you desire. Maybe you'll make some friends in the process.\n\nIf you are here, lurking, feel free to create an account and say hi. \n\nHow are you? What brings you here? \n\n&amp;#x200B;\n\nPS, we got rules, please [read 'em](https://www.reddit.com/r/CasualConversation/about/rules)!",
     'author_fullname': 't2_6l4z3',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'r/CasualConversation Welcome Thread - Month of April 01, 2023',
     'link_flair_ric

this data is long an extensive. It can be called whenever we want to create a new dataset.

In [8]:
sorted(depress_json["data"].keys())

['after', 'before', 'children', 'dist', 'geo_filter', 'modhash']

In [10]:
depress_json["data"]["after"]

't3_12k9e18'

In [11]:
[post["data"]["name"] for post in depress_json["data"]["children"]]

['t3_128ov86',
 't3_12jlhdq',
 't3_12jtvrh',
 't3_12k1t2n',
 't3_12jw4ge',
 't3_12j6m2m',
 't3_12k59tz',
 't3_12k5ymz',
 't3_12k7vqe',
 't3_12jj6ml',
 't3_12jpk6z',
 't3_12ikxxh',
 't3_12jsl4z',
 't3_12k1pvd',
 't3_12jhvx7',
 't3_12k40zr',
 't3_12jpzkq',
 't3_12k8bxa',
 't3_12j17ni',
 't3_12k3sdm',
 't3_12k997t',
 't3_12jrwk3',
 't3_12jnhv3',
 't3_12jm38q',
 't3_12k9hb9',
 't3_12k9e18']

In [12]:
# checking posts per page
len(depress_json["data"]["children"])

26

In [13]:
# dataframe the posts
pd.DataFrame(depress_json["data"]["children"])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'Casual..."


In [14]:
# view real data
depress_json["data"]["children"][0]["data"]

{'approved_at_utc': None,
 'subreddit': 'CasualConversation',
 'selftext': "Welcome to r/CasualConversation! Thank you for joining and coming to our corner of Reddit. \n\n&gt;The friendlier part of Reddit. Have a fun conversation about anything that is on your mind. Ask a question or start a conversation about (almost) anything you desire. Maybe you'll make some friends in the process.\n\nIf you are here, lurking, feel free to create an account and say hi. \n\nHow are you? What brings you here? \n\n&amp;#x200B;\n\nPS, we got rules, please [read 'em](https://www.reddit.com/r/CasualConversation/about/rules)!",
 'author_fullname': 't2_6l4z3',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'r/CasualConversation Welcome Thread - Month of April 01, 2023',
 'link_flair_richtext': [{'a': ':chat:',
   'e': 'emoji',
   'u': 'https://emoji.redditmedia.com/04fpiw4fukg21_t5_323oy/chat'},
  {'e': 'text', 't': ' Just Chatting'}],
 'subreddit_name_prefixed': 'r/

In [1]:
# automate a function to scrape reddit

def reddit_scrape(url_string, number_of_scrapes, output_list):
    #scraped posts outputted as lists
    after = None 
    for _ in range(number_of_scrapes):
        if _ == 0:
            print("SCRAPING {}\n--------------------------------------------------".format(url_string))
            print("<<<SCRAPING COMMENCED>>>") 
            print("Downloading Batch {} of {}...".format(1, number_of_scrapes))
        elif (_+1) % 5 ==0:
            print("Downloading Batch {} of {}...".format((_ + 1), number_of_scrapes))
        
        if after == None:
            params = {}
        else:
            #THIS WILL TELL THE SCRAPER TO GET THE NEXT SET AFTER REDDIT'S after CODE
            params = {"after": after}             
        res = requests.get(url_string, params=params, headers=headers)
        if res.status_code == 200:
            the_json = res.json()
            output_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,6))
    
    print("<<<SCRAPING COMPLETED>>>")
    print("Number of posts downloaded: {}".format(len(output_list)))
    print("Number of unique posts: {}".format(len(set([p["data"]["name"] for p in output_list]))))

In [17]:
# call function for depression subreddit
casual_scraped = []
reddit_scrape("https://www.reddit.com/r/CasualConversation.json", 50, casual_scraped)

SCRAPING https://www.reddit.com/r/CasualConversation.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1238
Number of unique posts: 789


In [19]:
# output list of unique posts
def create_unique_list(original_scrape_list, new_list_name):
    data_name_list=[]
    for i in range(len(original_scrape_list)):
        if original_scrape_list[i]["data"]["name"] not in data_name_list:
            new_list_name.append(original_scrape_list[i]["data"])
            data_name_list.append(original_scrape_list[i]["data"]["name"])
    #CHECKING IF THE NEW LIST IS OF SAME LENGTH AS UNIQUE POSTS
    print("LIST NOW CONTAINS {} UNIQUE SCRAPED POSTS".format(len(new_list_name)))

In [20]:
# call function on our data
casual_scraped_unique = []
create_unique_list(casual_scraped, casual_scraped_unique)

LIST NOW CONTAINS 789 UNIQUE SCRAPED POSTS


In [21]:
# input depression data to dataframe and csv
casualConvo = pd.DataFrame(casual_scraped_unique)
casualConvo["is_suicide"] = 0
casualConvo.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,CasualConversation,Welcome to r/CasualConversation! Thank you for...,t2_6l4z3,False,,0,False,r/CasualConversation Welcome Thread - Month of...,"[{'a': ':chat:', 'e': 'emoji', 'u': 'https://e...",...,all_ads,True,https://www.reddit.com/r/CasualConversation/co...,2118491,1680361000.0,0,,False,,0
1,,CasualConversation,"Yes, a bidet. And I would like to tell everyon...",t2_39jbqndq,False,,0,False,"So, we bought a bidet...",[],...,all_ads,False,https://www.reddit.com/r/CasualConversation/co...,2118491,1681304000.0,0,,False,,0
2,,CasualConversation,Tldr; Birth name holds trauma. After 8 years o...,t2_gt9gu3ns,False,,0,False,I started the process of legally changing my n...,"[{'a': ':party:', 'e': 'emoji', 'u': 'https://...",...,all_ads,False,https://www.reddit.com/r/CasualConversation/co...,2118491,1681322000.0,0,,False,,0
3,,CasualConversation,I always wanted to be an artist but never been...,t2_3x73tah5,False,,0,False,I pushed myself out of my comfort zone and sta...,"[{'a': ':party:', 'e': 'emoji', 'u': 'https://...",...,all_ads,False,https://www.reddit.com/r/CasualConversation/co...,2118491,1681338000.0,0,,False,,0
4,,CasualConversation,After over two years (about 4-5 months were af...,t2_7u3te049,False,,0,False,I got the job!,"[{'a': ':party:', 'e': 'emoji', 'u': 'https://...",...,all_ads,False,https://www.reddit.com/r/CasualConversation/co...,2118491,1681328000.0,0,,False,,0


In [29]:
casualConvo.to_csv('casual_conversation_vs_suicide.csv', index = False)

In [31]:
# call function on our data
depress_scraped_unique = []
depress_scraped = []
create_unique_list(depress_scraped, depress_scraped_unique)

LIST NOW CONTAINS 0 UNIQUE SCRAPED POSTS


In [33]:
# input depression data to dataframe and csv
depression = pd.DataFrame(depress_scraped_unique)
depression["is_suicide"] = 0
depression.head()

Unnamed: 0,is_suicide


In [35]:
# calling scraping on suicidewatch data
suicide_scraped = []
reddit_scrape("https://www.reddit.com/r/SuicideWatch.json", 50, suicide_scraped)

SCRAPING https://www.reddit.com/r/SuicideWatch.json
--------------------------------------------------
<<<SCRAPING COMMENCED>>>
Downloading Batch 1 of 50...
Downloading Batch 5 of 50...
Downloading Batch 10 of 50...
Downloading Batch 15 of 50...
Downloading Batch 20 of 50...
Downloading Batch 25 of 50...
Downloading Batch 30 of 50...
Downloading Batch 35 of 50...
Downloading Batch 40 of 50...
Downloading Batch 45 of 50...
Downloading Batch 50 of 50...
<<<SCRAPING COMPLETED>>>
Number of posts downloaded: 1241
Number of unique posts: 990


In [36]:
# using unique function on suicide data
suicide_scraped_unique = []
create_unique_list(suicide_scraped, suicide_scraped_unique)

LIST NOW CONTAINS 990 UNIQUE SCRAPED POSTS


In [37]:
# inputting suicidewatch data into dataframe and csv
suicide_watch = pd.DataFrame(suicide_scraped_unique)
suicide_watch["is_suicide"] = 1
suicide_watch.head() #CHECK IF THERE ARE 100 COLS AND LAST DUMMY is_suicide COL

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_suicide
0,,SuicideWatch,We've been seeing a worrying increase in pro-s...,t2_1t70,False,,1,False,New wiki on how to avoid accidentally encourag...,[],...,,True,https://www.reddit.com/r/SuicideWatch/comments...,414924,1567526000.0,0,,False,,1
1,,SuicideWatch,"Activism, i.e. advocating or fundraising for s...",t2_1t70,False,,0,False,Please remember that NO ACTIVISM of any kind i...,[],...,,True,https://www.reddit.com/r/SuicideWatch/comments...,414924,1631232000.0,0,,False,,1
2,,SuicideWatch,Sick of hearing it. It's all just platitudes t...,t2_22dy1wt4,False,,0,False,"""iT gEtS bEtTeR"", ""yOu'Re LoVeD"" and other cli...",[],...,,False,https://www.reddit.com/r/SuicideWatch/comments...,414924,1681335000.0,0,,False,,1
3,,SuicideWatch,That’s it that’s the post lmao,t2_xxfb7,False,,0,False,My therapist fired me for being too depressed.,[],...,,False,https://www.reddit.com/r/SuicideWatch/comments...,414924,1681329000.0,0,,False,,1
4,,SuicideWatch,i just need to kill myself.,t2_r8k51zao,False,,0,False,if i kill myself all my problems will go away.,[],...,,False,https://www.reddit.com/r/SuicideWatch/comments...,414924,1681320000.0,0,,False,,1


### Collection Complete
This function should be called very often. The point is to improve our dataset and thus improve our model. Since new posts are always made, periodically scraping the subreddits will allow for more comprehensive datasets and thus more comprehensive models. 

In [38]:
# saving models
suicide_watch.to_csv('suicide_watch6-13.csv', index = False)
depression.to_csv('depression6-13.csv', index = False)