---
# Data Scraping
---

Intro paragraph

In [1]:
import time
import pandas as pd
import requests

pd.set_option('display.max_rows', None)

---
## 1. Fetch the content by URL
---

### Loop to pull Reddit API posts
To collect Reddit data, the existing .json API formatwas used. This returned a dictionary for containing a .json extension. A headers dictionary was included to access Reddit, which allowed to execute an API loop to accumulate the maximum allowed posts (~1000 per subreddit). To prevent Reddit from percieving the extraction fo data as a hack, time.sleep function also introduced.

In [2]:
#soup = BeautifulSoup(open("sample.html"), "lxml")

In [3]:
# target web page

url1 = "https://www.reddit.com/r/SuicideWatch/.json"
url2 = "https://www.reddit.com/r/depression/.json"

In [4]:
headers = {'user-agent': 'mother_monkey'}  # find out what this is for

# Establishing the connection to the web page:
response1 = requests.get(url1, headers=headers)
response2 = requests.get(url2, headers=headers)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print('Status Code 1: ',response1.status_code)
print('Status Code 2: ',response2.status_code)

Status Code 1:  200
Status Code 2:  200


In [5]:
response1.json()

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'SuicideWatch',
     'selftext': 'We\'ve been seeing a worrying increase in pro-suicide content showing up here and, and also going unreported. This undermines our purpose here, so we wanted to highlight and clarify our guidelines about both direct and indirect incitement of suicide.  \n\nWe\'ve created a wiki that covers these issues.  We hope this will be helpful to anyone who\'s wondering whether something\'s okay here and which responses to report.  It explains in detail why *any* validation of suicidal intent, even an "innocent" message like "if you\'re 100% committed, I\'ll just wish you peace" is likely to increase people\'s pain, and why it\'s important to report even subtle pro-suicide comments. The full text of the wiki\'s current version is below, and it is maintained at [/r/SuicideWatch/wiki/incitement](http://www.reddit.com/r/Suic

In [6]:
response2.json()

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'depression',
     'selftext': 'We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to k

In [7]:
# function to extract posts in URL1 - suicidewatch

post_1 = []
after = None
for i in range(41):  ## given 25 (less the pinned post) per page,
                     ## cycle through 40 times to hit 1000posts
                     ## which is also the limit for reddit
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url1 = "https://www.reddit.com/r/SuicideWatch/.json"
    response1 = requests.get(url1, params=params, headers=headers)
    if response1.status_code == 200:
        res1_json = response1.json()
        post_1.extend(res1_json['data']['children'])
        after = res1_json['data']['after']
    else:
        print(res1.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


In [8]:
# 'name' being the unique id for each post,
# find the number of unique posts
len(set([p['data']['name'] for p in post_1]))

985

In [9]:
# function to extract posts in URL2 - depression

post_2 = []
after = None
for i in range(41):  ## given 25 (less the pinned post) per page,
                     ## cycle through 40 times to hit 1000posts
                     ## which is also the limit for reddit
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url2 = "https://www.reddit.com/r/depression/.json"
    response2 = requests.get(url2, params=params, headers=headers)
    if response2.status_code == 200:
        res2_json = response2.json()
        post_2.extend(res2_json['data']['children'])
        after = res2_json['data']['after']
    else:
        print(res2.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


In [10]:
# 'name' being the unique id for each post,
# find the number of unique posts
len(set([p['data']['name'] for p in post_2]))

931

In [11]:
# create DataFrames from posting lists
depression_df = pd.DataFrame(post_1)
suicide_df = pd.DataFrame(post_2)


In [12]:
depression_df.head()

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'Suicid..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'Suicid..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'Suicid..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'Suicid..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'Suicid..."


In [13]:
suicide_df.head()

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'depres..."


In [14]:
depression_df['data'][911].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable'

In [15]:
depression_df['data'][911]['selftext']

'I’m just tired and don’t know how much longer I can deal with it. I have put everything into living and I just hit the point where I don’t care anymore.'

In [16]:
depression_df['data'][911]['title']

'I’m tired.'

In [17]:
depression_df['data'][911]['author_fullname']

't2_5vqjs8fv'

In [18]:
depression_df['data'][911]['downs']

0

In [19]:
depression_df['data'][911]['ups']

3

In [20]:
suicide_df['data'][912].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable'

In [21]:
suicide_df['data'][912]['selftext']

"Does anyone here experience depression due to being in a new country or culture that you don't understand? What are some difficulties you're dealing with?"

In [22]:
suicide_df['data'][912]['title']

'Depression from Culture Shock/Moving to a New Country'

In [23]:
suicide_df['data'][912]['author_fullname']

't2_4j9we4tp'

In [24]:
suicide_df['data'][912]['downs']

0

In [25]:
suicide_df['data'][912]['ups']

1

---
## 2. Collecting post attributes
---
Using nested list comprehensions, for each of the 40 subreddit pages of 25 posts per page, iterate through each individual post dictionary to collect the key attributes of all posts (e.g. title,selftext,etc.).

In [26]:
# 'dep' for depression
# 'sw' for suicidewatch

---
### Selftext

In [27]:
dep_selftext = [depression_df['data'][i]['selftext'] for i in range(len(depression_df))]
dep_selftext[:5]

['We\'ve been seeing a worrying increase in pro-suicide content showing up here and, and also going unreported. This undermines our purpose here, so we wanted to highlight and clarify our guidelines about both direct and indirect incitement of suicide.  \n\nWe\'ve created a wiki that covers these issues.  We hope this will be helpful to anyone who\'s wondering whether something\'s okay here and which responses to report.  It explains in detail why *any* validation of suicidal intent, even an "innocent" message like "if you\'re 100% committed, I\'ll just wish you peace" is likely to increase people\'s pain, and why it\'s important to report even subtle pro-suicide comments. The full text of the wiki\'s current version is below, and it is maintained at [/r/SuicideWatch/wiki/incitement](http://www.reddit.com/r/SuicideWatch/wiki/incitement). \n\nWe deeply appreciate everyone who gives responsive, empathetic, non-judgemental support to our OPs, and we particularly thank everyone who\'s alre

In [28]:
sw_selftext = [suicide_df['data'][i]['selftext'] for i in range(len(suicide_df))]
sw_selftext[:5]

['We understand that most people who reply immediately to an OP with an invitation to talk privately  mean only to help, but this type of response usually leads to either disappointment or disaster.  it usually works out quite differently here than when you say "PM me anytime" in a casual social context.  \n\nWe have huge admiration and appreciation for the goodwill and good citizenship of so many of you who support others here and flag inappropriate content - even more so because we know that so many of you are struggling yourselves.  We\'re hard at work behind the scenes on more information and resources to make it easier to give and get quality help here - this is just a small start.  \n\nOur new wiki page explains in detail why it\'s much better to respond in public comments, at least until you\'ve gotten to know someone.  It will be maintained at /r/depression/wiki/private_contact, and the full text of the current version is below.\n\n*****\n\n###Summary###\n\n**Anyone who, while 

In [29]:
len(dep_selftext)

1012

In [30]:
len(sw_selftext)

1008

---
### Titles

In [31]:
dep_titles = [depression_df['data'][i]['title'] for i in range(len(depression_df))]
dep_titles[:5]

['New wiki on how to avoid accidentally encouraging suicide, and how to spot covert incitement',
 'Reminder: Absolutely no activism of any kind is allowed here. Any day.',
 'Haha help',
 'Someone please talk to me😢 anyone at all please',
 "I'm usually responding to these posts"]

In [32]:
sw_titles = [suicide_df['data'][i]['title'] for i in range(len(suicide_df))]
sw_titles[:5]

['Our most-broken and least-understood rules is "helpers may not invite private contact as a first resort", so we\'ve made a new wiki to explain it',
 'Regular Check-In Post',
 'Does anyone else find it increasingly hard to pretend to be a normal functioning human being?',
 'I’m going to a psychiatrist for the first time and I’m a little freaked',
 "I made rice and didn't burn it."]

In [33]:
len(dep_titles)

1012

In [34]:
len(sw_titles)

1008

---
### Author

In [35]:
#dep_authors = [depression_df['data'][i]['author_fullname'] for i in range(len(depression_df))]
#dep_authors[0]

In [36]:
#sw_authors = [suicide_df['data'][i]['author_fullname'] for i in range(len(suicide_df))]
#sw_authors[:5]

In [37]:
# key error in above lines indicate possible missing data

In [38]:
dep_authors = [] # empty lists to store results
sw_authors = []

for i in range(len(depression_df)): # for each bulk post (size 100)
    try:
        dep_authors.append(depression_df['data'][i]['author_fullname']) # attempt to add to list
    except:
        dep_authors.append('anonymous') # if it fails, add text stating 'no author'
            
for i in range(len(suicide_df)): # for each bulk post
    try:
        sw_authors.append(suicide_df['data'][i]['author_fullname']) # attempt to add to list
    except:
        sw_authors.append('anonymous') # if it fails, add instead 'no author'

In [39]:
len(dep_authors)

1012

In [40]:
len(sw_authors)

1008

---
### Down votes

In [41]:
dep_downs = [depression_df['data'][i]['downs'] for i in range(len(depression_df))]
dep_downs[:5]

[0, 0, 0, 0, 0]

In [42]:
sw_downs = [suicide_df['data'][i]['downs'] for i in range(len(suicide_df))]
sw_downs[:5]

[0, 0, 0, 0, 0]

In [43]:
len(dep_downs)

1012

In [44]:
len(sw_downs)

1008

---
### Up votes

In [45]:
dep_ups = [depression_df['data'][i]['ups'] for i in range(len(depression_df))]
dep_ups[:5]

[1728, 1216, 158, 113, 32]

In [46]:
sw_ups = [suicide_df['data'][i]['ups'] for i in range(len(suicide_df))]
sw_ups[:5]

[1867, 339, 1285, 712, 47]

In [47]:
len(dep_ups)

1012

In [48]:
len(sw_ups)

1008

---
### Compiling lists to DataFrame

In [49]:
# Depression DataFrame
dep_df = pd.DataFrame([dep_titles, dep_selftext, dep_authors, dep_ups, dep_downs],
                      index=['title','post','author','upvotes','downvotes'])
# SuicideWatch DataFrame
sw_df = pd.DataFrame([sw_titles, sw_selftext, sw_authors, sw_ups, sw_downs], 
                     index=['title','post','author','upvotes','downvotes'])

In [50]:
dep_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011
title,New wiki on how to avoid accidentally encourag...,Reminder: Absolutely no activism of any kind i...,Haha help,Someone please talk to me😢 anyone at all please,I'm usually responding to these posts,I am a horrible abusive person who undeniably ...,I want to Die!!,Suicidal Ideation,I'm done,I'm a lonely worthless loser with no friends a...,...,I just want to rest.,how do i deal with suicidal thoughts if they s...,I’m done,I just slit my wrists,goodbye,wait with me,Is almost jumping off a building a suicide att...,I'm tired.,I wish I was never born.,Maybe its for the best that I don't live anymo...
post,We've been seeing a worrying increase in pro-s...,"If you want to recognise an occasion, please d...","Yes, I am suicidal. Yes, I am ""getting help"". ...","I have absolutely no one to turn to, I feel so...",I don't feel like I shouldn't be posting this ...,I'm going to get straight to the point. When I...,"&amp;#x200B;\n\nI so badly want to die, Everyd...","I’ve been on Prozac, Lexapro, Abilify, and Lor...",I really don't want to be considering suicide....,I have no friends. Everyone forgets I even exi...,...,"I have a beautiful girlfriend, I finally moved...",school makes me horribly depressed. i am a soc...,I’m fucking done,Go ahead and ban me now. You know you want to....,,i'm currently getting through a month-long bou...,The title says it all. I don't want to go into...,I don't know what to do. I have no one. I'm ...,I want to kill myself. I can't bring myself...,"I am almost 40, on disability morbidly obese a..."
author,t2_1t70,t2_1t70,t2_u5zigd9,t2_5a7y2qvy,t2_5w7amdt9,t2_5wgahit1,t2_1153k8,t2_4oposcku,t2_epni2,t2_5wht3pwg,...,t2_15s36q,t2_57zg8jc3,t2_580rrxyh,t2_5wh6itmf,t2_54svfch2,t2_42j8iug9,t2_5rdu7b8c,t2_4nxz31nb,t2_2c27arp7,t2_4cfx78jz
upvotes,1728,1216,158,113,32,38,410,25,6,7,...,1,82,3,5,24,4,3,4,4,8
downvotes,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
sw_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,998,999,1000,1001,1002,1003,1004,1005,1006,1007
title,Our most-broken and least-understood rules is ...,Regular Check-In Post,Does anyone else find it increasingly hard to ...,I’m going to a psychiatrist for the first time...,I made rice and didn't burn it.,This life is a curse not a blessing.,The COVID-19 massive decrease in social activi...,I can’t fucking get a job,"""Learn to love yourself""",I’ve given up on 2020 and it’s a struggle to e...,...,i wish i'm forever asleep,I feel unwanted,Hi...,How do you tell all your loved wants you want ...,Drowning,Do you get more sadder the older you get?,Apparently you can’t have depression if *some*...,15 year old with nothing to do,do you believe depression always has a cause?,How long does booking a session with a therapi...
post,We understand that most people who reply immed...,Welcome to /r/depression's check-in post - a p...,My only social interaction is at work but it's...,So bit of a story I’m 22 I’ve never been to th...,I gained some self confidence from this.,Nothing in life can out weigh the CONSISTENT p...,"Ok, hear me out. I've been pretty severely dep...",I’m just done. I had an interview with this co...,Does anyone else feel really irritated when th...,"I’ve been depressed ever since June of 2019, I...",...,i always have these thoughts every time that i...,I may look like I’m having fun at parties but ...,"Hey everyone, how has your week been so far?","I don't want to like, seem dramatic. It makes ...","The time 5 am once again, I sit here in the da...",I feel like the older I get the more sadder li...,I should be living in bliss because I can be a...,im 15 years old and i have nothing to do...i s...,when I started becoming depressed I had a norm...,Just a few months ago i finally got the courag...
author,t2_1t70,t2_64qjj,t2_5nqv8mcx,t2_2cd736ed,t2_3yfpqhqs,t2_3q8ltal1,t2_hscw8,t2_5c3k3z5l,t2_4mzvnfpt,t2_5cedh0wm,...,t2_5ar1umn2,t2_5t3gu6a6,t2_38b9m61h,t2_9kky4w2,t2_5gdokfjf,t2_3vyyio99,t2_4ujxb8he,t2_4srx8kxs,t2_5ftp3mbi,t2_34w98hzm
upvotes,1867,339,1285,712,47,52,146,19,74,176,...,4,46,63,13,2,248,9,5,2,2
downvotes,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# transposing df
dep_df = dep_df.T
sw_df = sw_df.T

In [53]:
dep_df.head()

Unnamed: 0,title,post,author,upvotes,downvotes
0,New wiki on how to avoid accidentally encourag...,We've been seeing a worrying increase in pro-s...,t2_1t70,1728,0
1,Reminder: Absolutely no activism of any kind i...,"If you want to recognise an occasion, please d...",t2_1t70,1216,0
2,Haha help,"Yes, I am suicidal. Yes, I am ""getting help"". ...",t2_u5zigd9,158,0
3,Someone please talk to me😢 anyone at all please,"I have absolutely no one to turn to, I feel so...",t2_5a7y2qvy,113,0
4,I'm usually responding to these posts,I don't feel like I shouldn't be posting this ...,t2_5w7amdt9,32,0


In [54]:
sw_df.head()

Unnamed: 0,title,post,author,upvotes,downvotes
0,Our most-broken and least-understood rules is ...,We understand that most people who reply immed...,t2_1t70,1867,0
1,Regular Check-In Post,Welcome to /r/depression's check-in post - a p...,t2_64qjj,339,0
2,Does anyone else find it increasingly hard to ...,My only social interaction is at work but it's...,t2_5nqv8mcx,1285,0
3,I’m going to a psychiatrist for the first time...,So bit of a story I’m 22 I’ve never been to th...,t2_2cd736ed,712,0
4,I made rice and didn't burn it.,I gained some self confidence from this.,t2_3yfpqhqs,47,0


In [55]:
# to be dropped
dep_df['downvotes'].value_counts()

0    1012
Name: downvotes, dtype: int64

---
### Save the initial DataFrames.


In [56]:
dep_df.to_csv('./dataset/dep_df.csv', index=False)
sw_df.to_csv('./dataset/sw_df.csv', index=False)

---

In [57]:
# open dataframes
dep_df = pd.read_csv('./dataset/dep_df.csv')
sw_df = pd.read_csv('./dataset/sw_df.csv')

---

Identify duplicate posts and titles. Note that titles may be duplicated with different posts.

---
**dep_df**

In [58]:
# number of duplicated posts
dep_df['post'].duplicated(keep=False).sum()

122

In [59]:
# check out post that are duplicates
dep_df[dep_df['post'].duplicated(keep=False)].sort_values(by=['post'])


Unnamed: 0,title,post,author,upvotes,downvotes
1010,I wish I was never born.,I want to kill myself. I can't bring myself...,t2_2c27arp7,4,0
25,I wish I was never born.,I want to kill myself. I can't bring myself...,t2_2c27arp7,4,0
998,I need something to kill me. I am tired of tak...,"""...They say the ocean's blue but it's black r...",t2_4otqifif,29,0
13,I need something to kill me. I am tired of tak...,"""...They say the ocean's blue but it's black r...",t2_4otqifif,29,0
991,I want to Die!!,"&amp;#x200B;\n\nI so badly want to die, Everyd...",t2_1153k8,415,0
6,I want to Die!!,"&amp;#x200B;\n\nI so badly want to die, Everyd...",t2_1153k8,410,0
16,17 and Empty,"**Hey reddit, I'm a 17 year old boy from Engla...",t2_uvjdp79,8,0
1001,17 and Empty,"**Hey reddit, I'm a 17 year old boy from Engla...",t2_uvjdp79,9,0
15,I'm tired.,Every day I wake up and go through the same ex...,t2_48tcsbp1,6,0
1000,I'm tired.,Every day I wake up and go through the same ex...,t2_48tcsbp1,6,0


In [60]:
# drop duplicate posts that are not null values
dep_df = dep_df[dep_df['post'].isnull() | ~dep_df[dep_df['post'].notnull()].duplicated(subset='post',keep='first')]

In [61]:
# number of duplicated posts
dep_df['post'].duplicated(keep=False).sum()

66

In [62]:
dep_df['post'].isnull().sum()

66

In [63]:
# number of duplicated titles
dep_df['title'].duplicated(keep=False).sum()

41

In [64]:
# check out titles that are duplicates
dep_df[dep_df['title'].duplicated(keep=False)].sort_values(by=['title'])


Unnamed: 0,title,post,author,upvotes,downvotes
739,.. #5,At this point i realize I dont want to kill my...,t2_5vge2ohk,1,0
584,.. #5,Decided im too much of a pussy to kill myself....,t2_5vge2ohk,5,0
960,Help,I’ve tried to overdose on paracetamol. I took ...,anonymous,2,0
857,Help,"No helplines will answer, nobody to talk to, g...",t2_4e4wij5o,3,0
219,Help,I want to die I’m just feeling so much at once...,t2_58yrdw26,1,0
287,Help,Help. I need help but want to chat in private ...,t2_3n09fmv1,2,0
91,Help,Can anyone help. I've lost all motivation to l...,t2_5wgvebfl,3,0
646,I give up,I really do think it’s too hard to do this any...,t2_4tvqcss5,7,0
147,I give up,"I’m done, I give up. I thought I may be tough ...",t2_4bqul4lw,1,0
116,I want to die,I want to die I want to die I want to die I wa...,t2_5qagjgtl,3,0


In [65]:
dep_df = dep_df.drop_duplicates(subset=['title', 'post'], keep='first')

In [66]:
# number of duplicated titles
dep_df['title'].duplicated(keep=False).sum()

39

In [67]:
# check out titles that are duplicates
dep_df[dep_df['title'].duplicated(keep=False)].sort_values(by=['title'])

Unnamed: 0,title,post,author,upvotes,downvotes
739,.. #5,At this point i realize I dont want to kill my...,t2_5vge2ohk,1,0
584,.. #5,Decided im too much of a pussy to kill myself....,t2_5vge2ohk,5,0
960,Help,I’ve tried to overdose on paracetamol. I took ...,anonymous,2,0
287,Help,Help. I need help but want to chat in private ...,t2_3n09fmv1,2,0
857,Help,"No helplines will answer, nobody to talk to, g...",t2_4e4wij5o,3,0
91,Help,Can anyone help. I've lost all motivation to l...,t2_5wgvebfl,3,0
219,Help,I want to die I’m just feeling so much at once...,t2_58yrdw26,1,0
646,I give up,I really do think it’s too hard to do this any...,t2_4tvqcss5,7,0
147,I give up,"I’m done, I give up. I thought I may be tough ...",t2_4bqul4lw,1,0
204,I want to die,I am 13 and get severly bullied everyday im no...,t2_5cpmq1rk,4,0


**sw_df**

In [68]:
# number of duplicated posts
sw_df['post'].duplicated(keep=False).sum()

172

In [69]:
# check out post that are duplicates
sw_df[sw_df['post'].duplicated(keep=False)].sort_values(by=['post'])


Unnamed: 0,title,post,author,upvotes,downvotes
41,I wish i had friends,"(To start off I’m 16, male, haven’t been to sc...",t2_5u7u5b6q,9,0
972,I wish i had friends,"(To start off I’m 16, male, haven’t been to sc...",t2_5u7u5b6q,8,0
977,Life sucks,Anyone else feel like ur just living the same ...,t2_3651p0eg,5,0
46,Life sucks,Anyone else feel like ur just living the same ...,t2_3651p0eg,5,0
942,Dating sucks when your a loner,Been seeing this guy for a few months and he's...,t2_3srcn08x,12,0
11,Dating sucks when your a loner,Been seeing this guy for a few months and he's...,t2_3srcn08x,11,0
8,"""Learn to love yourself""",Does anyone else feel really irritated when th...,t2_4mzvnfpt,74,0
939,"""Learn to love yourself""",Does anyone else feel really irritated when th...,t2_4mzvnfpt,78,0
15,A darker future,"Don't say that you didn't see it coming, how t...",t2_2ba83lsd,17,0
946,A darker future,"Don't say that you didn't see it coming, how t...",t2_2ba83lsd,16,0


In [70]:
# drop duplicate posts that are not null values
sw_df = sw_df[sw_df['post'].isnull() | ~sw_df[sw_df['post'].notnull()].duplicated(subset='post',keep='first')]

In [71]:
# number of duplicated posts
sw_df['post'].duplicated(keep=False).sum()

12

In [72]:
sw_df['post'].isnull().sum()

12

In [73]:
# number of duplicated titles
sw_df['title'].duplicated(keep=False).sum()

8

In [74]:
# check out titles that are duplicates
sw_df[sw_df['title'].duplicated(keep=False)].sort_values(by=['title'])


Unnamed: 0,title,post,author,upvotes,downvotes
250,.,The reproduction of hope is there was nothing ...,t2_2mpw45ts,1,0
285,.,Wanting to die is the most alive you’ll ever feel,t2_2mpw45ts,2,0
865,Apethetic and depressed - M18,"\nHey everyone, I’ve been super depressed off ...",t2_5kfevpzr,1,0
868,Apethetic and depressed - M18,"Hey everyone, I’ve been super depressed off la...",t2_5kfevpzr,1,0
381,Eh,I’ve been having so many anxiety attacks latel...,t2_5nm4iu9d,1,0
767,Eh,"I've reached a point where I'm constantly sad,...",t2_4zrultbw,3,0
183,Life,"Truth is, I really wish my fucks ran out. I do...",t2_5a3gy45c,3,0
621,Life,My life is full of tragedies I hate it so much...,t2_2oogwu6z,4,0


In [75]:
sw_df = sw_df.drop_duplicates(['title', 'post', 'author'], keep='first')

In [76]:
# number of duplicated titles
sw_df['title'].duplicated(keep=False).sum()

8

In [77]:
# check out titles that are duplicates
sw_df[sw_df['title'].duplicated(keep=False)].sort_values(by=['title'])

Unnamed: 0,title,post,author,upvotes,downvotes
250,.,The reproduction of hope is there was nothing ...,t2_2mpw45ts,1,0
285,.,Wanting to die is the most alive you’ll ever feel,t2_2mpw45ts,2,0
865,Apethetic and depressed - M18,"\nHey everyone, I’ve been super depressed off ...",t2_5kfevpzr,1,0
868,Apethetic and depressed - M18,"Hey everyone, I’ve been super depressed off la...",t2_5kfevpzr,1,0
381,Eh,I’ve been having so many anxiety attacks latel...,t2_5nm4iu9d,1,0
767,Eh,"I've reached a point where I'm constantly sad,...",t2_4zrultbw,3,0
183,Life,"Truth is, I really wish my fucks ran out. I do...",t2_5a3gy45c,3,0
621,Life,My life is full of tragedies I hate it so much...,t2_2oogwu6z,4,0


## 3. Concatenate DataFrames
---
Designate each of the initial data tables with a binary value for the status 'is_sw', where 0 is for all posts from r/depression, and 1 for r/SuicideWatch.

In [78]:
# create binary classifier: 'is_sw' (is 'suicide watch')
dep_df['is_sw'] = 0
sw_df['is_sw'] = 1

Combine both tables into one DataFrame.

In [79]:
df = pd.concat([dep_df, sw_df])

Inspect null values.

In [80]:
df.isnull().sum()

title         0
post         77
author        0
upvotes       0
downvotes     0
is_sw         0
dtype: int64

Having ascertained that the null values were submissions with titles but without self text, fill null data.

In [81]:
df['post'].fillna("", inplace=True)

In [82]:
df.head(10)

Unnamed: 0,title,post,author,upvotes,downvotes,is_sw
0,New wiki on how to avoid accidentally encourag...,We've been seeing a worrying increase in pro-s...,t2_1t70,1728,0,0
1,Reminder: Absolutely no activism of any kind i...,"If you want to recognise an occasion, please d...",t2_1t70,1216,0,0
2,Haha help,"Yes, I am suicidal. Yes, I am ""getting help"". ...",t2_u5zigd9,158,0,0
3,Someone please talk to me😢 anyone at all please,"I have absolutely no one to turn to, I feel so...",t2_5a7y2qvy,113,0,0
4,I'm usually responding to these posts,I don't feel like I shouldn't be posting this ...,t2_5w7amdt9,32,0,0
5,I am a horrible abusive person who undeniably ...,I'm going to get straight to the point. When I...,t2_5wgahit1,38,0,0
6,I want to Die!!,"&amp;#x200B;\n\nI so badly want to die, Everyd...",t2_1153k8,410,0,0
7,Suicidal Ideation,"I’ve been on Prozac, Lexapro, Abilify, and Lor...",t2_4oposcku,25,0,0
8,I'm done,I really don't want to be considering suicide....,t2_epni2,6,0,0
9,I'm a lonely worthless loser with no friends a...,I have no friends. Everyone forgets I even exi...,t2_5wht3pwg,7,0,0


Combine the 'title' and 'post' columns into a single column.

In [83]:
df['comb'] = df['title'] + ' ' + df['post']
df.head()

Unnamed: 0,title,post,author,upvotes,downvotes,is_sw,comb
0,New wiki on how to avoid accidentally encourag...,We've been seeing a worrying increase in pro-s...,t2_1t70,1728,0,0,New wiki on how to avoid accidentally encourag...
1,Reminder: Absolutely no activism of any kind i...,"If you want to recognise an occasion, please d...",t2_1t70,1216,0,0,Reminder: Absolutely no activism of any kind i...
2,Haha help,"Yes, I am suicidal. Yes, I am ""getting help"". ...",t2_u5zigd9,158,0,0,"Haha help Yes, I am suicidal. Yes, I am ""getti..."
3,Someone please talk to me😢 anyone at all please,"I have absolutely no one to turn to, I feel so...",t2_5a7y2qvy,113,0,0,Someone please talk to me😢 anyone at all pleas...
4,I'm usually responding to these posts,I don't feel like I shouldn't be posting this ...,t2_5w7amdt9,32,0,0,I'm usually responding to these posts I don't ...


In [84]:
# check for any null data in columns
df.isnull().sum()

title        0
post         0
author       0
upvotes      0
downvotes    0
is_sw        0
comb         0
dtype: int64

In [85]:
df.shape

(1909, 7)

In [86]:
df.is_sw.value_counts()

0    983
1    926
Name: is_sw, dtype: int64

Within the the total of 2034 posts in is_sw class, we have 1021 posts from r/depression, and 1013 posts from r/SuicideWatch.

---
### Save new combined DataFrame
---

This DataFrame table containing titles, post, upvotes, downvotes, authors, and combined values, as well as our target vector is_sw.

In [87]:
# save combined data in new DataFrame
df.to_csv('./dataset/df.csv', index=False)

---