# Data Collection

In [1]:
# Import Basic Packages
import requests
import time
import random
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
pd.set_option('display.max_columns', 999)

## Making a Reddit Scraping Function

Firstly, let's try to scrape data from a Reddit URL and see what are the key actions we need to perform a successful web scraping. The `requests` library is used for submitting HTTP requests from Python. Let's first try it on the software engineering subreddit URL link.

In [2]:
url = 'https://www.reddit.com/r/softwareengineering.json'
headers = {'User-agent': 'GA boy'}
res = requests.get(url, headers=headers)
res.status_code

200

A HTTP Response Status Code of `200` means that the request has successed. In this example, since we have used the `GET` method, this implies that the resource has been fetched and is transmitted in the message body.

In [3]:
# Converting the data into a format similar to a dictionary using the .json() method
url_dict = res.json()
url_dict

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'SoftwareEngineering',
     'selftext': "Hi /r/SoftwareEngineering - This is just to let you all know that automoderator has been enabled for this subreddit and we will be fine-tuning the options as we go, if you get a post filtered and you think it's an error please contact me so that I can take a look.\n\n&amp;nbsp;\n\nI also enabled a [dark theme](https://www.reddit.com/r/darkserene/) for the people not using RES' nightmode already that I hope it's easier on your eyes.\n\n&amp;nbsp;\n\nPlease use this space for any feedback you have for the community.\n\n&amp;nbsp;\n\n\nThat would be all!\n\n&amp;nbsp;\n\nAll the best,\n\nThe modteam.",
     'author_fullname': 't2_6w5wq',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Style changes and automoderator',
     'link_flair_richtext': [

Based on the output above, we would like to exrract columns like 'children' and further sub-columns like 'subreddit', 'selftext' and 'title' that contains key words text that we will use for our analysis.

In [4]:
url_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [5]:
# Shows the subreddit topic for the first post
url_dict['data']['children'][0]['data']['subreddit']

'SoftwareEngineering'

In [6]:
# Shows the title of the second post
url_dict['data']['children'][1]['data']['title']

'How do your companies keep track of multiple systems, products and teams to better inform decision making?'

In [7]:
# Shows the selftext of the first post
url_dict['data']['children'][0]['data']['selftext']

"Hi /r/SoftwareEngineering - This is just to let you all know that automoderator has been enabled for this subreddit and we will be fine-tuning the options as we go, if you get a post filtered and you think it's an error please contact me so that I can take a look.\n\n&amp;nbsp;\n\nI also enabled a [dark theme](https://www.reddit.com/r/darkserene/) for the people not using RES' nightmode already that I hope it's easier on your eyes.\n\n&amp;nbsp;\n\nPlease use this space for any feedback you have for the community.\n\n&amp;nbsp;\n\n\nThat would be all!\n\n&amp;nbsp;\n\nAll the best,\n\nThe modteam."

In [8]:
# Convert these subreddit posts into a pandas dataframe
url_posts = [p['data'] for p in url_dict['data']['children']]
print(str('The number of subreddit posts in this dataframe is: '), len(pd.DataFrame(url_posts)))
pd.DataFrame(url_posts).head()

The number of subreddit posts in this dataframe is:  26


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,poll_data,post_hint,url_overridden_by_dest,preview
0,,SoftwareEngineering,Hi /r/SoftwareEngineering - This is just to le...,t2_6w5wq,False,,0,False,Style changes and automoderator,[],r/SoftwareEngineering,False,6,,0,,,False,t3_cbpgfx,False,dark,0.72,,public,3,0,{},,,False,[],,False,False,,{},,False,3,,False,self,False,,[],{},,True,,1562837000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,True,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,cbpgfx,True,,Tred27,,0,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/cbpgfx/style_c...,all_ads,True,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1562808000.0,0,,False,,,,
1,,SoftwareEngineering,I have been at several large companies now and...,t2_59dwm06,False,,0,False,How do your companies keep track of multiple s...,[],r/SoftwareEngineering,False,6,,0,,,False,t3_inz7gf,False,dark,0.84,,public,4,0,{},,,False,[],,False,False,,{},,False,4,,False,self,False,,[],{},,True,,1599475000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,inz7gf,True,,arven_ekargon,,10,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/inz7gf/how_do_...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599446000.0,0,,False,"{'user_won_amount': None, 'resolved_option_id'...",,,
2,,SoftwareEngineering,"I am doing a Uni course on Data Engineering, w...",t2_57j4x5fp,False,,0,False,How to learn programming fundamentals?,[],r/SoftwareEngineering,False,6,,0,,,False,t3_ins3wm,False,dark,0.82,,public,11,0,{},,,False,[],,False,False,,{},,False,11,,False,self,1.59942e+09,,[],{},,True,,1599449000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,ins3wm,True,,AMGraduate564,,13,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/ins3wm/how_to_...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599420000.0,0,,False,,,,
3,,SoftwareEngineering,,t2_ayl6m,False,,0,False,On house building and software development pro...,[],r/SoftwareEngineering,False,6,,0,78.0,,False,t3_innvzx,False,dark,0.7,,public,4,0,{},140.0,,False,[],,False,False,,{},,False,4,,False,https://b.thumbs.redditmedia.com/Ly_ESOgCG5bCG...,False,,[],{},,False,,1599435000.0,text,6,,,text,blog.frankel.ch,False,,,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,innvzx,True,,nfrankel,,0,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/innvzx/on_hous...,all_ads,False,https://blog.frankel.ch/on-house-building-soft...,24671,1599406000.0,0,,False,,link,https://blog.frankel.ch/on-house-building-soft...,{'images': [{'source': {'url': 'https://extern...
4,,SoftwareEngineering,I’ve been learning to code using Udemy and thi...,t2_60gqusa6,False,,0,False,Learning to code,[],r/SoftwareEngineering,False,6,,0,,,False,t3_ineb6z,False,dark,0.95,,public,14,0,{},,,False,[],,False,False,,{},,False,14,,False,self,False,,[],{},,True,,1599389000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,ineb6z,True,,jrsteck_,,10,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/ineb6z/learnin...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599360000.0,0,,False,,,,


In [9]:
# Unique IDs/names given to each post in the subreddit
pd.DataFrame(url_posts)['name']

0     t3_cbpgfx
1     t3_inz7gf
2     t3_ins3wm
3     t3_innvzx
4     t3_ineb6z
5     t3_inhaxx
6     t3_in1l8s
7     t3_imsnpc
8     t3_imys4v
9     t3_imi669
10    t3_impuyf
11    t3_im52nv
12    t3_imd8wh
13    t3_im7334
14    t3_im9s8c
15    t3_ilul2d
16    t3_im3v63
17    t3_ilswz1
18    t3_ilsny2
19    t3_ilk9u6
20    t3_ilsogp
21    t3_ilw6nr
22    t3_ilmg3e
23    t3_il6c72
24    t3_iller3
25    t3_ilg75b
Name: name, dtype: object

In [10]:
# The after column is an indication of the name of the last post on a subreddit on a particular page
url_dict['data']['after']

't3_ilg75b'

In [11]:
# Generate the url that gives us the next page with the next 25 posts
url + '?after=' + url_dict['data']['after']

'https://www.reddit.com/r/softwareengineering.json?after=t3_ilg75b'

Based on the above actions we have donne on this URL, we are now able to create our own scraping function to perform on a SubReddit URL Page.

## Scraping Software Engineering Subreddit Data Using Reddit's API

In [12]:
# Scraping Software Engineering Subreddit Data

swe_posts=[] # Storing Software Engineering Subreddit Posts
after = None # after will be empty for the first iteration
swe_url = 'https://www.reddit.com/r/softwareengineering.json'

for a in range(40): #range number will indicate how many pages of 25 posts to scrape
    if after == None:
        current_url = swe_url
    else:
        current_url = swe_url + '?after=' + after
    print(current_url)
        
    res = requests.get(current_url, headers={'User-agent': 'GA Boy'})
    
    if res.status_code == 200: #to check if request is successful
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        swe_posts.extend(current_posts) # Adding all posts scraped into this list
        after = current_dict['data']['after']
    else:
        print(results.status_code)
        break
        
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)
    time.sleep(1) #seconds to sleep

https://www.reddit.com/r/softwareengineering.json
4
https://www.reddit.com/r/softwareengineering.json?after=t3_ilg75b
6
https://www.reddit.com/r/softwareengineering.json?after=t3_ijqcb3
3
https://www.reddit.com/r/softwareengineering.json?after=t3_ihkilo
2
https://www.reddit.com/r/softwareengineering.json?after=t3_ifotao
3
https://www.reddit.com/r/softwareengineering.json?after=t3_idd0wf
2
https://www.reddit.com/r/softwareengineering.json?after=t3_ib9bqs
3
https://www.reddit.com/r/softwareengineering.json?after=t3_i8n0qw
6
https://www.reddit.com/r/softwareengineering.json?after=t3_i75nw7
4
https://www.reddit.com/r/softwareengineering.json?after=t3_i4mgnc
4
https://www.reddit.com/r/softwareengineering.json?after=t3_i2ja4v
5
https://www.reddit.com/r/softwareengineering.json?after=t3_i041d0
5
https://www.reddit.com/r/softwareengineering.json?after=t3_hyhhx3
3
https://www.reddit.com/r/softwareengineering.json?after=t3_hvq26m
2
https://www.reddit.com/r/softwareengineering.json?after=t3_htr2o

In [13]:
# Check the total number of posts scrapped
print(str('The total number of subreddit posts scrapped was:'), len(swe_posts))

The total number of subreddit posts scrapped was: 994


In [14]:
# Converting these scrapped posts into a dataframe
swe_df = pd.DataFrame(swe_posts)
swe_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,poll_data,post_hint,url_overridden_by_dest,preview,crosspost_parent_list,crosspost_parent,media_metadata
0,,SoftwareEngineering,Hi /r/SoftwareEngineering - This is just to le...,t2_6w5wq,False,,0,False,Style changes and automoderator,[],r/SoftwareEngineering,False,6,,0,,,False,t3_cbpgfx,False,dark,0.72,,public,3,0,{},,,False,[],,False,False,,{},,False,3,,False,self,False,,[],{},,True,,1562837000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,True,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,cbpgfx,True,,Tred27,,0,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/cbpgfx/style_c...,all_ads,True,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1562808000.0,0,,False,,,,,,,
1,,SoftwareEngineering,I have been at several large companies now and...,t2_59dwm06,False,,0,False,How do your companies keep track of multiple s...,[],r/SoftwareEngineering,False,6,,0,,,False,t3_inz7gf,False,dark,0.72,,public,3,0,{},,,False,[],,False,False,,{},,False,3,,False,self,False,,[],{},,True,,1599475000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,inz7gf,True,,arven_ekargon,,10,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/inz7gf/how_do_...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599446000.0,0,,False,"{'user_won_amount': None, 'resolved_option_id'...",,,,,,
2,,SoftwareEngineering,"I am doing a Uni course on Data Engineering, w...",t2_57j4x5fp,False,,0,False,How to learn programming fundamentals?,[],r/SoftwareEngineering,False,6,,0,,,False,t3_ins3wm,False,dark,0.76,,public,8,0,{},,,False,[],,False,False,,{},,False,8,,False,self,1.59942e+09,,[],{},,True,,1599449000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,ins3wm,True,,AMGraduate564,,13,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/ins3wm/how_to_...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599420000.0,0,,False,,,,,,,
3,,SoftwareEngineering,,t2_ayl6m,False,,0,False,On house building and software development pro...,[],r/SoftwareEngineering,False,6,,0,78.0,,False,t3_innvzx,False,dark,0.58,,public,2,0,{},140.0,,False,[],,False,False,,{},,False,2,,False,https://b.thumbs.redditmedia.com/Ly_ESOgCG5bCG...,False,,[],{},,False,,1599435000.0,text,6,,,text,blog.frankel.ch,False,,,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,innvzx,True,,nfrankel,,0,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/innvzx/on_hous...,all_ads,False,https://blog.frankel.ch/on-house-building-soft...,24671,1599406000.0,0,,False,,link,https://blog.frankel.ch/on-house-building-soft...,{'images': [{'source': {'url': 'https://extern...,,,
4,,SoftwareEngineering,I’ve been learning to code using Udemy and thi...,t2_60gqusa6,False,,0,False,Learning to code,[],r/SoftwareEngineering,False,6,,0,,,False,t3_ineb6z,False,dark,1.0,,public,15,0,{},,,False,[],,False,False,,{},,False,15,,False,self,False,,[],{},,True,,1599389000.0,text,6,,,text,self.SoftwareEngineering,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmng,,,,ineb6z,True,,jrsteck_,,10,True,all_ads,False,[],False,,/r/SoftwareEngineering/comments/ineb6z/learnin...,all_ads,False,https://www.reddit.com/r/SoftwareEngineering/c...,24671,1599360000.0,0,,False,,,,,,,


In [15]:
swe_df.shape

(994, 111)

To go one step deeper, let's find the number of posts that have a unique entry in the 'selftext' column. We will remove all duplicate rows.

In [16]:
# Check the number of posts scrapped that have a unique entry in the 'selftext' column
print(str('The number of subreddit posts that have a unique entry in the "selftext" column was:'), len(set([p['selftext'] for p in swe_posts])))

The number of subreddit posts that have a unique entry in the "selftext" column was: 724


In [17]:
# Check on duplicate rows/posts
swe_df.duplicated(subset=['selftext']).sum()

270

In [18]:
# Removing duplicate rows
swe_df.drop_duplicates(subset = ['selftext'], inplace = True)

In [19]:
# Verification Check
swe_df.duplicated(subset=['selftext']).sum()

0

Secondly, we will also remove dupliicate posts that do not have a unique entry in the 'title' column.

In [20]:
# Check on duplicate rows/posts
swe_df.duplicated(subset=['title']).sum()

3

In [21]:
# Removing duplicate rows
swe_df.drop_duplicates(subset = ['title'], inplace = True)

In [22]:
# Verification Check
swe_df.duplicated(subset=['title']).sum()

0

In [23]:
# Final Remaining Rows in our swe_df
swe_df.shape

(721, 111)

After removing our duplicate rows, we are left with 721 subreddit posts for our software engineering subreddit thread.

## Scraping Data Science Subreddit Data using Reddit's API

In [24]:
# Scraping Data Science Subreddit Data

ds_posts=[] # Storing Data Science Subreddit Posts
after = None # Param will be empty for the first iteration
ds_url = 'https://www.reddit.com/r/datascience.json'

for a in range(40): #range number will indicate how many pages of 25 posts to scrape
    if after == None:
        current_url = ds_url
    else:
        current_url = ds_url + '?after=' + after
    print(current_url)
        
    res = requests.get(current_url, headers={'User-agent': 'GA Boy'})
    
    if res.status_code==200: #to check if request successful
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        ds_posts.extend(current_posts) 
        after = current_dict['data']['after']
    else:
        print(results.status_code)
        break
        
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)
    time.sleep(1) #seconds to sleep

https://www.reddit.com/r/datascience.json
3
https://www.reddit.com/r/datascience.json?after=t3_in5vde
6
https://www.reddit.com/r/datascience.json?after=t3_il6rux
4
https://www.reddit.com/r/datascience.json?after=t3_ii0idb
3
https://www.reddit.com/r/datascience.json?after=t3_ih1sd5
2
https://www.reddit.com/r/datascience.json?after=t3_igewo5
6
https://www.reddit.com/r/datascience.json?after=t3_iflow4
5
https://www.reddit.com/r/datascience.json?after=t3_ie7rzz
3
https://www.reddit.com/r/datascience.json?after=t3_idu7pn
6
https://www.reddit.com/r/datascience.json?after=t3_icsad1
6
https://www.reddit.com/r/datascience.json?after=t3_ib1enn
2
https://www.reddit.com/r/datascience.json?after=t3_i98wum
5
https://www.reddit.com/r/datascience.json?after=t3_i87f3l
3
https://www.reddit.com/r/datascience.json?after=t3_i639kx
6
https://www.reddit.com/r/datascience.json?after=t3_i5cwxj
6
https://www.reddit.com/r/datascience.json?after=t3_i2jogq
6
https://www.reddit.com/r/datascience.json?after=t3_i1yfl

In [25]:
# Check the total number of posts scrapped
print(str('The total number of subreddit posts scrapped was:'), len(ds_posts))

The total number of subreddit posts scrapped was: 995


In [26]:
# Converting these scrapped posts into a dataframe
ds_df = pd.DataFrame(ds_posts)
ds_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,url_overridden_by_dest,preview,crosspost_parent_list,crosspost_parent,author_cakeday,poll_data
0,,datascience,Welcome to this week's entering &amp; transiti...,t2_4l4cxw07,False,,0,False,Weekly Entering &amp; Transitioning Thread | 0...,[],r/datascience,False,6,,0,,,False,t3_inkv4p,False,dark,1.0,,public,1,0,{},,,False,[],,False,False,,{},Discussion,False,1,,False,self,False,,[],{},,True,,1599422000.0,text,6,,,text,self.datascience,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,new,,,False,True,False,False,False,[],[],False,False,False,False,v0.5.1,[],False,,,moderator,t5_2sptq,,,,inkv4p,True,,datascience-bot,,8,False,all_ads,False,[],False,dark,/r/datascience/comments/inkv4p/weekly_entering...,all_ads,True,https://www.reddit.com/r/datascience/comments/...,285442,1599394000.0,0,,False,,,,,,,,
1,,datascience,"I wrote this post 6 months ago, and I know tha...",t2_3pmyfxcf,False,,0,False,What we look for in hiring,[],r/datascience,False,6,career,0,,,False,t3_innazn,False,dark,0.97,,public,671,1,{},,,False,[],,False,False,,{},Career,False,671,,False,self,False,,[],{},,True,,1599433000.0,text,6,,,text,self.datascience,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2sptq,,,,innazn,True,,Pinkpenguin438,,120,True,all_ads,False,[],False,,/r/datascience/comments/innazn/what_we_look_fo...,all_ads,False,https://www.reddit.com/r/datascience/comments/...,285442,1599404000.0,1,,False,a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,,,,,,,
2,,datascience,Does anyone have thoughts on Chartered Statist...,t2_6324ybwh,False,,0,False,What professional accreditation and certificat...,[],r/datascience,False,6,discussion,0,,,False,t3_io4qai,False,dark,0.83,,public,4,0,{},,,False,[],,False,False,,{},Discussion,False,4,,False,self,False,,[],{},,True,,1599501000.0,text,6,,,text,self.datascience,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2sptq,,,,io4qai,True,,AxelJShark,,0,True,all_ads,False,[],False,,/r/datascience/comments/io4qai/what_profession...,all_ads,False,https://www.reddit.com/r/datascience/comments/...,285442,1599473000.0,0,,False,4fad7108-d77d-11e7-b0c6-0ee69f155af2,,,,,,,
3,,datascience,I'm really interested in GIS and environmental...,t2_2448korq,False,,0,False,"Any data scientist work with geography, GIS, E...",[],r/datascience,False,6,career,0,,,False,t3_inwz0c,False,dark,0.92,,public,9,0,{},,,False,[],,False,False,,{},Career,False,9,,False,self,1.59944e+09,,[],{},,True,,1599466000.0,text,6,,,text,self.datascience,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2sptq,,,,inwz0c,True,,ArAMITAS,,8,True,all_ads,False,[],False,,/r/datascience/comments/inwz0c/any_data_scient...,all_ads,False,https://www.reddit.com/r/datascience/comments/...,285442,1599437000.0,0,,False,a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,,,,,,,
4,,datascience,Goal: Predict the amount of open bug tickets d...,t2_5j5ny4xv,False,,0,False,Bug prediction with AI | Question,[],r/datascience,False,6,discussion,0,,,True,t3_io5uxk,False,dark,1.0,,public,1,0,{},,,False,[],,False,False,,{},Discussion,False,1,,False,self,False,,[],{},,True,,1599507000.0,text,6,,,text,self.datascience,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2sptq,,,,io5uxk,True,,DDyos,,0,True,all_ads,False,[],False,,/r/datascience/comments/io5uxk/bug_prediction_...,all_ads,False,https://www.reddit.com/r/datascience/comments/...,285442,1599479000.0,0,,False,4fad7108-d77d-11e7-b0c6-0ee69f155af2,,,,,,,


In [27]:
# Check on duplicate rows/posts
ds_df.duplicated(subset=['selftext']).sum()

433

In [28]:
# Removing duplicate rows
ds_df.drop_duplicates(subset = ['selftext'], inplace = True)

In [29]:
# Verification Check
ds_df.duplicated(subset=['selftext']).sum()

0

In [30]:
# Check on duplicate rows/posts
ds_df.duplicated(subset=['title']).sum()

0

In [31]:
# Final Remaining Rows in our ds_df
ds_df.shape

(562, 112)

After removing our duplicate rows, we are left with 562 subreddit posts for our data science subreddit thread.

## Saving our Data into CSV Files

In [32]:
# Saving swe_df into a csv_file
swe_df.to_csv('./data/swe_data.csv', index = False)

In [33]:
# Saving ds_df into a csv_file
ds_df.to_csv('./data/ds_data.csv', index = False)