# Scrapping Entries from Reddit

There are two TV series that I absolutely love. They reflect the real life problems. That's why I scrapped the subreddits of Black Mirror and WestWorld. In this notebook, I will go through the scraping.

In [1]:
# libraries needed:
import requests
import time
import pandas as pd
import numpy as np

## Black Mirror Posts

To scrap the data, I used the api push. That enabled me to scrap data in less than a minute.

In [22]:
def get_author_comments(**kwargs):
    r = requests.get("https://api.pushshift.io/reddit/submission/search/",params=kwargs)
    data = r.json()
    return data['data']

before = None
all_posts = []
for i in range(20):
    print('grabbing {} posts...'.format((i+1)*500))
    posts = get_author_comments(subreddit="blackmirror",size=500,before=before,sort='desc',sort_type='created_utc')

    before = posts[-1]['created_utc'] # This will keep track of your position for the next call in the while loop
     
    all_posts.extend(posts)

    time.sleep(1)

grabbing 500 posts...
grabbing 1000 posts...
grabbing 1500 posts...
grabbing 2000 posts...
grabbing 2500 posts...
grabbing 3000 posts...
grabbing 3500 posts...
grabbing 4000 posts...
grabbing 4500 posts...
grabbing 5000 posts...
grabbing 5500 posts...
grabbing 6000 posts...
grabbing 6500 posts...
grabbing 7000 posts...
grabbing 7500 posts...
grabbing 8000 posts...
grabbing 8500 posts...
grabbing 9000 posts...
grabbing 9500 posts...
grabbing 10000 posts...


At the end I have the following amount of posts:

In [3]:
len(all_posts)

10000

Then I needed to check the number of unique ID's. That is very important because a common problem is having duplicated posts since 10000 is a big number. Thanks to push api, I had all different.

In [4]:
ids = []
for i in range(0,len(all_posts)):
    ids.append(all_posts[i]['id'])
len(set(ids))

10000

Below is all the data related to the first post:

In [5]:
all_posts[0]

{'author': 'ismael676',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_q2mzyzm',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1545417655,
 'domain': 'news.mit.edu',
 'full_link': 'https://www.reddit.com/r/blackmirror/comments/a8czdm/mit_was_able_to_reconstruct_sound_from_the/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'a8czdm',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_css_class': 'yellow',
 'link_flair_richtext': [{'e': 'text', 't': 'FLUFF'}],
 'link_flair_text': 'FLUFF',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'richtext',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 '

Here there are many columns that need to be extracted. The reason is I won't include them to the analysis. In order to do that, I extracted each element as one and then appended those to a list. The important columns are:
 - created_utc
 -   id
 -   is_video
 -   num_comments
 -   score
 -   selftext
 -   spoiler
 -   subreddit
 -   title
 
Below I created an empty lists for each column. Then I appended each element to these lists. When there is no element, then I added an empty element so that the list still has 10000 elements.

In [6]:
bm_utc = []
bm_id = []
bm_num_comments = []
bm_score = []
bm_selftext = []
bm_spoiler = []
bm_subreddit = []
bm_title = []

##############################

for i in range(0,len(all_posts)):
    bm_utc.append(all_posts[i]['created_utc'] if all_posts[i]['created_utc'] != '' else 'NA')
    bm_id.append(all_posts[i]['id'] if all_posts[i]['id'] != '' else None)
    #bm_is_video.append(all_posts[i]['is_video'] if all_posts[i]['is_video'] != '' else None)
    bm_num_comments.append(all_posts[i]['num_comments'] if all_posts[i]['num_comments'] != '' else None)
    bm_score.append(all_posts[i]['score'] if all_posts[i]['score'] != '' else None)
    bm_spoiler.append(all_posts[i]['spoiler'] if all_posts[i]['spoiler'] != '' else None)
    bm_subreddit.append(all_posts[i]['subreddit'] if all_posts[i]['subreddit'] != '' else None)
    bm_title.append(all_posts[i]['title'] if all_posts[i]['title'] != '' else None)


But the 'selftext' column is a little bit different than the rest. Since the empty values might create a KeyError, I used an try-except phrase. This helped me to bypass the error in case I need to. 

In [7]:
bm_text = []
bm_is_video = []
for i in range(0,len(all_posts)):
    try : all_posts[i]['selftext']
    except KeyError : all_posts[i]['selftext'] = ''
    bm_text.append(all_posts[i]['selftext'])
for i in range(0,len(all_posts)):
    try : all_posts[i]['is_video']
    except KeyError : all_posts[i]['is_video'] = ''
    bm_is_video.append(all_posts[i]['is_video'])

Below I created a dataframe for the posts from the subreddit of Black Mirror!

In [8]:
black_mirror = pd.DataFrame(
    {'created_utc' : bm_utc,
     'id': bm_id,
     'is_video' : bm_is_video,
     'num_comments' : bm_num_comments,
     'score' : bm_score,
     'selftext' : bm_text,
     'spoiler' : bm_spoiler,
     'subreddit' : bm_subreddit,
     'title': bm_title
    }
)

In [9]:
counter = 0
for i in range(0,len(black_mirror)):
    if black_mirror.selftext[i] == '':
        counter += 1
    else:
        continue

For our analysis, it is very important to know the number of the empty posts, in other words posts without body. With the help if the above code we see that there are 4310 empty elements. This number might vary since we are always pulling new data.

In [10]:
counter

4321

## WestWorld Posts

To do the same for our next favourite tv series, I combined all the code:

In [11]:
def get_author_comments(**kwargs):
    r = requests.get("https://api.pushshift.io/reddit/submission/search/",params=kwargs)
    data = r.json()
    return data['data']

before = None
all_posts = []
for i in range(20):
    print('grabbing {} posts...'.format((i+1)*500))
    posts = get_author_comments(subreddit='westworld', size=500,before=before,sort='desc',sort_type='created_utc')

    before = posts[-1]['created_utc'] # This will keep track of your position for the next call in the while loop

    all_posts.extend(posts)

    time.sleep(1)

grabbing 500 posts...
grabbing 1000 posts...
grabbing 1500 posts...
grabbing 2000 posts...
grabbing 2500 posts...
grabbing 3000 posts...
grabbing 3500 posts...
grabbing 4000 posts...
grabbing 4500 posts...
grabbing 5000 posts...
grabbing 5500 posts...
grabbing 6000 posts...
grabbing 6500 posts...
grabbing 7000 posts...
grabbing 7500 posts...
grabbing 8000 posts...
grabbing 8500 posts...
grabbing 9000 posts...
grabbing 9500 posts...
grabbing 10000 posts...


In [12]:
ww_utc = []
ww_id = []
ww_num_comments = []
ww_score = []
ww_selftext = []
ww_spoiler = []
ww_subreddit = []
ww_title = []

    ##############################

for i in range(0,len(all_posts)):
    ww_utc.append(all_posts[i]['created_utc'] if all_posts[i]['created_utc'] != '' else 'NA')
    ww_id.append(all_posts[i]['id'] if all_posts[i]['id'] != '' else None)
    #ww_is_video.append(all_posts[i]['is_video'] if all_posts[i]['is_video'] != '' else None)
    ww_num_comments.append(all_posts[i]['num_comments'] if all_posts[i]['num_comments'] != '' else None)
    ww_score.append(all_posts[i]['score'] if all_posts[i]['score'] != '' else None)
    ww_spoiler.append(all_posts[i]['spoiler'] if all_posts[i]['spoiler'] != '' else None)
    ww_subreddit.append(all_posts[i]['subreddit'] if all_posts[i]['subreddit'] != '' else None)
    ww_title.append(all_posts[i]['title'] if all_posts[i]['title'] != '' else None)

In [13]:
ww_text = []
ww_is_video = []

for i in range(0,len(all_posts)):
    try : all_posts[i]['selftext']
    except KeyError : all_posts[i]['selftext'] = ''
    ww_text.append(all_posts[i]['selftext'])
    
for i in range(0,len(all_posts)):
    try : all_posts[i]['is_video']
    except KeyError : all_posts[i]['is_video'] = ''
    ww_is_video.append(all_posts[i]['is_video'])

In [14]:
west_world = pd.DataFrame(
    {'created_utc' : ww_utc,
     'id': ww_id,
     'is_video' : ww_is_video,
     'num_comments' : ww_num_comments,
     'score' : ww_score,
     'selftext' : ww_text,
     'spoiler' : ww_spoiler,
     'subreddit' : ww_subreddit,
     'title': ww_title
    }
)

    ######################################

In [15]:
counter = 0
for i in range(0,len(west_world)):
    if west_world.selftext[i] == '':
        counter += 1
    else:
        continue
print(counter)

4110


## Convert them to csv files

Let's check both of the dataframes one last time before converting them to csv files:

In [16]:
black_mirror.head()

Unnamed: 0,created_utc,id,is_video,num_comments,score,selftext,spoiler,subreddit,title
0,1545417655,a8czdm,False,0,1,,False,blackmirror,MIT was able to reconstruct sound from the vib...
1,1545415295,a8cklh,False,1,1,,False,blackmirror,"'Black Mirror: Bandersnatch' Synopsis, Runtime..."
2,1545413557,a8c9tm,False,0,1,,True,blackmirror,What is Black Mirror: Bandersnatch?
3,1545410154,a8bozn,False,4,1,not sure if i need to mark this as possible sp...,True,blackmirror,White Christmas Ending
4,1545404674,a8at5v,False,1,1,,False,blackmirror,This is a playlist with songs from some of the...


In [17]:
west_world.head()

Unnamed: 0,created_utc,id,is_video,num_comments,score,selftext,spoiler,subreddit,title
0,1545416118,a8cprr,False,2,1,,False,westworld,"Is this....Bernard? Nah, just my friend's mom'..."
1,1545377064,a87law,False,1,1,,False,westworld,Is there cheaper ways to watch than Hulu?
2,1545371104,a86vwk,False,3,1,,False,westworld,Anyone else think it’s funny that Tessa Thomps...
3,1545370046,a86r6o,False,0,1,,False,westworld,Black Mirror [Online Game Code] is 67% OFF
4,1545364029,a860qt,False,2,1,,False,westworld,Clicked on Trending and thought that they rele...


In [18]:
black_mirror.to_csv('blackmirror.csv')
west_world.to_csv('westworld.csv')