# Data Gathering notebook

Data Collection is done in this notebook. Codes are not in main notebook to prevent long waiting time to loop through posts, download and override existing datasets and affect the results.

Brief steps of the process is discussed here.

<ol>
    <li>Using the python `requests` library to connect to the subreddit url via reddit's json</li>
    <li>Make sure that the status code is correct and properly connected.</li>
    <li>Wrote a loop and ran the loop parsing the base url and the query to go to the next 40 posts until it loops to collect 1000post and exports it to a csv file.</li>
    <li>Check the length of the list and the number of empty posts.</li>
    <li>Repeat for the other subreddit</li>
</ol>
<br>    
</br> 

In [1]:
# import the relevant libraries
import requests
import pandas as pd
import time
import random

In [2]:
# the URL for r/nosleep
url = 'https://www.reddit.com/r/nosleep.json'

## Looping through the posts, 25 posts at a time for 40 times to get 1000 post.

For reddit r/nosleep

In [3]:
posts = []
after = None

for a in range(40):
    # long process adds a count of the loops it is running
    print('Running loop '+str(a+1)+' out of 40')
    
    # the tags for end of post at each page
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Zoogle Inc 1.0'})
    
    # check the status code
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    # appending post to a list
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # creates a dataframe and saving in a csv
    pd.DataFrame(posts).to_csv('nosleep.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = 3
    time.sleep(sleep_duration)

Running loop 1 out of 40
https://www.reddit.com/r/nosleep.json
Running loop 2 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dku4if
Running loop 3 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dk8wx9
Running loop 4 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dkjis8
Running loop 5 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dk1q9e
Running loop 6 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dkcp9m
Running loop 7 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dk8pzr
Running loop 8 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_djyjsq
Running loop 9 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_djnx8x
Running loop 10 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dj6gue
Running loop 11 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_dje9t9
Running loop 12 out of 40
https://www.reddit.com/r/nosleep.json?after=t3_djcp2x
Running loop 13 out of 40
https://www.reddit.com/r/nosleep.json?a

In [4]:
len(posts)

981

In [5]:
df = pd.DataFrame(posts)

In [6]:
df['selftext'].isnull().sum()

0

In [7]:
# this code is added after spotting null values when I import csv data. Apparently a field with a space is null.
df[df['selftext'] == '']

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],True,,,False,cmd102,,,,[],...,,Spooktober! 31 Days of Horror Nosleep Event,0,393,https://old.reddit.com/r/NoSleepOOC/comments/d...,[],,False,all_ads,6
1,[],False,,,False,TheCusterWolf,,,,[],...,,August 2019 Winners!,0,12,https://redd.it/dkiw8h,[],,False,all_ads,6
666,[],True,,,False,TheCusterWolf,,,,[],...,,August 2019 Voting Thread,0,60,https://redd.it/dfwups,[],,False,all_ads,6
829,[],True,,,False,cmd102,,,,[],...,,Spooktober! 31 Days of Horror Nosleep Event,0,393,https://old.reddit.com/r/NoSleepOOC/comments/d...,[],,False,all_ads,6
830,[],False,,,False,TheCusterWolf,,,,[],...,,August 2019 Winners!,0,12,https://redd.it/dkiw8h,[],,False,all_ads,6


<br>
<br>
<br>
For reddit r/UnresolvedMysteries

In [8]:
url = 'https://www.reddit.com/r/UnresolvedMysteries.json'

## Looping through the posts, 25 posts at a time for 40 times to get 1000 post.

For reddit UnresolvedMysteries

In [9]:
posts = []
after = None

for a in range(40):
    print('Running loop '+str(a+1)+' out of 40')
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Zoogle Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    pd.DataFrame(posts).to_csv('UnresolvedMysteries.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = 3
    time.sleep(sleep_duration)

Running loop 1 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json
Running loop 2 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_djbteq
Running loop 3 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_di7g88
Running loop 4 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_dgxznl
Running loop 5 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_dfy0sz
Running loop 6 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_df5skz
Running loop 7 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_dci990
Running loop 8 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_dayux9
Running loop 9 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_d9piw9
Running loop 10 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_d8cppt
Running loop 11 out of 40
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_d6cjy3
Running loop 

In [10]:
len(posts)

995

In [11]:
df = pd.DataFrame(posts)

In [12]:
df['selftext'].isnull().sum()

0

In [13]:
df[df['selftext'] == '']

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
308,[],False,,,False,SmartNegotiation,,,,[],...,,The Mysterious 9/11 Disappearance of Sneha Ann...,0,144,https://www.reddit.com/r/TrueCrime/comments/d2...,[],,False,all_ads,6
485,[],False,,,False,closingbelle,,,,[],...,,Patricia Webb - one of Nebraska's oldest and c...,0,100,https://www.reddit.com/r/coldcases/comments/cs...,[],,False,all_ads,6
511,[],False,,,False,HasanTheSyrian_,,,,[],...,,/r/Syriacus__1144 Linked,0,1,https://www.reddit.com/r/UnresolvedMysteries/c...,[],,False,all_ads,6
876,[],False,,,False,SmartNegotiation,,,,[],...,,The Mysterious 9/11 Disappearance of Sneha Ann...,0,143,https://www.reddit.com/r/TrueCrime/comments/d2...,[],,False,all_ads,6
