# Reddit

The goal is to collect posts from Reddit using the following criteria with the [PushShift API](https://github.com/pushshift/api):

* 2020-01-01 - 2020-12-31
* "police violence"

To make things easier we can create a function to search for submissions using a start and end time, which generates results until there are no more. The function can accept options for the API as keyword arguments to the function.

In [152]:
import time
import pandas
import requests

from datetime import datetime

def search_pushshift(start, end, endpoint='submission', max_errors=30, sleep=1, **kwargs):
    if endpoint == 'submission':
        url = "https://api.pushshift.io/reddit/search/submission"
    else:
        url = "https://api.pushshift.io/reddit/search/comment"
    
    params = kwargs
    params['results'] = 500
        
    now = datetime.now()
    hour = int((now - end).total_seconds() / (60 * 60))
    num_hours = int((now - start).total_seconds() / (60 * 60))
    step = 1
    errs = 0

    while hour < num_hours:
        params['before'] = f'{hour}h'
        params['after'] = f'{hour + step}h'
        try:
            resp = requests.get(url, params=params)
            if resp.status_code != 200:
                errs += 1
            else:
                errs = 0
                results = resp.json()['data']
                if len(results) > 0: 
                    for result in results:
                        result['created'] = datetime.fromtimestamp(result['created_utc'])
                        if result['created'] < start:
                            break                            
                        yield result
                    hour += step
                    step = 1            
                else:
                    step = 2 * step
                        
        except Exception as e:
            print(f'got exception: {e}')
            errs += 1
            
        if errs > max_errors:
            print(f'bailing after {max_errors} consecutive errors')
            
        time.sleep(sleep)

So for example you can search for "police violence" posts from 2021-01-01 to 2021-01-04:

In [120]:
for result in search_pushshift(q='"police violence"', start=datetime(2020, 1, 1), end=datetime(2020, 1, 4)):
    print(result['id'], result['created'], result['title'])

ejr7go 2020-01-03 22:51:08 'Old habits die hard': Syed Akbaruddin slams Pak PM Imran Khan for posting fake video about 'police violence in India'
ejp5uw 2020-01-03 20:12:40 Can we talk about S4:E8?
ejmjlz 2020-01-03 17:01:59 Imran Khan Tweets Indian Police Pogrom On Muslims
ejhke1 2020-01-03 11:16:32 Pakistan's Prime Minister Imran Khan Tweets Fake Video From Bangladesh, Tries To Pass It Off As Police Violence In India.
ejhvga 2020-01-03 11:38:04 [IN] - Videos of police violence surface in Bihar
eji17g 2020-01-03 11:49:16 [IN] - Videos of police violence surface in Bihar | The Hindu


Now we can try the same but searching only the title of the submission:

In [81]:
for result in search_pushshift(title='"police violence"', start=datetime(2020, 1, 1), end=datetime(2020, 1, 4)):
    print(result['id'], result['created'], result['title'])


ejr7go 2020-01-03 22:51:08 'Old habits die hard': Syed Akbaruddin slams Pak PM Imran Khan for posting fake video about 'police violence in India'
ejhke1 2020-01-03 11:16:32 Pakistan's Prime Minister Imran Khan Tweets Fake Video From Bangladesh, Tries To Pass It Off As Police Violence In India.
ejhvga 2020-01-03 11:38:04 [IN] - Videos of police violence surface in Bihar
eji17g 2020-01-03 11:49:16 [IN] - Videos of police violence surface in Bihar | The Hindu


## Data Collection

Ok, lets try to get all the Reddit posts having a title including "police violence" in 2020 and save them to a CSV `reddit-police-violence-20200101-20201231.csv` for later analysis. 

Experimentation has shown that the following properties come back from the PushShift API. We can use them to create a CSV where each row is a submission and each column is a property of that submission.

In [107]:
cols = ['pinned', 'secure_media_embed', 'post_hint', 'parent_whitelist_status', 'author_premium', 'pwls', 'all_awardings', 'author_flair_richtext', 'whitelist_status', 'gildings', 'media_metadata', 'score', 'author_cakeday', 'thumbnail_height', 'treatment_tags', 'created', 'media_embed', 'is_crosspostable', 'upvote_ratio', 'is_gallery', 'author_flair_template_id', 'spoiler', 'secure_media', 'crosspost_parent_list', 'url_overridden_by_dest', 'allow_live_comments', 'subreddit_id', 'link_flair_css_class', 'url', 'can_mod_post', 'author_flair_text', 'media', 'domain', 'preview', 'wls', 'author_flair_type', 'is_original_content', 'locked', 'removed_by_category', 'thumbnail_width', 'permalink', 'awarders', 'suggested_sort', 'link_flair_background_color', 'content_categories', 'link_flair_text_color', 'selftext', 'thumbnail', 'link_flair_type', 'author_flair_background_color', 'is_robot_indexable', 'media_only', 'crosspost_parent', 'over_18', 'author_patreon_flair', 'total_awards_received', 'author_flair_text_color', 'author_fullname', 'contest_mode', 'gallery_data', 'id', 'num_comments', 'retrieved_on', 'is_video', 'send_replies', 'is_self', 'author', 'subreddit', 'stickied', 'full_link', 'no_follow', 'poll_data', 'author_flair_css_class', 'subreddit_type', 'is_reddit_media_domain', 'num_crossposts', 'link_flair_template_id', 'is_meta', 'subreddit_subscribers', 'created_utc', 'link_flair_text', 'link_flair_richtext', 'title', 'edited', 'banned_by', 'author_cakeday', 'rpan_video', 'event_start', 'event_enshowsd', 'collections', 'steward_reports', 'discussion_type', 'gilded', 'event_is_live', 'edited', 'banned_by', 'author_cakeday', 'rpan_video', 'event_start', 'event_enshowsd', 'collections', 'steward_reports', 'discussion_type', 'gilded', 'event_is_live']

In [None]:
Now we are ready to do the data collection.

In [98]:
import csv 
import sys

missing_cols = set()

with open('data/reddit-police-violence-20200101-20201231.csv', 'w') as fh:
    out = csv.DictWriter(fh, fieldnames=cols)
    out.writeheader()
    for result in search_pushshift(title='"police violence"', start=datetime(2020, 1, 1), end=datetime(2020, 12, 31), sleep=1):                                                                                     
        out.writerow(result)
        sys.stdout.write('.')
        sys.stdout.flush()
        
print(missing_cols)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## Comments

It would also be interesting to get the comments for these submissions as well. PushShift offers a similar API endpoint for fetching comments for a given submission id. First lets get a submission that some comments. We can read in the submission dataset we just created to do that.

In [135]:
import pandas

df = pandas.read_csv('data/reddit-police-violence-20200101-20201231.csv')
df = df.sort_values('num_comments', ascending=False)

df.head(10)[['id', 'created', 'title', 'num_comments']]    

Unnamed: 0,id,created,title,num_comments
5678,gt6s4a,2020-05-29 22:54:36,Trudeau: Canadians watching US unrest and poli...,8160
5019,gvc4tw,2020-06-02 13:53:52,FBI Asks for Evidence of Individuals Inciting ...,3053
3867,gy6pg6,2020-06-07 01:37:09,A group of 66 United Nations human rights moni...,2657
5614,gtja0t,2020-05-30 14:37:49,A girl who lost her father to police violence.,1935
2941,h8qi36,2020-06-14 05:13:26,Global press urges President Trump to curb pol...,1768
2796,ha18d6,2020-06-16 06:16:27,"The moment police in Richmond, VA escalated la...",1447
1928,hv7lsq,2020-07-21 09:46:41,[Update] Players will likely be allowed to wea...,1329
3305,gzyqbl,2020-06-09 19:12:20,Why filming police violence has done nothing t...,1255
5785,gs7612,2020-05-28 10:23:13,Tucker Carlson Calls Protests Against Police V...,1010
1458,ihbge7,2020-08-26 21:00:22,Naomi Osaka pulls out of semifinal tomorrow in...,788


Lets try to get the comments for `gt6s4a`.

In [153]:
for result in search_pushshift(ids='gt6s4a', start=datetime(2020, 5, 29), end=datetime.now(), endpoint='comment'):
    print(result)
    break

KeyboardInterrupt: 