## Using the Pushshift API to get data from Reddit

_Adjusted by Jeff Hale__

## Learning Objectives

By the end of this lesson, students will be able to:

- Use the Pushshift API with the requests library to get data from Reddit
- Parse the data into a pandas DataFrame

## Pushshift API Example

This document gives an example of a successful call to the pushshift API.

The documentation found at [https://github.com/pushshift/api](https://github.com/pushshift/api) is a good reference and was used to make this guide.

Another walkthrough that is helpful is found here: https://www.osrsbox.com/blog/2019/03/18/watercooler-scraping-an-entire-subreddit-2007scape/.  I do not know why this is a Runescape blog that did a great walkthrough, but we'll take it.

### Setup

In [None]:
import pandas as pd   
import requests  #library that uses internet to make web requests to outside pages
import time    #specialized functions to 
import datetime

### Making our first API call

#### Try 1

First we need to tell the computer where it is going to go to get the data

#### url where we want to request data.

In [None]:
reddit_url = 'https://api.pushshift.io/reddit/search/submisssion/?subreddit=PrideandPrejudice&size=100'

Hold up, that's ugly as sin. What are all the parts of this?
- https://api.pushshift.io :  
- /reddit/ : 
- /search/ : 
- /submission/ :
- ?subreddit=PrideandPrejudice :
- &size=100 : 

More to come...

#### Then we make the request using the requests library:

This is the actual request, the r is a convention to pull data that's not processed

#### the method `.text` will present the full text

Hmmm that doesn't look right...let's check if our request connected properly

#### function to check status of request

Common Errors with API Requests:
- 401 : Unauthorized
- 403 : Forbidden
- 404 : Page does not exist
- 429 : Too many requests

200 indicates a success!

See more here: https://blog.runscope.com/posts/how-to-debug-common-api-errors and https://realpython.com/python-requests/#status-codes.

#### Try 2

We found our error so let's try that same API call.

#### list the url where we want to request data.

In [None]:
reddit_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=PrideandPrejudice&size=100'

What if we followed this link in Chrome instead of making the request?

#### try with repaired URL

#### what is r exactly?

#### the method .text will present the full text

Ok now THAT is ugly... How might we make this cleaner?

#### json method will show it in json format and transform it

Woah lots of stuff! Does this look familiar to anyone...?

Let's save it.

### Processing JSON 

#### let's explore this a bit...what type is this new json object?

![btfgif](https://i.kym-cdn.com/photos/images/newsfeed/001/480/980/897.gif)

We know how to work with dictionaries!

Let's use what we know about dictionaries to explore further

#### look at keys

Oddly only one key: 'data'...what is its corresponding value?

Now we have it as a list...? What is each item of the list? 

#### Lets make this list a variable so we can explore each item further

#### Look at the first item in the list.

#### What type is the first item?

#### Get the title of the first post.

### Let's make a dataframe out of a post

![collins-gif](https://media1.tenor.com/images/e25c4f2744fa7c5d96bcd2c5ca4c1435/tenor.gif?itemid=5782673)

## Summary

You saw how to access the reddit API using requests, turn it into JSON, and save the the data in a DataFrame.

#### Check for understanding
- What is JSON?

Ready to get some data!? ⬇️
