In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../Data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [2]:
import requests 
import requests.auth
import json
#from PIL import Image
#from io import BytesIO
from IPython.display import display, Image


# Digital Trace Data

The Internet is a gigantic data dump. There is all the social networking data from Facebook, Twitter, and so on. There is the news from all the traditional media sources plus Quartz, Vox, and so on. Then there is the data from organizations such as the World Bank, the Bureau of Labor Statistics, the US Census, or Chicago's Data Portal.  Finally, you have all your scientific data sources: the National Cancer Institute, the ProteinBank, or the Kyoto Gene and Genomes Encyclopedia.

How can you use Python to access those sites and retrieve data for your research, your business, or your hobby?

There are two main approaches to retrieve data from websources. The preferred approach is using **Application Program Interfaces** or APIs.  If an organization has decided to share its data, and they have the forethought and resources to do it, they will develop an API that will let you interact with their data.

If the organization does not have the forethought or resources to create an API (or if they do not want to share their data), then you have to **crawl** their website and **scrape** their data.


+++

# Application Program Interfaces (APIs)


APIs simplify the process of obtaining specific information from a data source.  You do not have to worry about figuring out the **format** in which the information is stored, or **where** the information is stored.  All of those matter are handled seamlessly by the API. 

But convenience is not the only advantage of an API. APIs are also particular useful when:

* **The data is a small part of a whole.** Reddit comments are one example. What if you want to just pull your own comments on Reddit? It doesn’t make much sense to download the entire Reddit database, then filter just your own comments.
    
* **Massive imbalance in data availability.** Spotify has an API that can tell you the genre of a piece of music. You could theoretically create your own classifier, and use it to categorize music, but you’ll never have as much data as Spotify does.
    
* **Data is changing quickly.** An example of this is stock price data. It doesn’t really make sense to regenerate a dataset and download it every minute – this will take a lot of bandwidth, and be pretty slow.


## Schematic of an API transaction

Using an API is essentially the same as viewing a web page at a url. The rough transaction is as follows:

<img src='../images/api_schema.jpeg'>

1. Formulate and send the query to the server
2. The server runs the specified query on its database
3. Server returns the query data object

The difference between this process and a web page is that you are (i) not making a specific demand when you go to a web page and (ii) the server would return the data embedded in html script (a web page is data after all). 

## Returned data types

APIs will almost always return data in one of two formats: XML or JSON. These formats are nothing particularly special, they simply specify how the data object will be structured. XML is used **far less** than JSON, but both are presented here for completeness.

### eXtensible Markup Language (XML)

XML is a markup language, with the main difference being that it can accept arbitrary classes for items that would not be recognized in HTML. As an example:

<img src='../images/xml_example.png'>

Each ''<>'' is called a tag, it's an opening tag when it has the name normally spelled (``<food>``) and a closing tag when it has the forward slash in front of the name (``</food>``). In order for an XML file to be read without issue, it is necessary that every open tag is closed. 

As far as the structure of the rest of the file (i.e. what line goes where), that is left to the author to decide. One navigates an XML file by moving from sibling to sibling (tags that are at the same level and next to each other) or parent to child (a tag that is inside of another tag).

### Javascript Object Notation (JSON)

JSON is a set of key-value pairs, where the keys must be unique. If this sounds familiar, it is because it is effectively the exact same definition for a dictionary in Python. When you look at a JSON, you will effectively see a dictionary.

<img src='../images/json_example.png'>

The ``{}`` denotes the start and end of the data object and the key value pair is separated by the ``:``. Something to note - a python dictionary is not actually a JSON, you must convert it from the variable to the object.

In [3]:
sample = {'key1': 'value1', 'key2': 'value2'}
sample

{'key1': 'value1', 'key2': 'value2'}

In [4]:
import json 
json.dumps(sample)

'{"key1": "value1", "key2": "value2"}'

You can tell that you have a JSON when it is represented as a string.

## Making requests

In order to learn how APIs work, we will first use the APIs developed to retrieve data on the **International Space Station (ISS)**.  The relevant APIs can be found at http://open-notify.org/.  We will first consider the API for retrieving the location (latitude and longitude) of the ISS (http://open-notify.org/Open-Notify-API/ISS-Location-Now/). The API is hosted at http://api.open-notify.org/iss-now.json. 

So, how do we make requests for information with this API?

Like standard webpages, APIs are also hosted on web servers. When you type http://www.google.com in your browser’s address bar, your computer is actually asking the http://www.google.com server for a webpage, which it then returns it to your browser for display. That action is called a `request`. 

There are many possible types of requests. The most common, and the one we will be using throughout this unit, is the `GET` request. A `GET` request simply accesses and downloads the webpage found at the URL you specified as an input. 

We will use the package [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/) package to crawl (load) webpages and scrape (download) their contents.

In [5]:
import requests 

response = requests.get("http://api.open-notify.org/iss-now.json")
print( response )
print( response.status_code )


<Response [200]>
200


Methods from the `requests` package return `Response` objects. One of the most important properties of the response is its `status code`, which is printed by default but which we can also get explicitly.

Here are some of the most common status codes you might encounter:
* 200, **OK**. Standard response for successful HTTP requests. The actual response will depend on the request method used.
* 301, **Moved Permanently**. The server is redirecting you to a different endpoint. This and all future requests should be directed to the given URL. This can happen when a company switches domain names, or an endpoint name is changed.
* 303, **See Other**. The response to the request can be found under another URI using a GET method. When received in response to a POST (or PUT/DELETE), the client should presume that the server has received the data and should issue a redirect with a separate GET message. Your web browser automatically fetches the new URL but web crawlers do not usually do this unless you specify it.
* 400, **Bad Request**. The server cannot or will not process the request due to an apparent client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).
* 401, **Unauthorized**. Similar to `403 Forbidden`, but specifically for use when authentication is required and has failed or has not yet been provided. The response must include a WWW-Authenticate header field containing a challenge applicable to the requested resource.
* 403, **Forbidden**. The request was a valid request, but the server is refusing to respond to it. `403` error semantically means "unauthorized", i.e. the user does not have the necessary permissions for the resource.
* 404, **Not Found**. The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.
* 500, **Internal Server Error**. A generic error message, given when an `unexpected` condition was encountered and no more specific message is suitable.
* 503, **Service Unavailable**. The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.
* 504, **Gateway Timeout**. The server was acting as a gateway or proxy and did not receive a timely response from the upstream server.[



More codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

The status code of our request was **200**. It means that all went well -- we successfully connected to the web address we wanted and downloaded its contents.

But `status codes` are not the only methods available:

In [6]:
print( response.url )

http://api.open-notify.org/iss-now.json


In [7]:
response.text

'{"timestamp": 1515012863, "message": "success", "iss_position": {"latitude": "35.0142", "longitude": "-152.8480"}}'

This is the content format specified http://open-notify.org/Open-Notify-API/ISS-Location-Now/. It is in `json` format which means that we can easily parse it using the `json` module.

In [8]:
data = json.loads(response.text)

data

{'iss_position': {'latitude': '35.0142', 'longitude': '-152.8480'},
 'message': 'success',
 'timestamp': 1515012863}

**YES**. 

The method `loads()` returns json formatted data as a dictionary. We can print whatever information we need from the dictionary using the appropriate `keys`.

In [9]:
print( "The ISS current position is {0} of latitude and {1} of longitude.".format( 
        data['iss_position']['latitude'], 
        data['iss_position']['longitude'] ) )

The ISS current position is 35.0142 of latitude and -152.8480 of longitude.


### Another example

The [`Open Notify`](http://open-notify.org/) API can handle different kinds of data requests. Let's now consider the case overhead pass prediction API (http://open-notify.org/Open-Notify-API/ISS-Pass-Times/).  Before we write any code, it is important to check the requirements of the API. 

**Overview**

`The API returns a list of upcoming ISS passes for a particular location formatted as JSON.`

`As input it expects a latitude/longitude pair, altitude and how many results to return. All fields are required.`

`As output you get the same inputs back (for checking) and a time stamp when the API ran in addition to a success or failure message and a list of passes. Each pass has a duration in seconds and a rise time as a unix time stamp.`

Notice the second line. We need to provide inputs. Let's see how we can do this!

Any suggestions?

.

.

.

.

.

.

.

.

.

.

.

.


Yes, we need to read the documentation for [requests](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls). Specifically, the documentation concerning passing parameters.

This is the relevant information: `Requests allows you to provide these arguments as a dictionary, using the params keyword argument.`

And what are the parameters we need to provide? The API's documentation has the answer:

**Input**

`This API has 2 required input values and 2 optional ones.`

`Inptut	Description	Query string	Valid Range	Units	Required?`

`Latitude	The latitude of the place to predict passes	lat	-80..80	degrees	YES`

`Longitude	The longitude of the place to predict passes	lon	-180..180	degrees	YES`

`Altitude	The altitude of the place to predict passes	alt	0..10,000	meters	No`

`Number	The number of passes to return	n	1..100	–	No`

The beauty of code documentation. If you recall, the `Overview` stated that all inputs were required. The `Inputs` section let's know that only `lat` and `lon` are actually required.  Let's try and see what happens

In [10]:
#our_location = {'lat': ###, 'lon': ####}

response = requests.get("http://api.open-notify.org/iss-pass.json", params = our_location)

response.url

NameError: name 'our_location' is not defined

In [11]:
# Answer

our_location = {'lat': 42.04, 'lon': 87.687}
response = requests.get("http://api.open-notify.org/iss-pass.json", params = our_location)
response.url

'http://api.open-notify.org/iss-pass.json?lat=42.04&lon=87.687'

In [12]:
response.content

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1515012865, \n    "latitude": 42.04, \n    "longitude": 87.687, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 641, \n      "risetime": 1515016951\n    }, \n    {\n      "duration": 599, \n      "risetime": 1515022782\n    }, \n    {\n      "duration": 573, \n      "risetime": 1515028637\n    }, \n    {\n      "duration": 624, \n      "risetime": 1515034444\n    }, \n    {\n      "duration": 627, \n      "risetime": 1515040234\n    }\n  ]\n}\n'

**What have we learned?** 

That indeed providing values for `alt` and `n` is not required.  The APIs just assigns those variables default values of `100` and `5`, respectively.

We can also see that, as stated in the documentation, the output returns both the values of our inputs and the data we are requesting.

**Exercise:** It is now time for you to try to use an API on your own. The last API available at [`Open Notify`](http://open-notify.org/) returns the number of astronauts in the ISS. Write the code to access that information.

## Using The US Census' API

The United States Census is a decennial census mandated by the United States Constitution. The United States Census Bureau (officially the Bureau of the Census) is responsible for the United States Census.

The first census after the American Revolution was taken in 1790, under Secretary of State Thomas Jefferson; there have been 22 federal censuses since that time. The current national census was held in 2010; the next census is scheduled for 2020 and will be largely conducted using the Internet. For years between the decennial censuses, the Census Bureau issues estimates made using surveys and statistical models.

The Census Bureau has begun rolling out their datasets via [APIs](http://www.census.gov/developers/). You can find a full list of APIs [here](http://www.census.gov/data/developers/data-sets.html).  In this unit, we will focus on the [decennial census](http://www.census.gov/data/developers/data-sets/decennial-census-data.html).

Because we are dealing with US data, we will start by loading some helpful data: US city names, their states, and their geographic codes.  The relevant data is stored in `json` format in `../data/`



In [13]:
ls ../data/us*json

../data/us_cities4analysis.json   ../data/us_state_names.json
../data/us_state_city_names.json


In [14]:
with open('../Data/us_state_names.json') as file_in:
    states_w_codes = json.load( file_in )  
    
states_w_codes

{'AK': {'Name': 'Alaska', 'fips_state': '02'},
 'AL': {'Name': 'Alabama', 'fips_state': '01'},
 'AR': {'Name': 'Arkansas', 'fips_state': '05'},
 'AZ': {'Name': 'Arizona', 'fips_state': '04'},
 'CA': {'Name': 'California', 'fips_state': '06'},
 'CO': {'Name': 'Colorado', 'fips_state': '08'},
 'CT': {'Name': 'Connecticut', 'fips_state': '09'},
 'DE': {'Name': 'Delaware', 'fips_state': '10'},
 'FL': {'Name': 'Florida', 'fips_state': '12'},
 'GA': {'Name': 'Georgia', 'fips_state': '13'},
 'HI': {'Name': 'Hawaii', 'fips_state': '15'},
 'IA': {'Name': 'Iowa', 'fips_state': '19'},
 'ID': {'Name': 'Idaho', 'fips_state': '16'},
 'IL': {'Name': 'Illinois', 'fips_state': '17'},
 'IN': {'Name': 'Indiana', 'fips_state': '18'},
 'KS': {'Name': 'Kansas', 'fips_state': '20'},
 'KY': {'Name': 'Kentucky', 'fips_state': '21'},
 'LA': {'Name': 'Louisiana', 'fips_state': '22'},
 'MA': {'Name': 'Massachusetts', 'fips_state': '25'},
 'MD': {'Name': 'Maryland', 'fips_state': '24'},
 'ME': {'Name': 'Maine', 'f

In [15]:
with open('../Data/us_state_city_names.json') as file_in:
    cities_by_state = json.load( file_in )
    
cities_by_state

{'AK': {'Akhiok': {'fips_city': '00650',
   'fips_state': '02',
   'gnis_city': '2419347',
   'type': 'CDP'},
  'Akiachak': {'fips_city': '00760',
   'fips_state': '02',
   'gnis_city': '2418756',
   'type': 'CDP'},
  'Akiak': {'fips_city': '00870',
   'fips_state': '02',
   'gnis_city': '2419348',
   'type': 'CDP'},
  'Akutan': {'fips_city': '01090',
   'fips_state': '02',
   'gnis_city': '2419349',
   'type': 'CDP'},
  'Alakanuk': {'fips_city': '01200',
   'fips_state': '02',
   'gnis_city': '2419350',
   'type': 'CDP'},
  'Alatna': {'fips_city': '01305',
   'fips_state': '02',
   'gnis_city': '2418761',
   'type': 'CDP'},
  'Alcan Border': {'fips_city': '01390',
   'fips_state': '02',
   'gnis_city': '2418762',
   'type': 'CDP'},
  'Aleknagik': {'fips_city': '01420',
   'fips_state': '02',
   'gnis_city': '2419351',
   'type': 'CDP'},
  'Aleneva': {'fips_city': '01560',
   'fips_state': '02',
   'gnis_city': '2418764',
   'type': 'CDP'},
  'Allakaket': {'fips_city': '01860',
   'fip

**FIPS state codes** are numeric and two-letter alphabetic codes defined in U.S. Federal Information Processing Standard Publication ("FIPS PUB") 5-2 to identify U.S. states and certain other associated areas. The codes are used in Geographic Names Information System, overseen by the U.S. Board on Geographic Names. 

In [16]:
print(states_w_codes['MT'])
print(cities_by_state['MT']['Antelope'])

{'Name': 'Montana', 'fips_state': '30'}
{'fips_city': '02050', 'type': 'CDP', 'gnis_city': '2407745', 'fips_state': '30'}


Now that we have the basic information, we can start using the API to retrieve data. The Census Bureau has a number of helpful resources.  The [decennial census page](http://www.census.gov/data/developers/data-sets/decennial-census-data.html) constains basic instructions on how to contruct queries. There is a also a [page with examples](http://api.census.gov/data/2010/sf1/examples.html), and a page with a list of all (and I *really* mean **all**) [variable codes](http://api.census.gov/data/2010/sf1/variables.html).

But before we can do anything, we need to obtain a `key` that will identify us as the person doing the queries.

This is an important first step in learning how to interact with an API. 

**Exercise.** Request and obtain an API key to use with the Census. 

In [17]:
#Answer

auth = json.load(open('../data/census_auth.json'))
auth

{'my_key': '8690204fa4d0045c42c42fa63af43f7a7751db74'}

Now I would like you to retrieve the total population for all of the states from the following url http://api.census.gov/data/2010/sf1

In [18]:
#Answer

census_url = 'http://api.census.gov/data/2010/sf1'
response = requests.get( census_url, params = {'key': auth['my_key'], 'get': 'P0010001,NAME', 'for': 'state: *'})
print(response.status_code)
print('http://api.census.gov/data/2010/sf1?key=1eb4e956ec8bdfd987960641728d0fed68589575&get=P0010001,NAME&for=state:*')
print(response.url)
print(response.text)

200
http://api.census.gov/data/2010/sf1?key=1eb4e956ec8bdfd987960641728d0fed68589575&get=P0010001,NAME&for=state:*
https://api.census.gov/data/2010/sf1?for=state%3A+%2A&key=8690204fa4d0045c42c42fa63af43f7a7751db74&get=P0010001%2CNAME
[["P0010001","NAME","state"],
["4779736","Alabama","01"],
["710231","Alaska","02"],
["6392017","Arizona","04"],
["2915918","Arkansas","05"],
["37253956","California","06"],
["5029196","Colorado","08"],
["3574097","Connecticut","09"],
["897934","Delaware","10"],
["601723","District of Columbia","11"],
["18801310","Florida","12"],
["9687653","Georgia","13"],
["1360301","Hawaii","15"],
["1567582","Idaho","16"],
["12830632","Illinois","17"],
["6483802","Indiana","18"],
["3046355","Iowa","19"],
["2853118","Kansas","20"],
["4339367","Kentucky","21"],
["4533372","Louisiana","22"],
["1328361","Maine","23"],
["5773552","Maryland","24"],
["6547629","Massachusetts","25"],
["9883640","Michigan","26"],
["5303925","Minnesota","27"],
["2967297","Mississippi","28"],
["598

We can also write queries that obtain several data sets all at once. For example, we can obtain population by age and ethnicity using the codes:

* P012A018 -- Sex By Age (White Alone) MALE 15 yrs old
* P012A038 -- Sex By Age (White Alone) MALE 35 yrs old
* P012B018 -- Sex By Age (Black Or African American Alone) MALE 15 yrs old

And we can restrict the query to a single state.

**Exercise** Pull the population totals for all three of those categories for only Alaska

In [19]:
#Answer

data_codes = ''
for i in ['P012A018', 'P012A038', 'P012B018']:
    data_codes += i + ','
data_codes += 'NAME'
print(data_codes)

state_fips = 'state:' + states_w_codes['AK']['fips_state']

response = requests.get( census_url, params = {'key': auth['my_key'], 'get': data_codes, 'for': state_fips})
print(response.status_code)
print(response.url)
print(response.text)

P012A018,P012A038,P012B018,NAME
200
https://api.census.gov/data/2010/sf1?for=state%3A02&key=8690204fa4d0045c42c42fa63af43f7a7751db74&get=P012A018%2CP012A038%2CP012B018%2CNAME
[["P012A018","P012A038","P012B018","NAME","state"],
["7003","15520","231","Alaska","02"]]


We can also retrieve the population for specific cities - pull the population totals those three categories for Chicago, Evanston, and Wilmette.

In [20]:
#Answer

location_codes = 'place:'
for city in ['Chicago', 'Evanston', 'Wilmette']:
    location_codes += '0'+ cities_by_state['IL'][city]['fips_city'] + ','
location_codes = location_codes[:-1]

state_fips = 'state:' + states_w_codes['IL']['fips_state']

response = requests.get( census_url, params = {'key': auth['my_key'], 'get': data_codes, 'for': location_codes, 'in': state_fips})

print(response.status_code)
print(response.url)
print(response.text)


200
https://api.census.gov/data/2010/sf1?for=place%3A014000%2C024582%2C082075&key=8690204fa4d0045c42c42fa63af43f7a7751db74&get=P012A018%2CP012A038%2CP012B018%2CNAME&in=state%3A17
[["P012A018","P012A038","P012B018","NAME","state","place"],
["12005","37878","8115","Chicago city","17","14000"],
["593","1746","110","Evanston city","17","24582"],
["346","924","1","Wilmette village","17","82075"]]


### Refactor the code 

We have written code that can retrieve specific decennial census information, however, that code is not modular or generalizable. In order to write better code it is useful to refactor our code so it is modular and generalizable.

In [21]:
def create_query_for_census_API(ages, cities, state_code, census_key, ethnicity_code = 'A' ):
    """
    Creates a query for retrieving male populations of given ethnicity for a given set of cities
    
    input:
        ages - list : ages of male population to query
        cities - list : fips codes of cities to query 
        state_code - str : fips code of state for cities
        census_key - str : user personal key for census API
        ethnicity_code - str : ethnicity census code (A, B, C, D, H)
        
    output:
        query - dict : params for API query
    """
    
    return query

#response = requests.get( census_url, params = create_query_for_census_API() )

## Analysis

Which state has the largest African American population as a percentage of total population? Pull the data from the Census and also plot the percentage AA population per state in a graph.

## Social Platforms

Reddit is a link/image sharing website. It has 234 million unique monthly users at last count and serves as an on-line watering hole for a huge and diverse number of groups. Because of these groups and their interests, it can serve as an extremely interesting data source for an observational study.

The API docs are here: https://www.reddit.com/dev/api/

**Exercise** Obtain an API key for an OAuth app and authenticate with Reddit

In [22]:
# Answer

auth = json.load(open('../data/reddit_auth.json'))
    
client_auth = requests.auth.HTTPBasicAuth(auth['app_client_ID'], auth['app_client_secret']) 
post_data = {"grant_type": "password", "username": auth['username'], "password": auth['password']}
headers = {"User-Agent": "kphd540:0.1 by brixtonandcash"}
response = requests.post("https://www.reddit.com/api/v1/access_token", 
                         auth = client_auth, data = post_data, headers = headers)
response.json()

{'access_token': 'jqXsCdogax5dxasVbkyi17e9QlU',
 'expires_in': 3600,
 'scope': '*',
 'token_type': 'bearer'}

Now, once we do that we have to update our header to start querying the API. We will start with querying for our own user profile.

In [23]:
access_token = response.json()['access_token']
headers.update( {'Authorization': 'bearer ' + access_token } )
response = requests.get("https://oauth.reddit.com/api/v1/me", headers = headers)
response.json()

{'comment_karma': 0,
 'created': 1512015267.0,
 'created_utc': 1511986467.0,
 'features': {'activity_service_read': True,
  'activity_service_write': True,
  'ad_moderation': True,
  'adblock_test': True,
  'ads_auction': True,
  'ads_auto_extend': True,
  'ads_auto_refund': True,
  'adserver_reporting': True,
  'adzerk_do_not_track': True,
  'adzerk_reporting_2': True,
  'block_user_by_report': True,
  'chat_group_rollout': True,
  'chat_menu_notification': True,
  'chat_rollout': True,
  'crossposting_ga': True,
  'crossposting_recent': True,
  'default_srs_holdout': {'experiment_id': 171,
   'owner': 'relevance',
   'variant': 'control_1'},
  'do_not_track': True,
  'eu_cookie_policy': True,
  'expando_events': True,
  'force_https': True,
  'geopopular': True,
  'geopopular_ie_holdout': {'experiment_id': 209,
   'owner': 'relevance',
   'variant': 'geopopular_holdout'},
  'geopopular_mobile_holdout': {'experiment_id': 211,
   'owner': 'relevance',
   'variant': 'control_2'},
  'geo

Now I would like you to search the `puppies` subreddit for mentions of the word `adorable`

In [24]:
#Answer 

reddit_query = 'https://oauth.reddit.com/r/puppies/.json'
options = {'q': 'adorable', 'sort': 'new', 'restrict_sr': 'on'}

response = requests.get( reddit_query, params = options, headers=headers)
print(response.url)
response.json()

https://oauth.reddit.com/r/puppies/.json?q=adorable&restrict_sr=on&sort=new


{'data': {'after': 't3_7lvie4',
  'before': None,
  'children': [{'data': {'approved_at_utc': None,
     'approved_by': None,
     'archived': False,
     'author': 'stevetimothy',
     'author_flair_css_class': None,
     'author_flair_text': None,
     'banned_at_utc': None,
     'banned_by': None,
     'brand_safe': True,
     'can_gild': True,
     'can_mod_post': False,
     'clicked': False,
     'contest_mode': False,
     'created': 1515022740.0,
     'created_utc': 1514993940.0,
     'distinguished': None,
     'domain': 'imgur.com',
     'downs': 0,
     'edited': False,
     'gilded': 0,
     'hidden': False,
     'hide_score': False,
     'id': '7nve50',
     'is_crosspostable': True,
     'is_reddit_media_domain': False,
     'is_self': False,
     'is_video': False,
     'likes': None,
     'link_flair_css_class': None,
     'link_flair_text': None,
     'locked': False,
     'media': None,
     'media_embed': {},
     'mod_note': None,
     'mod_reason_by': None,
     'm

In [25]:
goodies = response.json()
print(type(goodies['data']['children']))
stories = goodies['data']['children']
print(stories[1])

<class 'list'>
{'kind': 't3', 'data': {'is_crosspostable': True, 'can_mod_post': False, 'subreddit_id': 't5_2qm5g', 'subreddit_type': 'public', 'mod_reason_title': None, 'approved_by': None, 'banned_at_utc': None, 'subreddit': 'puppies', 'num_crossposts': 0, 'downs': 0, 'id': '7numl0', 'num_reports': None, 'contest_mode': False, 'permalink': '/r/puppies/comments/7numl0/rambos_puppy_sploot/', 'locked': False, 'created_utc': 1514986088.0, 'visited': False, 'approved_at_utc': None, 'link_flair_css_class': None, 'media': None, 'gilded': 0, 'author_flair_css_class': None, 'pinned': False, 'preview': {'images': [{'variants': {}, 'source': {'url': 'https://i.redditmedia.com/0vwpRDeQ2PgkF_1QQ1FXgzJ3Ix-tgL0n5lgAsMGfS1E.jpg?s=1a22710fffee5b4dc7870afdf88ea7ee', 'width': 3264, 'height': 1952}, 'id': 'vZfH9nNGhfS862OyjNM-Z-Lbxd4FRFgonX7tau6dEnM', 'resolutions': [{'url': 'https://i.redditmedia.com/0vwpRDeQ2PgkF_1QQ1FXgzJ3Ix-tgL0n5lgAsMGfS1E.jpg?fit=crop&amp;crop=faces%2Centropy&amp;arh=2&amp;w=108&a

`['data']['children']` is where the posts are.  We can see that we have a bunch of `keys` storing different types of information for each post.  We can look into more detail into the contents of our stories/posts.

In [26]:
print(stories[1].keys())

dict_keys(['kind', 'data'])


In [27]:
print(stories[0]['kind'])
print(stories[0]['data'].keys())

t3
dict_keys(['is_crosspostable', 'can_mod_post', 'subreddit_id', 'subreddit_type', 'mod_reason_title', 'approved_by', 'banned_at_utc', 'subreddit', 'num_crossposts', 'downs', 'id', 'num_reports', 'contest_mode', 'permalink', 'locked', 'created_utc', 'visited', 'approved_at_utc', 'link_flair_css_class', 'media', 'gilded', 'author_flair_css_class', 'pinned', 'preview', 'brand_safe', 'edited', 'suggested_sort', 'name', 'is_video', 'created', 'report_reasons', 'subreddit_name_prefixed', 'parent_whitelist_status', 'thumbnail_height', 'ups', 'selftext', 'post_hint', 'hide_score', 'stickied', 'spoiler', 'thumbnail', 'secure_media_embed', 'domain', 'title', 'is_self', 'is_reddit_media_domain', 'can_gild', 'quarantine', 'archived', 'removal_reason', 'whitelist_status', 'author_flair_text', 'mod_note', 'hidden', 'clicked', 'link_flair_text', 'url', 'banned_by', 'mod_reports', 'media_embed', 'view_count', 'score', 'distinguished', 'user_reports', 'saved', 'mod_reason_by', 'author', 'secure_media

In [28]:
print(stories[0]['data']['author'])
print(stories[0]['data']['id'])
print(stories[0]['data']['score'])
print(stories[0]['data']['thumbnail'])

stevetimothy
7nve50
158
https://b.thumbs.redditmedia.com/beroWTyrmkV2BrvQT6fFxIHLURiAp2JA-DKszccCQDo.jpg


Notice that each of the stories ahd the url for the posted image on it. We can show that image directly in the notebook using the already loaded libraries.

To display an image:

``display(Image(url=image_url))``

Display all the image urls

In [29]:
#Answer

for i, story in enumerate(stories):
    if 'http' in story['data']['thumbnail']:
        figure_url = story['data']['thumbnail']
        print(i, story['data']['name'], story['data']['created_utc'])
        display(Image(url=figure_url))

0 t3_7nve50 1514993940.0


1 t3_7numl0 1514986088.0


2 t3_7nr83t 1514943430.0


3 t3_7nmcd0 1514898808.0


4 t3_7npsvc 1514930554.0


5 t3_7nfcn3 1514816770.0


6 t3_7nf6jn 1514814133.0


7 t3_7ncq3s 1514774065.0


8 t3_7nd3fk 1514779039.0


9 t3_7n4c75 1514670719.0


10 t3_7n6qpv 1514697400.0


11 t3_7n62at 1514689215.0


12 t3_7myqtu 1514598750.0


13 t3_7msmlf 1514528808.0


14 t3_7mtd4d 1514539851.0


15 t3_7mvcjq 1514565635.0


16 t3_7mkfyy 1514436313.0


17 t3_7mdaw6 1514355184.0


18 t3_7mhg4h 1514405539.0


19 t3_7me8he 1514369662.0


20 t3_7m7a9t 1514287427.0


21 t3_7m4t9h 1514250966.0


22 t3_7m7881 1514286478.0


23 t3_7m04bo 1514184798.0


24 t3_7lvie4 1514127464.0


**Exercise** Now I want you to pull the comments for the last posted story

In [30]:
#Answer

stories[-1]['data']['id']

post_query = 'https://oauth.reddit.com/r/puppies/comments/' + stories[-1]['data']['id'] + '.json'

post_response = requests.get( post_query, headers=headers)
print(post_response.url)
post_response.json()

https://oauth.reddit.com/r/puppies/comments/7lvie4.json


[{'data': {'after': None,
   'before': None,
   'children': [{'data': {'approved_at_utc': None,
      'approved_by': None,
      'archived': False,
      'author': 'Tipperton',
      'author_flair_css_class': None,
      'author_flair_text': None,
      'banned_at_utc': None,
      'banned_by': None,
      'brand_safe': True,
      'can_gild': True,
      'can_mod_post': False,
      'clicked': False,
      'contest_mode': False,
      'created': 1514156264.0,
      'created_utc': 1514127464.0,
      'distinguished': None,
      'domain': 'imgur.com',
      'downs': 0,
      'edited': False,
      'gilded': 0,
      'hidden': False,
      'hide_score': False,
      'id': '7lvie4',
      'is_crosspostable': True,
      'is_reddit_media_domain': False,
      'is_self': False,
      'is_video': False,
      'likes': None,
      'link_flair_css_class': None,
      'link_flair_text': None,
      'locked': False,
      'media': None,
      'media_embed': {},
      'mod_note': None,
      'mo

# BotOrNot

APIs also allow you to provide important services and research to others. The assigned paper for this week was CA Davis, O Varol, E Ferrara, A Flammini, F Menczer. (2016) BotOrNot: A system to evaluate social bots. Proceedings of the 25th International Conference Companion on World Wide Web. These researchers have worked to identify if a social media account on Twitter is a bot or not based on its posting patterns a nd content. 

Without an API, you can go to the tool at https://botometer.iuni.iu.edu/#!/. This is excellent but it only lets you check one account at a time manually. If we want to use this for research, this method of using the tool is simply not feasible/scalable.

However, with the API we can now easily use this research (except at the moment - the API microservice provider just...changed :( ).