Data Collection
===
Up until now, the data that you've been working with has been given to in the form of a text file, a database, or in a python pickle. In this module, we'll learn a few techniques for writing our own python scripts for collecting data from the internet. 

Collecting data from the internet can be done in one of two ways:
1. through an API (Application Program Interface)
2. by parsing html

The API route is always preferable as it is (usually) sactioned by whomever is hosting the data, which allows them to dictate terms of use and monitor and restrict how their service is used. An access token (or "key") is typically required to access data from an official API.

Not all websites provide a public API for accessing data. In those cases, you may have to write a custom HTML parser. HTML is the language that the static components of webpages are written in. This code is usually well structured and it is often times not difficult to write a short script to find the elements of a interest in the HTML.

Getting Data via an API
---
Each API is different as is the process getting an API key. Usually getting an API key is as simple as creating an account and registering for a key. Almost all companies that offer a public API, also offer documentation on how to use their API and how to get a key. Here's Goodreads' Developer's page: https://www.goodreads.com/api

We will want to retrieve a Goodreads API key. In order to do so, create an account with Goodreads, and then head to the developer page and apply for an API key. 

>**Note:** Most API users have their own web application and are using the Goodreads API to allow the users of their web app to integrate their goodreads data with their app. The client ID and client secret are specific to the web application and the access token is specific to the user of the web application. You can think of the access token as a custom username and password that the client can use on behalf of its users to get access to certain data that is only available to a user logged into goodreads. Something similar happens everytime you use facebook or google to "login" to a third party site. The third party side gets an access token from facebook or google and can use that to get your name, birthdate, gender or any other information that you've agreed to give it access to.
Although we won't be making any requests on anybody elses behalf, we'll still need an access token to access goodreads data as though we were logged in.

When filling out the form to register your client you will only need to fill in an application name and company name. Feel free to leave the other fields blank. Agree to the terms of service listed below and apply. Goodreads will instantly provide you with an API key and a client secret. 


In order to get data from Goodreads, we'll be making "requests" to their API. We'll be making our requests through python, although you could also make a request directly from your browser. Your browser is in fact making dozens or even hundreds of requests every time it loads a web page. In the cases where your browser loads a web page, it's converting HTML into well formatted human-readable content. When we make requests to an API we'll usually get back content that was designed for a computer to read. Paste the following url (replacing YOUR-API-KEY with your API key) into your browser.

````
https://www.goodreads.com/book/review_counts.json?isbns=0141439602&key=YOUR-API-KEY

````

What you get back is a JSON object that contains review counts for the book `A Tale of Two Cities`. An ISBN is essentially a serial number for a book. In order to know which book the ISBN refers to, we used the 'book.show' method listed in the Goodreads API documentation. Goodreads lists all available methods here. Replacing the book 'id' in the sample URL returns metadata around the book in question, including the title. 

We briefly saw JSON objects in the previous module. They're identical in structure to python dictionaries, and python comes with a library for converting JSON strings into a dictionary.

Let's make that same API request, now using python's `requests` library. Create a variable using your API key, so that we can easily substitute in the code. 

In [2]:
import requests
my_key = "miQ3kVWsjPTA7UmAWM8Dg"
url = "https://www.goodreads.com/book/review_counts.json?isbns=0141439602&key="+my_key
response = requests.get(url)
response.text

u'{"books":[{"id":1953,"isbn":"0141439602","isbn13":"9780141439600","ratings_count":552448,"reviews_count":873363,"text_reviews_count":8614,"work_ratings_count":581897,"work_reviews_count":957951,"work_text_reviews_count":10860,"average_rating":"3.78"}]}'

The response object has other metadata about the response (success/error codes, headers, etc.), but we're only interested in the content (or "text") of the response. As we saw before, the response text is a JSON string. We'll use python's `json` library to turn that string into a dictionary

In [3]:
import json
response_data = json.loads(response.text)
response_data

{u'books': [{u'average_rating': u'3.78',
   u'id': 1953,
   u'isbn': u'0141439602',
   u'isbn13': u'9780141439600',
   u'ratings_count': 552448,
   u'reviews_count': 873363,
   u'text_reviews_count': 8614,
   u'work_ratings_count': 581897,
   u'work_reviews_count': 957951,
   u'work_text_reviews_count': 10860}]}

In [4]:
response.headers

{'status': '200 OK', 'x-request-id': '1WBFB1VP06Y9XM6BM6FN', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff, nosniff', 'content-encoding': 'gzip', 'set-cookie': 'csid=BAhJIhg5MTEtNjI1ODM1My0wMTE2NzM4BjoGRVQ%3D--b9491e8a0a5a6f555f2081872818441467ede936; path=/; expires=Sat, 15 Dec 2035 15:50:01 -0000, locale=en; path=/, _session_id2=e04b0663638384f87a36f4fa9f0b260c; path=/; expires=Tue, 15 Dec 2015 21:50:01 -0000; HttpOnly', 'vary': 'Accept-Encoding,User-Agent', 'content-length': '160', 'server': 'Server', 'x-runtime': '0.028118', 'etag': 'W/"65747fc8dfcac017cf4c2657fda8daf1-gzip"', 'cache-control': 'max-age=0, private, must-revalidate', 'date': 'Tue, 15 Dec 2015 15:50:01 GMT', 'x-frame-options': 'ALLOWALL', 'content-type': 'application/json; charset=utf-8'}

which we saw how to work with in the previous module:

In [5]:
ave_rating = response_data['books'][0]['average_rating']
ave_rating

u'3.78'

>**Note:** The "books" value is actually a list of dictionaries (with only one element), we need to specify that we want the first element in that list. 

This is the Basic workflow for using python to get data from the internet:
1. Make a request and get the content
2. convert the content into a parsable python object (like a dictionary)
3. get the data of interest and do something with it

**Exercise 3.1:**
Write a function that takes a list of goodreads ISBNs and returns the goodreads Ids.


###API Parameters
Instagram's [recent media user endpoint](https://instagram.com/developer/endpoints/users/#get_users_media_recent) returns recent posts for a given user id. By default, it will return 20 of the most recent posts, but you can supply parameters to override some of the defaults. For instance, if you wanted the last 20 posts from 2014, you include the `max_timestamp` parameter in your request with value 1420070400.

>**side note: what is a unix timestamp?**  A unix timestamp is the number of seconds since January 1, 1970. While it's a pretty inconvenient way to represent time from the point-of-view of human readability, you'll often see it used in data storage because it's easier to use in calculations, takes up less space, and is faster to lookup than a datetime string. Python's `datetime` library has a few functions for converting to and from unix time. You can also do these conversion for unix timestamps stored in a dataframe with pandas. And, if you have a single unix timestamp that you're going to lookup just once, you can also do so manually [here](http://www.epochconverter.com).

The API parameters are passed in as a suffix to the url. The first parameter is preceded with a `?` and all subsequent parameters are separated by an `&`. All api request require at least one parameter (the access token). A request url using the `max_timestamp` parameter would look like this:
````
https://api.instagram.com/v1/users/260133476/media/recent/?access_token=703990239.f1d54a3.8cf94c42373e43a2951bce01d7ac0bf1&max_timestamp=1420070400
````
You can paste this url in your browser to see the results (be sure the use your own access_token). Note that despite what is implied by the instagram api documentation, your request params should be all lowercase.

>**PRO-TIP:** There are browser extensions that make it much easier to view json in your browser (everything is well-formatted and you can collapse values). For those of you who use the chrome web browser, [here's the json formatting extentsion that I use](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc?hl=en). You may have to restart your to start the extension.

**Exercise 3.2:** Get the most recent 30 instagram posts by The Alabama Shakes (instagram username: alabama_shakes). Record the id, the link, the number of comments, the number of likes, and the filter. Put the results in a dataframe.

###Pagination
An API request doesn't always return all availalbe data. For instance, by default you only get the 20 most recent posts from Instagram's recent media endpoint, but even using the `count` param you can only get a maximum of 33 posts per request. One solution to get data for all posts is to use the `max_id` param and pass it the lowest id of your current results and continue to do this (using a `while-loop`) until the API returns an empty data set. If you're lucky, the API results will include a paginiation link that you can follow to get the next page for you. Then to get all of the results you'd just continue to follow the pagination link until there were no more pagination links.

**Exercise 3.3:** Get *every* instagram post by the Alabama Shakes and put the same values from above in another dataframe. Determine which was the bands most popular post and what their favorite filter is.

###Rate Limits
One important consideration when building a crawler that will make lots of API requests is that it adheres to the rate limits (i.e. number of requests per hour) [specified in the documentation](https://www.goodreads.com/api/terms). You can do this either pro-actively or reactively. The proactive solution is to put a pause between each request. Goodreads has a 1 request per second rate limit. You can use the `sleep` function from python's `time` library to implement these pauses:

In [6]:
import time
response = requests.get(url)  # make a request
time.sleep(1)  # sleep for 1 second
response = requests.get(url)  # make the next request request

Sometimes an API response will give you feedback too let you know that you've hit their rate-limit. In that case you can take a reactive approach to limiting your requests. Instagram provides this feedback in the form of a response code (which is part of python's response object). When all is well, the code is 200, but if you've exceeded your rate limit, the code will change to 429.

In [7]:
response.status_code  # code should be 200 if all is well and 429 if we've exceeded our rate limit

200

>**Note:** exceeding the rate limit isn't the only thing that can go wrong with an API call and there's a [different status code for each issue you might have](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). For instance, if we make a request to the recent media endpoint using an invalid user id, we get a 400 status code (bad request). If we make a request to a non-exising endpoint, we get a 404 (Not Found) status code.

In [11]:
url = "https://www.goodreads.com/book/review_counts.json?isbns=-1&key="+my_key  # -1 is not a valid ISBN
response = requests.get(url)
response.status_code

404

Parsing HTML for data
---
An official API is always the best way to get data from a host, but sometimes a website either doesn't have an API or, if they do have an API, some data of interest may not be available via an API endpoint. In those cases you may be able to get the raw HTML and use an HTML parsing library to extract the data of interest.

We can get the raw HTML, the same we we got API data, with the `requests` library. Let's grab the HTML from the wikipedia page on Conan episodes from 2014.

In [12]:
url = "https://en.wikipedia.org/wiki/List_of_Conan_episodes_(2014)"
response = requests.get(url)
html_raw = response.text

The response text is raw html which is difficult to work with, so we'll use the [beautiful soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to help us parse out the data we're interested in.

In [13]:
from bs4 import BeautifulSoup # To get everything
html_soup = BeautifulSoup(html_raw)

HTML is a nested set of opening and closing tags which define the page layout and properties. Everything is typically wrapped in html open (<code>&lt;html></code>) and close (<code>&lt;/html></code>) tags. Between <code>&lt;html>&lt;/html></code> there's usually a <code>&lt;head>&lt;/head></code> section where metadata and javascript code lives, and a <code>&lt;body>&lt;/body></code> section where the page content lives. Once you've soupified your html, You can access nested elements as though they were attributes of their parent object. For instance, to get the `head` section of the page you would do:

In [14]:
head = html_soup.html.head

and to get the title element within the `head` section:

In [15]:
head.title

<title>List of Conan episodes (2014) - Wikipedia, the free encyclopedia</title>

You can also perform a search on a beautiful soup object. Let's say you want to get the Table of Contents element from this wikipedia page. Suppose you know that the Table of Contents HTML element has an `id` attribute with value "toc".

In [16]:
table_of_contents = html_soup.find(id='toc')

Beautiful soup objects have a few useful methods and attributes. The `text` attribute is a string of all text within the element that would be displayed in a browser, and the `attrs` attibute is a dictionary of html attributes that are a part of the tag (like the `id` or `class`).

In [17]:
table_of_contents.attrs  # the table of contents element has 2 attributes, a class (equal to ['toc']) and an 'id' (equal to 'toc')

{'class': ['toc'], 'id': 'toc'}

In [18]:
print table_of_contents.text



Contents


1 2014

1.1 January
1.2 February
1.3 March
1.4 April
1.5 May
1.6 June
1.7 July
1.8 August
1.9 September
1.10 October
1.11 November
1.12 December


2 References




If we want to find *all* instances of an element meeting our search criteria we can instead use the `findAll` method. Let's get a list of all months included in the Table of Contents. If you print the `table_of_contents` object, you'll see that each month is wrapped in a list tag (<code>&lt;li></code>) with class attribute 'toclevel-2'.

In [19]:
month_elements = table_of_contents.findAll('li', {'class':'toclevel-2'})
len(month_elements)

12

As expected, there are 12 month_elements. Let's see what the first month_element looks like:

In [20]:
month_elements[0]

<li class="toclevel-2 tocsection-2"><a href="#January"><span class="tocnumber">1.1</span> <span class="toctext">January</span></a></li>

Each month element has a tag (a link to somewhere else in the page in this case) and two spans. The text of the spans are the table of content number and month. Let's create a lists for the href, number, and month and put the results in a dataframe.

In [21]:
hrefs = []
toc_numbers = []
months  = []
for element in month_elements:
    a_tag = element.a  # a tags in html are used for links within a page or too another page
    hrefs.append(a_tag.attrs['href'])  # the href attribute determines where the link goes
    toc_number = element.find('span',{'class':'tocnumber'})
    toc_numbers.append(toc_number.text)
    month = element.find('span',{'class':'toctext'})
    months.append(month.text)

import pandas as pd
pd.DataFrame({'month':months, 'href':hrefs}, index=toc_numbers)

Unnamed: 0,href,month
1.1,#January,January
1.2,#February,February
1.3,#March,March
1.4,#April,April
1.5,#May,May
1.6,#June,June
1.7,#July,July
1.8,#August,August
1.9,#September,September
1.1,#October,October


Notice that with the `find` and `findAll` methods, we can specified both the tag type (a, span, div, etc.) as well as attribute values. It's worth looking over the [beautiful soup Quick Start](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start) documentation to see all of the available methods. If you're not very familiar with HTML you might also consider looking over [this tutorial](http://www.w3schools.com/html/) or taking the [codeacademy course](https://www.codecademy.com/learn/web) in HTML/CSS for a deeper dive.

>**Pro-Tip:** Most web browsers have their own built-in developer tools for investegating all the HTML and other elements that go into displaying a web page. In Chrome you can right click on any element on a page and click on "Inspect Element" to look at it's HTML. Use the arrows on the left of each element to expand or collapse in order to show/hide all of its children.

**Exercise 3.4:** Use beautiful soup to get the show number ("No."), air date, lists of guests, and list of entertainment guests for every Conan show from 2014. Put the results in a dataframe.  **Bonus:** make the date a datetime object.