# Getting Data from the Web
## STA 141B


## Getting data from the Web

### The Internet and HTTP

<blockquote> "It's a series of tubes" - Sen. Ted Stevens </blockquote>



<center>
<img src="pipe.gif" alt="tubes" style="width: 200px;margin-top:1.5cm;""/>
</center>

## Getting data from the Web

### The Internet and HTTP

- Internet: system of computer networks that uses internet protocols to link communicating devices
- Network protocols: which are rules by which computers communicate.
- Each protocol is designed for a certain task
 - Simple Mail Transfer Protocol (SMTP) and Post Office Protocol (POP), are agreements on how email clients and servers create and parse messages
 - These protocols are mostly open source, such as HTTP, but some are proprietary, like the Yahoo! Instant Messenger Protocol.
 - The cryptocurrency, Bitcoin, has an associated protocol that specifies how bitcoins are sent and recieved.

## Internet Protocol (IP)

The internet protocol suite is a stack of interdependent protocols that power the internet.

- routing of packets (IP)
- the interaction with physical components (MAC/Ethernet/etc.)
- the error-free transmission of data (TCP)
- and the application layer that standardizes communication (HTTP).

## Some history

![](http://geektrio.net/wp-content/themes/arras-theme/library/timthumb.php?src=http://geektrio.net/wp-content/uploads/2017/08/vintbob.jpg&w=630&h=250&zc=1)

Bob Kahn and Vinton Cerf developed TCP/IP for DARPA, the research arm of the Department of Defense.

## Some History

- Kahn and Cerf invented ARPANET, and the protocols multiplied as TCP spawned IP, and other researchers joined the project.
- Tim Berners-Lee at the European Organization for Nuclear Research (CERN) developed the Hypertext Transfer Protocol (HTTP).
- HTTP is the application protocol that is the basis for the world wide web.
- HTTP standards are maintained and updated by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C).

## RFCs

Network protocols that are maintained by the IETF are published in Request for Comment (RFC) documents.

<center><pre>Network Working Group                                     T. Berners-Lee
Request for Comments: 1945                                       MIT/LCS
Category: Informational                                      R. Fielding
                                                               UC Irvine
                                                              H. Frystyk
                                                                 MIT/LCS
                                                                May 1996


Hypertext Transfer Protocol -- HTTP/1.0
    
Abstract

   The Hypertext Transfer Protocol (HTTP) is an application-level
   protocol with the lightness and speed necessary for distributed,
   collaborative, hypermedia information systems. It is a generic,
   stateless, object-oriented protocol which can be used for many tasks,
   such as name servers and distributed object management systems,
   through extension of its request methods (commands). A feature of
   HTTP is the typing of data representation, allowing systems to be
   built independently of the data being transferred.
</pre></center>

## HTTP

- HTTP is based on the client-server computing model
 - client is typically a web browser
 - web server serves content
- One common open-source web server application is Apache
- HTTP is a request-response protocol---clients make requests for content and then the server makes a response.
- request methods to HTTP: 
 - GET which is used to retrieve data
 - POST which asks the server to accept data at a URI  

## Making requests in Python

- Request package: we will focus on extracting data from the web, we will focus on GET requests, and ignore POST requests
- basic GET request is what the browser does when you navigate to a URL
- can pass parameters to URLs  
- URL can be broken into the following components: 
 - scheme
 - network location
 - hierarchical path
 - parameters, query, and fragment identifier.  

## Making requests in Python

- basic format: ``scheme://netloc/path;parameters?query#fragment``. 
- scheme can be <code>file, ftp, http, https,...</code>
- the netloc is typically the host root url, like <code>www.google.com</code>
- the path is a typical relative path
- the parameters and query specify arguments for GET, POST, PUT, etc.  

**Example:** <code>http://api.petfinder.com/my.method?key=12345&arg1=foo&token=67890&sig=abcdef</code> 
- scheme <code>http</code>, 
- netloc of <code>api.petfinder.com</code>, 
- path of <code>my.method</code>, and 
- query is <code>key=12345&arg1=foo&token=67890&sig=abcdef</code>

## Requests

The basic use of the Requests package is through the GET method, as in the following.

In [7]:
import requests

r = requests.get('https://api.github.com/events')

- establish a connection to github.com 
- make the get request.  
- same as what your browser does when you type https://api.github.com/events into the url (go ahead and try it)

In [8]:
r.text[:300]

'[{"id":"8530767475","type":"PushEvent","actor":{"id":324298,"login":"jimkang","display_login":"jimkang","gravatar_id":"","url":"https://api.github.com/users/jimkang","avatar_url":"https://avatars.githubusercontent.com/u/324298?"},"repo":{"id":129019986,"name":"jimkang/self-tagging-bot","url":"https:'

We are seeing a serialization of JSON data from a web API, which we will discuss shortly.

**Note:** You can use requests to make an overwhelmingly large number of requests to a webserver in a very short period of time.  This is how denial of service attacks work, and it is an unkind thing to do.  Please use requests with care.

## Web APIs and JSON

- application programming interface (API): set of methods by which software components communicate
- Web APIs define how HTTP request methods can be used to access and modify data on the server
- <code>http://api.petfinder.com/subsystem.method</code> indicating the method

- JSON, specified by a request for comment (RFC 7159), is a simple data exchange format.
- efficiently parse complex data
- not meant to be especially human readible and writtable like markup languages

- Serialization: turn an object (list a dictionary) into a string that can be sent via HTTP or written to a file
- convert back is called deserialization

## Reading JSON

JSON is based on dictionaries and lists (called objects and arrays in JSON), and you can nest these in complex ways.  
- a matrix could be written as a list of lists
- a DataFrame is a dictionary with column names for keys and lists for values, etc

![](object.gif)

![](array.gif)

![](value.gif)

```
[
  {
    "id": "8537980078",
    "type": "PushEvent",
    "actor": {
      "id": 43446077,
      "login": "nguadarrama",
      "display_login": "nguadarrama",
      "gravatar_id": "",
      "url": "https://api.github.com/users/nguadarrama",
      "avatar_url": "https://avatars.githubusercontent.com/u/43446077?"
    },
    "repo": {
      "id": 150308400,
      "name": "nguadarrama/sicoa-services",
      "url": "https://api.github.com/repos/nguadarrama/sicoa-services"
    },
    "payload": {
      "push_id": 3021746006,
      "size": 1,
      "distinct_size": 1,
```

In [12]:
gh_json = r.json()

In [13]:
type(gh_json)

list

Arrays are interpretted as lists, and similarly, the object is interpretted as a dictionary.

In [14]:
type(gh_json[0])

dict

So for example, we could get the id of the first entry of the response with,

In [15]:
gh_json[0]['id']

'8530767475'

- no standard format for the response of web APIs
- reading the data from each will differ from API to API.

### Petfinder API

In [2]:
import requests_cache
import pandas as pd
from matplotlib import pyplot as plt

plt.style.use('ggplot')
requests_cache.install_cache('pet_cache')

In [152]:
key = "12345" #get your own key!

In [4]:
## Specify parameters
params = {'key': key, 'animal': 'dog', 'format': 'json'}

In [5]:
## Specify method URL
breed_url = "http://api.petfinder.com/breed.list"

These are combined to make the request using ``requests.get``.  

In [8]:
req = requests.get(breed_url,params=params)

In [156]:
req.url # look at url

'http://api.petfinder.com/breed.list?animal=dog&format=json&key=12345'

In [9]:
js = req.json() # parse the JSON

print(js.__repr__()[0:300])

{'@encoding': 'iso-8859-1', '@version': '1.0', 'petfinder': {'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', 'breeds': {'breed': [{'$t': 'Affenpinscher'}, {'$t': 'Afghan Hound'}, {'$t': 'Airedale Terrier'}, {'$t': 'Akbash'}, {'$t': 'Akita'}, {'$t': 'Alaskan Malamute'}, {'$t': 'American Bu


In [25]:
req.status_code

200

In [27]:
print(js.keys())

dict_keys(['@encoding', '@version', 'petfinder'])


In [28]:
js['petfinder'].keys()

dict_keys(['@xmlns:xsi', 'breeds', 'header', '@xsi:noNamespaceSchemaLocation'])

In [10]:
js['petfinder']['breeds'].keys()

dict_keys(['breed', '@animal'])

In [12]:
## Extract dog breeds
breeds = [b['$t'] for b in js['petfinder']['breeds']['breed']]

In [14]:
len(breeds)

257

In [13]:
print(", ".join(breeds[:30]))

Affenpinscher, Afghan Hound, Airedale Terrier, Akbash, Akita, Alaskan Malamute, American Bulldog, American Eskimo Dog, American Foxhound, American Hairless Terrier, American Staffordshire Terrier, American Water Spaniel, Anatolian Shepherd, Appenzell Mountain Dog, Australian Cattle Dog / Blue Heeler, Australian Kelpie, Australian Shepherd, Australian Terrier, Basenji, Basset Hound, Beagle, Bearded Collie, Beauceron, Bedlington Terrier, Belgian Shepherd / Laekenois, Belgian Shepherd / Malinois, Belgian Shepherd / Sheepdog, Belgian Shepherd / Tervuren, Bernese Mountain Dog, Bichon Frise


The Petfinder API also includes a ``pet.getRandom`` method which allows one to randomly sample the database for a pet with a specific breed.

In [16]:
pet_url_ex = "http://api.petfinder.com/pet.getRandom"
randomparms = {'key':key,'animal':'dog','format':'json','output':'basic'}
randreq = requests.get(pet_url_ex,params = randomparms)
js = randreq.json()

In [18]:
print(js['petfinder'].__repr__()[0:700])

{'pet': {'options': {'option': [{'$t': 'altered'}, {'$t': 'hasShots'}, {'$t': 'housetrained'}]}, 'status': {'$t': 'A'}, 'contact': {'phone': {'$t': '210-535-5480'}, 'state': {'$t': 'TX'}, 'address2': {'$t': 'P.O. Box 743'}, 'email': {'$t': 'teresakopacki@gmail.com'}, 'city': {'$t': 'Lytle'}, 'zip': {'$t': '78052'}, 'fax': {}, 'address1': {'$t': '17971 W. FM 2790 S.'}}, 'age': {'$t': 'Adult'}, 'size': {'$t': 'S'}, 'media': {'photos': {'photo': [{'@size': 'pnt', '$t': 'http://photos.petfinder.com/photos/pets/40700638/1/?bust=1516246159&width=60&-pnt.jpg', '@id': '1'}, {'@size': 'fpm', '$t': 'http://photos.petfinder.com/photos/pets/40700638/1/?bust=1516246159&width=95&-fpm.jpg', '@id': '1'}, {'


In [24]:
def print_pet(js):
    """
    Input: Petfinder JSON object from getRandom method
    Output: String describing the pet
    """
    breed_obj = js[u'petfinder'][u'pet'][u'breeds'][u'breed']
    if type(breed_obj) == list:
        breeds = [a[u'$t'] for a in breed_obj]
        breed = ", ".join(breeds) + " mix"
    else:
        breed = breed_obj[u'$t']
    name = js[u'petfinder'][u'pet'][u'name'][u'$t']
    desc = js[u'petfinder'][u'pet'][u'description'][u'$t']
    return "{} is a {}. {}".format(name,breed,desc)

In [40]:
print(print_pet(js))

Murray is a Rat Terrier. Meet MURRAY!!

Murray is an approximately 2 year old, 18-20 pound, neutered, male terrier mix breed. Possibly a Rat Terrier blend.

Murray was originally found as a stray and was in a semi-rural outdoor shelter near Lytle. The ACO reached out to us and we happily brought Murray into our program.

Murray is a gentlemanly little guy. He is very social, but generally minds his own business, and likes to be active. Murray is a chunky guy and needs an active person to help get him fit. He does well with the other dogs in his foster home, but does not like other pushy or bossy males picking on him. He has not shown alot of interest in cats, but as a typical terrier may give chase. He is crate trained and does his business outside when kept on a consistent schedule. Murray is loyal and loves his people.

Murray is heartworm negative, up to date on vaccinations, is current on flea and heartworm preventive, and is micro-chipped. He comes with a health certificate if tra

In [20]:
image_recs = js[u'petfinder'][u'pet'][u'media'][u'photos'][u'photo']
image_recs[0]

{'@size': 'pnt',
 '$t': 'http://photos.petfinder.com/photos/pets/40700638/1/?bust=1516246159&width=60&-pnt.jpg',
 '@id': '1'}

In [21]:
from IPython.display import HTML

In [22]:
def display_pet(js):
    """
    Input: petfinder json object
    Output: html string with image
    """
    try:
        image_recs = js[u'petfinder'][u'pet'][u'media'][u'photos'][u'photo']
    except KeyError:
        return ""
    for rec in image_recs:
        image_uri = rec['$t']
        if rec['@size'] == u'x':
            break
    return "<center><img src='{}'></center>".format(image_uri)

In [25]:
HTML(display_pet(js) + "<pre>" + print_pet(js) + "</pre>") 

Let's streamline this process with the following function.

In [26]:
def random_dog(key):
    """
    Input: api key
    Output: HTML description of the dog
    """
    pet_url_ex = "http://api.petfinder.com/pet.getRandom"
    randomparms = {'key':key,'animal':'dog','format':'json','output':'basic'}
    randreq = requests.get(pet_url_ex,params = randomparms)
    js = randreq.json()
    return js

In [27]:
with requests_cache.disabled():
    js = random_dog(key)
HTML(display_pet(js) + "<pre>" + print_pet(js) + "</pre>") 

Now we are ready to get a random sample of the dataset.  This should only be done sparingly because it will make many requests to the webserver in a short period of time.  Many APIs including the petfinder API have rate limits that restrict the number of queries a single key can make.

In [30]:
## Make 500 requests to the petfinder API of random dogs
samp_size = 500

with requests_cache.disabled():
    dog_data = [random_dog(key) for s in range(samp_size)]

In [31]:
def extract_breeds(pet):
    """Extract the breed information for petfinder json"""
    try:
        breed_obj = pet[u'breeds'][u'breed']
        if type(breed_obj) == list:
            breeds = [a[u'$t'] for a in breed_obj]
        else:
            breeds = [breed_obj[u'$t']]
        return breeds
    except KeyError:
        return None

In [32]:
def catch_missing(var_dict,key):
    """Catch missingness in a variable and return None"""
    try:
        var = var_dict[key]['$t']
        return var
    except KeyError:
        return None

In [33]:
def extract_pet_vars(js):
    """Extract the desired variables from petfinder json"""
    pet = js['petfinder']['pet']
    pet_breeds = extract_breeds(pet)
    pet_cont = pet['contact']
    pet_state = catch_missing(pet_cont,'state')
    pet_age = catch_missing(pet,'age')
    pet_size = catch_missing(pet,'size')
    pet_id = int(catch_missing(pet,'id'))
    pet_desc = catch_missing(pet,'description')
    pet_shelterId = catch_missing(pet,'shelterId')
    return {'breeds':pet_breeds, 'state':pet_state, 'age':pet_age, 'size':pet_size, 'id':pet_id, 
           'desc': pet_desc, 'shelter_id': pet_shelterId}

In [34]:
extract_pet_vars(js)

{'breeds': ['Great Pyrenees'],
 'state': 'NC',
 'age': 'Baby',
 'size': 'L',
 'id': 43203450,
 'desc': 'This is Snowman he is a 10 month old Great  Pyrenees that now needs a new home through no fault of his. He is house trained current on shots and still does dumb puppy things. Snowman needs a home that will continue his training and be able to provide lots of hugs and belly rubs.',
 'shelter_id': 'NC820'}

In [35]:
dog_df = pd.DataFrame(extract_pet_vars(js) for js in dog_data)

dog_df = dog_df.set_index('id')

In [36]:
dog_df.describe()

Unnamed: 0,age,breeds,desc,shelter_id,size,state
count,500,500,447,500,500,500
unique,4,215,438,440,4,49
top,Adult,[Pit Bull Terrier],No Notes,IL192,M,TX
freq,236,51,4,4,245,74


In [37]:
dog_df.groupby('age').describe()

Unnamed: 0_level_0,breeds,breeds,breeds,breeds,desc,desc,desc,desc,shelter_id,shelter_id,shelter_id,shelter_id,size,size,size,size,state,state,state,state
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
Adult,236,113,[Pit Bull Terrier],28,213,208,No Notes,4,236,223,TN75,2,236,4,M,100,236,43,TX,31
Baby,80,60,[Labrador Retriever],8,73,73,Meet Cosmo! Cosmo is a 6 month old Beagle/Houn...,1,80,77,MN289,2,80,4,M,47,80,31,TX,10
Senior,48,32,[Chihuahua],9,45,45,HIGHLY ADOPTABLE!!! ROSIE is a Jack Russell Te...,1,48,45,NV200,2,48,4,S,22,48,23,CA,7
Young,136,79,[Pit Bull Terrier],15,116,114,Craigie D Boss came to us from a kill shelter ...,3,136,129,IN434,3,136,3,M,84,136,31,TX,27


In [109]:
from collections import Counter

all_breeds = Counter(sum(dog_df['breeds'].values,[]))

In [113]:
all_breeds.most_common()[:5]

[('Labrador Retriever', 89),
 ('Pit Bull Terrier', 76),
 ('Mixed Breed', 67),
 ('Chihuahua', 37),
 ('Shepherd', 26)]

In [116]:
max(len(br) for br in dog_df['breeds'].values)

2

In [117]:
## Encode breeds into two vars

dog_df['breed1'] = [br[0] for br in dog_df['breeds'].values]
dog_df['breed2'] = [br[-1] for br in dog_df['breeds'].values]

In [126]:
lab_dogs = dog_df.query('breed1 == "{0}" or breed2 == "{0}"'.format("Labrador Retriever"))
lab_dog = lab_dogs.iloc[0,:]

In [127]:
lab_dog

age                                                       Adult
breeds                                     [Labrador Retriever]
desc          This is the Animal Description Header\nRover i...
shelter_id                                                CA387
size                                                          L
state                                                        CA
breed1                                       Labrador Retriever
breed2                                       Labrador Retriever
Name: 42874092, dtype: object

In [129]:
print(lab_dog['desc'])

This is the Animal Description Header
Rover is all dog, hence the name! He loves people and is very confident and playful in his surroundings. He will need some training since he can be mouthy in play and is jumpy with excitement when meeting new people. Rover is a hunting stock kind of Lab mix, lanky and energetic - he needs space and adventures!; This is a high energy boy who will need and love lots of exercise. He will make an excellent dog for outdoorsy folks who have the time and sense of adventure to go on long hikes, to do some training and recall work - he'll likely be really good and he does want to please - but he needs an outlet for his enthusiasm!This is the Animal Description Footer
10/23/18 2:49 AM


In [132]:
dog_df.groupby('state').count().sort_values('age',ascending=False).iloc[:5,0]

state
TX    75
CA    55
FL    30
GA    28
OH    19
Name: age, dtype: int64

In [136]:
pitt_dog = dog_df.query('state == "TX" and breed1 == "Pit Bull Terrier"').iloc[0,:]

In [137]:
pitt_dog

age                                                       Adult
breeds                                       [Pit Bull Terrier]
desc          I'm a petite little lady who loves to give kis...
shelter_id                                                TX198
size                                                          M
state                                                        TX
breed1                                         Pit Bull Terrier
breed2                                         Pit Bull Terrier
Name: 41596889, dtype: object

In [139]:
pitt_dog['desc']

"I'm a petite little lady who loves to give kisses! I'm sweet and enjoy lots of belly rubs, and I'm also a pretty amazing athlete! I like playing with toys, going for runs, and sleeping next you once I'm all tuckered out. I would prefer to have all your attention to myself, so I'd do best an only pet - I'm such a good girl, I'm the only companion you need! Primary Color: Black Secondary Color: White Weight: 43.4lbs Age: 3yrs 11mths 1wks Animal has been Spayed"