# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [1]:
# import the Python requests library so that you can use it in your program
import requests

In [3]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
url = 'http://www.bom.gov.au/nsw/forecasts/sydney.shtml'

In [4]:
# Use the requests.get() method to fetch that page
weather = requests.get(url)

In [6]:
# Did that succeed? What was the .status_code?
weather.status_code

200

In [10]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it
# a little later
weather.text



In [11]:
len(weather.text)

31217

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [26]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to
params = {'s' : 'IBM', 
          'ql' : '0'}
answer = requests.get('https://au.finance.yahoo.com/q', data = params)

In [27]:
# Use requests.get to retrieve that page
#answer.status_code
answer.text

u'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html lang="en-AU">\n<head>\n<title>Quotes &amp; Info- Yahoo!7 Finance</title>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="description" xml:space="default" content="Get the latest index performance and chart outlook for  ()."><meta name="keywords" content="index, market indices, dow jones industrial average, S&amp;P 500, Nasdaq 100, industrial average, transportation average asx"><script>\n          window.yfinBucket = \'\';\n        </script>\n<link rel="stylesheet" href="https://s.yimg.com/zz/combo?kx/yucs/uh2/uh/295/css/yunivhead-min.css&amp;kx/yucs/uh2/uh/295/css/logo-min.css&amp;kx/ucs/avatar/css/17/avatar-min.css&amp;kx/yucs/uh2/mail-link/85/css/mailcount_v2-min.css&amp;kx/yucs/uh2/mail-link/85/css/mail_preview-min.css&amp;kx/ucs/search/css/190/search_all-min.css&amp;kx/ucs/search/css/190/search_buttons-min.css&amp;kx/yucs/uh2/uh/295/css/yunivhead_http

In [28]:
open('yahoofinance-dump.html', 'w').write(answer.content)

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [36]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?
r = requests.get('http://kemek.ifost.org.au/ga/protected/data.json')
r.status_code

401

In [37]:
# Try again, but this time supplying a username and password
r = requests.get('http://kemek.ifost.org.au/ga/protected/data.json', auth=('ga', 's3cr3t'))
r.status_code

200

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [38]:
# import BeautifulSoup library (version 4)
from bs4 import BeautifulSoup

In [39]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
soup = BeautifulSoup(weather.content, 'lxml')

In [41]:
def has_the_word_tuesday(x):
    return 'Tuesday' in x

# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful
first_elmt = soup.find(string = has_the_word_tuesday)
first_elmt

u'Tuesday 28 June'

In [55]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
div = first_elmt.parent.parent
print div.prettify()

<div class="day main">
 <h2>
  Tuesday 28 June
 </h2>
 <div class="forecast">
  <dl>
   <dt>
    Summary
   </dt>
   <dd class="image">
    <img alt="" height="42" src="/images/symbols/large/partly-cloudy.png" width="45"/>
   </dd>
   <dd>
    Min
    <em class="min">
     10
    </em>
   </dd>
   <dd>
    Max
    <em class="max">
     18
    </em>
   </dd>
   <dd class="summary">
    Cloud clearing.
   </dd>
   <dd class="rain">
    Possible rainfall:
    <em class="rain">
     0 mm
    </em>
   </dd>
   <dd class="rain">
    Chance of any rain:
    <em class="pop">
     10%
     <img alt="" height="10" src="/images/ui/weather/rain_10.gif" width="69"/>
    </em>
   </dd>
  </dl>
  <h3>
   Sydney area
  </h3>
  <p>
   Sunny. Winds southwesterly 20 to 30 km/h becoming light in the late afternoon.
  </p>
 </div>
 <p class="alert">
  No UV Alert, UV Index predicted to reach 2 [Low]
 </p>
</div>



In [50]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)
div.find('dd', class_='summary')

<dd class="summary">Cloud clearing.</dd>

In [51]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?
div.find('dd', class_='summary').string

u'Cloud clearing.'

## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [60]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable
url = 'http://pokeapi.co/api/v2/pokemon/groudon'

In [61]:
# Use the requests library to fetch the Groudon data
groudon = requests.get(url)

In [62]:
# Check the status code to make sure that it worked
groudon.status_code

200

In [64]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary
g_json = groudon.json()

In [65]:
# What are the keys of this python dictionary?
g_json.keys()

[u'is_default',
 u'abilities',
 u'stats',
 u'name',
 u'weight',
 u'held_items',
 u'location_area_encounters',
 u'height',
 u'forms',
 u'base_experience',
 u'id',
 u'game_indices',
 u'species',
 u'moves',
 u'order',
 u'sprites',
 u'types']

In [68]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?
g_json['weight']

9500