# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [1]:
# import the Python requests library so that you can use it in your program
import requests

In [2]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
url = "http://m.smh.com.au"

In [3]:
# Use the requests.get() method to fetch that page
r= requests.get(url)

In [4]:
# Did that succeed? What was the .status_code?
r.status_code

200

In [5]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it
# a little later
'election' in r.text

True

In [11]:
'avocado' in r.text
params ={'name_origin' :'Central Station', 'name_destintion': 'Wynyard Station', 'itdHour': 20, 'itdMinute': 0, 'itdDay': 'today'}
answer = requests.get('http://transportnsw.info/en/index.page', data=params)
answer.status_code
answer.text
open('transport-dump.html', 'wb').write(answer.content)

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [45]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to
url = "https://au.finance.yahoo.com/q?s=ibm&ql=1"
#params = {'s': "ibm", 'ql':"1"}

r = requests.get(url, data=params)
r.text.index("time_rtq_ticker")
r.text[24559:24565]

u'146.59'

In [36]:
# Use requests.get to retrieve that page

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [65]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?
url = "http://kemek.ifost.org.au/ga/protected/data.json"
r = requests.get(url)
r.status_code

401

In [75]:
# Try again, but this time supplying a username and password
params = {'username' :'ga' , 'password': 's3cr3t'}
r = requests.get(url, auth=('ga', 's3cr3t'))
r.status_code

200

In [76]:
r.text

u'{\n "result": "success",\n "message": "you have accessed data from a protected page"\n}\n'

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [6]:
# import BeautifulSoup library (version 4)
import requests
from bs4 import BeautifulSoup
import lxml
url = "http://www.bom.gov.au/nsw/forecasts/sydney.shtml"
r = requests.get(url)
def day_finder(x):
    if x is None:
        return False
    if 'Tuesday' in x:
        return True
    return False

In [7]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
soup = BeautifulSoup( r.content,'lxml')

In [9]:
Finder = soup.find_all(string = day_finder)

In [54]:
print Finder[0].string

print Finder[0].parent.parent

Tuesday 28 June
<div class="day main">
<h2>Tuesday 28 June</h2>
<div class="forecast">
<dl>
<dt>Summary</dt>
<dd class="image">
<img alt="" height="42" src="/images/symbols/large/partly-cloudy.png" width="45"/>
</dd>
<dd>Min <em class="min">10</em></dd>
<dd>Max <em class="max">18</em></dd>
<dd class="summary">Cloud clearing.</dd>
<dd class="rain">Possible rainfall: <em class="rain">0 mm</em></dd>
<dd class="rain">Chance of any rain: <em class="pop">10%
					<img alt="" height="10" src="/images/ui/weather/rain_10.gif" width="69"/></em></dd>
</dl>
<h3>Sydney area</h3>
<p>Sunny. Winds southwesterly 20 to 30 km/h becoming light in the late afternoon.</p>
</div>
<p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>
</div>


In [61]:
raw = Finder[0].parent.parent
raw

<div class="day main">\n<h2>Tuesday 28 June</h2>\n<div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/partly-cloudy.png" width="45"/>\n</dd>\n<dd>Min <em class="min">10</em></dd>\n<dd>Max <em class="max">18</em></dd>\n<dd class="summary">Cloud clearing.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">0 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">10%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_10.gif" width="69"/></em></dd>\n</dl>\n<h3>Sydney area</h3>\n<p>Sunny. Winds southwesterly 20 to 30 km/h becoming light in the late afternoon.</p>\n</div>\n<p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>\n</div>

In [63]:
soup = BeautifulSoup(raw)
print soup.prettify()

TypeError: 'NoneType' object is not callable

In [None]:
def has_the_word_tuesday(x):
    return 'Tuesday' in x

# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful

In [36]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
print(etree.lxml.tostring(Finder[0].parent.parent, pretty_print=True))

NameError: name 'etree' is not defined

In [None]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)

In [None]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?

## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [None]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable


In [None]:
# Use the requests library to fetch the Groudon data

In [None]:
# Check the status code to make sure that it worked

In [None]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary

In [None]:
# What are the keys of this python dictionary?

In [None]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?