# Retrieving data from the web, Part 2


![image.png](attachment:image.png)

## 5.1 Programming exercise
Create a new Jupyter Notebook on your system.

Try downloading some web pages using a Python program and extracting information. Look at the page in your web browser and use the inspector (right click, view source/inspect.) to locate areas of interest. You could stick with beautifulsoup4 https://pypi.org/project/beautifulsoup4/  or try out a new library like pyquery https://pythonhosted.org/pyquery/. You will likely have to engage with the documentation a bit to find out how to search the HTML for more specific things, for example, tags with particular class attributes. We give you some examples, but HTML is made up of all manner of things and each page has a different set of semantic components.

Can you scrape an interesting bit of data from the web and share it with us on the forums?

If any of the links are broken, let us know via the Student Portal. 

## 5.11 Revision quiz – HTML and HTTP
Practice Quiz • 15 MIN

**Question 1**
HTTP is the protocol used for describing the markup of web pages.
* True
* `False`

`HTML describes the markup of web pages
HTTP is hypertext transfer protocol, the mechanism by which data is transmitted via the web.`

**Question 2**
Hypertext Markup Language describes the semantic elements of a webpage using a DOM (Document Object Model) structure.
* `True`
* False

## 5.12 Scraping, APIs and libraries
Read through the documentation https://pygithub.readthedocs.io/en/latest/introduction.html for PyGithub. This will give you a sense of what a library for accessing an API looks like in Python. There are quite a few of these!

If any of the links are broken, let us know via the Student Portal. 

5.13 Web APIs
See if you can find some web APIs e.g. 

https://jsonapi.org/examples/

https://www.programmableweb.com/

World Bank: https://data.worldbank.org/

TV Maze: http://www.tvmaze.com/api

or some JSON based APIs e.g.

Ofcom radio frequency spectrum allocations

http://static.ofcom.org.uk/static/spectrum/data/spectrumMapping.json

UK police forces

https://data.police.uk/api/forces

Exchange rates

https://api.exchangerate-api.com/v4/latest/GBP

Current location of the International Space Station

http://api.open-notify.org/iss-now.json

Who is in space right now

http://api.open-notify.org/astros.json

Meterorites that have hit the earth

https://data.nasa.gov/resource/y77d-th95.json

Earthquakes happening now

https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson

Nobel prize winners

http://api.nobelprize.org/v1/prize.json

London Crime

https://data.london.gov.uk/dataset/recorded_crime_summary

Can you find one that relates to your own research interest(s)?

Post a link to the developer documentation for the API that you have found on the discussion forum. Describe in a sentence or two the utility of the API for a given set of tasks. Can you show an example of your code working in practice?

If this link is broken, let us know via the Student Portal. 

Participation is optional

## Open Data

According to the Worldbank, Open data has many benefits among them are:
* Transparency. It supports `monitoring governments` and helping `reduce corruption` by enabling greater transparency..
<br>
* Innovation and Economic Value. It is a key resource for social innovation and economic growth by providing opportunities for governments to collaborate with citizens. `Businesses` can use Open Data to better `understand potential markets` and build new `data-driven` products.
<br>
* Efficiency. It provides `easier and less costly access` to government and other ministries reducing acquisition costs, redundancy and overhead. Open Data can also empower citizens with the ability to `alert governments to gaps` in public datasets and to provide `more accurate information`.
 <br>
* Public Service Improvement. Open Data gives citizens the raw materials they need to engage their governments and contribute to the `*improvement* of public services`. For instance, citizens can use Open Data to` contribute to public planning`, or provide feedback to government ministries on service quality.


To improve our lives and other people's live in Singapore, it is important to get the data related to our countries.  Singapore government has provided many open dataset for us to use.  By providing these datasets, our government takes benefits of open data described by WorldBank.

We can possible find many interesting patterns that we have not before (i.e., `getting insights from data`).  For instance, 
* is the weather in Singapore (i.e, hot and humid) correlated to dengue?
* which months have the highest dengue cases?
* is the price of HDB correlated to the price of COE?  i.e., if the price of HDB is lower, the price of COE is higher?
* Is the location of money changer correlated to the location of MRT?

If we find some patterns or hidden insights which are important, we are contributing the knowledge to the good of society.


## CLASS ACTIVITY

## Explorer data.gov.sg (15 minutes)

* Go to https://data.gov.sg
* Explore and search datasets
* Find 2 or 3 interesting datasets that you may use it for project in this module or other modules
* Write down the title of the data, and the URL where to get it.
i.e., 
1. Title: Weekly Infectious Disease    URL: https://data.gov.sg/dataset/weekly-infectious-disease-bulletin-cases
2. Title: Weather in Singapore for the past 1 year (i.e., 2002) Historical Weather data in Singapore URL: http://www.weather.gov.sg/climate-historical-daily/


Potential to Play with (Potential Analysis): 
* To see whether certain infectious diseases emerge more frequently when the temperature is higher.

3. Post your 2 or 3 potential analyses in menti: https://www.menti.com



See other examples as `indicated above`:
    
* is the weather in Singapore (i.e, hot and humid) correlated to dengue?
* which months have the highest dengue cases?
* is the price of HDB correlated to the price of COE?  i.e., if the price of HDB is lower, the price of COE is higher?
* Is the location of money changer correlated to the location of MRT?

Some other hacking data activities

##  5.13 Discussion Prompt: Web APIs


## 5.15 Considering alternative ways to parse text
Access the documentation on regular expression operations https://docs.python.org/3/library/re.html for reference.

We learned about these in a lab earlier in the course. Now it is time to compare and contrast techniques with web scraping.

Try to give an account of the similarities and differences of navigation through regular expressions versus utilising semantic elements such as those present in the DOM. 
Can you see the difference in applications for both techniques? 
Are there any domains where both cases offer utility? 
Try to post a four- or five-sentence answer in the discussion forums. Reply to at least one of your peers.

If this link is broken, let us know via the Student Portal. 

## 5.16 Advanced programming exercise

Create a new Jupyter Notebook on your system. 

There are multiple parts to this final revision activity. You can choose to do one, two or all three activities depending on how confident you feel. You will probably have to rely on the DOM inspection capabilities of your browser. I would personally suggest Chrome for this as it can point to specific elements both visually and in the markup. If you get stuck you can always ask for help in the tutor forum and we will give you hints and tips!


Exercise
Web scrape a list of all of my publications since 2015 (for example from my Google Scholar profile). 

We want to scrape this in the same way that we did with the table in the video lecture. We can even specify the element we want to look at as follows

`table = soup.find("table", attrs={"id":"gsc_a_t"}`

That means we need to focus on some special HTML elements 

`<tr> and <td>`

Caveat: there are numerous people with my name and not all of my publications are at the same institution so this search may not be as easy as it sounds! You may also find yourself blocked by captchas and such, in which case you might have to find workarounds. 

Advanced exercise
Scrape a list of all the co-authors of my papers including a numerical value that corresponds to the number of co-authorships.

Expert exercise
If you feel really brave and want to go out on your own you can try this activity:

Scrape the abstract/keywords from these papers.

If this link is broken, let us know via the Student Portal.

In [1]:
import requests
import bs4

In [2]:
Sean=requests.get('https://scholar.google.com/citations?user=ETIBghkAAAAJ&hl=en')

In [3]:
Sean.text

'<!doctype html><html><head><title>Sean McGrath - Google Scholar</title><meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="referrer" content="always"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"><link rel="shortcut icon" href="/favicon.ico"><link rel="canonical" href="http://scholar.google.co.uk/citations?user=ETIBghkAAAAJ&amp;hl=en"><meta name="description" content="Lecturer in Computer Science Goldsmiths University - Cited by 91 - HCI - User Experience - Audio - Technology - Creativity"><meta property="og:description" content="Lecturer in Computer Science Goldsmiths University - Cited by 91 - HCI - User Experience - Audio - Technology - Creativity"><meta property="og:title" content="Sean McGrath"><meta property="og:image" content="https://scholar.googleusercontent.com/citations?view_op=medium_ph

In [4]:
Sean_parse=bs4.BeautifulSoup(Sean.text,'html.parser') 

In [5]:
Sean_parse

<!DOCTYPE html>
<html><head><title>Sean McGrath - Google Scholar</title><meta content="text/html;charset=utf-8" http-equiv="Content-Type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2" name="viewport"/><meta content="telephone=no" name="format-detection"/><link href="/favicon.ico" rel="shortcut icon"/><link href="http://scholar.google.co.uk/citations?user=ETIBghkAAAAJ&amp;hl=en" rel="canonical"/><meta content="Lecturer in Computer Science Goldsmiths University - Cited by 91 - HCI - User Experience - Audio - Technology - Creativity" name="description"/><meta content="Lecturer in Computer Science Goldsmiths University - Cited by 91 - HCI - User Experience - Audio - Technology - Creativity" property="og:description"/><meta content="Sean McGrath" property="og:title"/><meta content="https://scholar.googleusercontent.com/citations?view_op=medium_photo&amp;user=ET

In [6]:
Sean_parse.find("table", attrs={"id":"gsc_a_t"})


<table id="gsc_a_t"><thead><tr aria-hidden="true" id="gsc_a_tr0"><th class="gsc_a_t"></th><th class="gsc_a_c"></th><th class="gsc_a_y"></th></tr><tr id="gsc_a_trh"><th class="gsc_a_t" scope="col"><span id="gsc_a_ta"><a class="gsc_a_a" href="/citations?hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;view_op=list_works&amp;sortby=title">Title</a></span><div class="gs_md_r gs_md_rmb gs_md_rmbl" id="gsc_dd_sort-r"><button aria-controls="gsc_dd_sort-d" aria-haspopup="true" class="gs_in_se gs_btn_mnu gs_btn_flat gs_btn_lrge gs_btn_half gs_btn_lsu gs_press gs_md_tb" id="gsc_dd_sort-b" ontouchstart="gs_evt_dsp(event)" type="button"><span class="gs_wr"><span class="gs_lbl">Sort</span><span class="gs_icm"></span></span></button><div class="gs_md_d gs_md_ulr" id="gsc_dd_sort-d" role="menu" tabindex="-1"><div class="gs_oph gsc_dd_sec gsc_dd_sep" id="gsc_dd_sort-s"><a class="gs_md_li gsc_dd_sort-sel" href="/citations?hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;view_op=list_works" role="menuitem" tabi

In [7]:
Sean_table=Sean_parse.find("table", attrs={"id":"gsc_a_t"})


In [8]:
Sean_table.find_all("a", attrs='gsc_a_at')

[<a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;citation_for_view=ETIBghkAAAAJ:u-x6o8ySG0sC" href="javascript:void(0)">Designing for exploratory play with a hackable digital musical instrument</a>,
 <a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;citation_for_view=ETIBghkAAAAJ:UeHWp8X0CEIC" href="javascript:void(0)">Making music together: An exploration of amateur and pro-am grime music production</a>,
 <a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;citation_for_view=ETIBghkAAAAJ:d1gkVwhDpl0C" href="javascript:void(0)">GeoTracks: Adaptive music for everyday journeys</a>,
 <a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=ETIBghkAAAAJ&amp;citation_for_view=ETIBghkAAAAJ:u5HHmVD_uO8C" href="javascript:void(0)">Understanding social media and sound: mu

In [9]:
Sean_cell=Sean_table.find_all("a", attrs='gsc_a_at')

In [18]:
for i in Sean_cell:
    print(i.text, "https://scholar.google.com" + i['data-href'] + "\n")

Designing for exploratory play with a hackable digital musical instrument https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=ETIBghkAAAAJ&citation_for_view=ETIBghkAAAAJ:u-x6o8ySG0sC

Making music together: An exploration of amateur and pro-am grime music production https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=ETIBghkAAAAJ&citation_for_view=ETIBghkAAAAJ:UeHWp8X0CEIC

GeoTracks: Adaptive music for everyday journeys https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=ETIBghkAAAAJ&citation_for_view=ETIBghkAAAAJ:d1gkVwhDpl0C

Understanding social media and sound: music, meaning and membership, the case of SoundCloud https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=ETIBghkAAAAJ&citation_for_view=ETIBghkAAAAJ:u5HHmVD_uO8C

The Grime scene: social media, music, creation and consumption https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=ETIBghkAAAAJ&

#### get the last URL

In [55]:
#Sean_cell[2]

In [56]:
#[2]['data-href']

In [57]:
#last_article=requests.get("https://scholar.google.com" +  Sean_cell[2]['data-href'])

In [58]:
requests.get("https://scholar.google.com" + i['data-href'])

<Response [200]>

In [59]:
last_article=requests.get("https://scholar.google.com" + i['data-href'])

In [60]:
last_article.text

'<style>.gsc_oms_mm{font-size:24px;line-height:16px;display:inline-block;margin:-10px 0 0 4px;position:relative;top:8px;}.gsc_oms_link{white-space:nowrap;margin-right:12px;}.gsc_oms_link:last-child{margin-right:0;}#gsc_ocd_upload{max-width:500px;margin:0 auto;}.gsc_upl_title{font-size:20px;margin:0 0 4px 0;}.gsc_upl_desc{font-size:15px;padding-top:12px;line-height:1.24;}.gsc_upl_desc>p{margin:12px 0;}.gs_el_ph .gsc_upl_desc{padding-top:4px;}#gsc_upl_error:empty,#gsc_upl_form{display:none;}#gsc_upl_error{padding:8px;margin-bottom:16px;}#gsc_vcd_title_wrapper{font-size:16px;margin:0 0 16px 0;}#gsc_vcd_title{font-size:20px;}.gs_el_sm #gsc_vcd_title{font-size:18px;}#gsc_vcd_title_gg{float:right;padding:0 0 8px 16px;}.gs_el_ph #gsc_vcd_title_gg{float:none;text-align:right;padding:0;margin:-4px 0 8px 0;}.gsc_vcd_title_ggt{font-size:13px;font-weight:bold;}.gsc_vcd_title_ggut{font-size:13px;color:#777;padding-bottom:8px;}.gsc_vcd_field{float:left;width:100px;text-align:right;color:#777;}.gs_el

In [61]:
last_parse=bs4.BeautifulSoup(last_article.text, "html.parser")

In [62]:
last_parse.find("div", "gsc_vcd_field", text="Authors")

<div class="gsc_vcd_field">Authors</div>

In [63]:
Authors=last_parse.find("div", "gsc_vcd_field", text="Authors").findNext().text

In [64]:
Authors

'Sean McGrath'

In [65]:
#Authors.split(",")

In [53]:
#len(Authors.split(","))

In [45]:
Publication_Date=last_parse.find("div", "gsc_vcd_field", text="Publication date").findNext().text

In [46]:
Publication_Date

'2016/10/1'

In [10]:
import pandas as pd

In [11]:
df=pd.read_excel("books.xls")

In [12]:
df

Unnamed: 0.1,Unnamed: 0,Title,URL,Date,Type,Name of Publication,Authors,Number,Abstract
0,0,Designing for exploratory play with a hackable...,https://scholar.google.com/citations?view_op=v...,2016/6/4,Book,Proceedings of the 2016 ACM Conference on Desi...,"Andrew P McPherson, Alan Chamberlain, Adrian H...",5,This paper explores the design of digital musi...
1,1,Making music together: An exploration of amate...,https://scholar.google.com/citations?view_op=v...,2016/10/4,Book,Proceedings of the Audio Mostly 2016,"Sean McGrath, Alan Chamberlain, Steve Benford",3,This novel research presents the results of an...
2,2,GeoTracks: Adaptive music for everyday journeys,https://scholar.google.com/citations?view_op=v...,2016/10/1,Book,Proceedings of the 24th ACM international conf...,"Chris Greenhalgh, Adrian Hazzard, Sean McGrath...",4,Listening to music on the move is an everyday ...
3,3,"Understanding social media and sound: music, m...",https://scholar.google.com/citations?view_op=v...,2015/12/22,Conference,DMRN+10: Digital Music Research Network One-da...,"Alan Chamberlain, Sean McGrath, Steve Benford",3,Social media technologies have meant that peop...
4,4,"The Grime scene: social media, music, creation...",https://scholar.google.com/citations?view_op=v...,2016/10/4,Book,Proceedings of the Audio Mostly 2016,"Sean McGrath, Alan Chamberlain, Steve Benford",3,In this paper we start to explore and unpack t...
5,5,The Rough Mile: Testing a framework of immersi...,https://scholar.google.com/citations?view_op=v...,2017/6/10,Book,Proceedings of the 2017 Conference on Designin...,"Jocelyn Spence, Adrian Hazzard, Sean McGrath, ...",5,We present our case study on gifting digital m...
6,6,The user experience of mobile music making: An...,https://scholar.google.com/citations?view_op=v...,2017/7/1,Journal,Computers in Human Behavior,"Sean McGrath, Steve Love",2,The research herein describes the investigatio...
7,7,The Rough Mile: Reframing Location Through Loc...,https://scholar.google.com/citations?view_op=v...,2017/8/23,Book,Proceedings of the 12th International Audio Mo...,"Adrian Hazzard, Jocelyn Spence, Chris Greenhal...",4,We chart the design and deployment of The Roug...
8,8,An ethnographic exploration of studio producti...,https://scholar.google.com/citations?view_op=v...,2016/9/2,Conference,Proceedings of the 2nd AES Workshop on Intelli...,"Sean McGrath, Adrian Hazzard, Alan Chamberlain...",4,Tools for music production range from full sca...
9,9,DESIGNING AND DEVELOPING USER-CENTRED SYSTEMS,https://scholar.google.com/citations?view_op=v...,2018/8/14,Conference,The 4th Workshop on Intelligent Music Production,Sean McGrath,1,Our work explores the implications for the des...


# END