# Lab 10 - How to ethically scrape the web
*© 2023 Colin Conrad*

Welcome to Week 10 of ECMM 6014! This week we are going to learn about data science topics earnestly by applying the programming skills to work with web data.


**This week, we will achieve the following objectives:**
- Search for books using the Open Library API
- Use an API to retrieve Open Library collections data
- Retrieve and process webpage data
- Ethically scrape XKCD comics

Reference: Sweigart (2014) Ch. 12.



# Case: Open Library
Though there are virtually infinite uses for web application programming interfaces (APIs), one of the most tangible and easy to use is that provided by [Internet Archive's](https://archive.org/) Open Library. The vision of the [Open Library](https://openlibrary.org/) is to make all of humanity's published works freely available to everyone in the world. It does this by providing a digital collection of books in a variety of formats, ranging from text to Kindle.

The [Open Library's API](https://openlibrary.org/developers/api) gives detailed documentation about how to access and use data retained on their system. Though we will not use the API to retrieve book content, we will use it to navigate their collection and retain their library system data. Though book content may be copyrighted, their system data is [freely available for web developers to use](https://openlibrary.org/developers/licensing). For more information about open data licenses, please refer to the documentation on [opensource.org](https://opensource.org/licenses).

There are many APIs which may be useful to you in your research. You may be interested in checking out some of the following free APIs:
- [Open Corporates](https://api.opencorporates.com/) - A large repository of company information;
- [NASA](https://api.nasa.gov/) - NASA images galore! (Requires a key);
- [Chuck Norris jokes](http://www.icndb.com/api/) - A simple Chuck Norris joke generator;
- [REST Countries](https://restcountries.eu/) - Country information;
- [Reddit](https://www.reddit.com/dev/api/) - Access to social media data.

# Objective 1: Search for books using the Open Library API
As many of you have likely encountered in the past, application programming interfaces (APIs) are a critical piece of computer infrastructure, particularly for web applications. APIs are communication protocols that govern how software communicates to each other. In the case of web APIs, they often govern how computers exchange data over the internet. These days, internet data is often exchanged in JavaScript Object Notation (JSON) format.

Python has a few great libraries for retrieving and managing JSON data. The first library we will explore is `requests`. Before we proceed, please ensure that you have this library installed. If not, install it using `pip install requests` explored in previous weeks.

In [2]:
import requests # import the library

Though it might seem like wizardry at times, all the `requests` library does is allow us to make web requests similarly to how a web browser does. Building on Sweigart's example, we could use requests to retrieve a web page from a particular URL. For example, the following code retrieves the results of a request (in this case, HTML code of my home page) and saves it in the specified variable.

In [3]:
resp = requests.get('https://python.org') # retrieves python homepage
resp # tell us what this is

<Response [200]>

If you execute the code above successfully you should get something along the lines of `<Response [200]>`, which denotes that this is a response object that was successful (HTTP's code for success). If we wanted to see the content of the response we could try typing the following.

In [4]:
resp.text # give us the text result from the request



The text generated above is actually the HTML code for *Python.org*. When rendered by a web browser, it creates a nice interface, though here it does not. You can take a look at the code next to the page in Chrome below. 

Similarly, we can make web requests to an API to retrieve data. Open a web browser then copy and paste the following URL and see what happens: `http://openlibrary.org/search.json?q=brave+new+world`. 

You will likely be given a wall of JSON text. These are results of a request to the Open Library API for 'Brave New World', one of my all-time favorite fiction novels. We can retrieve the same results `requests` in Python.  

In [5]:
# save the response as a variable and retrieve the JSON data

response = requests.get('http://openlibrary.org/search.json?q=brave+new+world')
response.json()

{'numFound': 629,
 'start': 0,
 'numFoundExact': True,
 'docs': [{'key': '/works/OL64468W',
   'type': 'work',
   'seed': ['/books/OL34075704M',
    '/books/OL19557074M',
    '/books/OL37907144M',
    '/books/OL27279269M',
    '/books/OL22123296M',
    '/books/OL26423807M',
    '/books/OL26896558M',
    '/books/OL28390491M',
    '/books/OL26376236M',
    '/books/OL23757347M',
    '/books/OL13806517M',
    '/books/OL9229851M',
    '/books/OL14130759M',
    '/books/OL40234250M',
    '/books/OL26376932M',
    '/books/OL6770033M',
    '/books/OL6474536M',
    '/books/OL19834566M',
    '/books/OL7275142M',
    '/books/OL6289093M',
    '/books/OL19303376M',
    '/books/OL17349234M',
    '/books/OL20879326M',
    '/books/OL16824082M',
    '/books/OL22604809M',
    '/books/OL24979469M',
    '/books/OL24854289M',
    '/books/OL28660780M',
    '/books/OL23266684M',
    '/books/OL6504102M',
    '/books/OL22810116M',
    '/books/OL23757350M',
    '/books/OL14243013M',
    '/books/OL17727850M',
   

You can probably see where this is going. When we do this in Python we retrieve the data in JSON format easily and use it like a Python dictionary. This gives us a lot of power; in the wise words of Uncle Ben 'with great power comes great responsibility'.

The Open Library API provides [documentation on performing searches](https://openlibrary.org/dev/docs/api/search) using the API. In a previous example, we searched for Aldus Huxley's *Brave New World*. Modify the previously used code to conduct a search, retrieve the results as JSON data, and display the JSON data in Jupyter.

In [15]:
# conducts a different search and retrieve the JSON data

response = requests.get('http://openlibrary.org/search.json?q=the+lord+of+the+rings')
response.json()

{'numFound': 769,
 'start': 0,
 'numFoundExact': True,
 'docs': [{'key': '/works/OL27448W',
   'type': 'work',
   'seed': ['/books/OL9158246M',
    '/books/OL9177076M',
    '/books/OL7883890M',
    '/books/OL21217116M',
    '/books/OL6165495M',
    '/books/OL24200787M',
    '/books/OL5975400M',
    '/books/OL17990125M',
    '/books/OL16539692M',
    '/books/OL23795326M',
    '/books/OL5574175M',
    '/books/OL5535578M',
    '/books/OL4382055M',
    '/books/OL20943851M',
    '/books/OL16791443M',
    '/books/OL22917263M',
    '/books/OL10681058M',
    '/books/OL10681579M',
    '/books/OL21392110M',
    '/books/OL22470927M',
    '/books/OL10682160M',
    '/books/OL5237526M',
    '/books/OL24353781M',
    '/books/OL10682337M',
    '/books/OL23821472M',
    '/books/OL9129627M',
    '/books/OL9117897M',
    '/books/OL22984886M',
    '/books/OL17885449M',
    '/books/OL7465857M',
    '/books/OL20943862M',
    '/books/OL27037515M',
    '/books/OL22510662M',
    '/books/OL9228715M',
    '/book

### *Challenge Question 1 (1 point)*
Retrieve the number of items found from your search above. **Hint:** you should be able to do that by saving your response.json() as a variable and retrieving the dictionary value of the `num_found` key.

In [21]:
result = response.json()
result["numFound"]


769

# Objective 2: Use an API to retrieve Open Library collections data
Let's take a closer look at how the Open Library API works. [According to their documentation](https://openlibrary.org/dev/docs/api/books), we also have the ability to retrieve particular book information. The books are indexed by many keys, including ISBN numbers and a unique Open Library ID key (OLID). Using these keys we can retrieve data about the particular books in question.

When you conducted a search for Brave New World earlier, you retrieved a series of OLID keys the first of which was `OL22123296M`. Using the first key in that set, we can retrieve the data for this particular collection item. Just like everything in this lab, we make a requests over the internet for using a specific URL. The structure of an Open Library query is as follows:
- It begins with the call for book data (rather than, say, a search): `https://openlibrary.org/api/books?` 
- It then then adds the key information: `bibkeys=OLID:OL22123296M`
- And completes by stating the desired format: `&format=json`

This leads us to the following URL call: `https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json`. Try calling this request below.

In [22]:
import requests # retrieve the requests library
request = requests.get('https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json')
bnw_info = request.json() # as before
bnw_info

{'OLID:OL22123296M': {'bib_key': 'OLID:OL22123296M',
  'info_url': 'https://openlibrary.org/books/OL22123296M/Brave_new_world',
  'preview': 'noview',
  'preview_url': 'https://openlibrary.org/books/OL22123296M/Brave_new_world'}}

The data saved in the `bnw` variable is now callable in a dictionary format. If we want to retrieve the preview URL we can execute the code below. Consider copying this into your web browser!

In [23]:
bnw_info['OLID:OL22123296M']['preview_url'] # note that there are two levels in this dictionary

'https://openlibrary.org/books/OL22123296M/Brave_new_world'

### *Challenge Question 2 (1 point)*
[Sweigart (2020)](https://automatetheboringstuff.com/2e/chapter12/) provides code for ordering Python to open your web browser to a specified URL. In theory, we could combine this code with the Open Library API to create a simple app for reading books. Retrieve the `preview_url` for your book as demonstrated above and use the `webbrowser.open()` function to order your web browser to open the book preview. 

Dont forget to import webbrowser before requesting to open()

In [24]:
import webbrowser

In [26]:
bookopen = bnw_info['OLID:OL22123296M']['preview_url']
webbrowser.open(bookopen)

True

# Objective 3: Retrieve and process webpage data
In addition to APIs, we can also use Python to retrieve and process regular web data. Last time we tried this using the `requests` module, we retrieved a series of unreadable HTML text. It would be much easier to process this type of data if there was a more efficient library.

Fortunately, Python has `Beautiful Soup` which is designed exactly for this task. This library structures HTML data retrieved using requests in a way that is not only readable, but also manageable. For instance, if we wanted to retrieve the Open Library home page, we could execute the following code.

**Note: It is possible that the Beautiful Soup `bs4` library is not installed. If not, use `pip install bs4` before executing this code.**

In [28]:
pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=08124d1cbc6e26e01eabcfeeae4828412f0cb3e08f42d65ae3d8dffcfdf83826
  Stored in directory: /Users/samibashir/Library/Caches/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Note: you may need to restart the kernel to use updated packages.


In [32]:
import bs4 # import the Beautiful Soup library
res = requests.get('https://openlibrary.org/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
print(librarySoup) #print the HTML


<!DOCTYPE html>

<html lang="" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="title"/>
<meta content="free books, books to read, free ebooks, audio books, read books for free, read books online, online library" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="OpenLibrary.org" name="author">
<meta content="OpenLibrary.org" name="creator">
<meta content="Original content copyright; 2007-2015" name="copyright"/>
<meta content="Global" name="distribution"/>
<meta content="#e2dcc5" name="theme-color"/>
<link href="https://openlibrary.org/" rel="canonical"/>
<link href="https://analytics.archive.org" rel="preconnect"/>
<link href="/static/opensearch.xml" rel="search" title="Open Library" type="application/opensearchdescription+xml"/>
<link href="/static/manifest.json" rel="manifest"/>
<link href="/static/images/openlibrary-128x128.png" rel="apple-

Beautiful Soup has a few handy functions that greatly lightening our load when processing web data. We could save this HTML data by opening a file and saving the content of the retrieved website on our local computer. For instance, the following code retrieves Open Library's web page and saves the code on our local computer in the `data` folder.

In [34]:
exampleFile = open('example.html', 'w', encoding='utf-8') # we need to explicitly state UTF8 encoding
exampleFile.write(str(librarySoup)) # writes the file
exampleFile.close() # closes the html file

Try opening the file using a code editor such as Notepad++. You will see that you have just copied Open Library's web page; this is to say, you **scraped** Open Library's web page. This example illustrates how computers access and process web data. Web scrapers also form the backbone of search engine technology and also the Open Internet Archive's software.

Web scrapers are ubiquitous, though they may not necessarily be legal in many circumstances. Many (or perhaps even most) web materials are copyrighted (e.g. many newspaper articles) and may not permit you accessing their data in this way. Fortunately the Open Internet Archive allows scholars to access their materials. Other sites may not be so generous.

### Retrieving specific web data
Using Beautiful Soup we can also access particular page elements. HTML documents consist of a series of elements which could include tags (e.g. `<div>`) as well as properties (e.g. the logo class `.logo`). Beautiful Soup helps us to navigate these elements so that we can retrieve the data that we want, rather than whole web pages.

This is better expressed using an example. If we wanted to retrieve data from specific elements from the Open Library web page, we can use the `select` method to retrieve that data. The following code retrieves only data which is contained in their `page-banner` class (usually reserved for important catch phrases).  

In [35]:
res = requests.get('https://openlibrary.org/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
elems = librarySoup.select('.page-banner') # select only elements with the page-banner class
elems # print the data retrieved

[<div class="page-banner page-banner-black page-banner-center">
 <div class="iaBar">
 <a class="iaLogo" href="https://archive.org"><img alt="Internet Archive logo" src="/static/images/ia-logo.svg" width="160"/></a>
 <a class="ghost-btn" data-ol-link-track="IABar|DonateButton" href="https://archive.org/donate/?platform=ol&amp;origin=olwww-TopNavDonateButton">Donate <span aria-hidden="true" class="heart">♥</span></a>
 <div class="language-component header-dropdown" id="footer-locale-menu">
 <details>
 <summary>
 <img alt="Change Website Language" class="translate-icon" src="/static/images/language-icon.svg" title="Change Website Language">
 </img></summary>
 <div class="language-dropdown-component">
 <ul class="locale-options dropdown-menu">
 <li><a data-lang-id="cs" href="#" lang="cs" title="Czech">Čeština (cs)</a></li>
 <li><a data-lang-id="de" href="#" lang="de" title="German">Deutsch (de)</a></li>
 <li><a data-lang-id="en" href="#" lang="en" title="English">English (en)</a></li>
 <li

Beautiful Soup detected two elements with this feature. The first was the donate button and the second was their catch phrase "Together, let's build an Open Library for the world". Beautiful Soup retrieved these in a list format, so we can retrieve the second of these elements using the following code.

In [36]:
elems[0]

<div class="page-banner page-banner-black page-banner-center">
<div class="iaBar">
<a class="iaLogo" href="https://archive.org"><img alt="Internet Archive logo" src="/static/images/ia-logo.svg" width="160"/></a>
<a class="ghost-btn" data-ol-link-track="IABar|DonateButton" href="https://archive.org/donate/?platform=ol&amp;origin=olwww-TopNavDonateButton">Donate <span aria-hidden="true" class="heart">♥</span></a>
<div class="language-component header-dropdown" id="footer-locale-menu">
<details>
<summary>
<img alt="Change Website Language" class="translate-icon" src="/static/images/language-icon.svg" title="Change Website Language">
</img></summary>
<div class="language-dropdown-component">
<ul class="locale-options dropdown-menu">
<li><a data-lang-id="cs" href="#" lang="cs" title="Czech">Čeština (cs)</a></li>
<li><a data-lang-id="de" href="#" lang="de" title="German">Deutsch (de)</a></li>
<li><a data-lang-id="en" href="#" lang="en" title="English">English (en)</a></li>
<li><a data-lang-i

Beautiful soup's elements object also has a specific `getText()` method for retrieving only text. Using this we can retrieve the slogan from their web page. A picture of the exact element retrieved is provided for your reference.

In [37]:
elems[0].getText()

'\n\n\nDonate ♥\n\n\n\n\n\n\n\nČeština (cs)\nDeutsch (de)\nEnglish (en)\nEspañol (es)\nFrançais (fr)\nHrvatski (hr)\nPortuguês (pt)\nతెలుగు (te)\nУкраїнська (uk)\n中文 (zh)\n\n\n\n\n\n'

BeautifulSoup also works well in downloading images in the webcontent. To find all images and print the image source, we use 'findall'

In [38]:
import os
import bs4 # import the Beautiful Soup library
res = requests.get('https://xkcd.com') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
#print(librarySoup)
for item in librarySoup.find_all('img'):
    print(item['src'])
   

/s/0b7742.png
https://xkcd.com/s/5bef6b.png
//imgs.xkcd.com/comics/salt_dome.png
//imgs.xkcd.com/s/a899e84.jpg
//imgs.xkcd.com/s/temperature.png


### *Challenge Question 3 (1 point)*
Using the Beautiful Soup library, retrieve and print the HTML data from `https://dal.ca`. You can modify the code we used to retrieve the Open Library page for this task.

In [47]:
import bs4 # import the Beautiful Soup library
res = requests.get('https://dal.ca/') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
print(librarySoup) #print the HTML

<!DOCTYPE html>

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="keywords"/>
<meta content="With campuses located in Halifax and Truro, NS, Dalhousie is a research-intensive university offering over 190 degrees in 13 diverse faculties." name="description"/>
<meta content="2023-03-21T11:40:15Z" name="coveoDate"/><link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/icon-192x192.png.lt_52d8b16a1d0bc6e6e2f65bda92d6c7fa.res/icon-192x192.png" rel="icon" sizes="192x192"/>
<link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/apple-touch-icon-180x180.png.lt_0769fa00e34e46c4d1386dc7853ec114.res/apple-touch-icon-180x180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="https://cdn.dal.ca/etc/designs/dalhousie/clientlibs/global/default/images/favicon/apple-touch-icon-167x167.png.lt_96c78712ff221466887c36a6900b4720.res/apple-touch-icon-167x167

### *Challenge Question 4 (1 point)*
Again using Beautiful Soup, retrieve and print the text from Dalhousie University's `mainLogo` class. Your result should write something along the lines of `\nDalhousie University\n\n`.

In [48]:
res = requests.get('https://dal.ca') # the target URL
res.raise_for_status() # checks for errors
librarySoup = bs4.BeautifulSoup(res.text, 'html.parser') # retrieve the text and format it as HTML
elems = librarySoup.select('.mainLogo') # select only elements with the page-banner class
elems # print the data retrieved

[<div class="mainLogo"><h2>
 <a href="https://www.dal.ca/" title="Back to Dalhousie University Home Page">Dalhousie University</a>
 </h2>
 </div>]

In [49]:
elems[0].getText()

'\nDalhousie University\n\n'

### *Challenge Question 5 (1 point)*
Find all the image sources in the HTML data and save it in the file

In [63]:
import os
import bs4 # import the Beautiful Soup library
res = requests.get('https://dal.ca') # the target URL
res.raise_for_status() # checks for errors

# parse the HTML content and retrieve all image links
soup = bs4.BeautifulSoup(res.text, 'html.parser')
image_links = [img['src'] for img in soup.find_all('img')]

# save the image links to a .txt file
with open('image_links.txt', 'w') as f:
    for link in image_links:
        f.write(link + '\n')