Last week, we practiced importing various files into Python and opened them as a pandas DataFrame. Concretely, we saw how to import .txt files, .csv and .xlsx files and last but not least, relational databases. 


However, this means that we had these files locally. Often that is not the case. Going to a website and manually downloading/copying information there is not scalable and not reproducible. You'd likely need to do that via code. This is the topic of this workshop. We will practice:

- Importing data from the web and loading it to pandas
- Make https requests
- Scrape and parse html code 


Some jargon, or what do some popular abbreviations mean

- **URL**: Universal/Unifom resource locator
- **HTTP**: HypterText Transfer Portal
- **HTML**: HypterText Markup Language
- **API**: Application Programming Interface
- **JSON**: JavaScript Object Notation

In [1]:
import pandas as pd
import numpy as np

In [336]:
# Packages you will need

!pip install BeautifulSoup4
!pip install MechanicalSoup
!pip install lxml



### Download data from the web using the urllib module

We will work with a popular data set, the so-called 'Wisconsin breast cancer' data, which is uploaded to the UCI Machine learning repository. The data is stored in one file ('breast-cancer-wisconsin.data'). However, it does NOT contain the column names. That information is stored in another file, called 'breast-cancer-wisconsin.names'. We will retrieve them both. 


Here is an example importing that data using the **urllib** module.


We will download data from the ***UCI Machine learning Repository***. The parent directory can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

In [7]:
from urllib.request import urlretrieve

url_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
urlretrieve(url_data, 'breast-cancer-wisconsin.data')

('breast-cancer-wisconsin.data', <http.client.HTTPMessage at 0x2237f143be0>)

**Q1. Using the same module and structure, retrieve the names of the breast cancer data.**

In [8]:
url_names = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names'
urlretrieve(url_names, 'breast-cancer-wisconsin.names')

('breast-cancer-wisconsin.names', <http.client.HTTPMessage at 0x2237f143df0>)

**Q2. Import the 'breast-cancer-wisconsin.data' file as a pandas DataFrame. Print the top 5 rows**

In [18]:
# Read the data file in pandas, making sure there is no header here
bc = pd.read_csv('breast-cancer-wisconsin.data', header=None)
bc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


**Q3. So far so good. Sometimes it can be quite challenging to work with html files or broadly any non-well structured files we download from the web. Inspect the 'breast-cancer-wisconsin.names' in a text editor. As you can see, it is a long file and contains quite a lot of information, not all of which will be relevant for us. We will try to extract the relevant information in a bit.** 

    First, import the file, making sure you read it line by line and remove the new line character ('\n')

In [8]:
# Open the text file and read it line by line, removing the new line character

lines = []
with open('breast-cancer-wisconsin.names', mode='r') as file:
    for line in file:
        lines.append(line.strip('\n'))

Inspect the output, which should look like the one below.

In [9]:
lines

['Citation Request:',
 '   This breast cancer databases was obtained from the University of Wisconsin',
 '   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results',
 '   when using this database, then please include this information in your',
 '   acknowledgements.  Also, please cite one or more of:',
 '',
 '   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear ',
 '      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.',
 '',
 '   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of ',
 '      pattern separation for medical diagnosis applied to breast cytology", ',
 '      Proceedings of the National Academy of Sciences, U.S.A., Volume 87, ',
 '      December 1990, pp 9193-9196.',
 '',
 '   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition ',
 '      via linear programming: Theory and application to medical diagnosis", ',
 '      in: "Large-scale numerical optimization", Thomas F. Coleman

The relevant for us information is at **7.Attribute Information**. It contains the names of the features and a brief description of their meaning and/or values they can take. 

**Q4. The next task is quite challening and may involve quite a few steps. Using the file you imported, which contains the names, extract the names of the features. In this case, the file is not so complicated but do not just copy the features names to a new list. Make sure you use your Python coding skills to extract the relevant information. In the end, we want the names of the features in a list.** 

    NOTE: This is a task which can be solved in many different ways. It is likely to take a few lines of code. Utilize what you have learned so far about lists, list slicing, string splitting and cleaning, and/or joining. If you are familiar with regular expressions (regex), you can utilize them as well. My solution does not make use of regex, only material we have covered so far. 

In [109]:
print(lines.index('7. Attribute Information: (class attribute has been moved to last column)'))
print(lines.index('8. Missing attribute values: 16'))

101
117


In [12]:
extracted = lines[105:116]
extracted

['   1. Sample code number            id number',
 '   2. Clump Thickness               1 - 10',
 '   3. Uniformity of Cell Size       1 - 10',
 '   4. Uniformity of Cell Shape      1 - 10',
 '   5. Marginal Adhesion             1 - 10',
 '   6. Single Epithelial Cell Size   1 - 10',
 '   7. Bare Nuclei                   1 - 10',
 '   8. Bland Chromatin               1 - 10',
 '   9. Normal Nucleoli               1 - 10',
 '  10. Mitoses                       1 - 10',
 '  11. Class:                        (2 for benign, 4 for malignant)']

In [13]:
extracted_split = [line.replace(':', '').split() for line in extracted]
print(extracted_split)
extracted_split_no_digits = [[item for item in line if item.isalpha()]for line in extracted_split]
extracted_split_no_digits

[['1.', 'Sample', 'code', 'number', 'id', 'number'], ['2.', 'Clump', 'Thickness', '1', '-', '10'], ['3.', 'Uniformity', 'of', 'Cell', 'Size', '1', '-', '10'], ['4.', 'Uniformity', 'of', 'Cell', 'Shape', '1', '-', '10'], ['5.', 'Marginal', 'Adhesion', '1', '-', '10'], ['6.', 'Single', 'Epithelial', 'Cell', 'Size', '1', '-', '10'], ['7.', 'Bare', 'Nuclei', '1', '-', '10'], ['8.', 'Bland', 'Chromatin', '1', '-', '10'], ['9.', 'Normal', 'Nucleoli', '1', '-', '10'], ['10.', 'Mitoses', '1', '-', '10'], ['11.', 'Class', '(2', 'for', 'benign,', '4', 'for', 'malignant)']]


[['Sample', 'code', 'number', 'id', 'number'],
 ['Clump', 'Thickness'],
 ['Uniformity', 'of', 'Cell', 'Size'],
 ['Uniformity', 'of', 'Cell', 'Shape'],
 ['Marginal', 'Adhesion'],
 ['Single', 'Epithelial', 'Cell', 'Size'],
 ['Bare', 'Nuclei'],
 ['Bland', 'Chromatin'],
 ['Normal', 'Nucleoli'],
 ['Mitoses'],
 ['Class', 'for', 'for']]

In [14]:
# Remove the last two items from first and last lines 
extracted_split_no_digits[0] = extracted_split_no_digits[0][:3]
extracted_split_no_digits[-1] = extracted_split_no_digits[-1][0]
print(extracted_split_no_digits)

[['Sample', 'code', 'number'], ['Clump', 'Thickness'], ['Uniformity', 'of', 'Cell', 'Size'], ['Uniformity', 'of', 'Cell', 'Shape'], ['Marginal', 'Adhesion'], ['Single', 'Epithelial', 'Cell', 'Size'], ['Bare', 'Nuclei'], ['Bland', 'Chromatin'], ['Normal', 'Nucleoli'], ['Mitoses'], 'Class']


In [15]:
newlines = []
for line in extracted_split_no_digits[:-1]:
    #for item in line:
    newlines.append('_'.join(line))

In [16]:
newlines.append('Class')
print(newlines)

['Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class']


Finally, we have the names of the columns in a list called 'newlines'. We can assign it to the DataFrame we imported.

In [19]:
bc.columns = newlines
bc.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [33]:
step_1 = [x.split(".")[1] for x in extracted]
step_2 = [x.replace(':', '').split("  ")[0].strip() for x in step_1]
step_2

['Sample code number',
 'Clump Thickness',
 'Uniformity of Cell Size',
 'Uniformity of Cell Shape',
 'Marginal Adhesion',
 'Single Epithelial Cell Size',
 'Bare Nuclei',
 'Bland Chromatin',
 'Normal Nucleoli',
 'Mitoses',
 'Class']

### Extract from the web

Extracting data from the web is also referred to as scraping. There are many use cases in data science where you may want to do that. Usually, you'd like to enrich your data with some extra information. Say we are building a house price forecasting model, then getting some extra information about the area/neighborhood could increase your predictive power. 

However, not all website can be scraped freely. Some website explicitly forbid automated scraping. There could be good/valid reasons for that. 
1. The site has good reasons to protect its data (i.e. the data there is not publicly available, and could be sensitive/proprietry).
2. Making many and repeated requests to a website may use too much bandwidth, slowing down the website/service for other users. 

**Always check the website's polcity regarding scraping before you go ahead and try to scrape it!!**

The first package we will work with is built-in the standard Python stack and is called *urllib*

In [36]:
from urllib.request import urlopen, Request

# Define the url
url = 'https://www.theguardian.com/international'
# Send a request to it
request = Request(url)
# Catches the response, returning an HTTPResponse object
response = urlopen(request) 
# Use the .read() method of the object
html = response.read()

In [35]:
response

<http.client.HTTPResponse at 0x2e601390fd0>

In [39]:
# We see that when we read the response, we get sequence of bytes
#html

In [40]:
## Another solution, without using Request (saving a line of code):
url = 'https://www.theguardian.com/international'
page = urlopen(url)
html2 = page.read()

In [41]:
type(html2)

bytes

In [42]:
# We can use .decode() to decode the bytes to a string, using a 'utf-8' encoding
html2 = html2.decode('utf-8')
type(html2)

str

In [43]:
response.close()

### Extract from the web using Requests package

Requests package is another way to do this. It's a super popular package, which does the same but provides higher-level interfact, so requires us to write fewer lines of code. 

Most companies out there use the **requests** package to extract data from the web.

In [44]:
import requests

url = 'https://www.theguardian.com/international'
r = requests.get(url)
text_output = r.text # we apply the .text method, which return the html as a string

In [46]:
type(text_output)

str

### HTML parser: BeautifulSoup

We see that the result is the same in both cases : a string object, extracting the html of a web page. It can be a bit difficult to work with as it contains a lot of structured and unstructured data. You can create/extract structure from it either manually, using for example regular expressions.

Alterntatively, you can use a parser, which is a package that can do that for you. Enter **beautifulsoup**

In [265]:
# Install the package is you don't have it:

#!pip install beautifulsoup4

In [47]:
from bs4 import BeautifulSoup

In [48]:
# We create an BS4 object, parsing the html object and what type of parse Python should employ 
soup = BeautifulSoup(text_output, "lxml")

In [50]:
#print(soup.prettify())  # <==> the same output as print(soup)

Real web pages are quite complex and could be very long. I am therefore commenting out the object. We can however, select any tag within it (where a tag is an object between < name > and whose end is identified as < /name >.

Let's select all the buttons for example. Note that the result will always be a list. 

Another note: one of the main difficulties with scraping tasks (especially if one is not very familiar with html and css) is to figure out what to pass to the select method. 

In [51]:
soup.select('button')

[<button aria-expanded="true" aria-haspopup="true" class="menu-item__title menu-item__title--News hide-from-desktop js-navigation-toggle" data-link-name="nav2 : secondary : News" role="menuitem">
 <i class="menu-item__toggle"></i>
 News
 </button>,
 <button aria-expanded="true" aria-haspopup="true" class="menu-item__title menu-item__title--Opinion hide-from-desktop js-navigation-toggle" data-link-name="nav2 : secondary : Opinion" role="menuitem">
 <i class="menu-item__toggle"></i>
 Opinion
 </button>,
 <button aria-expanded="true" aria-haspopup="true" class="menu-item__title menu-item__title--Sport hide-from-desktop js-navigation-toggle" data-link-name="nav2 : secondary : Sport" role="menuitem">
 <i class="menu-item__toggle"></i>
 Sport
 </button>,
 <button aria-expanded="true" aria-haspopup="true" class="menu-item__title menu-item__title--Culture hide-from-desktop js-navigation-toggle" data-link-name="nav2 : secondary : Culture" role="menuitem">
 <i class="menu-item__toggle"></i>
 Cul

**Q: extract the button about Lifestyle**

In [289]:
### Your code goes here

soup.select('button')[4].text

'\n\nLifestyle\n'

**Q: Extract and print out the names of all the buttons, removing new line characters!**

In [294]:
## Your code goes here

for item in soup.select('button'):
    print(item.text.replace('\n', ''))

News
Opinion
Sport
Culture
Lifestyle
Search with google
International edition
More
     More Opinion 
     More Sport 
     More This is Europe 
     More Climate crisis 
     More Culture 
About 
     More Lifestyle 
     More Explore 
     More In pictures 
Close


In [270]:
soup.title# extract the title

<title>News, sport and opinion from the Guardian's global edition | The Guardian</title>

In [274]:
#soup.get_text() # extract the text

Often, you only need to extract particularly information from a url, such as images or hyperlinks. 

In [296]:
#soup.find_all('img')

The result is a list, and we can unpack it a little bit. Each item in the list is a Tag object. Let's check its type and extract the first one. 

In [213]:
type(soup.find_all('img')[0])

bs4.element.Tag

In [215]:
image1 = soup.find_all('img')[0]
image1

<img alt="" class="responsive-img" loading="lazy" src="https://i.guim.co.uk/img/media/69c583a21604059f50de525369a1749aa1c754a5/0_9_3500_2102/master/3500.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=f355341ca002b8a12c6bf4b3515ed2fe"/>

In [217]:
# You can access the source attribute, stores in src 
image1['src']

'https://i.guim.co.uk/img/media/69c583a21604059f50de525369a1749aa1c754a5/0_9_3500_2102/master/3500.jpg?width=700&quality=85&auto=format&fit=max&s=f355341ca002b8a12c6bf4b3515ed2fe'

Not all images can be downloaded. If you open the source, you will get an 'Unauthorized' error. Let's ensure that is indeed the case.


In [300]:
image = requests.get('https://i.guim.co.uk/img/media/69c583a21604059f50de525369a1749aa1c754a5/0_9_3500_2102/master/3500.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=f355341ca002b8a12c6bf4b3515ed2fe')

image.content

b'\n<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html>\n  <head>\n    <title>401 Unauthorized - missing signature</title>\n  </head>\n  <body>\n    <h1>Error 401 Unauthorized - missing signature</h1>\n    <p>Unauthorized - missing signature</p>\n    <h3>Guru Mediation:</h3>\n    <p>Details: cache-ams21078-AMS 1610208940 1702990014</p>\n    <hr>\n    <p>Varnish cache server</p>\n  </body>\n</html>\n'

**Take-home challenge(not graded or requested to be submitted): Go to a https://books.toscrape.com/ and scrape it. Return the title of every book with 2-star rating (not >= 2 **, only those books with exactly 2 stars).** 

In [301]:
########### Note that the books are not all listed on one page but across multiple pages. 

base_url = 'https://books.toscrape.com/catalogue/page-{}.html'

# I inspected a few items and see that i need to grab a class = 'star-rating Two', which is a part of the 'product_pod' class.
# Get for the first page

res = requests.get(base_url.format(1))

In [305]:
soup = BeautifulSoup(res.text, 'lxml')

In [312]:
# Let's select the class call for product_pod; if you check its length, it will be 20 since we have 20 books per page
products = soup.select('.product_pod')
len(products)

20

In [321]:
# Now we want to grab the title for 2-star rated books, let's first do it for the first book: 

example = products[0]
print(example)

<article class="product_pod">
<div class="image_container">
<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>


In [323]:
# We see this book has 3-stars as a rating; let's check if that is the case by grabbin the class
example.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

In [324]:
# How can we grab the title? We see it in the linking element 'a'
example.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [329]:
# First result is image, the second is the title
print(example.select('a')[1])

# Now we need to grab the title of this result; we call ['title']
print(example.select('a')[1]['title'])

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
A Light in the Attic


In [334]:
# Now we need to do this for all books in the page, then all pages

two_star_titles = []

for n in range(1, 51):
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = BeautifulSoup(res.text, 'lxml')
    books = soup.select('.product_pod')
    
    for book in books:
        # If the list is not empty, then I have a 2-stars book
        if len(book.select('.star-rating.Two')) != 0: 
            book_title = book.select('a')[1]['title']
            two_star_titles.append(book_title)

In [335]:
len(two_star_titles)

196

### Interact with HTML forms

If we want to scrape a website using a specific query, then **BeautifulSoup** may not get us very far (though it is still a very useful package). 

When you need to interact with a website (click a button,for example), then we need another approach. One package that allows us to do that is **MechanicalSoup**

In [219]:
!pip install MechanicalSoup

Collecting MechanicalSoup
  Downloading MechanicalSoup-1.0.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: MechanicalSoup
Successfully installed MechanicalSoup-1.0.0


In [220]:
import mechanicalsoup

We first create a Browser object, which represents a headless web browser. 

In [221]:
browser = mechanicalsoup.Browser()

Then we specify the url and request it using the .get(url). Let's work with a simple url, which is a login page.

In [244]:
url = 'http://olympus.realpython.org/login'
page = browser.get(url)

Again, a Response object; the number [200] corresponds to the status code returned by the request() where 200 means successful request

In [245]:
page 

<Response [200]>

MechanicalSoup uses BeautifulSoup to parse the HTML object, as we can see below:

In [246]:
type(page.soup)

bs4.BeautifulSoup

In [247]:
page.soup

<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<h2>Please log in to access Mount Olympus:</h2>
<br/><br/>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text"/><br/>
Password: <input name="pwd" type="password"/><br/><br/>
<input type="submit" value="Submit"/>
</form>
</center>
</body>
</html>

Notice that the page has a < form > with inputs being 'Username' and 'Password'. The correct usename is 'zeus' and password is 'ThunderDude'. If you manually enter them in the page, you will be directed to the 'Profiles' page. 

Note from above that the < form > has its **name** attribute set to *login*, the first *input* element is the 'Username', the second is the 'Password' and the third the 'Submit' button. 

In [261]:
# We save the soup object as html and extract the 'form'.

html = page.soup
form = html.select("form")[0] #we add the 0 to extract the Tag from a list form
print(type(form))
print(form)

<class 'bs4.element.Tag'>
<form action="/login" method="post" name="login">
Username: <input name="user" type="text" value="zeus"/><br/>
Password: <input name="pwd" type="password" value="ThunderDude"/><br/><br/>
<input type="submit" value="Submit"/>
</form>


In [260]:
# Select the username input (0) and password input (1), and assign to them the correct values
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"


# Submit the form using the 'SUBMIT' button, passing the form object and the url of the login page
profiles_page = browser.submit(form, page.url)
# To confirm we successfully logged in
profiles_page.url

'http://olympus.realpython.org/profiles'

Word of caution: hackers can use similar approach to brute force logins by traying many different combinations.

**Do not try this at home!** as this is highly illegal, and many websites will lock you out after too many failed requests and block your IP address.

**Take-home challenge 2(again, for your own practice, no grading or submission!): Pick a website for which you have a registration and try to replicate that exercise from above. In other words, use MechanicalSoup to get a response from your url of choice and then find the appropriate log-in/sign-in form and enter your credentials to log in.** 

In [263]:
###### Your code goes here




