The beginning of this lecture is based on chapter 11 of *Automate the Boring Stuff with Python* by Al Sweigart

**OBS**: To make the given links work correctly, it is asumed that you start the notebook server on your VM from within the directory `/synced_folder/lecture_notes/`. That is,

```bash
vagrant@vagrant:~$ cd /synced_folder/lecture_notes/
vagrant@vagrant:/synced_folder/lecture_notes$ notebook
```


# HTML Refresher

HTML files are plain text files containing *tags*, which are words enclosed in angle brackets. Tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the `<a>` tag encloses text that should be a link.

Some elements have an `id` attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

# View a Page's HTML Sources

Here, I will only describe how to use Firefox' development features.

To view a page's sources right click on it and choose **View page source** which opens a new tab with the HTML sources. For example, the Jupiter Notebook server, serves static files under http://127.0.0.1:8888/files. Open for example the file `example.html` from within Firefox http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html.

![screenshot](images/view_source.png)


In Firefox, you can bring up the Web Developer Tools Inspector by pressing `CTRL-SHIFT-C` on Windows and Linux or by `CMD-OPTION-C` on OS X.

![screenshot](images/inspector.png)

# Parsing HTML with BeautifulSoup

BeautifulSoup is a module for parsing and extracting information from HTML sources. The module’s name is bs4. In case it is not already install to your VM, install it with `pip install beautifulsoup4`. While beautifulsoup4 is the name used for installation, to import BeautifulSoup you have to use `import bs4`.

According to its documentation (https://www.crummy.com/software/BeautifulSoup/) *"Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text.""*


## Creating a BeautifulSoup Object from a Local HTML File

The `bs4.BeautifulSoup()` function needs to be called with a string containing the HTML it will parse. The `bs4.BeautifulSoup()` function returns is a `BeautifulSoup` object.

You can load a local HTML file and pass a file object to `bs4.BeautifulSoup()`.

In [None]:
import bs4


with open('./data/example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html, 'html5lib')
type(soup)

## Creating a BeautifulSoup Object from a Remote HTML File



In [None]:
import bs4
import requests


r = requests.get('https://github.com/datsoftlyngby/soft2017fall-business-intelligence-teaching-material')
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html5lib')

print(soup.prettify()[:1500])

## Finding an Element with the `select()` Method

You can retrieve HTML elements from a `BeautifulSoup` object by calling the `select()` method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Common CSS selector patterns include:

  * `soup.select('div')` ... selects all elements named `<div>`
  * `soup.select('#lecturer')`  ... selects the element with an id attribute of author
  * `soup.select('.notice')` ... selects all elements that use a CSS class attribute named notice
  * `soup.select('div span')` ... selects all elements named ``<span>` that are within an element named `<div>`
  * `soup.select('div > span')` ... selects all elements named `<span>` that are directly within an element named `<div>`, with no other element in between
  * `soup.select('input[name]')` ... selects all elements named `<input>` that have a name attribute with any value
  * `soup.select('input[type="button"]')` ... selects all elements named `<input>` that have an attribute named type with value button
  
See more in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

In [None]:
import bs4


with open('./data/example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html, 'html5lib')

elems = soup.select('#lecturer')

#print(soup.prettify())
print(type(elems))
print(len(elems))
print(type(elems[0]))
print(elems[0].getText())
print(elems[0].text)
print(str(elems[0]))
print(elems[0].attrs)
print(elems[0].contents)

In [None]:
p_elems = soup.select('p')

for el in p_elems:
    # str(p_elems[0]), str(p_elems[1]),...
    print(str(el))
    print(el.getText())
    print('------------')

## Getting Data from an Element’s Attributes

The `get()` method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value.

In [None]:
import bs4


with open('./data/example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html, 'html5lib')

span_elem = soup.select('span')[0]
print(str(span_elem))
print(span_elem.get('id'))
print(span_elem.get('some_nonexistent_addr') == None)
print(span_elem.attrs)

# Example, Extract Data from a Page


Ususally, you will use web scraping to collect information, which you cannot gather otherwise. For example, we want to collect information about sale data of flats and houses in Copenhagen.

Since we cannot find an API or any other open dataset, we decide to scrape a publicly available homepage. In your exercises, you will scrape a mirror of boliga.dk/sales. During class we will consider simillar but smaller pages from within this repository. These are

  * http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html
  * http://127.0.0.1:8888/files/data/boliga_1050-1549_2.html
  * http://127.0.0.1:8888/files/data/boliga_1550-1799_1.html

The first two pages list some sales in the zipcode areas 1050 to 1549. The third page lists sales in the zipcode areas 1550 to 1799.

**OBS** Many web pages are not built to support high traffic or they exlicitely discourage automatic access. Keep this in mind when writing your scraping tool.


Let's have a look at the first page http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html

In [None]:
from IPython.display import IFrame

# This code is just to inline the webpage into this notebook
url = 'http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html'
IFrame(url, width=700, height=400)

In [None]:
import bs4
import requests


url = 'http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.content.decode('utf-8'), 'html5lib')
print(soup.prettify())

Now, let's say that we want to create two CSV files. One containing all data for Copenhagen's city center (`1050-1549.csv`) and one for Vesterbro (`1550-1799.csv`). To do that, we will need a function, which scrapes all data from a single page.

In [None]:
def scrape_housing_data(url):

    data = []
    
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.content.decode('utf-8'), 'html5lib')

    table = soup.find('table')
    table_body = table.find('tbody')

    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')

        # Decode address column
        address_str = cols[0].text.strip()
        street_str = ' '.join(address_str.split(' ')[:-3])
        city_str = ' '.join(address_str.split(' ')[-3:])
        zip_number = int(address_str.split(' ')[-3])

        # Decode number of rooms
        no_rooms_str = cols[1].text.strip()
        no_rooms = int(no_rooms_str)
        
        # Decode selling date and type
        size_in_sq_m_str = cols[2].text.strip()
        size_in_sq_m = int(size_in_sq_m_str)

        # Decode year of construction
        year_of_construction_str = cols[3].text.strip()
        year_of_construction = int(year_of_construction_str)
        
        # Decode price
        price_str = cols[4].text.strip()
        price = float(price_str)   

        # Decode sales date
        sale_date_str = cols[5].text.strip()

        decoded_row = (street_str, city_str, zip_number, no_rooms,
                       size_in_sq_m, year_of_construction, price, 
                       sale_date_str)
        data.append(decoded_row)
    
    print('Scraped {} sales...'.format(len(data)))
    
    return data

In [None]:
base_url = 'http://127.0.0.1:8888/files/data/boliga_1050-1549_1.html' 
housing_data = scrape_housing_data(base_url)
housing_data[:3]

## Saving a CSV File with the Scraped Data 

After scraping the data from each HTML page, we erceive a list of tuples. Each tuple will be a line in a CSV file.

In [None]:
import csv


def save_to_csv(data, path='./out/boliga.csv'):
    
    with open(path, 'w', encoding='utf-8') as output_file:
        output_writer = csv.writer(output_file)
        output_writer.writerow(['street', 'city', 'zipcode', 
                                'no_rooms', 'size_in_sq_m', 
                                'year_of_construction', 'price', 
                                'sale_date_str'])

        for row in data:
            output_writer.writerow(row)

In [None]:
import os


out_dir = './data/out'
if not os.path.exists(out_dir):
    os.mkdir(out_dir)
    
base_url = 'http://127.0.0.1:8888/files/data'
urls = ['boliga_1050-1549_1.html', 'boliga_1050-1549_2.html', 'boliga_1550-1799_1.html']
urls = [os.path.join(base_url, url) for url in urls]

fst_fourty_results = scrape_housing_data(urls[0])
snd_fourty_results = scrape_housing_data(urls[1])
fst_results = fst_fourty_results + snd_fourty_results

save_to_file = os.path.join(out_dir, os.path.basename(urls[0]).split('_')[1] + '.csv')
save_to_csv(fst_results, save_to_file)

last_results = scrape_housing_data(urls[2])
save_to_file = os.path.join(out_dir, os.path.basename(urls[2]).split('_')[1] + '.csv')
save_to_csv(last_results, save_to_file)

In [None]:
%%bash

ls -ltr ./data/out/1*.csv

In [None]:
%%bash

head ./data/out/1050-1549.csv

# Your Turn

![](https://camo.githubusercontent.com/320b4791da998fd94e34ad4a85d44d8d5a581ca4/68747470733a2f2f732d6d656469612d63616368652d616b302e70696e696d672e636f6d2f6f726967696e616c732f39662f37332f65332f39663733653366386139353864626530336230663736663838313161353461312e676966)

Now, you will expand the above example to some real data. See the assignment description at:
https://github.com/datsoftlyngby/soft2017fall-business-intelligence-teaching-material/tree/master/assignments/assignment_2


# A Small Detour...

If you want to install a new package, such as the `tqdm` package, you can try to find it in the Anaconda repository and if it is there, you can install it.

In [None]:
%%bash
conda search tqdm

In [None]:
%%bash
conda install tqdm

In [None]:
from time import sleep
from tqdm import tqdm


for i in tqdm(range(10)):
    sleep(1)