# Mini Intro to BeautifulSoup & Requests
by Dr Liang Jin

Part of MiniPy Sessions: [github.com/drliangjin/minipy](https://github.com/drliangjin/minipy)

Official BeautifulSoup Docs: [crummy.com/software/BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)

Official Requests Docs: [python-requests.org](http://python-requests.org)

References: [Web Scraping with Python](http://www.pythonscraping.com/)

### Main Components of a web page
- `HTML`: main contents of the page, such as text and data
- `CSS`: styles of the page, such as color, frames so on
- `JavaScript`: advanced functionality such as interactions

### HTML, a tag-based language
- `<html>` `</html>`: everything inside the tages is `HTML`
- `<head>` `</head>`tag for meta data about the page
- `<body>` `</body>`tag for the main contents of the page
- `<h1>` `</h1>`tag for level 1 header
- `<p>` `</p>`tag for a paragrah
- `<div>` `</div>`tag for a division
- `<a>` `</a>`tag for an attribute, such as `href`
- `<table>` `</table>`tag for a table: `<th>`, `<tr>`, and `<td>`
- `<form>` `</form>` tag for an input form
- See [MDN web docs](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a full list of tags

### CSS attributes
- `class`: specifies one or more category names such as city
- `id`: specifies a unique id for an HTML element
- `src`: specifies the URL of the image
- `href`: specifies the destination address of the link

In [None]:
import webbrowser
from urllib.request import urlopen, urlretrieve
import requests
# conda install beautifulsoup4
# conda install lxml <= alternative (more advanced) html parser
from bs4 import BeautifulSoup
import pandas as pd
import re

### Open webpage

In [None]:
url = 'https://drliangjin.github.io/simple-webpage/'

webbrowser.open(url); # plz spend sometime go through source codes

### `BeautifulSoup` to Rescue!

In [None]:
# let's cook soup!
url = 'https://drliangjin.github.io/simple-webpage/'
html = urlopen(url)

soup = BeautifulSoup(html, 'lxml') # <== or 'html.parser'

### Let's Scrape something!

In [None]:
# scrape some text
all_h1 = soup.find_all('h1')
for h1 in all_h1:
    print(h1.text)

### Scrape and store web data

In [None]:
# scrape a table
table = soup.find('table')

# let's write to a local csv file
with open('karabiner.csv', 'w') as f:
    for row in table.find_all('tr')[1:]:
        for cell in row.find_all('td'):
            f.write(cell.get_text())
            if cell in row.find_all('td')[:-1]: # no comma for last cell
                f.write(',')
        f.write('\n')
# Pandas approach...     
dfs = pd.read_html(url, header=0) # return a list of tables
df = dfs[0]

In [None]:
# we want to capture highlighted text
url = 'http://www.pythonscraping.com/pages/warandpeace.html'

webbrowser.open(url);

In [None]:
# because the obvious different colours, we can use this to filter our data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

names = soup.find_all('span', {'class': 'green'})
for name in names:
    print(name.get_text())

### `requests.get()` to access HTTP (more robust)

In [None]:
# similar to urlopen(), we can use requests.get() to retrive raw html codes and pass to beautifulsoup
url = 'https://drliangjin.github.io/simple-webpage/'
# this is the response we get from the server
res = requests.get(url)

print(res)
print(dir(res)) # <== what attributes and methods are available for this response?
# or we can print(help(res)) to have a look a detailed description

In [None]:
# to pass in beautifulsoup we only need the contents of this response, this case: html codes
html = res.text
soup = BeautifulSoup(html, 'lxml')

In [None]:
# urlretrieve to download data
a_tags = soup.find_all('a', {'href': re.compile(r'.+\.txt$')})
for a in a_tags:
    href = a.attrs['href']
    full_href = 'https://drliangjin.github.io'+ href
    print(full_href)
    filename = href.split('/')[-2] + href.split('/')[-1]
    urlretrieve(full_href, './{}'.format(filename));

In [None]:
# or we can use another way
# construct urls and fetch files using requests directly
years = list(range(2012, 2017))
urls = ['https://drliangjin.github.io/simple-webpage/docs/{}/test.txt'.format(year) for year in years]
with open('secret.txt', 'w') as wf:
    for url in urls:
        lines = requests.get(url).text.splitlines() # seperate contents according lines and store as list
        secret_loc = lines[0].find('Secret')
        secret_msg = lines[1][secret_loc:]
        wf.writelines(secret_msg)
        wf.writelines(' ') # <= try '\n', what happens?

## what if errors?

In [None]:
# we need to check if we actually downloaded the page
url_404 = 'https://drliangjin.github.io/no_such_webpage/'
res_404 = requests.get(url_404)
# error hanlder:
try:
    res_404.raise_for_status() # <= also check status_code attribute for the requests response
except Exception as exc:
    print("There was an issue {}".format(exc))

### Google search using `requests`

In [None]:
keyword = {'q': 'python'}
url = 'https://www.google.co.uk/search'
res = requests.get(url, params=keyword)
print(res.url)
webbrowser.open(res.url);


In [None]:
# It's google search results webpage that we dig further into...
soup = BeautifulSoup(res.text, 'lxml')
links = soup.select('.r a')

for link in links:
    full_link = 'https://google.co.uk' + link.get('href')
    print(full_link)

### `BeautifulSoup` + `re` ROCK!
we will cover regular expression next lecture together with basic textual analysis

In [None]:
# use regular expression to filter urls within wikipedia
html = urlopen('https://en.wikipedia.org/wiki/Linux')
soup = BeautifulSoup(html, 'lxml')
for link in soup.find('div', {'id':'bodyContent'}).find_all(
    'a', {'href':re.compile('^(/wiki/)((?!:).)*$')}):
    if 'href' in link.attrs:
        print('https://en.wikipedia.org/wiki/Linux'+
              link.attrs['href'])

### Other tools: 
- `selenium`: a tool for controling your web browser, can be a sequence of actions:
    - clicking
    - filling out and submiting forms
    - scraping
- `scrapy`: a powerful and complete scraping framework
    - multiple processes
    - process data
    - store data