# Web Scraping
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [web-scraping.ipynb](https://github.com/diegoinacio/creative-coding-notebooks/blob/master/Tips-and-Tricks/web-scraping.ipynb)
---
Some demonstrations of how to scrape data on the web.

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) or **web data extraction** is a process that allows us to collect structured (or even unstructured) data from the web via requests.

In [None]:
# Scraping libraries
import requests
from bs4 import BeautifulSoup

# Display libraries
from IPython.display import display, HTML, Image, Audio

## Request and HTML Parsing
---
The libraries we need are:
- **Requests**: Allows us to send *HTTP requests* in an extremely easily way.
- **Beautiful Soup**: Allows us to extract data from HTML files and parse it to a Python objct.

For the following example, let's take the content table from the page [Ordinary Differential Equation](https://en.wikipedia.org/wiki/Ordinary_differential_equation) on wikipedia.

In [None]:
# Request and parse
URL = "https://en.wikipedia.org/wiki/Ordinary_differential_equation"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, "html.parser")

# Get content table
content_table = (
    parse
        .find("div", {"id": "mw-content-text"})
        .find("div", {"id": "toc"})
        .find("ul")
)

# Change href for each item to redirect to actual page
for a in content_table.find_all("a"):
    new_URL = URL + a["href"]
    a["href"] = new_URL

# Display content
HTML(str(content_table))

## Structured Data
---
Scraping structured data with examples.

### HTML Tables
---
Getting data from HTML tables.

For the following example, let's take a currency exchange table for Brazillian Real.

In [None]:
URL = "https://www.x-rates.com/table/?from=BRL&amount=1"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, "html.parser")

table = parse.find("table")

HTML(str(table))

Having the HTML data, we can bind it to a dataframe using *Pandas* library.


In [None]:
import pandas as pd

df_table = pd.read_html(str(table))[0]
df_table

### Consuming open data
---
We can read a csv file on the web simply using its url as an argument.

In [None]:
import json

URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df_csv = pd.read_csv(URL)
df_csv

## Unstructured Data
---
Scraping unstructured data with examples.

### NoSQL file
---
Reading static *json* files from the web. For dynamic APIs would be almost the same process.

In [None]:
URL = "https://www.plus2net.com/php_tutorial/student.json"
json_file = requests.get(URL).json()

json_file

### Image data
---
Scraping *img* elements on google images search engine.

In [None]:
# Request | image search for "zebra" with large results
URL = "https://www.google.com/search?q=zebra&tbm=isch"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, 'html.parser')

# Find all img and show 5 of them
IMG = parse.find_all("img")[1:6]

mount = ""
for img in IMG:
    img["style"] = "float: left"
    mount += str(img)

mount = f'<div><h1 style="color: red;">Image Scraping</h1><br>{mount}</div>'

HTML(mount)

### Audio data
---
Scraping audio data from web pages.

For the following example, let's find all audio elements in a page.

In [None]:
mount = ""
TITLE, SOURCE = [], []

# Request and parsing
URL = "https://en.wikipedia.org/wiki/Additive_synthesis"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, 'html.parser')

# Find all audio elements
AUDIO = parse.find_all("audio")

# Procedure for each audio element
for audio in AUDIO:
    title = audio["data-mwtitle"]
    source = audio.find("source")
    src = source["src"]
    mount += f'''
    <div>
        <h4>{title}</h4><br>
        <audio controls>
          <source src="{src}" type="{source["type"]}">
        Your browser does not support the audio element.
        </audio><br>
        <a href="{src}">{src}</a>
    <div>
    '''
    TITLE.append(title)
    SOURCE.append(src)

# Output
mount = f'<div><h1 style="color: red;">Audio Scraping</h1><br>{mount}</div>'
HTML(mount)