# Web Scraping
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [web-scraping.ipynb](https://github.com/diegoinacio/creative-coding-notebooks/blob/master/Tips-and-Tricks/web-scraping.ipynb)
---
Some demonstrations of how to scrape data on the web.

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) or **web data extraction** is a process that allows us to collect structured (or even unstructured) data from the web via requests.

In [1]:
# Scraping libraries
import requests
from bs4 import BeautifulSoup

# Display libraries
from IPython.display import display, HTML, Image, Audio

## 1. Request and HTML Parsing
---
The libraries we need are:
- **Requests**: Allows us to send *HTTP requests* in an extremely easily way.
- **Beautiful Soup**: Allows us to extract data from HTML files and parse it to a Python objct.

For the following example, let's take the content table from the page [Ordinary Differential Equation](https://en.wikipedia.org/wiki/Ordinary_differential_equation) on wikipedia.

In [2]:
# Request and parse
URL = "https://en.wikipedia.org/wiki/Ordinary_differential_equation"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, "html.parser")

# Get content table
content_table = (
    parse
        .find("div", {"id": "mw-content-text"})
        .find("div", {"id": "toc"})
        .find("ul")
)

# Change href for each item to redirect to actual page
for a in content_table.find_all("a"):
    new_URL = URL + a["href"]
    a["href"] = new_URL

# Display content
HTML(str(content_table))

## 2. Structured Data
---
Scraping structured data with examples.

### 2.1. HTML Tables
---
Getting data from HTML tables.

For the following example, let's take a currency exchange table for Brazillian Real.

In [3]:
URL = "https://www.x-rates.com/table/?from=BRL&amount=1"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, "html.parser")

table = parse.find("table")

HTML(str(table))

Brazilian Real,1.00 BRL,inv. 1.00 BRL
US Dollar,0.191833,5.212876
Euro,0.169593,5.89647
British Pound,0.141776,7.053376
Indian Rupee,14.506304,0.068936
Australian Dollar,0.26879,3.720378
Canadian Dollar,0.244163,4.095624
Singapore Dollar,0.258403,3.869917
Swiss Franc,0.177584,5.631139
Malaysian Ringgit,0.803826,1.244051
Japanese Yen,22.185911,0.045074


Having the HTML data, we can bind it to a dataframe using *Pandas* library.


In [4]:
import pandas as pd

df_table = pd.read_html(str(table))[0]
df_table

Unnamed: 0,Brazilian Real,1.00 BRL,inv. 1.00 BRL
0,US Dollar,0.191833,5.212876
1,Euro,0.169593,5.89647
2,British Pound,0.141776,7.053376
3,Indian Rupee,14.506304,0.068936
4,Australian Dollar,0.26879,3.720378
5,Canadian Dollar,0.244163,4.095624
6,Singapore Dollar,0.258403,3.869917
7,Swiss Franc,0.177584,5.631139
8,Malaysian Ringgit,0.803826,1.244051
9,Japanese Yen,22.185911,0.045074


### 2.2. Consuming open data
---
We can read a csv file on the web simply using its url as an argument.

In [5]:
import json

URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df_csv = pd.read_csv(URL)
df_csv

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


## 3. Unstructured Data
---
Scraping unstructured data with examples.

### 3.1. NoSQL file
---
Reading static *json* files from the web. For dynamic APIs would be almost the same process.

In [6]:
URL = "https://www.plus2net.com/php_tutorial/student.json"
json_file = requests.get(URL).json()

json_file

[{'id': 1, 'name': 'John Deo', 'class': 'Four', 'mark': 75, 'sex': 'female'},
 {'id': 2, 'name': 'Max Ruin', 'class': 'Three', 'mark': 85, 'sex': 'male'},
 {'id': 3, 'name': 'Arnold', 'class': 'Three', 'mark': 55, 'sex': 'male'},
 {'id': 4, 'name': 'Krish Star', 'class': 'Four', 'mark': 60, 'sex': 'female'},
 {'id': 5, 'name': 'John Mike', 'class': 'Four', 'mark': 60, 'sex': 'female'},
 {'id': 6, 'name': 'Alex John', 'class': 'Four', 'mark': 55, 'sex': 'male'},
 {'id': 7, 'name': 'My John Rob', 'class': 'Fifth', 'mark': 78, 'sex': 'male'},
 {'id': 8, 'name': 'Asruid', 'class': 'Five', 'mark': 85, 'sex': 'male'},
 {'id': 9, 'name': 'Tes Qry', 'class': 'Six', 'mark': 78, 'sex': 'male'},
 {'id': 10, 'name': 'Big John', 'class': 'Four', 'mark': 55, 'sex': 'female'},
 {'id': 11, 'name': 'Ronald', 'class': 'Six', 'mark': 89, 'sex': 'female'},
 {'id': 12, 'name': 'Recky', 'class': 'Six', 'mark': 94, 'sex': 'female'},
 {'id': 13, 'name': 'Kty', 'class': 'Seven', 'mark': 88, 'sex': 'female'},
 

### 3.2. Image data
---
Scraping *img* elements on google images search engine.

In [7]:
# Request | image search for "zebra" with large results
URL = "https://www.google.com/search?q=zebra&tbm=isch"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, 'html.parser')

# Find all img and show 5 of them
IMG = parse.find_all("img")[1:6]

mount = ""
for img in IMG:
    img["style"] = "float: left"
    mount += str(img)

mount = f'<div><h1 style="color: red;">Image Scraping</h1><br>{mount}</div>'

HTML(mount)

### 3.3. Audio data
---
Scraping audio data from web pages.

For the following example, let's find all audio elements in a page.

In [8]:
mount = ""
TITLE, SOURCE = [], []

# Request and parsing
URL = "https://en.wikipedia.org/wiki/Additive_synthesis"
html_text = requests.get(URL).text
parse = BeautifulSoup(html_text, 'html.parser')

# Find all audio elements
AUDIO = parse.find_all("audio")

# Procedure for each audio element
for audio in AUDIO:
    title = audio["data-mwtitle"]
    source = audio.find("source")
    src = source["src"]
    mount += f'''
    <div>
        <h4>{title}</h4><br>
        <audio controls>
          <source src="{src}" type="{source["type"]}">
        Your browser does not support the audio element.
        </audio><br>
        <a href="{src}">{src}</a>
    <div>
    '''
    TITLE.append(title)
    SOURCE.append(src)

# Output
mount = f'<div><h1 style="color: red;">Audio Scraping</h1><br>{mount}</div>'
HTML(mount)