# Web scraping. 1

Web scraping is a technique used to extract information from web pages using programs that simulate a person's navigation through the WWW.

Web scraping can be a good alternative when a website does not offer its information through an API.
APIs offer many advantages:
* They are documented
* They offer structured information
* They are relatively stable over time
On the other hand, when using web scraping we find some disadvantages:
* The information is not structured
* The code of the page varies with frequency
* The structure of the page is complex and not documented
* Some pages prevent the use of scraping

## Scraping Steps
A scraping process follows the following steps:
* Inspect the HTML code of the web site
* Download HTML content from the site
* Parse HTML code with Beautiful Soup
* Work with the extracted data

Un proceso de scraping segue os seguintes pasos:
* Inspeccionar o código HTML da páxina web
* Descargar contido HTML da páxina
* Parsear código HTML con Beautiful Soup
* Traballar cos datos extraídos

In [None]:
# The fundamental libraries are:
# requests <- for http requests
# BeautifulSoup <- for parsing HTML code

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Example of a very simple page
# https://bigdatawirtz.github.io/exemplo-web/01.html

In [None]:
# Http request to web url
url = 'https://bigdatawirtz.github.io/exemplo-web/01.html'
paxina = requests.get(url)

#paxina.content
#paxina.text
print(paxina.text)

In [None]:
# Parse the web content in response
soup = BeautifulSoup(paxina.content, 'html.parser')

In [None]:
type(soup)

In [None]:
# Show the content of 'the soup'
soup
# prettify function can be helpful
#print(soup.prettify())

In [None]:
# We search for elements according to its tag
# Search element title
soup.find("title")
#soup.title

In [None]:
type(soup.title)

In [None]:
# Every Tag has a name
soup.title.name

In [None]:
type(soup.title.name)

In [None]:
# TAgs encloses text
soup.title.text

In [None]:
type(soup.title.text)

In [None]:
# Search for element "p" for paragraph
soup.find('p')
#soup.p

In [None]:
soup.p.name

In [None]:
soup.p.text

In [None]:
# In addition to a name and text, labels can have attributes
soup.p.attrs

In [None]:
# Access the value of the attributes
soup.p.attrs['data-attribute']

In [None]:
# NEW WEB: https://bigdatawirtz.github.io/exemplo-web/02.html
url = 'https://bigdatawirtz.github.io/exemplo-web/02.html'
paxina = requests.get(url)

print(paxina.text)

In [None]:
# Parser web content
soup = BeautifulSoup(paxina.content, 'html.parser')

In [None]:
# Search for h1
soup.h1

In [None]:
soup.h1.text

In [None]:
# Search for unordered list <ul>
soup.ul

In [None]:
soup.ul.li

In [None]:
soup.ul.li.text

In [None]:
# What about searching for several occurrences of an element
# Obtaining the first element
soup.h3

In [None]:
# Obtaining the first element
soup.find('h3')

In [None]:
# Obtaining of all the appearances of the element
soup.find_all('h3')

In [None]:
type(soup.find_all('h3'))

In [None]:
elementos_h3 = soup.find_all('h3') 
for i in elementos_h3:
              print(i.text)

In [None]:
# Obtaining the elements of an ordered list
elementos_lista = soup.find_all('ol')
for i in elementos_lista:
    print(i.text)

In [None]:
# List the paragraphs present on the web site
lista_paragrafos = soup.find_all('p')
contador = 1
for i in lista_paragrafos:
    print("Parágrafo ", contador , ": " , i.text)
    contador = contador + 1

In [None]:
# New web page with links
url = 'https://bigdatawirtz.github.io/exemplo-web/04.html'
paxina = requests.get(url)

print(paxina.text)

In [None]:
# Parsing
soup = BeautifulSoup(paxina.content,'html.parser')
soup

In [None]:
# Search for all occurrences of a tag
soup.find_all("a")

In [None]:
# Remember: this is not a list
type(soup.find_all('a'))

In [None]:
# Iterate result
for enlace in soup.find_all('a'):
    print(enlace)

In [None]:
# Iterate <a> elements to extract texts
for enlace in soup.find_all('a'):
    print(enlace.text)

In [None]:
# Links aren't in the texts, but in the href attribute
for enlace in soup.find_all('a'):
    print(enlace.get('href'))

In [None]:
# Another attributes
for enlace in soup.find_all('a'):
    print(enlace.get('title'))