# ETL I: Data Extraction with Web Scrapping

<br>

<img width=80 src="https://media.giphy.com/media/KAq5w47R9rmTuvWOWa/giphy.gif">

<img width=150 src="../Images/assblr.png">

***

## Scrapping data from the web

Sometimes we have data displayed on the web and that website don't have a download function or an api to access to the content, so the option we have is to read the content "As a user" and get that content.

To do this we'll use the urllib library and beautiful soup.

Imagine we need to extract the filmography of Anne Hathaway and we found this information in wikipedia (https://es.wikipedia.org/wiki/Anne_Hathaway), let's see the process to get the data.

### Getting the raw data

To get the raw data we need to make a call to the website we want to get. The best way to not be blocked is to use our firefox headers.

In [1]:
from urllib.request import Request, urlopen

webpage = 'https://es.wikipedia.org/wiki/Anne_Hathaway'

req = Request(webpage, headers={'User-Agent': 'Mozilla/5.0'})
raw_web = urlopen(req, timeout=10).read()

With this, we have the whole HTML of the website in the `raw_web`
variable.

Here's the first 400 characters:

In [2]:
raw_web[:400]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design'

### Looking for the data we need

Now we have the content, we need to clean everything we don't want. To do this, first we need to know where is the data we want, the easiest way to do it, is opening the website in a new tab, right-click in the content we want and inspect it.

Once we do that, we can see that the content is under a table tag like this:
`<table class="wikitable sortable jquery-tablesorter"><table>`

We'll use beautful soup library to get the table data (docs:
<https://www.crummy.com/software/BeautifulSoup/bs4/doc/>):

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(raw_web, 'html.parser')
tables = soup.find_all('table')

Whith the command above we have extracted all tables in the website. 

But if we have a look on how many tables we can see?

In [4]:
len(tables)

6

We can try another way of extracting, limiting the results to the class we saw when inspecting

In [5]:
tables = soup.find_all('table', attrs={"class": "wikitable sortable"})
len(tables)

2

Now we only have two tables and we can explore them. We'll work with the first one only by at this moment.

First we need to get the headers and rows of the table:

In [6]:
table_head = tables[0].find_all('th')[1:] #We skip the first header as we saw we don't want it
table_rows = tables[0].find_all('tr')[2:]

Once we have that data, we need to clean the html to get only the content and discard the html. We'll do it also with the beautifulsoup library.

In [7]:
theads = []
trows =[]

for col in table_head:
    theads.append(col.text.strip()) #With the strip() method we can delete the (\n) new line character
    
for row in table_rows:
    content = []
    for col in row:
        content.append(col.text.strip())
    try:
        content = [content[1], content[3], content[5], content[7]] #We select only the columns we need
    except:
        error = 'ignore' #We add this line just to ignore the error that is normal in our case
    
    trows.append(content)

In [8]:
print('Table headers:', theads)
print('Table first row', trows[0])

Table headers: ['Año', 'Título original', 'Papel', 'Notas']
Table first row ['2001', 'The Princess Diaries', 'Mia Thermopolis', '']


Let's save this data to use it later with the open command

In [9]:
with open('../sources/Anne_Hathaway.txt', 'w') as file:
    
    content = '' #We create an empty content
    
    content = content + ','.join(theads) + '\n' #we add the heads as the first line
    
    for row in trows:
        content = content + ','.join(row)+'\n' #This adds each row as a line        
    
    file.write(content) #This writes the content in the file

This is what you need to extract the most types of data you'll work with. Remember also, you can get data from other sources, for example an SQL database as you saw in the SQL lesson (you'll practice with this source in the next class)

You'll need to pay attention to every source you want to connect to because they could be all different and have some particularities you'll need to solve. But don't worry, practice and experience will let you solve any case.