# Web Data Extraction - Part II

__WEB SCRAPING:__ data extraction from human-readable output coming from a web browser.

__HTTP library for Python:__ [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ==> _a Python package for parsing HTML and XML documents_

---

In [1]:
# Import libraries
import pandas as pd
import requests
import bs4   # !pip install beautifulsoup4
import re

---

In [3]:
# DOM content
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

url = 'https://www.marca.com/'
response = requests.get(url)#, headers=headers, timeout=0.05)
html = response.content
print(f'Status code is: {response.status_code}')
print(type(html))

Status code is: 200
<class 'bytes'>


In [4]:
html[:1000]

b' <!DOCTYPE html><html lang="es"><head><script>/\\/radio(\\/parrilla)?.html/gmi.test(location.href||"")&&/MSIE|Trident/gm.test(navigator.userAgent||"")&&!!window.MSInputMethodContext&&!!document.documentMode&&function(){var a=document.createElement("script");a.src="//e00-elmundo.uecdn.es/js/ue-polyfills.min.js",a.type="text/javascript";var b=document.getElementsByTagName("script")[0];b.parentNode.insertBefore(a,b)}();</script>\n<script type="text/javascript" language="javascript" src="https://e00-ue.uecdn.es/cookies/js/policy_v4.js"></script>\n<script>window.googlefc=window.googlefc||{},window.googlefc.callbackQueue=window.googlefc.callbackQueue||[],googlefc.controlledMessagingFunction=function(b){var a=!1;try{if(/https?:\\/\\/(www.marca.com(\\/claro-mx|\\/en)?|(us|co|ar).marca.com\\/claro)\\/$/.test(window.location.origin+window.location.pathname))console.log("GFC in homepage"),a=!1;else{var c=function(a){var b=document.cookie.match("(^|;) ?"+a+"=([^;]*)(;|$)");return b?b[2]:null},d=

---

__Lets make some broth...__

![Image](./img/web_data_01.png)

In [6]:
html_sample = '<a href="url.com" title="Web Scraping" itemprop="url" id="example" class="intro">Hello World</a>' #paso1
broth = bs4.BeautifulSoup(html_sample, "html.parser") #paso2, lo parseo
print(type(broth))
broth

<class 'bs4.BeautifulSoup'>


<a class="intro" href="url.com" id="example" itemprop="url" title="Web Scraping">Hello World</a>

In [7]:
tag = broth.a #paso 3 creo la var tag
print(type(tag))
tag

<class 'bs4.element.Tag'>


<a class="intro" href="url.com" id="example" itemprop="url" title="Web Scraping">Hello World</a>

In [8]:
tag.name
#tag.name = 'b'
#tag.name

'a'

In [9]:
tag.attrs

{'href': 'url.com',
 'title': 'Web Scraping',
 'itemprop': 'url',
 'id': 'example',
 'class': ['intro']}

In [10]:
tag['class']

['intro']

In [11]:
tag.string

'Hello World'

In [13]:
broth.name

'[document]'

---

__Now, let's make some soup...__ find_all()

In [14]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html_doc, "html.parser") 
print(type(soup))
soup

<class 'bs4.BeautifulSoup'>



<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [15]:
print(type(soup.html))
soup.html # soup.find('html')

<class 'bs4.element.Tag'>


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [16]:
print(type(soup.find_all('a')))
soup.find_all('a')
#soup.find_all(["a", "b"])

<class 'bs4.element.ResultSet'>


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [17]:
all_tags = [tag.name for tag in soup.find_all(True)]
print(type(all_tags))
all_tags

<class 'list'>


['html', 'head', 'title', 'body', 'p', 'b', 'p', 'a', 'a', 'a', 'p']

In [18]:
some_tags = [tag.name for tag in soup.find_all(re.compile("^b"))]
some_tags

['body', 'b']

In [19]:
# Using a function
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

In [20]:
soup.find_all(has_class_but_no_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

---

__Finally, let's make a stew...__

In [21]:
stew = soup.find_all("a", {"class": "sister"})
stew

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [22]:
all_strings = [tag.string for tag in stew]
all_strings

['Elsie', 'Lacie', 'Tillie']

---

__Back to the original HTML content...__

In [24]:
parsed_html = bs4.BeautifulSoup(html, "html.parser")
parsed_tags = set([tag.name for tag in parsed_html.find_all(True)]) #para que solo te saque los elementos unicos
parsed_tags

{'a',
 'address',
 'article',
 'aside',
 'body',
 'button',
 'circle',
 'div',
 'fieldset',
 'figure',
 'footer',
 'form',
 'g',
 'h1',
 'h2',
 'h3',
 'head',
 'header',
 'html',
 'i',
 'iframe',
 'img',
 'input',
 'label',
 'legend',
 'li',
 'link',
 'main',
 'meta',
 'nav',
 'noscript',
 'p',
 'path',
 'picture',
 'polygon',
 'polyline',
 'rect',
 'script',
 'section',
 'source',
 'span',
 'strong',
 'style',
 'svg',
 'time',
 'title',
 'ue-sport-events-carousel',
 'ul'}

In [34]:
element = parsed_html.find_all("a", {"itemprop": "url"})
print(element[0])
element[0]['title']
links = [tag.attrs for tag in element]
links

<a href="https://www.marca.com/futbol/barcelona/2021/12/11/61b327edca47415c588b45c5.html" itemprop="url" title="Quedan señalados"> Quedan señalados
</a>


[{'href': 'https://www.marca.com/futbol/barcelona/2021/12/11/61b327edca47415c588b45c5.html',
  'title': 'Quedan señalados',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/motor/formula1/gp-abu-dhabi/2021/12/11/61b47b4646163f98498b45ad.html',
  'title': 'Hamilton y Verstappen esconden sus cartas',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/futbol/real-madrid/2021/12/11/61b23c9be2704e2e728b4619.html',
  'title': 'Así ficha el Madrid',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/motor/formula1/gp-abu-dhabi/2021/12/11/61b45a1246163f0a9d8b45e6.html',
  'title': 'La F1 se rinde a Sainz y bendice el regreso de Alonso',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/baloncesto/nba/2021/12/11/61b4630b268e3e51168b45ab.html',
  'title': 'La versión más bestial de Antetokounmpo: acabará siendo el logo de los Bucks',
  'itemprop': 'url'},
 {'href': 'https://noesfutboleslaliga.marca.com/directos-al-futuro/un-nuevo-impulso-para-el-futuro-del-futbol-profesional-e

In [35]:
# Pandas!!!

df = pd.DataFrame(links, columns=['title', 'href', 'itemprop', 'rel'])
df

Unnamed: 0,title,href,itemprop,rel
0,Quedan señalados,https://www.marca.com/futbol/barcelona/2021/12...,url,
1,Hamilton y Verstappen esconden sus cartas,https://www.marca.com/motor/formula1/gp-abu-dh...,url,
2,Así ficha el Madrid,https://www.marca.com/futbol/real-madrid/2021/...,url,
3,La F1 se rinde a Sainz y bendice el regreso de...,https://www.marca.com/motor/formula1/gp-abu-dh...,url,
4,La versión más bestial de Antetokounmpo: acaba...,https://www.marca.com/baloncesto/nba/2021/12/1...,url,
5,,https://noesfutboleslaliga.marca.com/directos-...,url,[sponsored]
6,Nagelsmann explica por qué rechazó al Madrid,https://www.marca.com/futbol/bundesliga/2021/1...,url,
7,La 'venganza' de Ricky en Minnesota,https://www.marca.com/baloncesto/nba/2021/12/1...,url,
8,El reinado de Carlsen: ¿quién podrá acabar con...,https://www.marca.com/ajedrez/2021/12/11/61b47...,url,
9,Sainz o el carácter que forjó a un Número 1,https://leyendas.marca.com/sainz/sainz-o-el-ca...,url,


---

In [45]:
element2 = parsed_html.find_all("script","ueDataPage")


[]

__More info:__

- An example for creating a pipeline where the Acquisition part involves REST API and Web Scraping [link](https://towardsdatascience.com/data-engineering-create-your-own-dataset-9c4d267eb838)

- If you have dynamic content, you should consider using [Selenium](https://selenium-python.readthedocs.io/)

- [What would happen if you tried to scrape Idealista?](https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en#:~:text=Specifically%2C%20it%20is,prior%20written%20permission.)