# Scraping sport news' webs

The main objective of this Notebook is to output all headlines within each sports web.

The first step is to load the only two Python packages which are required to do this: *requests* and *BeautifulSoup*.

In [1]:
import requests
from bs4 import BeautifulSoup

A brief introduction for these two packages:

> **Requests** is an elegant and simple HTTP library for Python, built for human beings. (...) **Requests** allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3. (text from **https://requests.readthedocs.io/en/master/**, 2020-09-07)

> **Beautiful Soup** is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. (text from **https://www.crummy.com/software/BeautifulSoup/bs4/doc/**, 2020-09-07)

## MARCA

In this first section I'm scraping www.marca.com. For your information, MARCA is a Spanish daily sport newspaper.

After creating **url** object which contains the URL for this web, we get the webpage using **requests** package and it is stored in a *Response* object called **page**.

In [2]:
url = "https://www.marca.com/"
page = requests.get(url)

In the next code cell we can show the text content of this website:

In [4]:
print(page.text)

 <!DOCTYPE html><html lang="es"><head><script>/\/radio(\/parrilla)?.html/gmi.test(location.href||"")&&/MSIE|Trident/gm.test(navigator.userAgent||"")&&!!window.MSInputMethodContext&&!!document.documentMode&&function(){var a=document.createElement("script");a.src="//e00-elmundo.uecdn.es/js/ue-polyfills.min.js",a.type="text/javascript";var b=document.getElementsByTagName("script")[0];b.parentNode.insertBefore(a,b)}();</script>
<script type="text/javascript" language="javascript" src="https://e00-ue.uecdn.es/cookies/js/policy_v3.js"></script>
<script>window.googlefc=window.googlefc||{},window.googlefc.callbackQueue=window.googlefc.callbackQueue||[],googlefc.controlledMessagingFunction=function(o){var e,c,n=!1;try{if(new RegExp("https?://(www.marca.com(/claro-mx|/en)?|(us|co|ar).marca.com/claro)/$").test(window.location.origin+window.location.pathname))console.log("GFC in homepage"),n=!1;else{var l=(e="REGMARCA",(c=document.cookie.match("(^|;) ?"+e+"=([^;]*)(;|$)"))?c[2]:null),a=new RegExp(

If we look for the headlines in the previous output, we can see that they are within **a** and **h2** tags. Then, we can make use of **BeautifulSoup** package to extract only these headlines.

In [6]:
marca = BeautifulSoup(page.text, "html.parser")

Now, it is convenient to look for these **a** and **h2** tags within the webpage text. Besides, note that the headlines are located in **a** tags with **itemprop="url"** and in **h2** tags with **class="flex-article__heading"**. However, *class* is a Python reserved word, so it is necessary to include an underscore after this word to avoid errors:

In [7]:
tit1 = soup.find_all("a", itemprop = "url")
tit2 = soup.find_all("h2", class_="flex-article__heading")

The last step is to print the text between these two tags:

In [8]:
print("#######################\n" +
      "# Titulares del MARCA #\n" +
      "#######################\n")
for _ in range(len(tit1)):
    print("-", BeautifulSoup(str(soup.find_all("a", itemprop = "url")[_])).text.strip())
    
for _ in range(len(tit2)):
    print("-", BeautifulSoup(str(soup.find_all("h2", class_ = "flex-article__heading")[_])).text.strip())


#######################
# Titulares del MARCA #
#######################

- 50 jugadores y seis teclas mágicas: Así es la 'era Ansu'
- Djokovic: sin puntos, sin dinero y tocado para Roland Garros
- El Lyon pone precio a Depay, cumbre por Lautaro...
- Ansu: Renovación automática... según el Barça
- El pelotón tiembla ante los resultados de la PCR
- Bale-Zidane, tenso reencuentro
- Locura en casa de los Fati: ¡así celebró su familia el golazo de Ansu a Ucrania!
- Pedro Sánchez: "En diciembre se vacunará a una parte de la población en España"
- Crece la presión en los hospitales: "Estamos en una segunda ola "
- Sale a pasear por el ala de un avión porque tenía demasiado calor y quería tomar el aire
- Así es la nueva camiseta rosa del Barcelona
- El Atlético presenta su tercera equipación
- "Queremos un Sevilla de gestión transparente"
- El Madrid vuelve a los entrenamientos sin Asensio... y sin Bale
- Daniela Blume evita la censura de Instagram con unas imágenes que dejan poco a la imagina