# Web scraping Marca and El Pais 2.0

In a <a href="https://github.com/edu9as/web-scraping/blob/master/Scraping-Marca-and-El-Pais.ipynb">notebook</a> I wrote some weeks ago, I showed some complex code to obtain some headlines or complete news using Python. And when I say complex, I say really complex :(

These past weeks I have been reading about web scraping. Specifically, this <a href="https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577">book</a> from Ryan Mitchell has been very useful for me. Now, I can perform the same task as in the previous notebook with much simpler code and with better results. 

## Step 1: Load some libraries

In this case, we need the same two libraries as in the previous notebook, plus **json**. This last package will be useful to decipher information in some webpages that is present in JSON format.

In [None]:
import requests
from bs4 import BeautifulSoup
import json

## Step 2: Create a function to extract the news from the webpages

Here, I am defining a function that is allows the scraping of all news in these two webpages.

In [None]:
def scrape_new(url):
    try:
        print("#"*80)
        page = requests.get(url).text
        soup = BeautifulSoup(page, "html.parser")
        for article in soup.findAll("article"):
            for p in article.findAll({"p", "h3", "h2", "li"} if not "marca" in url 
                                     else {"p", "h3"}):
                if p.get("class") == ["nombre"]:
                    print(p.text.strip(), "says:")
                elif p.get("class") and p.name =="p" or p.text.strip() == "":
                    continue
                else:
                    print(p.text.strip(), "\n")
                    if p.text.strip() == "Inicia sesión para seguir leyendo":
                        print("I'm sorry I can't do it now, please show me this new:)\n")
                        print(json.loads(soup.findAll("script")[1].string)["articleBody"])
                        break
        print("#"*80)
    
    except:
        print("wowwwww problem with:\n", url)


- I want exactly 80 asterisks right before and after each new.
- All news in these two online newsletters are found within **article** tags.
- In the case of <a href="https://www.marca.com">Marca</a>, the text of the news can be found between **p** and **h3** tags.
- In the case of <a href="https://www.elpais.com">El Pais</a>, the news bodies can be found between **p**, **h3**, **h2** and **li** tags.
- In Marca, at the end of each new we can find some comments the users of this newsletter have made. I considered this to be useful. Username is found between **p** tags with attribute **class** being **nombre**, and the comment itself comes right after the name of the user within another pair of **p** tags.
- For some news in El Pais, it is mandatory to login before reading the news (*Inicia sesión para seguir leyendo* between **p** tags). I am not able to do that using Python, but I noticed that the body of the new can be found at the beginning of the source code, between **script** tags and in JSON format.

## Step 3: Let's read Marca news

In this step, I am requesting the main <a href="https://www.marca.com">marca.com</a> webpage. All headlines are found within **main** tags, and more specifically within **a** tags with **itemprop** attribute being **url**. For each **a** tag with the corresponding headline, the news' text can be accessed thanks to their **href** attribute. If the user of my code enters any character after my question, the news' text is printed to the console.

Note: some news are just a video. As this cannot be shown in the console, those news are skipped with a warning message.

In [None]:
titulo = "# MARCA.com #"
sep_lat = " "*int((80-len(titulo))/2)

print(sep_lat + "#"*len(titulo) + sep_lat + "\n" + 
      sep_lat + titulo + sep_lat + "\n" +
      sep_lat + "#"*len(titulo) + sep_lat)

url = "https://www.marca.com"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")


for titular in soup.main.findAll("a", itemprop = "url"):
    print("\n-",titular.text.strip())
    if titular.get("href")[8:11] == "vid":
        print("### This new is a video, not available here ###")
    elif input("Do you want to read more about it? "):
        scrape_new(titular.get("href"))

## Step 4: Let's read El Pais news

I have done the same for <a href="https://www.elpais.com">elpais.com</a>.

As a particularity, some El Pais news' links are in the form of "/..." and some others are like "https://...". It is important to distinguish between these two types of links in order to access each new.

In [None]:
titulo = "# ElPais.com #"
sep_lat = " "*int((80-len(titulo))/2)

print(sep_lat + "#"*len(titulo) + sep_lat + "\n" + 
      sep_lat + titulo + sep_lat + "\n" +
      sep_lat + "#"*len(titulo) + sep_lat)


url = "https://elpais.com"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")


for titular in soup.find("div", class_="flex_grid background_white").findAll("a"):
    if titular.get("class") == None:
        if titular.text == "Suscripciones":
            break
        print(titular.text)
        if input("More information? "):
            if titular.get("href")[0:4] != "http":
                scrape_new(url + titular.get("href"))
            else:
                scrape_new(titular.get("href"))

## Step 5: Have fun!

This is the end of this notebook. I hope you have enjoyed it.

I think this code can be used not only to print the news to the console, but also to analyze some tendencies in the news media with natural language processing. This knowledge can be applied to other goals, and surely I will work in this line during the following weeks.

See you!