# Do you want to escape ads when you are visiting a news website?

Maybe your favourite Spanish-news diary, which might be www.elpais.com. Or your favourite Spanish sport-news newspaper, www.marca.com. The copious amount of ads you have to face when you visit these webpages some days gives you a hard time. 

What if you could read all the news you wanted, only with a click, very quickly, no ads? Well, continue reading because this post is for you. Welcome!

## Web scraping

Web scraping is a technique we use for extracting data from a website. (Almost) whichever website you imagine. You can read the source code of the website, notice where the data you need is found, and you can easily obtain what you want.

In this post, I am using web scraping for obtaining news (title, subtitle, theme, body, whatever) from www.elpais.com and www.marca.com. The language I am using is Python, a very powerfull programming language. Thanks to some functionalities brought by **requests** and **BeautifulSoup** packages, we can easily parse the web in search of some good (or bad) news.

In [None]:
import requests
from bs4 import BeautifulSoup

## Marca

Quoting Wikipedia, "Marca is a Spanish national daily sport newspaper owned by Unidad Editorial. The newspaper focuses primarily on football, in particular the day-to-day activities of Real Madrid, FC Barcelona and Atlético Madrid" (source: https://en.wikipedia.org/wiki/Marca_(newspaper)). Reading this newspaper, you can stay up to date on sports information, not only about these three football teams but also about many different topics such as basketball, tennis or Formula-1.

Years ago, I read this newspaper daily on the web (www.marca.com). However, over the years, they have included tons of advertisements in their webpage, and now it is very slow to read news in this website. And I am pretty impatient, so I have written a piece of code to get all trending news in this webpage and read only those you are interested in, without any advertisement.

First, I have defined a function to look for all the news webpages that we want to read and print them to the console in a clean manner, called **print_marca_new**.

In [None]:
def print_marca_new(new_):
    url = new_.get("href")
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "html.parser")
    
    print("Title:")
    try:
        title = BeautifulSoup(str(soup.find_all("h1")[0]), "html.parser").text.strip()
        print("   " + title)
    except:
        print("   Not available")
    try:
        subtitles = BeautifulSoup(str(soup.find_all("h2")[0]), "html.parser")
        theme = BeautifulSoup(str(subtitles.find_all("span")[0]), "html.parser").text.strip()
        subtitle = BeautifulSoup(str(subtitles.find_all("span")[1]), "html.parser").text.strip()
        print("Theme:\n   ", theme)
        print("Summary:\n   ", subtitle)
    except:
        subtitle = BeautifulSoup(str(soup.find_all("p")[0]), "html.parser").text.strip()
        print("Summary:\n   ", subtitle)
    
    print("\n")
        
    if input("Continue reading?(y): ") == "y":
        try:
            body = BeautifulSoup(str(soup.find_all("div", class_="row content cols-30-70")[0]), "html.parser")
            paragraphs = body.find_all(["p", "h3"])[0:-1]
            print("Content:\n")
            for paragraph in paragraphs:
                print(paragraph.text, "\n")
        except:
            print("Not available new.")
    print("\n\n")

Knowing how the main webpage of Marca is structured, we can get all the news from this website very easily:

In [None]:
url = "https://www.marca.com/"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

tit = soup.find_all("a", itemprop = "url")
subt = soup.find_all("h2", class_="flex-article__heading")

print("##################\n" +
      "# marca.com news #\n" +
      "##################\n")
for _ in range(len(tit)):
    a = BeautifulSoup(str(soup.find_all("a", itemprop = "url")[_]), "html.parser").text.strip()
    if input("-" + a + "     Read this new?(y): ") == "y":
        print_marca_new(tit[_])
        
     
print("NOWWWWW")
for _ in range(len(subt)):
    h2 = BeautifulSoup(str(soup.find_all("h2", class_ = "flex-article__heading")[_]), "html.parser")
    a = h2.find_all("a")[0]
    t = BeautifulSoup(str(a), "html.parser").text.strip()
    if input("-" + t + "     Read new?(y): ") == "y":
        print_marca_new(a)

## El País

Another time, I'm quoting Wikipedia to briefly explain what this newspaper is: "El País is a Spanish-language daily newspaper in Spain. El País is based in the capital city of Madrid and it is owned by the Spanish media conglomerate PRISA. According to the Office of Justification of Dissemination, it is the second most circulated daily newspaper in Spain as of December 2017." (source: https://en.wikipedia.org/wiki/El_Pa%C3%ADs)

This is a newspaper that I read sometimes, and I have web-scraped it just for fun. Also, this code is useful to avoid adverts when you want to read the body of a new.

The function I have defined to extract and print all news in a clean manner is this:

In [None]:
def print_elpais_new(new_):
    url = new_.get("href")
    
    if url[0]=="/":
        url = "https://elpais.com"+url
        page = requests.get(url)
        if not page:
            next
            
        soup = BeautifulSoup(page.text, "html.parser")        
                             
        article = BeautifulSoup(str(soup.find_all("article")), "html.parser")
        titles = BeautifulSoup(str(article.find_all("div", id = "article_header")), "html.parser")
        theme = titles.find_all("span")
        try:
            theme = theme[0].text
        except:
            theme = ""
            
        title = titles.find_all("h1")
        try:
            title = title[0].text
        except:
            print("Noticia no disponible")
            
        subtitle = titles.find_all("h2")
        try:
            subtitle = subtitle[0].text
        except:
            subtitle = ""
            
        print("\n   Tema: {}\n   Titular: {}\n   Subtitular: {}\n   Noticia:\n".format(theme, title, subtitle))
        
        paragraphs = BeautifulSoup(str(article.find_all("div", class_="a_b article_body | color_gray_dark")), "html.parser")
        paragraphs = paragraphs.find_all("p")
        for paragraph in paragraphs:
            if paragraph.text[0:14] != "Este navegador":
                print(paragraph.text, "\n")
        print("\n")

And with this last block of code, you can access the text of all news in elpais.com:

In [None]:
url = "https://www.elpais.com/"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
                     
news = BeautifulSoup(str(soup.find_all("div", class_ = "section_b | col desktop_12 tablet_8 mobile_4")), "html.parser")
tit1 = news.find_all("a", class_ = ["", "related_story_headline"])

sections = {"section_b": "portada", "section_c": "destacados",
          "thematic_section": "sección temática"}


for section in ["section_b", "section_c", "thematic_section"]:
    name = sections[section].capitalize()
    print("\n\n" + "#"*(len(name) + 4) + "\n" +
          "# %s #\n" % name +
          "#"*(len(name) + 4) + "\n")
    block = BeautifulSoup(str(soup.find_all("div", class_ = section+" | col desktop_12 tablet_8 mobile_4")), "html.parser")
    if section != "thematic_section":
        for _class_ in ["", "related_story_headline"]:
            titles = block.find_all("a", class_ = _class_)
            for _ in range(len(titles)):
                a = input("-" + BeautifulSoup(str(titles[_]), "html.parser").text + "    ¿Leer noticia?(y): ")
                if a == "y":
                    print_elpais_new(titles[_])
    elif section == "thematic_section":
        for topic in ["internacional", "espana", "el-pais-economia",
                          "sociedad", "cultura", "ciencia-y-tecnologia",
                          "deportes", "television", "estilo-y-vida", "gente",
                          "motor"]:
            print("\n##", apartados.upper(), "##\n")
            category = BeautifulSoup(str(block.find_all("div", id= topic)), "html.parser")
            for _class_ in ["", "related_story_headline"]:
                titles = category.find_all("a", class_ = _class_)
                for _ in range(len(titles)):
                    a = input("-" + BeautifulSoup(str(titles[_]), "html.parser").text + "   ¿Leer noticia entera?(y): ")

                    if a=="y":
                        print_elpais_new(titles[_])

I hope you have enjoyed the content of this notebook. Web scraping is a whole new world and it brings a lot of possibilities for data science, so I recommend you to learn some HTML, JavaScript or similar languages if you are interested in this topic, because it has almost no limits.

Have fun