# El Comercio daily scraper

This one takes the links from the rss feeds and downloads the contents. In this version, it is mostly meant as a demonstation.

### To do/limitations

- Check if we are missing articles that are in the front page.
- This is only a daily scraper. Older articles can be scraped from pdf files in the archives.
- Check if it really works with all types of articles (does not leaves out stuff we want or include stuff we don't want)

In [8]:
import requests
from bs4 import BeautifulSoup
import datetime
import os
import unidecode
import re
import json

import sqlite3 # I'm putting the article list in a database. But you can use whatever you're comfortable with

`unidecode` is purely for translating titles into ascii for file naming. It needs to be installed with `pip install unidecode`. Alternatives also here: <https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string#2633310>

In [9]:
rss_feeds = [
    "http://www.elcomercio.com/rss",
    "https://www.elcomercio.com/rss/actualidad",
    "http://www.elcomercio.com/rss/tendencias",
    "http://www.elcomercio.com/rss/deportes",
    "http://www.elcomercio.com/rss/opinion"
]
output_dir = "./elcomercio/"

In [10]:
# Convert dates from rss to simple YYYYMMDD dates

ectimefmt = "%a, %d %b %Y %H:%M:%S %z"
def ecdate2mydate(d):
    return datetime.datetime.strptime(d, ectimefmt).strftime("%Y%M%d")

In [11]:
cxn = sqlite3.connect("ecuador_news_elcomercio_urlqueue.sqlite")
cs = cxn.cursor()

cs.execute("""
create table if not exists articles
(
    guid text,
    url text,
    author text,
    title text,
    pubdate text,
    fetched integer
)
""")

<sqlite3.Cursor at 0x110beb730>

## Download item list

The feeds only list the last 20 entries. Which might be alright for monitoring, but then it would make sense to have a script that keeps downloading them and putting them in a database.

In [12]:
articles = {}

for feed in rss_feeds:
    for i in range(5):
        req = requests.get(feed)
        if req.ok: break
    for newsitem in BeautifulSoup(req.content, "xml").find_all("item"):
        
        # For each news item, get the metadata
        itemtitle = newsitem.find("title").get_text()
        itemurl = newsitem.find("link").get_text()
        itemauthor = newsitem.find("author").get_text()
        itemguid = newsitem.find("guid").get_text() # Is identical to link, but we'll use as identifier
        itempubdate = ecdate2mydate(newsitem.find(re.compile("pub[dD]ate")).get_text())
        
        # check if exists
        if cs.execute("select count(*) from articles where guid=?", (itemguid,)).fetchone()[0] > 0:
            print("Item duplicated: '%s'" % itemtitle)
            continue
        
        # save it
        cs.execute(
            "insert into articles (guid, url, author, title, pubdate, fetched) values (?,?,?,?,?,0)",
            (itemguid, itemurl, itemauthor, itemtitle, itempubdate)
        )

cxn.commit()

Item duplicated: 'Incidentes aislados y una gran concentración en Chile'


AttributeError: 'NoneType' object has no attribute 'get_text'

## Downloading the actual articles

In [6]:
def get_contents(soup):
    "Gets the content of an El Comercio article, from the soup."
    return "\n\n".join(p.get_text() for p in soup.find(class_="paragraphs").find_all("p"))

def get_todays_date():
    "Gets todays date, outputs it in YYYYMMDD format."
    return datetime.datetime.now().strftime("%Y%M%d")

def title2ascii(title):
    "Converts a unicode title into something that can conveniently be added in a file name."
    r = unidecode.unidecode(title).lower().replace(" ", "_")
    return re.sub(r"\W", "", r)

In [7]:
lastdate = ""
datecounter = 0

# Fetch the articles we haven't processed yet
cs.execute("select url, guid, title, author, pubdate from articles where fetched=0")

# Roll through them
for itemurl, itemguid, itemtitle, itemauthor, itempubdate in cs.fetchall():
    
    # Get what's missing
    itemcontents = get_contents(BeautifulSoup(requests.get(itemurl).content))
    itemretrdate = get_todays_date()
    
    # Stuff we need for the file name
    asciititle = title2ascii(itemtitle)
    if lastdate == itempubdate: datecounter += 1
    else:
        datecounter = 1
        lastdate = itempubdate
    
    filenameroot = "%s%02d_%s" % (lastdate, datecounter, asciititle)
    
    # Save text
    with open(filenameroot + ".txt", "w") as f:
        f.write(itemcontents)
    
    # Save metadata
    with open(filenameroot + ".json", "w") as f:
        json.dump(
            {
                "title": itemtitle,
                "author": itemauthor,
                "url": itemurl,
                "date_published": itempubdate,
                "date_retrieved": itemretrdate
            },
            f
        )
    
    # Mark it down in the database, so that if something fails, we don't need to redo the whole url list.
    cs.execute("update articles set fetched=1 where guid=?", (itemguid,))
    cxn.commit()