# WEBSCRAPING *LA GACETA DE MADRID*

## Elena Fernández Fernández

In this notebook you will find the necessary code to webscrape the historic government-owned newspaper *La Gaceta de Madrid*. You can find the original documents (with OCR alreadey implemented) in the Spanish-government run website: https://www.boe.es/buscar/ayudas/gazeta_ayuda.php

Please, have a look at the Spanish Government guidelines regarding *La Gaceta de Madrid* data use: https://www.boe.es/buscar/ayudas/gazeta_ayuda.php

**1.** The first thing that you need to do is to import the necessary libraries for webscraping

In [1]:
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import shutil

**2.** Then, you need to choose the documents that you need. For example, lets say that you would like to retrieve all the *Gacetas de Madrid* that were around during the times of Fernando VII and the Napoelonic Invations of Spain. Go to the *Gaceta colección histórica* and make your search: https://www.boe.es/buscar/gazeta.php

**3.** Once you have found the documents that you are looking for, you will need to have a look at the source code of your search. The source code is the original code in which a website is written (if you would like to read more: https://en.wikipedia.org/wiki/Source_code), and you will need to be able to read it and understand it in order to select the exact information that you need to retrieve. To find the source code of any given website, just right-click your mouse, and select the "see source code" option".

The target of this notebook will be to download all the PDFs in this website: https://www.boe.es/buscar/gazeta.php?campo%5B0%5D=ID_HIST&dato%5B0%5D=16&campo%5B1%5D=TITULOS&dato%5B1%5D=&operador%5B1%5D=and&campo%5B2%5D=RNG.ID&dato%5B2%5D=&operador%5B2%5D=and&campo%5B3%5D=DEM.ID&dato%5B3%5D=&operador%5B3%5D=and&campo%5B4%5D=DOC&dato%5B4%5D=&operador%5B4%5D=and&campo%5B5%5D=TITULOS&dato%5B5%5D=&operador%5B5%5D=and&campo%5B6%5D=GAZ.ID&dato%5B6%5D=&campo%5B7%5D=NBO&dato%5B7%5D=&operador%5B8%5D=and&campo%5B8%5D=FPU&dato%5B8%5D%5B0%5D=&dato%5B8%5D%5B1%5D=&operador%5B9%5D=and&campo%5B9%5D=FAP&dato%5B9%5D%5B0%5D=&dato%5B9%5D%5B1%5D=&page_hits=50&sort_field%5B0%5D=FPU&sort_order%5B0%5D=desc&sort_field%5B1%5D=REF&sort_order%5B1%5D=asc&accion=Buscar

**4**. The next step is to use Beautiful Soup, Python's library specializing in webscraping. By using Beautiful Soup, we are geting the source code that you just had a look at into our notebook. If you would like to familiarize yourself more with Beautiful Soup, have a look at the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

In [2]:
html_doc = urlopen("https://www.boe.es/buscar/gazeta.php?campo%5B0%5D=ID_HIST&dato%5B0%5D=16&campo%5B1%5D=TITULOS&dato%5B1%5D=&operador%5B1%5D=and&campo%5B2%5D=RNG.ID&dato%5B2%5D=&operador%5B2%5D=and&campo%5B3%5D=DEM.ID&dato%5B3%5D=&operador%5B3%5D=and&campo%5B4%5D=DOC&dato%5B4%5D=&operador%5B4%5D=and&campo%5B5%5D=TITULOS&dato%5B5%5D=&operador%5B5%5D=and&campo%5B6%5D=GAZ.ID&dato%5B6%5D=&campo%5B7%5D=NBO&dato%5B7%5D=&operador%5B8%5D=and&campo%5B8%5D=FPU&dato%5B8%5D%5B0%5D=&dato%5B8%5D%5B1%5D=&operador%5B9%5D=and&campo%5B9%5D=FAP&dato%5B9%5D%5B0%5D=&dato%5B9%5D%5B1%5D=&page_hits=50&sort_field%5B0%5D=FPU&sort_order%5B0%5D=desc&sort_field%5B1%5D=REF&sort_order%5B1%5D=asc&accion=Buscar")
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html lang="es">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="noindex" name="robots">
   <meta content="Gazeta: formulario de b&amp;uacute;squeda de disposiciones y noticias" name="Description"/>
   <title>
    BOE.es - Gazeta: formulario de búsqueda de disposiciones y noticias
   </title>
   <link href="/favicon.ico" rel="shortcut icon"/>
   <link href="/favicon.ico" rel="icon" type="image/x-icon"/>
   <link href="/apple-touch-icon.png" rel="apple-touch-icon"/>
   <base target="_top"/>
   <link href="/estilos/boe.css" rel="stylesheet" type="text/css">
    <link href="/estilos/buscador.css" rel="stylesheet" type="text/css"/>
    <!--[if lt IE 10]>
    <link rel="stylesheet" type="text/css" href="/estilos/boe_ie9.css" />
    <![endif]-->
    <!--[if lt IE 9]>
    <link rel="stylesheet" type="text/css" href="/estilos/boe_ie8.css" />
    <![endif]-->


**5**. The PDFs, which is what we are looking for, are contained in the category "a", so we need to filter our search to get all the information stored within "a".

In [3]:
soup.find_all("a")

[<a accesskey="c" href="#contenedor" tabindex="-1">Ir a contenido</a>,
 <a accesskey="5" href="/diario_boe/" tabindex="-1">Consultar el diario oficial BOE</a>,
 <a href="http://www.mpr.es/" title="Ir al Ministerio de la Presidencia"><img alt="Ministerio de la Presidencia" src="/imagenes/logoMPRmovil.png" srcset="/imagenes/logoMPRmovil.svg"/></a>,
 <a href="http://www.mpr.es/" title="Ir al Ministerio de la Presidencia"><img alt="Ministerio de la Presidencia" src="/imagenes/logoMPR.png" srcset="/imagenes/logoMPR.svg"/></a>,
 <a accesskey="1" href="/" title="Ir a la página de inicio"><img alt="Agencia Estatal Boletín Oficial del Estado" src="/imagenes/logoBOE.gif" srcset="/imagenes/logoBOE.svg"/></a>,
 <a href="/" title="Ir a la página de inicio"><img alt="Agencia Estatal Boletín Oficial del Estado" src="/imagenes/logoBlanco128.png"/></a>,
 <a href="gazeta.php?lang=es" hreflang="es" lang="es" title="Cambiar a español/castellano"><span aria-hidden="true" class="idioma"><abbr title="español

**6**. Once we have all the "a" information, we need to define our search even more, as the PDFs links that we are looking for are stored within the "href" category inside of "a". We will store them in a list (pdfs).

In [4]:
pdfs = []
for link in soup.find_all('a'):
    pdfs.append(link.get("href"))
pdfs

['#contenedor',
 '/diario_boe/',
 'http://www.mpr.es/',
 'http://www.mpr.es/',
 '/',
 '/',
 'gazeta.php?lang=es',
 'gazeta.php?lang=ca',
 'gazeta.php?lang=gl',
 'gazeta.php?lang=eu',
 'gazeta.php?lang=va',
 'gazeta.php?lang=en',
 'gazeta.php?lang=fr',
 '/buscar/',
 '/mi_boe/',
 '/buscar/',
 '/mi_boe/',
 '/index.php#diarios',
 '/diario_boe',
 '/diario_borme',
 '/legislacion/otros_diarios_oficiales.php',
 '/index.php#juridico',
 '/legislacion/',
 '/biblioteca_juridica/',
 '/index.php#servicios-adicionales',
 '/notificaciones',
 '/edictos_judiciales',
 'https://subastas.boe.es',
 '/anuncios',
 '/datosabiertos/api/api.php',
 '/',
 '/buscar/',
 './gazeta.php?accion=&id_busqueda=TUYyNVhOUXRCZ2Fkdi9jbUM0S09RZ1cyWUd6Q2FidkszUTRzZloyUGMwbnlpUXdqdUpGTi9EcHhZdk9yN1B1K0Z5Lzdod0tibXl4ek1KV3ArL2xoRzA3enBQTG1jVVNrK2lnNmV5bDg1YVllMHVTWHkwd0lDREp4T2FFQk5NUGloVkFkNCtwUmowakxhRFZTcVpvV1JPSFplelFSSFFDdHI4eXdRNDRMQ25EdHhHSmNDRytabXpQVjNKckswS3VHcUN4VWpZNlVHUGp2QU5iaFRUck40b2E4MDZmaytrNjBxQU9XMjk0OHJwWi91ZX

**7**. Now we need to narrow our search even more. What we need are all the strings in which the PDFs are stored, and they all have a very similar format '/datos/pdfs/BOE//1833/120/A00509-00510.pdf', all of them ending with ".pdf". So, let's filter our search using the string method .endswith(".pdf"). 

In [5]:
links = [x for x in pdfs if x.endswith('.pdf')]
links

['/datos/pdfs/BOE//1870/320/A00001-00001.pdf',
 '/datos/pdfs/BOE//1870/320/A00001-00001.pdf',
 '/datos/pdfs/BOE//1870/320/A00001-00003.pdf',
 '/datos/pdfs/BOE//1870/320/A00003-00003.pdf',
 '/datos/pdfs/BOE//1870/320/A00003-00004.pdf',
 '/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 '/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 '/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 '/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 '/datos/pdfs/BOE//1870/320/A00004-00006.pdf',
 '/datos/pdfs/BOE//1870/320/A00006-00007.pdf',
 '/datos/pdfs/BOE//1870/320/A00007-00007.pdf',
 '/datos/pdfs/BOE//1870/320/A00007-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 '/datos/pdfs/BOE//1870/319/A00001-00001.pdf',
 '/datos/pdfs/BOE//1870/319/A00001-00001.pdf',
 '/datos/pdfs

In [6]:
len(links)

50

**8**. The next step is to transform our list of strings formatted into source-code style into a list of strings with a proper url look. To do this, we will use the string property that allows us to add two strings just by using the mathematical sign "+".

In [7]:
beginning = "https://www.boe.es"

files = [beginning + x for x in links]
files

['https://www.boe.es/datos/pdfs/BOE//1870/320/A00001-00001.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00001-00001.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00001-00003.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00003-00003.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00003-00004.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00004-00004.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00004-00006.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00006-00007.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00007-00007.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00007-00008.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 'https://www.boe.es/datos/pdfs/BOE//1870/320/A00008-00008.pdf',
 'https://www.boe.es/dato

**9**. The next thing that we need to do is to be able to store those urls (with their corresponding PDFs) in our computer. To do this, we need to follow several steps:

- Firstly, we need to create a folder in our computer where we will save our data. Please go to your current Jupyter Notebook location (where you have stored this notebook that we are currently using), and create a folder called "files_1". In case you may want to double check where you are at the moment, simply type pwd, run the cell, and you will be able to see your current location.

In [8]:
pwd

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\Webscraping 2. HTML\\Webscraping'

- Aftewards, we need to create a list of file names that will match the urls-pdfs that we want to save. For this, ideally, a function should be created that takes pdf (a string, such as the urls in our list files), and returns an easily identifiable file name.

In [9]:
def local_name(pdf):
    return 'a'.join(pdf.split('/')[-3:])

To clarify this even more, let's have a look at what happens if we use the method split in one of our files: it breaks it down into several parts, and it allows us see (and then choose) how we want to structure our file names.

In [10]:
files[0].split("/")

['https:',
 '',
 'www.boe.es',
 'datos',
 'pdfs',
 'BOE',
 '',
 '1870',
 '320',
 'A00001-00001.pdf']

After that, we can choose whatever character we want (such as letters, or a white space, for example) with the join string method to name our file. 

In [11]:
"a".join(files[0].split("/")[-3:])

'1870a320aA00001-00001.pdf'

In [12]:
def local_name(pdf):
    return 'a'.join(pdf.split('/')[-3:])

To clarify things even more: what we are doing is creating an individual file in our computer that we will obtain by calling the function local_name() on any element on the list files (which contains the urls-pdfs that we want to store). Afterwards, we will be able to store each url-pdf in these files in our computer.

In [13]:
files[0], local_name(files[0])

('https://www.boe.es/datos/pdfs/BOE//1870/320/A00001-00001.pdf',
 '1870a320aA00001-00001.pdf')

Now, before we get the PDFs and save them into our laptop, we need to create a folder. We can do it in Jupyter notebooks! Let's call it "files_1"

In [14]:
pwd #to check where we are

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\Webscraping 2. HTML\\Webscraping'

In [15]:
mkdir files_1 

Ya existe el subdirectorio o el archivo files_1.


- Finally, we need to loop over the list of files. To do this, we will need to address several different aspects: 

**1. Exception Handling**. To avoid any download malfunction problems, we will use something called "exception handling" (have a look at this Wikipedia article if you would like to know more: https://en.wikipedia.org/wiki/Exception_handling#:~:text=In%20computing%20and%20computer%20programming,the%20execution%20of%20a%20program.)

**2. Storing PDFs in our computer**. We will use the Python Shutil Module to do this. If you would like to know more about it, have a look at the python documentation (https://docs.python.org/3/library/shutil.html). 

**3. Checking that we got all our files in our computer**. If you had a panick attach by checking the missmatch bewteen the lenght of the files (len(files)) and the number of files that you got in your computer, don´t worry ;) Just check len(set(files)) instead, and you will see that there were some files repeated, and how now the numbers make sense. 

In [16]:
def download(remote_fn, local_fn):
    fname = 'files_1/' + local_fn
    try:
        with urllib.request.urlopen(remote_fn) as response, open(fname, mode="wb") as out_file:
            shutil.copyfileobj(response, out_file)
        return True
    except: # catch *all* exceptions
        return False
   
for remote_fn in files:
    local_fn = local_name(remote_fn)
    download(remote_fn, local_fn)


And now you have your files stored in your computer ready for doing some Text Data Mining research!

# Programming Acknowledgements: 

Thanks so much to Dr. Raji Ghawi for his help with this notebook! 

# Works Cited

https://stackoverflow.com/questions/16304146/in-python-how-do-i-extract-a-sublist-from-a-list-of-strings-by-matching-a-string

https://stackoverflow.com/questions/9304408/how-to-add-an-integer-to-each-element-in-a-list

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://stackoverflow.com/questions/24844729/download-pdf-using-urllib

# Recommended Readings

I have found extremely useful to read this book! https://www.oreilly.com/library/view/web-scraping-with/9781491985564/