# Miscellaneous sources

## Import statements

In [1]:
import pandas as pd
import pathlib

## HTML-specific package

To handle HTML pages, an additional package (`lxml`) is needed:

In [2]:
import lxml

## Setup

The files for this class are stored in the `data/misc` directory:

In [3]:
directory = pathlib.Path("data") / "misc"

## Demo 1: Import data from a table in a webpage

The URL for this demo is the [Comunidad autónoma](https://es.wikipedia.org/wiki/Comunidad_aut%C3%B3noma) page on [Wikipedia](https://es.wikipedia.org)

### Import all tables from a webpage into a list of dataframes

In [4]:
url = "https://es.wikipedia.org/wiki/Comunidad_aut%C3%B3noma"

In [5]:
df_list = pd.read_html(url)

In [6]:
type(df_list)

list

In [7]:
len(df_list)

12

In [8]:
df = df_list[0]
df.head()

Unnamed: 0,0
0,Galicia Asturias Cantabria PaísVasco Navarra L...
1,Comunidades autónomas de España.


In [9]:
df = df_list[1]
df.head()

Unnamed: 0,0
0,C LU O S BI SS NA HU L GI B T CS PM V A MU Z T...
1,Provincias de España identificadas según el es...


In [10]:
df = df_list[2]
df.head()

Unnamed: 0,Bandera,Escudo,Comunidadautónoma,Provincias,Capital provincial,Capital de la CCAA,Estatuto de Autonomía,Localización
0,,,Andalucía[6]​,Almería,Almería,Sevilla,"2007 (en sustitución del anterior Estatuto, de...",
1,,,Andalucía[6]​,Cádiz,Cádiz,Sevilla,"2007 (en sustitución del anterior Estatuto, de...",
2,,,Andalucía[6]​,Córdoba,Córdoba,Sevilla,"2007 (en sustitución del anterior Estatuto, de...",
3,,,Andalucía[6]​,Granada,Granada,Sevilla,"2007 (en sustitución del anterior Estatuto, de...",
4,,,Andalucía[6]​,Huelva,Huelva,Sevilla,"2007 (en sustitución del anterior Estatuto, de...",


In [11]:
df = df_list[3]
df.head()

Unnamed: 0,Comunidad o ciudad autónoma,Comunidad o ciudad autónoma.1,Superficie (km²),Porcentaje
0,1,Castilla y León,94 226,"18,6 %"
1,2,Andalucía,87 268,"17,2 %"
2,3,Castilla-La Mancha,79 463,"15,7 %"
3,4,Aragón,47 719,"9,4 %"
4,5,Extremadura,41 635,"8,2 %"


In [12]:
df = df_list[4]
df.head()

Unnamed: 0,Posición,Nombre,Población(hab.),%,Densidad(hab./km²)
0,1,Andalucía,8 379 248,1799,9602
1,2,Cataluña,7 596 131,1622,23533
2,3,Comunidad de Madrid,6 576 009,1397,81117
3,4,Comunidad Valenciana,4 959 243,1061,21249
4,5,Galicia,2 700 970,582,9158


## Exercise 1

The URL for this exercise is the [Timeline of programming languages](https://en.wikipedia.org/wiki/Timeline_of_programming_languages) page from [Wikipedia](https://en.wikipedia.org/).

In [18]:
url = "https://en.wikipedia.org/wiki/Timeline_of_programming_languages"

Import the list of DataFrames from the `url`:

In [19]:
list_of_tables = pd.read_html(url)

In [20]:
type(list_of_tables)

list

Check the number of items in the list of DataFrames:

In [21]:
len(list_of_tables)

15

Identify the table with programming languages from the 1990s:

In [29]:
list_of_tables[9].head()

Unnamed: 0,Year,Name,"Chief developer, company",Predecessor(s)
0,1990,Sather,Steve Omohundro,Eiffel
1,1990,AMOS BASIC,François Lionet and Constantin Sotiropoulos,STOS BASIC
2,1990,AMPL,"Robert Fourer, David Gay and Brian Kernighan a...",
3,1990,Object Oberon,"H Mössenböck, J Templ, R Griesemer",Oberon
4,1990,J,"Kenneth E. Iverson, Roger Hui at Iverson Software","APL, FP"


Check the first 5 rows of the DataFrame:

In [31]:
prog_lang = list_of_tables[9]

Select the row(s) for which the "Name" column is equal to "Python":

In [35]:
prog_lang.loc[prog_lang["Name"]=="Python",:]

Unnamed: 0,Year,Name,"Chief developer, company",Predecessor(s)
12,1991,Python,Guido van Rossum,"ABC, C"


## Demo 2: Import data from a specific table in a webpage

The URL for this demo is the [Comunidad autónoma](https://es.wikipedia.org/wiki/Comunidad_aut%C3%B3noma) page on [Wikipedia](https://es.wikipedia.org)

In [36]:
url = "https://es.wikipedia.org/wiki/Comunidad_aut%C3%B3noma"

Import a subset of tables matching "Población" from a webpage into a list of dataframes

In [37]:
df_list = pd.read_html(url, match="Población")

In [38]:
type(df_list)

list

In [39]:
len(df_list)

1

In [40]:
df = df_list[0]
df.head()

Unnamed: 0,Posición,Nombre,Población(hab.),%,Densidad(hab./km²)
0,1,Andalucía,8 379 248,1799,9602
1,2,Cataluña,7 596 131,1622,23533
2,3,Comunidad de Madrid,6 576 009,1397,81117
3,4,Comunidad Valenciana,4 959 243,1061,21249
4,5,Galicia,2 700 970,582,9158


In [41]:
# Raises an error, because the case is incorrect:
df_list = pd.read_html(url, match="población")

ValueError: No tables found matching pattern 'población'

## Exercise 2

The URL for this exercise is the [Economy of Spain](https://en.wikipedia.org/wiki/Economy_of_Spain) page from [Wikipedia](https://en.wikipedia.org/).

In [42]:
url = "https://en.wikipedia.org/wiki/Economy_of_Spain"

Import a list of DataFrames containing "Government debt" from the `url`:

In [43]:
lists = pd.read_html(url, match="Government debt")

In [44]:
type(lists)

list

Check the number of items in the list of DataFrames:

In [45]:
len(lists)

1

Identify the table of interest:

In [46]:
my_list = lists[0]

Check the first 5 rows of the DataFrame:

In [47]:
my_list.head()

Unnamed: 0,Year,GDP(in bil. euro),GDP per capita(in euro),GDP growth(real),Inflation rate(in percent),Unemployment (in percent),Government debt(in % of GDP)
0,1980,99.0,2627,1.2%,15.6%,11.0%,16.6%
1,1981,112.8,2967,−0.4%,14.5%,13.8%,20.0%
2,1982,129.6,3390,1.2%,14.4%,15.8%,25.1%
3,1983,147.9,3851,"1,7%","12,2%","17,2%","30,3%"
4,1984,165.7,4299,1.7%,11.3%,19.9%,37.1%


### Bonus: Plot the debt vs the year

Set the "Year" column to be the index:

## Demo 3: Import data from the clipboard

### Import data from the clipboard into a dataframe

In [52]:
filename = "demo-3.csv"
file = directory / filename

**Open the text file and select some lines:**

In [53]:
! explorer {file}

In [54]:
df = pd.read_clipboard()

In [55]:
df

Unnamed: 0,"Country,Capital"
0,"Spain,Madrid"
1,"France,Paris"
2,"Germany,Berlin"


In [56]:
df = pd.read_clipboard(sep=",")

In [57]:
df

Unnamed: 0,Country,Capital
0,Spain,Madrid
1,France,Paris
2,Germany,Berlin


## Exercise 3

The dataset for this exercise is taken from the [List of European Union member states by population](https://en.wikipedia.org/wiki/List_of_European_Union_member_states_by_population) page on [Wikipedia](https://en.wikipedia.org).

In [59]:
filename = "exercise-3.csv"
file = directory / filename

**Open the exercise-3.csv file with a text editor, then  copy the 3 most populated EU countries in the clipboard**

In [65]:
! explorer {file}

Import the data from the clipboard:

In [69]:
df = pd.read_clipboard(sep='\t')

Check the first 5 rows of the DataFrame:

In [70]:
df.head()

Unnamed: 0,Country,Population
0,Germany,83166711
1,France,67098824
2,Italy,60244639
3,Spain,47329981
4,Poland,37958138


## Demo 4: Import data from a PDF file

The PDF for this demo is the [Informe diario de situación COVID-19 del 7 de marzo 2021](https://www.comunidad.madrid/sites/default/files/doc/sanidad/210307_cam_covid19.pdf) from the [Coronavirus page](https://www.comunidad.madrid/servicios/salud/coronavirus) provided by the Comunidad de Madrid.

In [None]:
filename = "demo-4.pdf"
file = directory / filename

**Open the PDF file, and use Tabula to extract data from some tables:**

In [None]:
! open {file}

Use [Tabula](http://127.0.0.1:8080) to extract tables from the PDF file.

[Example of Tabula usage by journalists](https://github.com/nytimes/gunsales):
> Statistical analysis of monthly background checks of gun purchases,  
> for the New York Times story [*What Drives Gun Sales: Terrorism,
Obama and Calls for Restrictions*](http://www.nytimes.com/interactive/2015/12/10/us/gun-sales-terrorism-obama-restrictions.html?).

> The source data comes from the [FBI's National Instant Criminal Background Check System](https://www.fbi.gov/about-us/cjis/nics)  
> and was converted from the original [PDF format](https://www.fbi.gov/file-repository/nics_firearm_checks_-_month_year_by_state_type.pdf) to CSV using [Tabula](http://tabula.technology/).
