# Session 18

[![Open and Execute in Google Colaboratory](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/astrojuanlu/ie-mbd-python-data-analysis-i/blob/main/sessions/Session%2018.ipynb)

- Practice

## Waiting lists in the Madrid health system

The region of Madrid publishes monthly data on [outpatient waiting list for its health service](https://www.comunidad.madrid/servicios/salud/lista-espera-consultas-externas). Having the data in CSV format is much better than what other services do (PDFs full of tables).

However, doing a systematic analysis needs a bit of data preprocessing.

Let's get into it!

### 0. Reading the data

Oops, reading these datasets won't be easy (see below).

- Looks like the `encoding` is not what it should be. To find which one to use, read the raw data with `requests` first, and display the `encoding` property of the result. Pass the appropriate `encoding=` to `pd.read_csv`. If the error message changes, move on to the next step.
- After reading the raw data with `requests`, print it out using the `.text` property. What character is used as a separator? Specify it as the `sep=` parameter in `pd.read_csv`. You should already see a pandas DataFrame!
- Seems like the first few rows and the last few ones are bogus. Investigate how can you skip the first N rows in `pd.read_csv`. Adjust the function call accordingly.
- Similarly, the last few rows have all `NaN` or contain no information. Cut them out of the final DataFrame.

In [43]:
DATA_URL_2025_10 = (
    "https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_octubre_2025.csv"
)

In [44]:
df_oct25.head()

Unnamed: 0,OCTUBRE 2025,Comunidad de Madrid
0,Población Asignada,7.059.453
1,Número de pacientes en espera estructural para...,733.359
2,Tasa por 1000 habitantes,13723
3,Demora media de espera para CONSULTA EXTERNA (...,6860
4,Desglose por días de espera de pacientes pendi...,


In [45]:
df_oct25.tail()

Unnamed: 0,OCTUBRE 2025,Comunidad de Madrid
14,Tasa por 1000 habitantes de pacientes atendidos,6367.0
15,NUMERO TOTAL DE PACIENTES ATENDIDOS,
16,Número total de pacientes atendidos,449.455
17,Espera media estructural para pacientes atendi...,3944.0
18,Demora media prospectiva,4905.0


### 1. Adjusting the data

Looks like every month has some variable names on the left column, and some numerical values on the second column.

Transform the October 2025 data into a pandas `DataFrame` where the variable names are the index and there is only 1 column, named `YYYY-MM`.

In [60]:
df_oct25_pre.head()

Unnamed: 0,2025-10
Población Asignada,7059453.0
Número de pacientes en espera estructural para PRIMERA CONSULTA,733359.0
Tasa por 1000 habitantes,137.23
Demora media de espera para CONSULTA EXTERNA (F.CORTE),68.6
Desglose por días de espera de pacientes pendientes por F.CITA,


### 2. Prepare further data gathering

The URLs from https://www.comunidad.madrid/servicios/salud/lista-espera-consultas-externas look something like this:

```
https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_octubre_2025.csv
https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_septiembre_2025.csv
https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_agosto_2025.csv
...
```

(notice "octubre" = October, "septiembre" = September, ...)

And there are actually 2 other waiting list datasets:

- Surgery https://www.comunidad.madrid/servicios/salud/lista-espera-quirurgica
  - Example: `https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_q_octubre_2025.csv`
- Diagnostic & Therapeutic https://www.comunidad.madrid/servicios/salud/lista-espera-pruebas-diagnosticas-terapeuticas
  - Example: `https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_t_octubre_2025.csv`

As you can see, the 3 of them have similar, but slight different, filename patterns.

Complete the implementation of this function:

```python
def get_url(hc_type: str, year: int, month: int) -> str:
    ...
```

In [61]:
get_url("outpatient", 2025, 10)

'https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_octubre_2025.csv'

### 3. Gather all the data

Now we're in a better position to actually do something with the data. But first, we want to see the historical evolution.

For that, create another function that returns a list of URLs for a given type and year.

```python
def get_url(hc_type: str, year: int) -> list[str]:
    ...
```

_For now, assume that you can go up to December 2025, which hasn't been published yet_.

In [64]:
get_urls("outpatient", 2025)

['https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_octubre_2025.csv',
 'https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_septiembre_2025.csv',
 'https://www.comunidad.madrid/sites/default/files/doc/sanidad/gest/le_c_agosto_2025.csv']

### 4. Assemble the data

Finally, assemble all the data of 2025 into 1 big DataFrame, where every column is a month.