# pandas tutorial

<img align="left" src="https://pandas.pydata.org/static/img/pandas.svg" alt="pandas" width="15%" height="15%"/>

<table align="left">
    <tr>
    <td><a href="https://colab.research.google.com/github/airnandez/numpandas/blob/master/notebooks/pandas.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a></td>
    <td><a href="https://mybinder.org/v2/gh/airnandez/numpandas/master?filepath=notebooks%2Fpandas.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/>
</a></td>
  </tr>
</table>

*Author: Fabio Hernandez*

*Last updated: 2024-03-06*

*Location:* https://github.com/airnandez/numpandas

--------------------
## Introduction

This is a short tutorial for helping you getting familiar with the **pandas** library, which is built on top of NumPy: you can find an introduction to NumPy in [this notebook](NumPy.ipynb).

This tutorial draws inspiration, ideas and sometimes material from several publicly available sources. Please see the [Acknowledgements](#Acknowledgements) section for details.

-----------------------
## Reference documentation

The entry point to the documentation of the stable release of pandas is http://pandas.pydata.org/pandas-docs/stable. It includes a [user guide](http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html), an [API reference](http://pandas.pydata.org/pandas-docs/stable/reference/index.html) and a [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

The [DataCamp pandas Cheat Sheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf) can also be a useful resource.

----------------------
## ⚠️ Important: installing dependencies ⚠️

If you are running this notebook in Google Colab the cell below installs a recent version of Pandas. This notebook is tested against Pandas v2.2.1 and Google Colab uses v1.5.3 by default which does not support some features used by this notebook:

In [1]:
%%bash

if [[ -n ${COLAB_RELEASE_TAG} ]]; then
  pip install --upgrade openpyxl
  pip install pandas>=2.2.1
fi

---------------------
## Import

**pandas** is customarily imported as shown below:

In [None]:
import pandas as pd
pd.__version__

In addition, for the examples given in this notebook we will need some packages from the Python standard library so we import them here:

In [None]:
import datetime

----------
## Overview

**pandas** offers three main data structures designed to facilitate the programmatic manipulation of datasets with flexibility. Those data structures are [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe), [`Series`](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#series) and `Index`. We will start exploring what a `DataFrame` is and what we can do with it.

![dataframe](../images/dataframe-axis.png)

---------------------
## Load the dataset

Read a sample dataset, located in the `data` subdirectory, which is formatted as a sequence of lines, each line composed of series of comma-separated values. Our sample dataset contains some data about the European Union, extracted from several sources, including [Wikipedia](https://en.wikipedia.org/wiki/European_Union), [EuroStat](https://ec.europa.eu/eurostat) and the [EU Budget](http://ec.europa.eu/budget) site.

In [None]:
import os
import requests

def download(url: str, path: str):
    """Download file at url and save it locally at path."""
    with requests.get(url, stream=True) as resp:
        if not resp.ok:
            raise f'Could not find file at URL {url}'
            
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

In [None]:
# Download the dataset if necessary to the directory 'data'
data_dir = 'data'
path = os.path.join('..', data_dir, 'european_union-2020.csv')

if not os.path.isfile(path):
    os.makedirs(os.path.join('..', data_dir), exist_ok=True)
    url = 'https://raw.githubusercontent.com/airnandez/numpandas/master/data/european_union-2020.csv'
    download(url, path)

In [None]:
# This particular dataset uses ';' as column separator (instead of the more usual ',')
# and uses ',' as the decimal separator
df = pd.read_csv(path, sep=';', decimal=',')

In [None]:
# Inspect the dimensions of the dataframe
rows, columns = df.shape
print(f'This dataframe has {rows} rows and {columns} columns')

**pandas** has built-in methods for doing I/O with files in several formats, including flat files (csv, fixed-width format, msgpack), Excel, JSON, HTML, HDF5, parquet, SQL, etc. See the [documentation](http://pandas.pydata.org/pandas-docs/stable/reference/io.html#flat-file) for details.

---------------------
## Exploring the dataset contents

To get an idea of what data is included in the dataset, you can explore the contents of the whole dataframe.

⚠️ **WARNING** ⚠️: generally speaking, it is not a good idea to display the entire dataset, depending of the size of the data. It is recommended to first inspect the size of the dataframe as we did above. Our dataset is small, so we can display all of it:

In [None]:
df

You can also explore a fraction of the dataset by displaying, for instance, a few rows at the begining or at the end of the dataframe:

In [None]:
# Display the first 3 rows of the dataset. By default, the first 5 rows will be displayed
df.head(3)

You can also explore the last rows of the dataset or any intermediate rows, by using notation similar to the one used with NumPy arrays, on top of which **pandas** is built:

In [None]:
# Display the last 3 rows of the dataset
df.tail(3)

In [None]:
# Display the rows from position 10 up to position 14 (not included)
df[10:14]

Displaying a small random sample of the dataframe rows is generally good practice:

In [None]:
# Display 5 randomly selected rows and all columns
df.sample(5)

**pandas** is designed for efficient handling of datasets organized as follows:

* each **observation** is saved in its own row
* each **variable** is saved in its own column

Our sample dataset is organized in exactly this way.

### An aside: understanding the dataset

In order to analyse any dataset, you need first to understand the meaning of the data. Here are the details of our sample data set:

| column                                    | meaning |
| ------------------------------------------|----------|
| `country`                                 | name of the country, in English |
| `country_code`                            | code of the country, as used by [Eurostat](https://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Country_codes) |
| `accession_date`                          | date of accession of the country to the European Union (format: `yyyy-mm-dd`) |
| `population`                              | the number of persons having their usual residence in each country as of January 1st, 2020 (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00001)) |
| `euro_zone_member`                        | `True` if the country is member of the [Eurozone](https://en.wikipedia.org/wiki/Eurozone)  |
| `immigration`                             | total number of long-term immigrants arriving into the country in 2019, as reported by each country (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00176)) |
| `emigration`                              | total number of long-term emigrants leaving from the reporting country in 2019, as reported by each country (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00177))  | 
| `contribution_to_eu_budget_millions_euro` | contribution to the EU budget for each country for year 2019, in millions euros (source: [European Commission](http://ec.europa.eu/budget/graphs/revenue_expediture.html)) |
| `expenditure_eu_budget_millions_euro`     | expenditure of the EU budget per country (for all programs), for year 2019, in millions euros (source: [European Commission](http://ec.europa.eu/budget/graphs/revenue_expediture.html)) |

Generally speaking, in order to draw sensible conclusions from any dataset you are analysing, make sure you understand precisely what is contained in the dataset and you understand where the data comes from.

### `dataframe` properties

**pandas** provides some methods for retrieving information about a [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object. [`pandas.Dataframe.info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas-dataframe-info) gives summary of the dataframe:

In [None]:
# Get a summary of the dataframe
df.info()

We can also get the amount of memory (RAM) the dataframe is using (in bytes):

In [None]:
# Get the amount of bytes the dataframe is using (in RAM)
df.memory_usage(deep=True)

The attribute `Dataframe.columns` is an object of type `Index` (see [reference documentation](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html#index)):

In [None]:
# Retrieve the list of column names in the dataframe
df.columns

In [None]:
# df.columns is a Python iterable (like a list)
for col in df.columns:
    print(col)

In [None]:
# Get the number of values (of any type) contained in the dataframe
print(f'This dataframe contains {df.size} values')

### Cleaning the data

Very often, the *raw* data needs some cleaning, so that we can easily manipulate them with **pandas**. For instance, in this particular example, we need to make sure that **pandas** understands that the column `accession_date` is a date and not just a string. We need this for comparisons and filtering, that will visit later on.

In [None]:
# Display the types of each column in the dataframe
df.dtypes

In [None]:
# Convert column 'accession_date' to a date
df['accession_date'] = df['accession_date'].astype('datetime64[s]')
df['accession_date'].dtype

-------
## Selecting and filtering

**pandas** provides powerful built-in tools for filtering the data both row-wise and column-wise.

In [None]:
# This is a utility function we use for displaying the dataframe, which we use later
def highlight_column(s):
    return 'background-color: PaleGoldenrod'

### select all values in a given column

Selecting all the values in a column is a frequent operation we need to perform on any dataframe:

In [None]:
# Highlight the column 'population' that we want to select
df.head(3).style.map(highlight_column, subset=['population'])

In [None]:
# Retrieve the values of the column 'population' for all rows
df['population']

The value returned by this selection operation is a [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object. A `Series` is a **one-dimensional array with axis labels**. In this particular case the labels are integers but they may be of other types.

**NOTE**: it is possible to use the notation `df.population` to select all the values of the column `"population"`. However, this notation is not recommended since the name of the column must be a valid Python identifier for it to work. For instance, if the name of my column is `budget-contribution`, this notation cannot be used as if would be `df.budget-contribution`:

In [None]:
# WARNING: This notation is NOT recommended because it is not robust. Use instead:: df['population']
df.population

We can perform operations on all the numerical values of a column (i.e. a `pandas.Series` object), such as descriptive statistics:

In [None]:
# Display some descriptive statistics of the values in the 'population' column
df['population'].describe()

You can also compute a subset of the descriptive statistics or perform an arithmetic operation on all the values of the `Series`:

In [None]:
# Retrieve the number of values in the column 'population'
df['population'].count()

In [None]:
# Compute the mean and standard deviation of the values in the column 'population'
mean, std = df['population'].mean(), df['population'].std()
print(f'Population: µ={mean:,.0f}  σ={std:,.0f}')

In [None]:
# Sum all the values of the column 'population'
eu_population = df['population'].sum()
print(f'The population of the EU in 2020 was {eu_population:,} people')

You can also perform an operation on all the values of one (or more) columns. For instance, let's convert all the population values to millions before performing some additional operations:

In [None]:
population = df['population'] / 1_000_000
population

In [None]:
# Divide all the values of the column 'population' by one million
population = df['population'] / 1_000_000  # you can also use the notations 1e6 or 1000000

# Retrieve the min and max values of the series
min_population, max_population = population.min(), population.max()

# Sum all the values of the series
total_population = population.sum()

# Count the number of values in the series
num_countries = population.count()

print(f'The least populous country has {min_population:.1f} millions')
print(f'The most populous country has {max_population:.1f} millions')
print(f'Total EU population in 2020 was {total_population:.1f} millions located in {num_countries} countries')

### select rows satisfying one or more conditions

You can select the rows of the dataframe that satisfy one or more conditions on the values of a column. You can use logical expressions with those conditions (i.e. using boolean operators and, or, not) to select the rows of interest:

In [None]:
# Let's first visualize the column 'euro_zone_member'
df.style.map(highlight_column, subset=['euro_zone_member'])

In [None]:
# Select all the rows with boolean value 'True' in the column 'euro_zone_member'. This operation
# returns a "mask" that we will use afterwards to select the rows. A mask is a pandas.Series object
# which contains boolean values.
is_eurozone_member = df['euro_zone_member'] == True

is_eurozone_member

In [None]:
# Use the mask created above to select the rows in 'df' for which the mask is 'True'. The returned value
# of this operation is a pandas.Dataframe which is a view of the original dataframe 'df'
euro_zone_df = df[is_eurozone_member]

# Note that the dataframe 'euro_zone_df' only contains rows which value in the column 'euro_zone_member' is 'True'
euro_zone_df.style.map(highlight_column, subset=['euro_zone_member'])

In [None]:
# This is another more compact way of expressing the same filter although is less readable
euro_zone_df = df[df['euro_zone_member'] == True]
euro_zone_df

The result of this kind of selection operation is generally a dataframe object. You can perform operations on that dataframe as you would on any other dataframe.

In [None]:
# Compute the population of the eurozone, in millions
is_eurozone_member = df['euro_zone_member'] == True
eurozone_population = df[is_eurozone_member]['population'].sum() / 1_000_000

print(f'The population of the Euro zone in 2020 was {eurozone_population:.2f} millions')

You can also select the rows of a dataframe that satisfy *several conditions*, by combining several masks using boolean operations (and, or, not, etc.):

In [None]:
# Select the countries of the Euro zone, which joined the EU since year 1989
is_eurozone_member = df['euro_zone_member'] == True
joined_since_1989  = df['accession_date'] >= datetime.datetime(1989, 1, 1)

# Combine the two masks obtained above with an 'and' (&) operator
df[is_eurozone_member & joined_since_1989]

In [None]:
# Select the rows for countries which are either EU founder members or have a population
# of at least 20M people. EU founder members are those which joined the date of the
# foundation of the EU, that is 1957-03-25.
is_founder         = df['accession_date'] == datetime.datetime(1957, 3, 25)
is_bigger_than_20m = df['population'] >= 20_000_000

# Combine the two masks obtained above with an "or" (|) operator
df[is_founder | is_bigger_than_20m]

In [None]:
# Compare the populations of EU founder members vs. non-founder member countries
# Founder members are those which 'accession_date' is 1957-03-25
is_founder = df['accession_date'] == datetime.datetime(1957, 3, 25)

founders_population     = df[ is_founder]['population'].sum() / 1e6
non_founders_population = df[~is_founder]['population'].sum() / 1e6  # Note the notation '~' which is the logical NOT

print(f"Founder countries population:     {founders_population:.0f} millions")
print(f"Non-founder countries population: {non_founders_population:.0f} millions")

### select specific rows

One of the most useful methods for selecting rows and columns within a row is [DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). It accepts several forms as input, but a general one is:

`df.loc[rows, cols]`

where `rows` and `cols` are slices (e.g. `10:15`, `10:`, etc.)

In [None]:
# Retrieve the values of all the columns for row with index 15
df.loc[15]

In [None]:
# Retrieve the rows with indices in the interval [10,15]
df.loc[10:15]

In [None]:
# Retrieve the value of the column 'population' for the row with index 11
df.loc[11, 'population']

In [None]:
# Retrieve all the columns of the row for France
# We need to provide the index of the row we want to retrieve, 11 in this particular case and
# the interval of the columns of interest (all, in this particular case)
df.loc[11, :]

In [None]:
# Retrieve specific columns of a given row
df.loc[11, ['capital', 'contribution_to_eu_budget_millions_euro']]

In [None]:
# Retrieve specific columns of a range of rows
df.loc[5:11, ['country', 'population']]

## Set a meaningful index

It is convenient to use an index which allows us to select an entire row or specific columns within a row using a meaningful label. You can set the index of a dataframe when you load it from a file or after the dataframe is already in memory:

In [None]:
df.head(3)

In the case of our dataframe, the index for each row is automatically assigned by pandas as an integer (leftmost column in the table above). For convenience, we can modify that label to use instead the country code as the index of the rows:

In [None]:
# Use the contents of the `country_code` column as the dataframe index
# We don't want the original dataframe to be modified, so we use a new variable
df_new = df.set_index('country_code')

df_new.head(3)

We can now use that more meaningful index to select the rows of interest for our analysis, without actually needing to know their row numbers:

In [None]:
# Retrieve specific columns of a given row using the country code which is
# now the dataframe index
df_new.loc['FR', ['capital', 'population']]

In [None]:
# Retrieve the populations for countries ES and DE
df_new.loc[['ES', 'DE'], ['population']]

We can also set the index when loading the data to memory, by specifying the column number we want to use as the index of the dataframe:

In [None]:
# Load the dataset and set the dataframe index to the first column which contains the
# country code, instead of the default row number
df = pd.read_csv('../data/european_union-2020.csv', sep=';', decimal=',', index_col=1)

df.sample(5)

## Filtering

It is possible to work with a projection of the dataframe by filtering the rows or columns we need to act on (e.g. query, modify, etc):

In [None]:
# Select the columns 'country' and 'capital' on all the rows of the dataset
df.filter(items=['country', 'capital']).sample(4)

You can also select rows or columns which match a regular expression:

In [None]:
# Retrieve the rows with country code ending by 'E'. Note that 'filter' method acts on the index of the dataframe
# or on the names of the columns, not on their values
df.filter(regex='.E$', axis=0)

In [None]:
# Retreive the rows for countries which value in the column 'capital' starts by 'B' and ends by 't'
df[ df['capital'].str.contains('^B.+t$', regex=True) ]

------------
## Sorting

You can sort the contents of a dataframe, according to the values of a set of columns:

In [None]:
mediterraneans = ('Spain', 'France', 'Italy', 'Slovenia', 'Croatia', 'Greece')
is_mediterranean = df['country'].isin(mediterraneans)
is_mediterranean

In [None]:
# Select the rows of the mediterranean countries
mediterraneans = ('Spain', 'France', 'Italy', 'Slovenia', 'Croatia', 'Greece')
is_mediterranean = df['country'].isin(mediterraneans)

# Sort the selected rows according to the values of columns 'accession_date' and then 'population'
df[is_mediterranean].sort_values(by=['accession_date', 'population'])

You can also retrieve the N largest (or N smallest) rows, according to the value of some columns:

In [None]:
# Get the top 5 countries according to the values in the 'emigration' column
df.nlargest(5, columns=['emigration'])

In [None]:
# Get the 3 less populous countries
df.nsmallest(3, columns=['population'])

---------------------
## Plotting

When exploring a dataset visualizing a projection of its contents is often useful. **pandas** provides some built-in tools for quick visualisations, based on [matplotlib](https://matplotlib.org).

In [None]:
# Plot the population of the countries, in descending order
populations = df['population'].sort_values(ascending=False)
populations.plot.bar(figsize=(15,8))    # figure size in inches: 1 inch ≃ 2.5 cm

It is possible to improve the plots, such as adding a title for the figure and modifying the axes labels. You can refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot).

In [None]:
# Plot a histogram of the population of EU countries (in millions)
populations = df['population'] / 1e6
figure = populations.plot.hist(figsize=(15,8), title="Distribution of the population of EU countries (2020)", grid=True)
figure.set_xlabel("millions")
figure.set_ylabel("countries")

---------------------
## Serializing a dataframe

You can save the contents of a dataframe to a disk file. **pandas** natively support several formats (see [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion)):

In [None]:
df

In [None]:
# Save the dataset in 'parquet' format
parquet_path = os.path.join('..', 'data', 'european_union-2020.parquet')
df.to_parquet(parquet_path, compression='gzip')

In [None]:
# Check that now we have a 'parquet' file in our 'data' directory
import glob

glob.glob('../data/european_union-2020.*')

In [None]:
! ls -al ../data/european_union-2020.*

In [None]:
# Read back the dataset from the parquet file just created
new_df = pd.read_parquet(parquet_path)
new_df.sample(5)

---------------------
## Modifying the dataframe

You will often need to modify the dataframe, for instance, for cleaning it, for extending it or for computing new values useful in the data analysis process.

Please note that the modifications are applied to the in-memory data, not to the disk file, unless you explicitely save the dataframe to disk.

In [None]:
# Rename some dataframe columns to use shorter, more convenient names
df = df.rename(columns={
    # current column name                      new column name
    'contribution_to_eu_budget_millions_euro': 'budget_contribution',
    'expenditure_eu_budget_millions_euro':     'budget_expenditure',
})
df.head(3)

You can also **extend the dataframe** by creating new columns, which values may be computed using other columns:

In [None]:
# Add two new columns 'budget_contribution_per_capita' and 'budget_expenditure_per_capita' to store the computed
# budget contribution and budget expenditure per capita.
# Note that the budget figures in the dataset are in millions
df['budget_contribution_per_capita'] = (df['budget_contribution'] * 1_000_000 ) / df['population']
df['budget_expenditure_per_capita']  = (df['budget_expenditure']  * 1_000_000 ) / df['population']

df.head(4).style.map(highlight_column, subset=['budget_contribution_per_capita', 'budget_expenditure_per_capita'])

We can also compute a `panda.Series` of values from the values in the columns of the dataframe:

In [None]:
# Compute the net per-capita contribution to the EU budget for each country
net_contribution_per_capita = df['budget_contribution_per_capita'] - df['budget_expenditure_per_capita']
net_contribution_per_capita.sort_values(ascending=True)

In [None]:
print(f'The net contribution by France to the 2019 EU budget was approx. {net_contribution_per_capita["FR"]:.0f}€ per capita')

We can use the methods `idxmin()` (or `idxmax()`) to retrieve the **index of the row** which contains the minimum (or maximum) value of a column in a `panda.Series`, as opposed to the minium (or maximum) value itself. In our case, we can use this to retrieve the country code (i.e. the value of the dataframe index) and then the country name:

In [None]:
# Retreive the indexes of the rows with minimum and maximum values in the series 'net_contribution_per_capita'
net_contribution_per_capita.idxmin(), net_contribution_per_capita.idxmax()

In [None]:
# Retrieve the value of the minimum net contribution per capita
net_contribution_per_capita.min()

In [None]:
index_min, index_max = net_contribution_per_capita.idxmin(), net_contribution_per_capita.idxmax()
df.loc[[index_min, index_max]]

In [None]:
# Retrieve the name of the countries for those minimum and maximums
df.loc[[index_min, index_max], 'country']

In [None]:
# Retrieve the code of the countries with minimum and maximum value on the
# 'net_contribution_per_capita' series (computed above)
country_min_expenditure = df.loc[net_contribution_per_capita.idxmin(), 'country']
value_min_expenditure = net_contribution_per_capita.min()

country_max_expenditure = df.loc[net_contribution_per_capita.idxmax(), 'country']
value_max_expenditure = net_contribution_per_capita.max()

print(f'The country with lowest EU budget expenditure per capita in 2019 was:  {country_min_expenditure:>10} ({value_min_expenditure:,.0f} €)')
print(f'The country with highest EU budget expenditure per capita in 2019 was: {country_max_expenditure:>10} ({value_max_expenditure:,.0f} €)')

----------
## Grouping

In some datasets, data is organized so that **grouping the observations** (i.e. the rows) is necessary to answer some analysis questions. **pandas** provides useful tools for grouping rows based on the values of one or more columns.

The data in the dataset we have been working on does not require grouping. We load a different, more complex and bigger dataset to explore how grouping works.

### load another dataset

The dataset we will use contains data about the names given to babies in France during from year 1900 to year 2021. For each given name you can find the sex of the baby (male or female), the year of birth, the department and the number of babies registered with that given name per year and per department.

You can find details of this public dataset, including the exact meaning of each variable (in French),  at https://www.insee.fr/fr/statistiques/2540004

In [None]:
# Download the dataset if necessary
data_dir = 'data'
path = os.path.join('..', data_dir, 'prenoms-fr-1900-2021.zip')

if not os.path.isfile(path):
    os.makedirs(os.path.join('..', data_dir), exist_ok=True)
    url = 'https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip'
    download(url, path)

In [None]:
# Load another dataset. Its fields are separated by ';'.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
names_df = pd.read_csv(path, sep=';', dtype={'annais':str, 'dpt':str})
rows, cols = names_df.shape
print(f'This dataset contains {rows:,} rows and {cols} columns')

In [None]:
names_df.sample(8)

Below you can find an edited excerpt of the [meaning and coding conventions of the columns](https://www.insee.fr/fr/statistiques/2540004#dictionnaire) of this dataset. You may also want to read the excellent [documentation associated to this dataset](https://www.insee.fr/fr/statistiques/2540004#documentation):

*Le second fichier départemental comporte 3.784.673  enregistrements et cinq variables décrites ci-après.*
*Ce fichier est trié selon les variables `SEXE`, `PREUSUEL`, `ANNAIS`, `DPT`.*

* `SEXE`: sexe - Type : caractère - Longueur : 1 - Modalité : 1 pour masculin, 2 pour féminin
* `PREUSUEL`: premier prénom - Type : caractère - Longueur : 25
* `ANNAIS`: année de naissance - Type : caractère - Longueur : 4 - Modalité : 1900 à 2021, XXXX
* `DPT`: département de naissance - Type : caractère - Longueur : 3 - Modalité : liste des départements, XX
* `NOMBRE`: fréquence - Type : numérique - Longueur : 8

In [None]:
# Inspect the types of the columns dataset
names_df.dtypes

### cleaning the dataset

In [None]:
# This is a utility function we use for displaying the dataframe
def highlight_missing(s):
    missings = ('XX', 'XXXX', '_PRENOMS_RARES')
    return 'color: white; background-color: Crimson' if s in missings else ''

In [None]:
# Rename some columns to use more meaningful names
names_df = names_df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

names_df.head().style.map(highlight_missing, subset=['name', 'year', 'department'])

There are rows with missing values, which in this case are represented by the strings `XXXX` for year or `XX` for the department or `_PRENOMS_RARES` for the name column. For the purposes of this tutorial, we ignore those rows:

In [None]:
# Drop rows with missing department and year
names_df.drop(names_df[names_df['department'] == 'XX'].index, inplace=True)
names_df.drop(names_df[names_df['year'] == 'XXXX'].index, inplace=True)

# Convert column 'year' to numeric values
names_df['year'] = pd.to_numeric(names_df['year'])

names_df.dtypes

In this dataset, some given names are coded as `_PRENOMS_RARES`, to represent a group of given names used very few times (see the dataset documentation for details). For this exercise, we are not interested in that data, so we remove the rows in the dataset which contain that given name:

In [None]:
is_unusual = names_df['name'] == '_PRENOMS_RARES'
names_df[is_unusual].head().style.map(highlight_missing, subset=['name'])

In [None]:
# Compute the fraction of unusual names we will exclude from our analysis
rows, _ = names_df.shape
unusual_rows, _ = names_df[is_unusual].shape
print(f'There are {unusual_rows:,} out of {rows:,} rows with unusual names, that is {100*unusual_rows/rows:0.2}% of the dataset')

In [None]:
names_df.drop(names_df[is_unusual].index, inplace=True)

Check that there are no rows which contain `_PRENOMS_RARES` in the column `name`:

In [None]:
names_df[names_df['name'] == '_PRENOMS_RARES'].count()

### explore the dataset

In this dataset, the column `sex` is coded as `1` (one) for males and `2` (two) for females. Create two views of the dataset, one for boys and one for girls:

In [None]:
# For convenience, create two views of the dataset
boys  = names_df[names_df['sex'] == 1]
girls = names_df[names_df['sex'] == 2]

# Count the number of babies of each sex contained in the dataset
print(f"Babies registered from 1900 to 2021:")
print(f"   boys: {boys['count'].sum():,}")
print(f"  girls: {girls['count'].sum():,}")
print(f"  total: {names_df['count'].sum():,}")

### aside: configure matplotlib

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (8,4)
matplotlib.rcParams["figure.dpi"] = 150
matplotlib.rcParams["font.size"] = 12

### grouping rows

We want to plot the evolution of the babies over time. We group the rows by the value in the column `year` and for each resulting group we sum the values of the `count` column to obtain the total number of babies registered each year:

In [None]:
babies_per_year = names_df.groupby(['year'])['count'].sum()
babies_per_year

In [None]:
fig = babies_per_year.plot.line(title="Evolution of the number of registered babies", grid=True)
fig.set_ylabel("registered babies")
fig.set_ylim(0)

We may want to focus on a subset of the rows. For instance, zoom in on the data over the period 1910 to 1925:

In [None]:
first_ww = babies_per_year.loc[1910:1925].plot.line(title="Babies registered around first world war", grid=True)
first_ww.set_ylabel("registered babies")
first_ww.set_ylim(0)

We want to know how many boys were given the name **Zinedine** before an after year 1998, when France won the football world cup:

In [None]:
zinedines = boys[boys['name'] == 'ZINEDINE']
zinedines_before_1998 = zinedines[zinedines['year'] <  1998]['count'].sum()
zinedines_after_1998  = zinedines[zinedines['year'] >= 1998]['count'].sum()

print(f"Number of boys named 'Zinedine' in France:")
print(f"   before 1998: {zinedines_before_1998: 5}")
print(f"    since 1998:  {zinedines_after_1998: 5}")

We want to get more details about the years those babies were named **Zinedine**, so we group the data by year: 

In [None]:
# Group the "zinedines" per year and sum the values of column 'count' for each year
zinedines_per_year = zinedines.groupby(['year'])['count'].sum()
zinedines_per_year.tail()

Make a plot to visually explore the results of the operation above. For this we use **matplotlib**. Please ignore for now the details of how to use matplotlib. We look in more detail some aspects of data visualisation in [this notebook](visualisation.ipynb).

In [None]:
# Plot the number of "zinedines" as a function of the year
zinedines_per_year.plot.bar(title="Evolution of number Zinedines")

----------
## Joining dataframes

It is usual that for analysing a dataset we need to join information found in several distinct datasets. **pandas** provides mechanisms for joining dataframes.

Motivating example: we want to identify the top 5 departments where the boys named *Zinedine* were born in year 1998.

In [None]:
# Select the boys, named 'ZINEDINE', born in year 1998
zinedines_1998 = boys[(boys['name'] == 'ZINEDINE') & (boys['year'] == 1998)]
zinedines_1998.nlargest(5, 'count').loc[:, ['department', 'count']]

The data we have in our example dataframe does not include the name of the department associated to those values (e.g. 13, 59, 69, etc.). We will use an additional helper dataframe for retrieving the names of those departments.

### download another dataset

In [None]:
# Download the dataset if necessary
data_dir = 'data'
path = os.path.join('..', data_dir, 'departements-region.csv')

if not os.path.isfile(path):
    os.makedirs(os.path.join('..', data_dir), exist_ok=True)
    url = 'https://www.data.gouv.fr/en/datasets/r/987227fb-dcb2-429e-96af-8979f97c9c84'
    download(url, path)

In [None]:
dept_df = pd.read_csv(path, index_col=0)
rows, cols = dept_df.shape
print(f'This dataset contains {rows:,} rows and {cols} columns')

In [None]:
dept_df.sample(5)

In [None]:
# Reminder: select the Zinedines born in 1998
zinedines_1998 = boys[(boys['name'] == 'ZINEDINE') & (boys['year'] == 1998)]
zinedines_1998.head()

Now we can **join** both dataframes to include all the data we need in each row, in particular the name of the department:

In [None]:
# Join both datasets using the index of 'dept_df' and the column 'department' of the 'zinedines_1998' dataframe
zinedines_1998 = boys[(boys['name'] == 'ZINEDINE') & (boys['year'] == 1998)]

# Create a new dataframe which is the result of joining the dataframes 'zinedines_1998' and 'dept_df' using
# the values in the column 'department' of the first dataframe.
zinedines_1998_full = zinedines_1998.join(dept_df, on='department')

zinedines_1998_full.sample(5).style.map(highlight_column, subset=['department', 'dep_name'])

We have now a dataframe with contains the selected rows each with all the information we need to answer the question: *what are the names of the top 5 deparments where the boys born in 1998 were named 'Zinedine'?*

In [None]:
top_depts = zinedines_1998_full.nlargest(5, 'count')
top_depts[['count', 'dep_name']]

In [None]:
# Extract the values of the series
for count, dept in zip(top_depts['count'].values, top_depts['dep_name'].values):
    print(f'{count}  {dept}')

-------------
## Acknowledgements
<a id='Acknowledgements'></a>

These are the sources this notebook is based on. You are encouraged to consult them to deep further:

* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jave VanderPlas (highly recommended book)
* [Intro to pandas](https://pandas.pydata.org/docs/getting_started/index.html#intro-to-pandas)
* Data School [Pandas best practices](https://youtu.be/hl-TGI4550M) (video)
* Dunder Data's [Intro to Pandas](https://youtu.be/31wa8tmrkPU) video series
* Python Bootcamp organised by the [Berkeley Institute for Data Science (BIDS)](https://bids.berkeley.edu) in the Fall 2016: [videos](https://bids.berkeley.edu/news/python-boot-camp-fall-2016-training-videos-available-online) and [notebooks](https://github.com/profjsb/python-bootcamp)
* [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653) 2nd Edition, by Wes McKinney