# pandas tutorial

![pandas logo](https://pandas.pydata.org/_static/pandas_logo.png)

*Author: Fabio Hernandez*

*Last updated: 2019-05-15*

*Location:* https://github.com/airnandez/numpandas

--------------------
## Introduction

This is a short tutorial for helping you getting familiar with the **pandas** library, which is built on top of NumPy: you can find an introduction to NumPy in [this notebook](NumPy.ipynb).

This tutorial draws inspiration, ideas and sometimes material from several publicly available sources. Please see the [Acknowledgements](#Acknowledgements) section for details.

-----------------------
## Reference documentation

The entry point to the documentation of the stable release of pandas is http://pandas.pydata.org/pandas-docs/stable. It includes a [user guide](http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html), an [API reference](http://pandas.pydata.org/pandas-docs/stable/reference/index.html) and a [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

The [DataCamp pandas Cheat Sheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf) can also be a useful resource.

---------------------
## Import

**pandas** is customarily imported as shown below:

In [None]:
import pandas as pd
pd.__version__

In addition, for the examples given in this notebook we will need some packages from the Python standard library so we import them here:

In [None]:
import datetime

----------
## Overview

**pandas** offers three main data structures designed to facilitate the programmatic manipulation of datasets with flexibility. Those data structures are `DataFrame`, `Series` and `Index`. We will start exploring what a `DataFrame` is and what we can do with it.

---------------------
## Load the dataset

Read a sample dataset, located in the `data` subdirectory, which is formatted as a sequence of lines, each line composed of series of comma-separated values. Our sample dataset contains some data about the European Union, extracted from several sources, including [Wikipedia](https://en.wikipedia.org/wiki/European_Union), [EuroStat](https://ec.europa.eu/eurostat) and the [EU Budget](http://ec.europa.eu/budget) site.

In [None]:
%%bash

# Inspect a few lines of the text file containing the dataset
head -5 "./data/european_union.csv"

In [None]:
# This particular dataset uses ';' as separator, instead of the more usual ','
df = pd.read_csv('./data/european_union.csv', sep=';')

In [None]:
# Inspect the dimensions of the dataframe
rows, columns = df.shape
print(f'This dataframe has {rows} rows and {columns} columns')

**pandas** has built-in methods for doing I/O with files in several formats, including flat files (csv, fixed-width format, msgpack), Excel, JSON, HTML, HDF5, parquet, SQL, etc. See the [documentation](http://pandas.pydata.org/pandas-docs/stable/reference/io.html#flat-file) for details.

---------------------
## Exploring the dataset contents

To get an idea of what data is included in the dataset, you can explore the contents of the whole dataframe.

**WARNING**: it is not always a good idea to display the entire dataset, depending of the size of the data. It is recommended to first inspect the size of the dataframe as done above. This dataset is small, so display it:

In [None]:
df

You can also explore a fraction of the dataset by displaying, for instance, a few rows at the begining or at the end of the dataframe:

In [None]:
# Display the first 3 rows of the dataset. By default, the first 5 rows will be displayed
df.head(3)

You can also explore the last rows of the dataset or any intermediate rows, by using notation similar to the one used with NumPy arrays, on top of which **pandas** is built:

In [None]:
# Display the last 3 rows of the dataset
df.tail(3)

In [None]:
# Display the rows from position 10 up to position 14 (not included)
df[10:14]

**pandas** is designed for efficient handling of datasets organized as follows:

* each observation is saved in its own row
* each variable is saved in its own column

Our sample dataset is organized in exactly that way.

### An aside: understanding the dataset

In order to analyse any dataset, you need a good understanding of the meaning of the data. Here are the details of our sample data set:

| column                                    | meaning |
| ------------------------------------------|----------|
| `country`                                 | name of the country, in English |
| `country_code`                            | code of the country, as used by [Eurostat](https://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Country_codes) |
| `accession_date`                          | date of accesion of the country to the European Union (format: yyyy-mm-dd) |
| `population`                              | the number of persons having their usual residence in each country as of January 1st, 2018 (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00001)) |
| `euro_zone_member`                        | `True` if the country is member of the [Eurozone](https://en.wikipedia.org/wiki/Eurozone)  |
| `immigration`                             | total number of long-term immigrants arriving into each country in 2017, as reported by each country (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00176)) |
| `emigration`                              | total number of long-term emigrants leaving from the reporting country in 2017, as reported by each country (source: [Eurostat](https://ec.europa.eu/eurostat/tgm/table.do?tab=table&plugin=1&language=en&pcode=tps00177))  | 
| `contribution_to_eu_budget_millions_euro` | contribution to the EU budget for each country (based on the GNI), for year 2017, in millions euros (source: [European Commission](http://ec.europa.eu/budget/graphs/revenue_expediture.html)) |
| `expenditure_eu_budget_millions_euro`     | expenditure of the EU budget per country (for all programs), for year 2017, in millions euros (source: [European Commission](http://ec.europa.eu/budget/graphs/revenue_expediture.html)) |

Generally speaking, in order to draw sensible conclusions from any dataset you are analysing, make sure you understand precisely what is contained in the dataset and you understand where the data comes from.

### `dataframe` properties

You can get some information on the properties of `dataframe`. The attribute `Dataframe.columns` is an object of type `Index` (see [reference documentation](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html#index)):

In [None]:
# Get the name of the columns in this dataframe
for c in df.columns:
    print(c)

In [None]:
# Get the number of values (of any type) contained in the dataframe
print(f'This dataframe contains {df.size} values')

In [None]:
# Get the amount of RAM (in bytes) used for storing this dataframe contents
df.memory_usage()

### Cleaning the data

Very often, the *raw* data needs some cleaning, so that we can easily manipulate them it in **pandas**. For instance, in this particular example, we need to make sure that **pandas** understands that the column `accession_date` is a date and not just a string. This is useful for comparisons and filtering, that will visit later on.

In [None]:
# Display the types of each column in the dataframe
df.dtypes

In [None]:
# Convert column 'accession_date' to a date
df['accession_date'] = df['accession_date'].astype('datetime64[D]')
df['accession_date'].dtype

-------
## Selecting and filtering

**pandas** provides powerful built-in tools for filtering the data row-wise and column-wise.

In [None]:
# This is a utility function we use for displaying the dataframe, which we use later
def highlight_column(s):
    return 'background-color: PaleGoldenrod'

### select all values in a given column

Selecting all the values in a column is a frequent operation we need to perform on any dataframe:

In [None]:
df.head(3).style.applymap(highlight_column, subset=['population'])

In [None]:
# Get the values of the column 'population' for all rows
df['population']

**NOTE**: it is possible to use the notation `df.population` to select all the values of the column `population`. However, this notation is not recommended since the name of the column must be a valid Python identifier for it to work. For instacnce, if the name of my column is `budget-contribution`, this notation cannot be used:

In [None]:
# WARNING: This notation is NOT recommended. Use instead:: df['population']
df.population

We can perform operations on all the numerical values of a column, such as descriptive statistics:

In [None]:
# Display some descriptive statistics of the values in the 'population' column
df['population'].describe()

In [None]:
# Sum all the values of the column 'population'
eu_population = df['population'].sum()
print(f'The population of the EU is {eu_population:,} people')

You can also perform an operation on all the values of one (or more) columns. For instance, let's convert all the population values to millions before performing some additional operations:

In [None]:
population = df['population'] / 1_000_000  # you can also use the notations 1e6 or 1000000
min_population, max_population = population.min(), population.max()
total_population = population.sum()
num_countries = population.count()

print(f'Least populous country has {min_population:.1f} millions')
print(f'Most populous country has {max_population:.1f} millions')
print(f'Total EU population is {total_population:.1f} millions located in {num_countries} countries')

### select rows satisfying one or more conditions

You can select the rows of the dataframe that satisfy one or more conditions on the values of a column. You can use logical expressions with those conditions (i.e. using operators and, or, not) to select the rows of interest:

In [None]:
# Select all the rows with value True in the column 'euro_zone_member'
is_eurozone_member = df['euro_zone_member'] == True
euro_zone_df = df[is_eurozone_member]

euro_zone_df.style.applymap(highlight_column, subset=['euro_zone_member'])

In [None]:
# This is another more compact way of expressing the same filter,
# but is not necessarily easier to read
euro_zone_df = df[df['euro_zone_member'] == True]
euro_zone_df.head(3)

The result of a select operation is generally a dataframe object. You can perform operations on that dataframe as you would on any other dataframe.

In [None]:
# Compute the population of the eurozone
is_eurozone_member = df['euro_zone_member'] == True
eurozone_population = df[is_eurozone_member]['population'].sum() / 1_000_000
print(f'The population of the Euro zone is {eurozone_population:.2f} millions')

In [None]:
# Select the countries which joined the EU since year 1989 which adopted the Euro
is_eurozone_member = df['euro_zone_member'] == True
joined_since_1989  = df['accession_date'] >= datetime.datetime(1989, 1, 1)

df[is_eurozone_member & joined_since_1989]

In [None]:
# Select the countries which are either founder members or have a population
# of at least 20M people
is_founder = df['accession_date'] == datetime.datetime(1957, 3, 25)
is_bigger_than_20m = df['population'] >= 20_000_000
df[is_founder | is_bigger_than_20m]

In [None]:
# Compare the populations of founder members vs. non-founder member countries
is_founder = df['accession_date'] == datetime.datetime(1957, 3, 25)

founders_population = df[is_founder]['population'].sum() / 1e6
non_founders_population = df[~is_founder]['population'].sum() / 1e6

print(f"Founders' population:     {founders_population:.0f} millions")
print(f"Non-founders' population: {non_founders_population:.0f} millions")

### select specific rows

One of the most useful methods for selecting rows and values within a row is [DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). It accepts several forms as input, but a general one is:

`df.loc[rows, cols]`


In [None]:
# Retrieve the entire row with data for France
# We need to provide the index of the row we want to retrieve
df.loc[9, :]

In [None]:
# Retrieve specific columns of a given row
df.loc[9, ['capital', 'contribution_to_eu_budget_millions_euro']]

In [None]:
# Retrieve specific columns of a range of rows
df.loc[9:11, ['country', 'population']]

### set a meaningful index

It is convenient to use an index which is meaningful to allow us to select an entire row or specific columns within a row using a meaningful label. You can set the index of a dataframe when you load it from a file or after the dataframe is already in memory:

In [None]:
df.head(3)

In [None]:
# Use the contents of the `country_code` column as the dataframe index
# We don't want the original dataframe to be modified, so we use a new variable
df_new = df.set_index('country_code')
df_new.head(3)

We can now use that more meaningful index to select the rows of interest:

In [None]:
# Retrieve the population for countries ES and DE
df_new.loc[['ES', 'DE'], ['population']]

We can also set the index when loading the data to memory, by specifying the column number we want to use as the index of the dataframe:

In [None]:
df = pd.read_csv('./data/european_union.csv', sep=';', index_col=1)
df.head(3)

------------
## Sorting

You can sort the contents of a dataframe, according to the values of a set of columns:

In [None]:
# Show the mediterranean countries, sorted according to their accession date and population
mediterraneans = ['Spain', 'France', 'Italy', 'Slovenia', 'Croatia', 'Greece']
is_mediterranean = df['country'].isin(mediterraneans)
df[is_mediterranean].sort_values(by=['accession_date', 'population'])

You can also retrieve the N largest (or N smallest) rows, according to the value of some columns:

In [None]:
# Get the top 5 countries according to their emigration values
df.nlargest(5, columns=['emigration'])

---------------------
## Modifying the dataframe

You will often need to modify the dataframe, for instance, for cleaning it, for extending it or for computing new values useful in the data analysis process.

Please note that the modifications are applied to the in-memory data, not to the disk file, unless you explicitely do so.

In [None]:
# Rename some dataframe columns to use shorter names
df = df.rename(columns={
    # current column name                      new column name
    'contribution_to_eu_budget_millions_euro': 'budget_contribution',
    'expenditure_eu_budget_millions_euro':     'budget_expenditure',
})
df.head(3)

You can also extend the dataframe by creating new columns, which values may be computed using other columns:

In [None]:
# Add the budget contribution and the budget expenditure per capita
df['budget_contribution_per_capita'] = (df['budget_contribution'] * 1_000_000 ) / df['population']
df['budget_expenditure_per_capita']  = (df['budget_expenditure'] * 1_000_000 ) / df['population']

df.head(4).style.applymap(highlight_column, subset=['budget_contribution_per_capita', 'budget_expenditure_per_capita'])

We can also compute a `Series` of values from the values in the columns of the dataframe:

In [None]:
# Compute the net contribution to the EU budget for each country
net_contribution = df['budget_contribution_per_capita'] - df['budget_expenditure_per_capita']
net_contribution.sort_values(ascending=True)

In [None]:
print(f'The net contribution by France to the 2017 EU budget was approx. {net_contribution["FR"]:.0f}€ per capita')

We can use the methods `idxmin()` (or `idxmax()`) to retrieve the index of the minimum (or maximum) value of a column. In our case, we can use this to retrieve the country code (i.e. the value of the dataframe index) and then the country name:

In [None]:
# Retrieve the code of the countries with minimum and maximum value on the
# 'budget_expenditure' column
expenditure = df['budget_expenditure']
country_min_expenditure = df.loc[expenditure.idxmin(), 'country']
country_max_expenditure = df.loc[expenditure.idxmax(), 'country']

print(f'The country with lowest EU budget expenditure in 2017 was:  {country_min_expenditure}')
print(f'The country with highest EU budget expenditure in 2017 was: {country_max_expenditure}')

---------------------
## Serializing a dataframe

You can save the contents of a dataframe to disk. **pandas** natively support several formats (see [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion)):

In [None]:
# Save the (modified) dataframe to disk, in CSV format
df.to_csv('./data/my_dataset.csv')

In [None]:
%%bash

head -5 './data/my_dataset.csv'

----------
## Grouping

In some datasets, data is organized so that grouping the observations (i.e. the rows) is necessary to answer some analysis questions. **pandas** provide useful tools for grouping data based on the values of one or more columns.

The data in the dataset we have been working on does not require grouping. We load a different, more complex and bigger dataset to explore how grouping works.

The dataset we will use contains data about the names given to babies in France during from year 1900 to year 2017. For each given name you can find the sex of the baby (male or female), the year of birth, the department and the number of babies registered with that given name per year and per department.

You can find details of this public dataset, including the exact meaning of each variable (in French),  at https://www.insee.fr/fr/statistiques/2540004

In [None]:
# Forget the dataset we have been using so far
del df

In [None]:
# Load the new dataset. Its fields are separated by tabs.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
df = pd.read_csv('./data/prenoms-fr-1900-2017.tsv.gz', sep='\t', dtype={'annais':str, 'dpt':str})
df.shape

In [None]:
# Inspect the types of the columns dataset
df.dtypes

In [None]:
# This is a utility function we use for displaying the dataframe
def highlight_missing(s):
    missings = ('XX', 'XXXX')
    return 'color: white; background-color: Crimson' if s in missings else ''

In [None]:
# Rename some columns to use more meaningful names
df = df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

df.head().style.applymap(highlight_missing, subset=['year', 'department'])

As you can see above, there are rows with missing values, which in this case are represented by the strings `XXXX` for year or `XX` for the department. For the purposes of this tutorial, we ignore those rows:

In [None]:
# Drop rows with missing department and year
df.drop(df[df['department'] == 'XX'].index, inplace=True)
df.drop(df[df['year'] == 'XXXX'].index, inplace=True)

# Convert columns 'department' and 'year' to numeric values
df['department'] = pd.to_numeric(df['department'])
df['year']       = pd.to_numeric(df['year'])

df.dtypes

In [None]:
# In this dataset, the sex is represented as 1 for males and 2 for females
# Define some convenient constants
MALE = 1
FEMALE = 2

In [None]:
# Count the number of babies of each sex contained in the dataset
boys = df[df['sex'] == MALE]
girls = df[df['sex'] == FEMALE]

print(f"Babies registered from 1900 to 2017:")
print(f"   boys: {boys['count'].sum():,}")
print(f"  girls: {girls['count'].sum():,}")
print(f"  total: {df['count'].sum():,}")

We want to know how many boys were given the name **Zinedine** before an after year 1998, when France won the football world cup:

In [None]:
zinedines = boys[boys['name'] == 'ZINEDINE']
zinedines_before_1998 = zinedines[zinedines['year'] < 1998]['count'].sum()
zinedines_after_1998  = zinedines[zinedines['year'] >= 1998]['count'].sum()

print(f"Number of boys named 'Zinedine' in France:")
print(f"   before 1998: {zinedines_before_1998: 5}")
print(f"    since 1998:  {zinedines_after_1998: 5}")

We want to get more details about the years those babies were named **Zinedine**, so we group the data by year: 

In [None]:
# Group the "zinedines" per year and sum the values of column 'count' for each year
zinedines_per_year = zinedines.groupby(['year'])['count'].sum()
zinedines_per_year.tail()

Make a plot to visually explore the results of the operation above. For this we use **matplotlib**. Please ignore for now the details of how to use matplotlib. We look in more detail some aspects of data visualisation in [this notebook](visualisation.ipynb).

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams["figure.figsize"] = (8,6)
matplotlib.rcParams["figure.dpi"] = 100

In [None]:
# Plot the number of "zinedines" as a function of the year
zinedines_per_year.plot.bar()

-------------
## Acknowledgements
<a id='Acknowledgements'></a>

These are the sources this notebook is based on. You are encouraged to consult them to deep further:

* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jave VanderPlas (highly recommended book)
* [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
* Data School [Pandas best practices](https://youtu.be/hl-TGI4550M) (video)
* Dunder Data's [Intro to Pandas](https://youtu.be/31wa8tmrkPU) video series
* Python Bootcamp organised by the [Berkeley Institute for Data Science (BIDS)](https://bids.berkeley.edu) in the Fall 2016: [videos](https://bids.berkeley.edu/news/python-boot-camp-fall-2016-training-videos-available-online) and [notebooks](https://github.com/profjsb/python-bootcamp)
* [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653) 2nd Edition, by Wes McKinney