# Analyse de données avec Python
## Combiner des DataFrames avec Pandas
Questions
* Peut-on travaillers avec plusieurs sources de données?
* Comment combiner les données de deux DataFrames?

Objectifs
* Combiner les données de plusieurs fichiers en utilisant `concat` et `merge`.
* Combiner deux DataFrames utilisant un identifiant commun.

# Data Analysis with Python
## Combining DataFrames with pandas
Questions
* Can I work with data from multiple sources?
* How can I combine data from different data sets?

Objectives
* Combine data from multiple files into a single DataFrame using `concat` and `merge`.
* Combine two DataFrames using a unique ID found in both DataFrames.

## Charger nos données

## Loading our data

In [None]:
# Charger le module pandas
import pandas as pd

# Lister une collection de fichiers CSV
from glob import glob
glob('../data/by_year/*.csv')

In [None]:
# First make sure pandas is loaded
import pandas as pd

# List a collection of CSV files
from glob import glob
glob('../data/by_year/*.csv')

## Concaténer des DataFrames

## Concatenating DataFrames

In [None]:
annee2001 = pd.read_csv('../data/by_year/surveys_2001.csv')
annee2002 = pd.read_csv('../data/by_year/surveys_2002.csv')

print(annee2001.shape, annee2002.shape)

In [None]:
year2001 = pd.read_csv('../data/by_year/surveys_2001.csv')
year2002 = pd.read_csv('../data/by_year/surveys_2002.csv')

print(year2001.shape, year2002.shape)

In [None]:
# Concaténer les dataframes verticalement
vertical = pd.concat([annee2001, annee2002], axis='index')
vertical

In [None]:
# Stack the DataFrames on top of each other
vertical = pd.concat([year2001, year2002], axis='index')
vertical

In [None]:
# Réinitaliser l'index du dataframe
# L'option drop=True évite l'ajout d'une colonne avec l'ancien index
vertical = vertical.reset_index(drop=True)
vertical

In [None]:
# Reset index values of the dataframe
# The drop=True option avoids adding new index column with old index values
vertical = vertical.reset_index(drop=True)
vertical

In [None]:
# Accumuler les données de tous les fichiers de la collection
surveys_df = pd.DataFrame()  # DataFrame vide

for fichier in glob('../data/by_year/*.csv'):
    df_annee = pd.read_csv(fichier)
    surveys_df = pd.concat([surveys_df, df_annee], axis='index')

surveys_df = surveys_df.reset_index(drop=True)
surveys_df

In [None]:
# Accumulate data from all files in the collection
surveys_df = pd.DataFrame()  # Empty DataFrame

for filename in glob('../data/by_year/*.csv'):
    df_year = pd.read_csv(filename)
    surveys_df = pd.concat([surveys_df, df_year], axis='index')

surveys_df = surveys_df.reset_index(drop=True)
surveys_df

## Exercice - Concaténer des DataFrames
* Chargez les données de tous les fichiers CSV du répertoire
  `../data/by_species_id/` et accumulez-les dans `surveys_sp`.
* Réinitialisez l'index sans préserver celui accumulé.

(4 min.)

## Exercise - Concatenating DataFrames
* Load the data from all CSV files in the directory
  `../data/by_species_id/` and accumulate them in `surveys_sp`.
* Reset the index while dropping the accumulated one.

(4 min.)

In [None]:
surveys_sp = pd.DataFrame()  # DataFrame vide

for fichier in glob('../data/by_species_id/*.csv'):
    nouveau_df = pd.read_csv(fichier)
    surveys_sp = pd.concat([surveys_sp, nouveau_df], axis='index')

surveys_sp = surveys_sp.reset_index(drop=True)
surveys_sp

In [None]:
surveys_sp = pd.DataFrame()  # DataFrame vide

for fichier in ###('../data/by_species_id/*.csv'):
    nouveau_df = pd.read_csv(fichier)
    surveys_sp = pd.###([###, nouveau_df], ###='index')

surveys_sp = surveys_sp.###(drop=###)
surveys_sp

In [None]:
surveys_sp = pd.DataFrame()  # Empty DataFrame

for filename in glob('../data/by_species_id/*.csv'):
    new_df = pd.read_csv(filename)
    surveys_sp = pd.concat([surveys_sp, new_df], axis='index')

surveys_sp = surveys_sp.reset_index(drop=True)
surveys_sp

In [None]:
surveys_sp = pd.DataFrame()  # Empty DataFrame

for filename in ###('../data/by_species_id/*.csv'):
    new_df = pd.read_csv(filename)
    surveys_sp = pd.###([###, new_df], ###='index')

surveys_sp = surveys_sp.###(drop=###)
surveys_sp

* Calculez le poids moyen selon l'espèce et le sexe (1 min.)

* Compute the average weight by sex for each species. (1 min.)

In [None]:
# Calculer le poids moyen par espèce et par sexe
poids_espece = surveys_sp.groupby(
    ['species_id', 'sex'])['weight'].mean().unstack()
poids_espece

In [None]:
# Calculer le poids moyen par espèce et par sexe
poids_espece = surveys_sp.groupby(
    ['species_id', 'sex'])###.unstack()
poids_espece

In [None]:
# Get the average weight by sex for each species
weight_species = surveys_sp.groupby(
    ['species_id', 'sex'])['weight'].mean().unstack()
weight_species

In [None]:
# Get the average weight by sex for each species
weight_species = surveys_sp.groupby(
    ['species_id', 'sex'])###.unstack()
weight_species

* Sauvegardez le tableau des moyennes
  dans un fichier CSV et le recharger (3 min.)

* Export your results as a CSV file and make sure
  it reads back into python properly. (3 min.)

In [None]:
# Écrire dans un fichier - garder l'index 'species_id' cette fois-ci
fichier_csv = 'poids_par_espece.csv'
poids_espece.to_csv(fichier_csv, index=True)

# Relire les données, fournir le nom de l'index
pd.read_csv(fichier_csv, index_col='species_id')

In [None]:
# Écrire dans un fichier - garder l'index 'species_id' cette fois-ci
fichier_csv = 'poids_par_espece.csv'
poids_espece###

# Relire les données, fournir le nom de l'index
pd.read_csv(fichier_csv, index_col=###)

In [None]:
# Writing to file while keeping the index
csv_file = 'weight_by_species.csv'
weight_species.to_csv(csv_file, index=True)

# Reading it back in with a specified index column
pd.read_csv(csv_file, index_col='species_id')

In [None]:
# Writing to file while keeping the index
csv_file = 'weight_by_species.csv'
weight_species###

# Reading it back in with a specified index column
pd.read_csv(csv_file, index_col=###)

## Joindre deux DataFrames

## Joining Two DataFrames

In [None]:
# Importer un sous-ensemble des espèces pour cet exemple
trois_especes = pd.read_csv('../data/speciesSubset.csv')
trois_especes

In [None]:
# Import a small subset of the species data designed for this part of the lesson
species_sub = pd.read_csv('../data/speciesSubset.csv')
species_sub

### Identifier les clés de jonction

### Identifying join keys

In [None]:
surveys_df.columns

In [None]:
trois_especes.columns

In [None]:
species_sub.columns

### Une intersection ou "inner join"

### Inner joins

![Inner join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/inner-join.png)

In [None]:
premiers10 = surveys_df.head(10)

# Calculer l'intersection de premiers10 et trois_especes
cle = 'species_id'
intersection = pd.merge(left=premiers10, right=trois_especes,
                        left_on=cle, right_on=cle)
# Quelle est la taille de la jonction?
intersection.shape

In [None]:
head10 = surveys_df.head(10)

# Computing the inner join of head10 and species_sub
key = 'species_id'
merged_inner = pd.merge(left=head10, right=species_sub,
                        left_on=key, right_on=key)
# What's the size of the output data?
merged_inner.shape

In [None]:
intersection

In [None]:
merged_inner

### Jonction de gauche

### Left joins

![Left join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/left-join.png)

In [None]:
jonc_gauche = pd.merge(left=premiers10, right=trois_especes,
                       on=cle, how='left')
# Quelle est la taille de la jonction?
jonc_gauche.shape

In [None]:
merged_left = pd.merge(left=head10, right=species_sub,
                       on=key, how='left')
# What's the size of the output data?
merged_left.shape

In [None]:
jonc_gauche

In [None]:
merged_left

### Les autres types de jonction
* `how='right'` : toutes les lignes du second DataFrame sont gardées
* `how='outer'` : équivalent d'une union, toutes les lignes sont gardées

### Other join types
* `how='right'` : all rows from the right DataFrame are kept
* `how='outer'` : all pairwise combinations of rows from both DataFrames

## Exercice - Joindre toutes les données
`1`. Créez un nouveau DataFrame tel que tous les
enregistrements de `surveys_df` sont gardés dans une jonction
impliquant les informations correspondantes de `species.csv`.
(3 min.)

## Exercise - Joining all data
`1`. Create a new DataFrame by joining the contents of the
`surveys_df` and `species.csv` tables. Keep all survey records.
(3 min.)

In [None]:
species_df = pd.read_csv('../data/species.csv')

jonc_gauche = pd.merge(
    left=surveys_df, right=species_df, on='species_id', how='left')
jonc_gauche.shape

In [None]:
species_df = pd.read_csv('../data/species.csv')

jonc_gauche = pd.merge(
    left=surveys_df, right=###, on=###, how=###)
jonc_gauche.shape

In [None]:
species_df = pd.read_csv('../data/species.csv')

merged_left = pd.merge(
    left=surveys_df, right=species_df, on='species_id', how='left')
merged_left.shape

In [None]:
species_df = pd.read_csv('../data/species.csv')

merged_left = pd.merge(
    left=surveys_df, right=###, on=###, how=###)
merged_left.shape

`2`. Calculez et créez un graphique montrant l'évolution de la
longueur moyenne des arrière-pieds (`'hindfoot_length'`) pour
chaque genre d'espèce (`'genus'`) d'une année à l'autre. (3 min.)

`2`. Calculate and plot the evolution of the average
hindfoot length for each genus from year to year. (3 min.)

In [None]:
longueurs_moyennes = jonc_gauche.groupby(
    ['year', 'genus'])['hindfoot_length'].mean().unstack()
longueurs_moyennes.tail()

In [None]:
longueurs_moyennes = jonc_gauche.###(
    ###)['hindfoot_length']###
longueurs_moyennes.tail()

In [None]:
average_lengths = merged_left.groupby(
    ['year', 'genus'])['hindfoot_length'].mean().unstack()
average_lengths.tail()

In [None]:
average_lengths = merged_left.###(
    ###)['hindfoot_length']###
average_lengths.tail()

In [None]:
longueurs_moyennes.plot(kind='line')

In [None]:
average_lengths.plot(kind='line')

`3`. Calculez et créez un graphique (*bar-plot*) montrant
le poids moyen selon le sexe pour chaque genre d'espèce.
Pour cet exercice, nous allons utiliser une
table de pivot à la place de `unstack()`.
(2 min.)

`3`. Calculate and create a bar plot showing
the average weight per sex for each genus.
For this exercise, we will use a pivot table instead of `unstack()`.
(2 min.)

In [None]:
poids_par_genre_sexe = jonc_gauche.groupby(
    ['genus', 'sex'])['weight'].mean().reset_index()
poids_par_genre_sexe.tail()

In [None]:
poids_par_genre_sexe = jonc_gauche.groupby(
    ['genus', 'sex'])['weight'].###()#.reset_index()
poids_par_genre_sexe.tail()

In [None]:
weights_by_genus_sex = merged_left.groupby(
    ['genus', 'sex'])['weight'].mean().reset_index()
weights_by_genus_sex.tail()

In [None]:
weights_by_genus_sex = merged_left.groupby(
    ['genus', 'sex'])['weight'].###()#.reset_index()
weights_by_genus_sex.tail()

In [None]:
# Utiliser pivot_table() au lieu de unstack()
pivot_weight_genus_sex = poids_par_genre_sexe.pivot_table(
    values='weight', index='genus', columns='sex')
pivot_weight_genus_sex

In [None]:
# Use pivot_table() instead of unstack()
pivot_weight_genus_sex = weights_by_genus_sex.pivot_table(
    values='weight', index='genus', columns='sex')
pivot_weight_genus_sex

In [None]:
pivot_weight_genus_sex.plot(kind="bar")

## Résumé technique
* **Concaténer** des DataFrames avec `pandas.concat()`
  * Requiert une liste de DataFrames
  * Verticalement si `axis='index'` (par défaut)
  * Horizontalement si `axis='columns'`
  * Réinitialiser l'index au besoin : `reset_index(drop=True)`
* **Joindre** des DataFrames avec `pandas.merge()`
  * `left=`, `right=` : les deux DataFrames à joindre
  * `left_on=`, `right_on=` : les clés de jonction de chaque DataFrame
  * `on=` : clés de jonction communes aux deux DataFrames
  * `how=` : `'inner'` (défaut), `'left'`, `'right'`, `'outer'`
* **Table de pivot** : `pivot_table()`
  * `values=colX`
  * `index=[col_ind]`
  * `columns=[categorie1, categorie2]`
  * `aggfunc=numpy.mean` (défaut: moyenne)

## Technical Summary
* **Concatenate** DataFrames with `pandas.concat()`
  * Requires a list of DataFrames
  * Vertically if `axis='index'` (by default)
  * Horizontally if `axis='columns'`
  * Resetting the index: `reset_index(drop=True)`
* **Joining** DataFrames with `pandas.merge()`
  * `left=`, `right=`: both DataFrames to join
  * `left_on=`, `right_on=`: join key for each DataFrame
  * `on=`: join key for both DataFrames
  * `how=`: `'inner'` (default), `'left'`, `'right'`, `'outer'`
* **Pivot table**  `pivot_table()`
  * `values=colX`
  * `index=[col_ind]`
  * `columns=[category1, category2]`
  * `aggfunc=numpy.mean` (default: mean)