# Exam pandas, with answers

## Instructions

For this exam we use the dataset we explored in class about the given names of French babies over the period 1900-2018. The documentation about this dataset is online at https://www.insee.fr/fr/statistiques/2540004 and the dataset is downloaded at `../data/prenoms-fr-1900-2018.csv.zip`.

For your convenience, this notebook is partially populated with code for loading and cleaning the dataset. A sample of the dataset is also displayed: **you have to focus on answering the questions.**

## Download the dataset

In [2]:
%%bash

force_download=false  # Set to 'true' to force download
dataset_location='https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2018_csv.zip'
dataset_file_name='../data/prenoms-fr-1900-2018.csv.zip'

if [[ ${force_download} == true || ! -e ${dataset_file_name} ]]; then
   curl --silent -L --output ${dataset_file_name} ${dataset_location}
fi

ls -lh ${dataset_file_name}

-rw-r--r--  1 fabio  staff    12M Feb 13 11:00 ../data/prenoms-fr-1900-2018.csv.zip


## What you are expected to do

Execute the cells of the notebook which are already populated before starting answering the questions, one by one. For answering each question you are provided one (or more) cells already prepared for you **to add your own code**. You will find some variables that you need to initialize.

## &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888;

When you are done answering your questions, please download the notebook file (extension `.ipynb`) to your personal computer and **send that file by e-mail to the address written in the whiteboard**. It is that notebook that we will use to give an score for your work.

----------------
## Load the dataset

In [3]:
import pandas as pd

In [6]:
# Load the data. Its fields are separated by ';'.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
df = pd.read_csv('../data/prenoms-fr-1900-2018.csv.zip', sep=';', dtype={'annais':str, 'dpt':str})
rows, cols = df.shape
print(f'This dataset contains {rows:,} rows and {cols} columns')

This dataset contains 3,624,994 rows and 5 columns


## Clean the dataset

In [7]:
# Rename some columns to use more meaningful names
df = df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

# Drop rows with missing department and year and special '_PRENOMS_RARES'
df.drop(df[df['department'] == 'XX'].index, inplace=True)
df.drop(df[df['year'] == 'XXXX'].index, inplace=True)
df.drop(df[df['name'] == '_PRENOMS_RARES'].index, inplace=True)

# Convert column 'year' to numeric values
df['year'] = pd.to_numeric(df['year'])

## Display a sample

In [8]:
df.sample(8)

Unnamed: 0,sex,name,year,department,count
2089428,2,CLAUDETTE,1942,72,31
829384,1,JOAN,2004,31,6
1127738,1,MAXIME,1913,30,6
3146001,2,MYLÈNE,1968,38,8
353566,1,DENIS,1976,8,8
1895863,2,ASSYA,2013,94,9
2119282,2,COLETTE,1920,73,5
3129951,2,MONIQUE,1936,12,46


## Subset the dataframe for convenience

In [9]:
# In this dataset, the sex is represented as 1 for males and 2 for females
# For convenience, create two views of the dataframe: one for boys and one for girls
is_boy  = df['sex'] == 1
is_girl = df['sex'] == 2

boys, girls = df[is_boy], df[is_girl]

In [19]:
boys.head(8)

Unnamed: 0,sex,name,year,department,count
3,1,AADIL,1983,84,3
4,1,AADIL,1992,92,3
6,1,AAHIL,2016,95,3
9,1,AARON,1962,75,3
10,1,AARON,1976,75,3
11,1,AARON,1982,75,3
12,1,AARON,1984,75,3
13,1,AARON,1985,75,5


In [21]:
girls.sample(8)

Unnamed: 0,sex,name,year,department,count
1919812,2,AURÉLIE,1998,37,5
2025124,2,CHARLENE,2012,13,5
3330899,2,REINE,1928,24,33
3510149,2,TALI,1996,75,4
2880283,2,MAEVA,1989,3,4
3170506,2,NADEGE,1983,92,26
3503398,2,SYLVIE,1963,3,195
2346150,2,FANNY,2011,60,3


## Questions 1 & 2:

**1)** Determine the year when the largest number of girls named `'MARIE'` were born. How many girls were named `'MARIE'` that particular year?

In [12]:
# Group the 'MARIE' per year and for each group (i.e. each year) sum the column 'count' for all the departments
is_marie = girls['name'] == "MARIE"
maries_per_year = girls[is_marie].groupby(['year'])['count'].sum()

year = maries_per_year.idxmax()
count_maries =  maries_per_year.max()

print(f'The year with largest number of girls named MARIE was {year}: there were {count_maries:,} of them')

The year with largest number of girls named MARIE was 1901: there were 52,149 of them


**2)** What **percentage** of all the girls born that year were named `'MARIE'`?

In [13]:
# Count the total number of girls born in the year computed in the previous question
total_girls = girls[girls['year'] == year]['count'].sum()

# Compute the fraction of MARIEs over the total number of girls born that year
percent_maries = (count_maries * 100) / total_girls

print(f'{percent_maries:.0f}% of the girls born in {year} were named MARIE')

21% of the girls born in 1901 were named MARIE


## Questions 3 & 4:

**3)** Determine the most popular name for boys and for girls for the whole period included in the dataset.

In [14]:
# Group the boys by name and sum the value of the column 'count' for all values of
# the column 'year' and 'department'. Then get the index of the maximum resulting value
# of that sum
top_boys  = boys.groupby(['name'])['count'].sum().idxmax()

# Idem for girls
top_girls = girls.groupby(['name'])['count'].sum().idxmax()

print(f'The most popular names over the period 1900-2018 are {top_girls} and {top_boys}')

The most popular names over the period 1900-2018 are MARIE and JEAN


**4)** Determine the top most popular name for the girls who in 2019 are aged 20 years or less

In [15]:
# Girls aged 20 years or less in 2019 were born in 1999 or later
girls_20y_or_less = girls[girls['year'] >= 1999]

# Same method used in the previous question
top_girl_up_to_20years = girls_20y_or_less.groupby(['name'])['count'].sum().idxmax()

print(f'The most popular name for girls aged 20 years or less in 2019 is {top_girl_up_to_20years}')

The most popular name for girls aged 20 years or less in 2019 is EMMA


## Question 5:

Answer `True` or `False` to the question below:

*Among the girls born in 1970, were there more named `"ISABELLE"` than `"BRIGITTE"` ?*

In [16]:
# Select the girls born in 1970
girls_1970 = girls[girls['year'] == 1970]

# Group those girls by their given name, and for each group sum the values of the column 'count'
girls_per_name_1970 = girls_1970.groupby(['name'])['count'].sum()

# Select the rows for ISABELLE and BRIGITTE
isabelles_1970 = girls_per_name_1970.loc['ISABELLE']
brigittes_1970 = girls_per_name_1970.loc['BRIGITTE']

print(f'{isabelles_1970 > brigittes_1970}: in 1970 {isabelles_1970:,} girls were named ISABELLE and {brigittes_1970:,} were named BRIGITTE')

True: in 1970 16,543 girls were named ISABELLE and 1,837 were named BRIGITTE
