# Exam pandas, with answers

## Instructions

For this exam we use the dataset we explored in class about the given names of French babies over the period 1900-2017. The documentation about this dataset is online at https://www.insee.fr/fr/statistiques/2540004 and the dataset is at `./data/prenoms-fr-1900-2017.tsv.gz`.

For your convenience, this notebook is partially populated with code for loading and cleaning the dataset. A sample of the dataset is also displayed: **you have to focus on answering the questions.**

## What you are expected to do

Execute the cells of the notebook which are already populated before starting answering the questions, one by one. For answering each question you are provided one (or more) cells already prepared for you **to add your own code**. You will find some variables that you need to initialize.

## &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888;

When you are done answering your questions, please download the notebook file (extension `.ipynb`) to your personal computer and **send that file by e-mail to the address written in the whiteboard**. It is that notebook that we will use to give an score for your work.

----------------
## Load the dataset

In [1]:
import pandas as pd

In [2]:
# Load the new dataset. Its fields are separated by tabs.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
df = pd.read_csv('./data/prenoms-fr-1900-2017.tsv.gz', sep='\t', dtype={'annais':str, 'dpt':str})

## Clean the dataset

In [3]:
# Rename some columns to use more meaningful names
df = df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

# Drop rows with missing 'department', 'year' and special 'name'
df.drop(df[df['department'] == 'XX'].index, inplace=True)
df.drop(df[df['year'] == 'XXXX'].index, inplace=True)
df.drop(df[df['name'] == '_PRENOMS_RARES'].index, inplace=True)

# Convert columns 'department' and 'year' to numeric values
df['department'] = pd.to_numeric(df['department'])
df['year']       = pd.to_numeric(df['year'])

## Display a sample

In [4]:
df.sample(8)

Unnamed: 0,sex,name,year,department,count
3175289,2,NOEMIE,1930,46,4
579417,1,GEORGES,1918,8,22
2782150,2,LUDIVINE,1981,24,8
1732050,2,ALYSON,1991,77,5
1530421,1,TRISTAN,1984,31,3
262152,1,CHARLES,1966,68,5
784876,1,JEAN-PAUL,1948,73,34
2315145,2,FATIHA,1977,33,8


## Subset the dataframe for convenience

In [5]:
# In this dataset, the sex is represented as 1 for males and 2 for females
MALE, FEMALE = 1, 2

# Create two views of the dataframe: one for boys and one for girls
is_boy  = df['sex'] == MALE
is_girl = df['sex'] == FEMALE

boys, girls = df[is_boy], df[is_girl]

In [6]:
boys.sample(5)

Unnamed: 0,sex,name,year,department,count
86879,1,ALPHONSE,1947,972,5
776019,1,JEAN-MARIE,1966,69,19
633121,1,GWENDAL,1987,75,4
1264103,1,PETER,1985,28,5
1355678,1,RODOLPHE,1972,14,9


In [7]:
girls.sample(5)

Unnamed: 0,sex,name,year,department,count
2447055,2,HENRIETTE,1907,50,18
2169335,2,DOMINIQUE,1961,40,25
2100017,2,CORALIE,1994,89,15
2282084,2,ESTHER,1960,971,10
3191476,2,OCÉANE,1991,86,7


## Questions 1 & 2:

**1)** Determine the year when the largest number of girls named `'MARIE'` were born. How many girls were named `'MARIE'` that particular year?

In [8]:
# Groupe the 'MARIE' per year and for each group sum its column 'count' for all the departments
maries_per_year = girls[girls['name'] == "MARIE"].groupby(['year'])['count'].sum()

year = maries_per_year.idxmax()
count_maries =  maries_per_year.max()

print(f'The year with largest number of girls named MARIE was {year}: there were {count_maries:,} of them')

The year with largest number of girls named MARIE was 1901: there were 52,167 of them


**2)** What **percentage** of all the girls born that year were named `'MARIE'`?

In [9]:
# Count the total number of girls born in the year computed in the previous question
total_girls = girls[girls['year'] == year]['count'].sum()

# Compute the fraction of MARIEs over the total number of girls born that year
percent_maries = (count_maries * 100) / total_girls

print(f'{percent_maries:.0f}% of the girls born in {year} were named MARIE')

21% of the girls born in 1901 were named MARIE


## Questions 3 & 4:

**3)** Determine the most popular name for boys and for girls for the whole period included in the dataset.

In [10]:
# Group the boys by name and sum the value of the column 'count' for all values of
# the column 'year' and 'department'. Then get the index of the maximum resulting value
# of that sum
top_boys  = boys.groupby(['name'])['count'].sum().idxmax()

# Idem for girls
top_girls = girls.groupby(['name'])['count'].sum().idxmax()

print(f'The most popular names over the period 1900-2017 are {top_girls} and {top_boys}')

The most popular names over the period 1900-2017 are MARIE and JEAN


**4)** Determine the top most popular name for the girls who in 2019 are aged 20 years or less

In [11]:
# Girls aged 20 years or less in 2019 were born in 1999 or later
girls_20y_or_less = girls[girls['year'] >= 1999]

# Same method used in the previous question
top_girl_up_to_20years = girls_20y_or_less.groupby(['name'])['count'].sum().idxmax()

print(f'The most popular name for girls aged 20 years or less in 2019 is {top_girl_up_to_20years}')

The most popular name for girls aged 20 years or less in 2019 is LÉA


## Question 5:

Answer `True` or `False` to the question below:

*Among the girls born in 1970, were there more named `"ISABELLE"` than `"BRIGITTE"` ?*

In [14]:
# Select the girls born in 1970
girls_1970 = girls[girls['year'] == 1970]

# Group those girls by their given name, and for each group sum the values of the column 'count'
girls_per_name_1970 = girls_1970.groupby(['name'])['count'].sum()

# Select the rows for ISABELLE and BRIGITTE
isabelles_1970 = girls_per_name_1970.loc['ISABELLE']
brigittes_1970 = girls_per_name_1970.loc['BRIGITTE']

print(f'{isabelles_1970 > brigittes_1970}: in 1970 {isabelles_1970:,} girls were named ISABELLE and {brigittes_1970:,} were named BRIGITTE')

True: in 1970 16,543 girls were named ISABELLE and 1,837 were named BRIGITTE
