# Exam pandas

## Instructions

For this exam we use the dataset we explored in class about the given names of French babies over the period 1900-2017. The documentation about this dataset is online at https://www.insee.fr/fr/statistiques/2540004 and the dataset is at `./data/prenoms-fr-1900-2017.tsv.gz`.

For your convenience, this notebook is partially populated with code for loading and cleaning the dataset. A sample of the dataset is also displayed: **you have to focus on answering the questions.**

## What you are expected to do

Execute the cells of the notebook which are already populated before starting answering the questions, one by one. For answering each question you are provided one (or more) cells already prepared for you **to add your own code**. You will find some variables that you need to initialize.

## &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888; **ATTENTION**  &#9888;

When you are done answering your questions, please download the notebook file (extension `.ipynb`) to your personal computer and **send that file by e-mail to the address written in the whiteboard**. It is that notebook that we will use to give an score for your work.

----------------
## Load the dataset

In [1]:
import pandas as pd

In [2]:
# Load the new dataset. Its fields are separated by tabs.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
df = pd.read_csv('./data/prenoms-fr-1900-2017.tsv.gz', sep='\t', dtype={'annais':str, 'dpt':str})

## Clean the dataset

In [3]:
# Rename some columns to use more meaningful names
df = df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

# Drop rows with missing 'department', 'year' and special 'name'
df.drop(df[df['department'] == 'XX'].index, inplace=True)
df.drop(df[df['year'] == 'XXXX'].index, inplace=True)
df.drop(df[df['name'] == '_PRENOMS_RARES'].index, inplace=True)

# Convert columns 'department' and 'year' to numeric values
df['department'] = pd.to_numeric(df['department'])
df['year']       = pd.to_numeric(df['year'])

## Display a sample

In [4]:
df.sample(8)

Unnamed: 0,sex,name,year,department,count
2245074,2,EMELINE,2011,972,5
2923566,2,MARIE-CLAUDE,1938,81,5
800737,1,JEREMI,1985,93,3
1942914,2,CAMILLE,1923,8,3
79399,1,ALI,2016,59,24
3260990,2,PRISCILLA,1986,1,4
1576406,1,WILLIAM,2007,31,13
3194379,2,ODETTE,1906,10,21


## Subset the dataframe for convenience

In [5]:
# In this dataset, the sex is represented as 1 for males and 2 for females
MALE, FEMALE = 1, 2

# Create two views of the dataframe: one for boys and one for girls
is_boy  = df['sex'] == MALE
is_girl = df['sex'] == FEMALE

boys, girls = df[is_boy], df[is_girl]

In [6]:
boys.sample(5)

Unnamed: 0,sex,name,year,department,count
853990,1,JOSUÉ,2016,93,4
1500053,1,THIERRY,1972,3,65
1579391,1,WILSON,2009,60,3
803290,1,JEREMY,1975,79,5
1128920,1,METIN,1992,68,4


In [7]:
girls.sample(5)

Unnamed: 0,sex,name,year,department,count
1898100,2,AÏNA,2017,75,6
3020222,2,MARYVONNE,1981,29,3
3346841,2,SAMIA,1974,75,36
1732197,2,ALYSON,2002,31,3
2545972,2,JENNIFER,1988,22,23


## Questions 1 & 2:

**1)** Determine the year when the largest number of girls named `'MARIE'` were born. How many girls were named `'MARIE'` that particular year?

In [None]:
# Your code goes here

year = ... # Initialize this variable with the year when most girls were named 'MARIE'
count_maries =  ... # Initialize this variable with the number of girls named 'MARIE' that particular year

print(f'The year with largest number of girls named MARIE was {year}: there were {count_maries:,} of them')

**2)** What **percentage** of all the girls born that year were named `'MARIE'`?

In [None]:
# Your code goes here

total_girls = ... # Initialize this variable with the total number of girls born that year

percent_maries = (count_maries * 100) / total_girls

print(f'{percent_maries:.0f}% of the girls born in {year} were named MARIE')

## Questions 3 & 4:

**3)** Determine the most popular name for boys and for girls for the whole period included in the dataset.

In [8]:
# Your code goes here

top_boys  = ... # Initialize this variable with the most popular name for boys
top_girls = ... # Initialize this variable with the most popular name for girls

print(f'The most popular names over the period 1900-2017 are {top_girls} and {top_boys}')

The most popular names over the period 1900-2017 were Ellipsis and Ellipsis


**4)** Determine the top most popular name for the girls who in 2019 are aged 20 years or less

In [None]:
# Your code goes here

top_girl_up_to_20years = ... # Initialize this variable with the most popular name for girls aged up to 20 years

print(f'The most popular name for girls aged 20 years or less in 2019 is {top_girl_up_to_20years}')

## Question 5:

Answer `True` or `False` to the question below:

*"Among the girls born in 1970, were there more named `'ISABELLE'` than `'BRIGITTE'` ?"*

In [None]:
# Your code goes here

isabelles_1970 = ... # Initialize this variable with the number of 'ISABELLE' born in 1970
brigittes_1970 = ... # Initialize this variable with the number of 'BRIGITTE' born in 1970

print(isabelles_1970 > brigittes_1970)