# Exam pandas

## Instructions

For this exam we use the dataset we explored in class about the given names of French babies over the period 1900-2019. The documentation about this dataset is online at https://www.insee.fr/fr/statistiques/2540004 and the dataset is downloaded at `../data/prenoms-fr-1900-2019.csv.zip`.

For your convenience, this notebook is partially populated with code for loading and cleaning the dataset. A sample of the dataset is also displayed: **you have to focus on answering the questions.**

## Download the dataset

In [None]:
import requests
import os

def download(url, path):
    """Download file at url and save it locally at path"""
    with requests.get(url, stream=True) as resp:
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

In [None]:
# Download the dataset if necessary
path = os.path.join('..', 'data', 'prenoms-fr-1900-2019.zip')

if not os.path.isfile(path):
    os.makedirs(os.path.join('..', 'data'), exist_ok=True)
    url = 'https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2019_csv.zip'
    download(url, path)

## What you are expected to do

Execute the cells of the notebook which are already populated before starting answering the questions, one by one. For answering each question you are provided one (or more) cells already prepared for you **to add your own code**. You will find some variables that you need to initialize.

## ⚠️ **ATTENTION**  ⚠️  **ATTENTION**  ⚠️  **ATTENTION**  ⚠️

When you are done answering your questions, please download the notebook file (extension `.ipynb`) to your personal computer and **send that file by e-mail to the address written in the whiteboard**. It is that notebook that we will use to give an score for your work.

----------------
## Load the dataset

In [None]:
import pandas as pd

In [None]:
# Load the data. Its fields are separated by ';'.
# We ask pandas to interpret the columns 'annais' and 'dpt' as strings to avoid error with missing
# values
df = pd.read_csv(path, sep=';', dtype={'annais':str, 'dpt':str})
rows, cols = df.shape
print(f'This dataset contains {rows:,} rows and {cols} columns')

## Clean the dataset

In [None]:
# Rename some columns to use more meaningful names
df = df.rename(columns={
    'sexe':      'sex',
    'preusuel':  'name',
    'annais':    'year',
    'dpt':       'department',
    'nombre':    'count'})

# Drop rows with missing department and year and special '_PRENOMS_RARES'
df.drop(df[df['department'] == 'XX'].index, inplace=True)
df.drop(df[df['year'] == 'XXXX'].index, inplace=True)
df.drop(df[df['name'] == '_PRENOMS_RARES'].index, inplace=True)

# Convert column 'year' to numeric values
df['year'] = pd.to_numeric(df['year'])

## Display a sample

In [None]:
df.head(8)

## Subset the dataframe for convenience

In [None]:
# In this dataset, the sex is represented as 1 for males and 2 for females
# For convenience, create two views of the dataframe: one for boys and one for girls
is_boy  = df['sex'] == 1
is_girl = df['sex'] == 2

boys, girls = df[is_boy], df[is_girl]

In [None]:
boys.sample(5)

In [None]:
girls.sample(5)

## Questions 1 & 2:

**1)** Determine the year when the largest number of girls named `'MARIE'` were born. How many girls were named `'MARIE'` that particular year?

In [None]:
# Your code goes here

year = ... # Initialize this variable with the year when most girls were named 'MARIE'
count_maries =  ... # Initialize this variable with the number of girls named 'MARIE' that particular year

print(f'The year with largest number of girls named MARIE was {year}: there were {count_maries:,} of them')

**2)** What **percentage** of all the girls born that year were named `'MARIE'`?

In [None]:
# Your code goes here

total_girls = ... # Initialize this variable with the total number of girls born that year

percent_maries = (count_maries * 100) / total_girls

print(f'{percent_maries:.0f}% of the girls born in {year} were named MARIE')

## Questions 3 & 4:

**3)** Determine the most popular name for boys and for girls for the whole period included in the dataset.

In [None]:
# Your code goes here

top_boys  = ... # Initialize this variable with the most popular name for boys
top_girls = ... # Initialize this variable with the most popular name for girls

print(f'The most popular names over the period 1900-2017 are {top_girls} and {top_boys}')

**4)** Determine the top most popular name for the girls who in 2019 are aged 20 years or less

In [None]:
# Your code goes here

top_girl_up_to_20years = ... # Initialize this variable with the most popular name for girls aged up to 20 years

print(f'The most popular name for girls aged 20 years or less in 2019 is {top_girl_up_to_20years}')

## Question 5:

Answer `True` or `False` to the question below:

*"Among the girls born in 1970, were there more named `'ISABELLE'` than `'BRIGITTE'` ?"*

In [None]:
# Your code goes here

isabelles_1970 = ... # Initialize this variable with the number of 'ISABELLE' born in 1970
brigittes_1970 = ... # Initialize this variable with the number of 'BRIGITTE' born in 1970

print(isabelles_1970 > brigittes_1970)