# Homework 1 

Due date : **2023-02-24 @23h55** (this is a **hard deadline**)

## Fill this cell with your names

- Name, First Name, Informatique/Mathématique-Informatique
- Name, First Name, Informatique/Mathématique-Informatique

## Carefully follow instructions

**If you don't: no evaluation!**

Write in English or French

The deliverable is a file

- `xxx_yyy.ipynb` file (jupyter notebook) or 
- `xxx_yyy.py` file (if you are using `jupytext`) or
- `xxx_yyy.qmd` file (if you are using `quarto`)

where `xxx` and `yyy` are your names, for example `lagarde_michard.ipynb`. 

The deliverable is not meant to contain cell outputs.  

The data files used to execute cells are meant to sit in the same directory as the deliverable. Use relative filepaths or urls to denote the data files.   

We **will** execute the code in your notebook: make sure that running all the cells works well. 



## Grading <i class="fa graduation-cap"></i>

Here is the way we'll assess your work

| Criterion | Points | Details |
|:----------|:-------:|:----|
|Spelling and syntax | 3 | English/French  |
|Plots correction | 3 |  Clarity / answers the question  |
|Plot style and cleanliness | 3 | Titles, legends, labels, breaks ... |
|Table wrangling | 4 | ETL, SQL like manipulations |
|Computing Statistics | 5 | SQL `goup by`  and aggregation  |
|DRY compliance | 2 | DRY principle at [Wikipedia](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)|

If we see a single (or more) `for` loop in your code: **-5 points**.  Everything can be done using high-level `pandas` methods

# Preliminaries

## Notebooks: Modus operandi

- This is a [Jupyter Notebook](https://jupyter.org).
- When you execute code within the notebook, the results appear beneath the code.
- [Jupytext](https://github.com/mwouts/jupytext)
- [Quarto](https://quarto.org) 

## Packages

- Base `Python` can do a lot. But the full power of `Python` comes from a fast growing collection of `packages`/`modules`.

- Packages are first installed (that is using `pip install` or `conda install`), and if
needed, imported during a session.

- The `docker` image you are supposed to use already offers a lot of packages. You should not need to install new packages.

- Once a package has been installed on your drive, if you want all objects exported by the package to be available in your session, you should import the package, using `from pkg import *`.

- If you just want to pick some subjects from the package,
you can use qualified names like `pkg.object_name` to access the object (function, dataset, class...)


In [239]:
# importing basic tools
import numpy as np
import pandas as pd

from pandas.api.types import CategoricalDtype

import os            # file operations
import requests      # networking
import zipfile
import io
from pathlib import Path

from datetime import date  # if needed

In [240]:
# importing plotting packages
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

In [241]:
# make pandas plotly-friendly
np.set_printoptions(precision=2, suppress=True)
%matplotlib inline
pd.options.plotting.backend = "plotly"

# Getting the data

## French data

The French data are built and made available by [INSEE](https://www.insee.fr/fr/accueil)  (French Governement Statistics Institute)

Prénoms:
- [https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip](https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip)

This dataset has been growing for a while. It has been considered by
social scientists for decades.  Given names are meant to give insights into a variety
of phenomena, including religious observance.

- A glimpse at the body of work can be found in [_L'archipel français_ by Jérome Fourquet, Le Seuil, 2019 ](https://www.seuil.com/ouvrage/l-archipel-francais-jerome-fourquet/9782021406023)

- Read the [File documentation](https://www.insee.fr/fr/statistiques/2540004?sommaire=4767262#documentation)

## US data 

US data may be gathered from 

[Baby Names USA from 1910 to 2021 (SSA)](https://www.kaggle.com/datasets/donkea/ssa-names-1910-2021?resource=download)

See [https://www.ssa.gov/oact/babynames/background.html](https://www.ssa.gov/oact/babynames/background.html)


## British data 

English and Welsh data can be gathered from 

[https://www.ons.gov.uk/](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesinenglandandwalesfrom1996?utm_source=pocket_saves)




## Download the French data

**QUESTION:** Download the data into a file which relative path is `'./nat2021_csv.zip'`

__Hints:__

- Have a look at  package [`requests`](https://requests.readthedocs.io/en/master/).
- Use magic commands to navigate across the file hierarchy and create subdirectories when needed

In [242]:
# for French data 

#dataPath = os.path.realpath(os.getcwd()) + os.sep + 'data' + os.sep   # current working directory of a process + 'data'
dataPath = "." + os.sep + 'data' + os.sep

params = dict(
    furl = 'https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip',
    burl = 'https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/babynamesinenglandandwalesfrom1996/1996tocurrent/babynames1996to2021.xlsx',
    usaurl = 'https://www.ssa.gov/oact/babynames/names.zip',
    dirpath = dataPath,
    timecourse = '',    
    fpath = dataPath + 'nat2021.csv',                   # csv  file in 'data' folder
    bpath = dataPath + 'babynames1996to2021.xlsx',      # xlsx file in 'data' folder
    upath = dataPath + 'usa_names' + os.sep             # a list of txt files in 'data/usa_names' folder
)
    
  # WARNING: may be need to change cause the initial pattern was :
  #  url = 'https://www.insee.fr/fr/statistiques/fichier/2540004/nat2021_csv.zip',
  #  dirpath = './',
  #  timecourse = '',
  #  datafile = 'nat2021.hdf',
  #  fpath = 'nat2021_csv.zip'

In [243]:
if not Path(params['fpath']).exists():
    r = requests.get(params['furl'])       
    if (r.status_code == 200):
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(path=params['dirpath'])
        print("French data file has been loaded")

    else:
        print("WARNING: download of French data file is failed")
else:
    print("French data file already exist")

French data file already exist


## Download US and British data 



In [244]:
if not Path(params['bpath']).exists():
    r = requests.get(params['burl'])
    if (r.status_code == 200):
        f = open(params['bpath'],'wb')
        f.write(r.content)
        f.close()
        print("British data file has been loaded")
    else:
        print("WARNING: download of British data file is failed") 
else:
    print("British data file already exist")

if not Path(params['upath']).exists():
    r = requests.get(params['usaurl'])
    if (r.status_code == 200):
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(path=params['upath'])
        print("USA data file has been loaded")
    else:
        print("WARNING: download of USA data file is failed")
else:
    print("USA data file already exist")

British data file already exist
USA data file already exist


## Load the French data in memory

**QUESTION:** Load the data in a `pandas` `DataFrame` called `data`

__Hints:__

- You should obtain a `Pandas dataframe` with 4 columns.
- Mind the conventions used to build the `csv` file.
- Package `pandas` provides the convenient tools.
- The dataset, though not too large, is already demanding.
- Don't hesitate to test your methods on a sample of rows method `sample()` from class `DataFrame` can be helpful.

In [245]:
# load French data in memory (1900 - 2021)
df_fr = pd.read_csv(params['fpath'], sep=';')
df_fr.sample()

Unnamed: 0,sexe,preusuel,annais,nombre
119048,1,HAZIZ,1985,3


In [246]:
df_fr.loc[df_fr.annais == '2021']

Unnamed: 0,sexe,preusuel,annais,nombre
121,1,_PRENOMS_RARES,2021,27222
184,1,AARISH,2021,3
243,1,AARON,2021,2496
263,1,AARONE,2021,10
279,1,AARONN,2021,14
...,...,...,...,...
686466,2,ZÜMRA,2021,17
686485,2,ZUZANNA,2021,4
686507,2,ZYA,2021,19
686518,2,ZYNA,2021,9


## Load US and British data in memory

In [247]:
# load British data in memory (1996 - 2021)

# will be used later for calculate NaN's count
statistics = dict(
    countNullFr = 0,
    linesNullFr = 0,
    countNullBr = 0,
    linesNullBr = 0,
    countNullUSA = 0,
    linesNullUSA = 0
)

# create dataframe for boys
df_boys = pd.read_excel(io = params['bpath'], engine='openpyxl', sheet_name='1', header = 7)

# take only "Name" and "Count <year>" columns and cut the word "Count"
columns =  df_boys.columns.str.contains('Count|Name')
indices = [i for i, col in enumerate(columns) if col]
df_boys = df_boys.iloc[:, indices]
df_boys.columns = [x.replace(' Count', '') for x in df_boys.columns]

# collect all years in one column "Year"
df_boys = df_boys.melt(id_vars='Name', var_name='Year', value_name='Nb')

statistics['countNullBr'] += df_boys['Nb'].value_counts()['[x]']          #For Q:How many missing values (NA) have been introduced?   

# Replace Nb [x] -> NaN, drop lines with NaN and define type
df_boys = (df_boys.replace('[x]', np.nan)
                .dropna()
                .sort_values(['Name','Year'], ascending=[True,False])
                .astype({'Nb': np.int64})
                  # .sort_values(by=['Year', 'Name'])
          )

# add "Gender" column with value = 1 for all rows
df_boys.insert(0, 'Gender', 1)


# create dataframe for girls
df_girls = pd.read_excel(io = params['bpath'], engine='openpyxl', sheet_name='2', header = 7)

# take only "Name" and "Count <year>" columns and cut the word "Count"
columns =  df_girls.columns.str.contains('Count|Name')
indices = [i for i, col in enumerate(columns) if col]
df_girls = df_girls.iloc[:, indices]
df_girls.columns = [x.replace(' Count', '') for x in df_girls.columns]

# collect all years in one column "Year"
df_girls = df_girls.melt(id_vars='Name', var_name='Year', value_name='Nb')

statistics['countNullBr'] += df_girls['Nb'].value_counts()['[x]']          #For Q:How many missing values (NA) have been introduced?                                                                          

# Replace Nb [x] -> NaN, drop lines with NaN and define type
df_girls = (df_girls.replace('[x]', np.nan)
                .dropna()
                .sort_values(['Name','Year'], ascending=[True,False])
                .astype({'Nb': np.int64})
                  # .sort_values(by=['Year', 'Name'])
          )

# add "Gender" column with value = 2 for all rows
df_girls.insert(0, 'Gender', 2)


# concatenate them to one DataFrame
df_br = pd.concat([df_boys, df_girls])

df_br

1
2
1
2


Unnamed: 0,Gender,Name,Year,Nb
0,1,A,2021,5
16777,1,A,2020,4
33554,1,A,2019,10
50331,1,A,2018,4
67108,1,A,2017,4
...,...,...,...,...
87831,2,Zyva,2018,5
109789,2,Zyva,2017,5
153705,2,Zyva,2015,3
285453,2,Zyva,2009,5


In [248]:
# Load USA data in memory  (1880 - 2021)

# path for directory & list of years took from filenames
pathGenerator = Path(params['upath']).rglob('*.txt')
pathsList = [x for x in pathGenerator]

# create DataFrame from .txt file
def fromTxtToDataFrame(filename, year):
    df = pd.read_csv(filename, sep=',', names = [ 'Name', 'Gender', 'Nb'])
    df.insert(2, 'Year', year)
    return df

# compose list of DataFrames
def doForAll():
    yearsList = [ int( (str(x))[-8:-4] ) for x in pathsList ]
    length = len(yearsList)   
    return [fromTxtToDataFrame(pathsList[i], yearsList[i]) for i in range(length)]

# concatenate all DataFrames in one
df_usa = pd.concat(doForAll(), ignore_index=True)

df_usa = (df_usa.replace({'Gender': {'F': 2, 'M': 1}})
                 # .sort_values(by=['Gender', 'Year', 'Name'])
         )      

df_usa

Unnamed: 0,Name,Gender,Year,Nb
0,Mary,2,1880,7065
1,Anna,2,1880,2604
2,Emma,2,1880,2003
3,Elizabeth,2,1880,1939
4,Minnie,2,1880,1746
...,...,...,...,...
2052776,Zyeire,1,2021,5
2052777,Zyel,1,2021,5
2052778,Zyian,1,2021,5
2052779,Zylar,1,2021,5


## Explore the data

**QUESTION:** Look at the data, Use the attributes `columns`, `dtypes` and the methods `head`, `describe`, to get a feeling of the data.

- This dataset is supposed to report all given names used
for either sex during a year in France since 1900

- The file is made of `652 056` lines and  4 columns.

```
|-- preusuel : object
|-- nombre: int64
|-- sexe: int64
|-- annais: object
```

Each row indicates for a given `preusuel` (prénom usuel, given name), `sexe` (sex), and `annais` (année naissance, birthyear) the `nombre` (number) of babies of the given sex who were given that name during the given year.

|sexe    |preusuel     | annais|   nombre|
|:------|:--------|----:|---:|
|2     |SYLVETTE | 1953| 577|
|1   |BOUBOU   | 1979|   4|
|1   |NILS     | 1959|   3|
|2   |NICOLE   | 2003|  36|
|1   |JOSÉLITO | 2013|   4|


**QUESTION:** Compare memory usage and disk space used by data

**Hints:**

- The method `info`  prints a concise summary of a `DataFrame`.
- With optional parameter `memory_usage`, you can get an estimate
of the amount of memory used by the `DataFrame`.
- Beware that the resulting estimate depends on the argument fed.

In [249]:
print(df_fr.info(memory_usage = 'deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686538 entries, 0 to 686537
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   sexe      686538 non-null  int64 
 1   preusuel  686536 non-null  object
 2   annais    686538 non-null  object
 3   nombre    686538 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 93.3 MB
None


**QUESTION:** Display the output of `.describe()` with style.

In [250]:
df_fr.describe().style.format("{:.0f}").bar(color ='lightblue')

Unnamed: 0,sexe,nombre
count,686538,686538
mean,2,127
std,0,875
min,1,1
25%,1,4
50%,2,8
75%,2,25
max,2,53547


**QUESTION:** For each column compute the number of distinct values

In [251]:
df_fr.apply(pd.Series.nunique)

sexe            2
preusuel    36170
annais        123
nombre       7281
dtype: int64

# Transformations

## Improving the data types

**QUESTION:** Make `sexe` a category with two levels `Female` and `Male`. Call the new column `gender`. Do you see any reason why this factor should be ordered?

__Hint:__ Read [Pandas and categorical variables](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html?highlight=category)

In [252]:
def insertFemaleMale(sexe):
    if sexe == 2:
        return "Female"
    return "Male"

df_fr['gender'] = df_fr.apply(lambda x: insertFemaleMale(x.sexe), axis=1)
df_fr['gender'] = df_fr['gender'].astype('category')
df_fr = df_fr[['gender', 'sexe', 'preusuel', 'annais', 'nombre']]
print(df_fr.dtypes)

gender      category
sexe           int64
preusuel      object
annais        object
nombre         int64
dtype: object


In [253]:
# Q: Do you see any reason why this factor should be ordered?
# A: It might be interesting to see the names given to both girls and boys 

df_fr[['gender', 'preusuel','nombre']].groupby(['preusuel', 'gender']).mean(numeric_only = True).round(0)

Unnamed: 0_level_0,Unnamed: 1_level_0,nombre
preusuel,gender,Unnamed: 2_level_1
A,Female,
A,Male,9.0
AADAM,Female,
AADAM,Male,4.0
AADEL,Female,
...,...,...
ÖZKAN,Male,7.0
ÖZLEM,Female,6.0
ÖZLEM,Male,
ÜMMÜ,Female,5.0


**QUESTION:** Compare memory usage of columns `sexe` and `gender`

In [254]:
print("Memory usage of column 'gender' = ", df_fr['gender'].memory_usage(deep=True)/1000, "kB")
print("Memory usage of column 'sexe'   =", df_fr['sexe'].memory_usage(deep=True)/1000, "kB")

Memory usage of column 'gender' =  686.898 kB
Memory usage of column 'sexe'   = 5492.432 kB


**QUESTION:** Would it be more memory-efficient to recode `sexe` using modalities `F` and `M` instead of `Male` and `Female` ?

_Insert your answer here_

> No, it wouldn't make any impact since the size is determined by the type ("category" in our case)

## Dealing with missing values

**QUESTION:** Variable `annais` class is `object`. Make `annais` of type `float`. Note that missing years are encoded as "XXXX", find a way to deal with that.

__Hint:__  As of releasing this Homework (2023-01-18), `Pandas` is not very good at managing missing values,
see [roadmap](https://pandas.pydata.org/docs/development/roadmap.html). Don't try to convert `annais` into an integer column.

In [256]:
statistics['countNullFr'] += df_fr['annais'].value_counts()['XXXX']

df_fr = (df_fr.replace('XXXX', None)
              .astype({'annais': np.float64})                
        )
df_fr

KeyError: 'XXXX'

## Rename and remove columns

**QUESTION:** Remove useless columns (now that you've created new ones, and rename them). You should end up with a dataframe with columns called `"gender"`, `"year"`, `"count"`, `"firstname`" with the following dtypes:

```python
gender        category
firstname     object
count         int64
year          float64
```

In [257]:
# your code here
df_fr = (df_fr.drop(columns = 'sexe', errors = 'ignore')
                .rename(columns = {'preusuel': 'firstname', 'annais': 'year', 'nombre': 'count'})
        )
df_fr

Unnamed: 0,gender,firstname,year,count
0,Male,_PRENOMS_RARES,1900.0,1249
1,Male,_PRENOMS_RARES,1901.0,1342
2,Male,_PRENOMS_RARES,1902.0,1330
3,Male,_PRENOMS_RARES,1903.0,1286
4,Male,_PRENOMS_RARES,1904.0,1430
...,...,...,...,...
686533,Female,ZYNEB,2018.0,5
686534,Female,ZYNEB,2019.0,7
686535,Female,ZYNEB,2020.0,8
686536,Female,ZYNEB,2021.0,6


**Question:** Do the same thing for British and US data. You should eventually obtain dataframes with the same schema.  

**QUESTION:** How many missing values (NA) have been introduced? How many births are concerned?

In [258]:
# British data
df_br['gender'] = df_br.apply(lambda x: insertFemaleMale(x.Gender), axis=1)
df_br['gender'] = df_br['gender'].astype('category')
df_br = (df_br.drop(columns = 'Gender', errors = 'ignore')
                .rename(columns = {'Name': 'firstname', 'Year': 'year', 'Nb': 'count'})
        )
df_br = df_br[['gender', 'firstname', 'year', 'count']]
df_br

Unnamed: 0,gender,firstname,year,count
0,Male,A,2021,5
16777,Male,A,2020,4
33554,Male,A,2019,10
50331,Male,A,2018,4
67108,Male,A,2017,4
...,...,...,...,...
87831,Female,Zyva,2018,5
109789,Female,Zyva,2017,5
153705,Female,Zyva,2015,3
285453,Female,Zyva,2009,5


In [259]:
# USA data
df_usa['gender'] = df_usa.apply(lambda x: insertFemaleMale(x.Gender), axis=1)
df_usa['gender'] = df_usa['gender'].astype('category')
df_usa = (df_usa.drop(columns = 'Gender', errors = 'ignore')
                .rename(columns = {'Name': 'firstname', 'Year': 'year', 'Nb': 'count'})
        )
df_usa = df_usa[['gender', 'firstname', 'year', 'count']]
df_usa

Unnamed: 0,gender,firstname,year,count
0,Female,Mary,1880,7065
1,Female,Anna,1880,2604
2,Female,Emma,1880,2003
3,Female,Elizabeth,1880,1939
4,Female,Minnie,1880,1746
...,...,...,...,...
2052776,Male,Zyeire,2021,5
2052777,Male,Zyel,2021,5
2052778,Male,Zyian,2021,5
2052779,Male,Zylar,2021,5


In [260]:
print(f"It was inserted:\n    {statistics['countNullFr']} NaN's for French data \n    {statistics['countNullBr']} NaN's for UK data\n    {statistics['countNullUSA']} NaN's for USA data" )

It was inserted:
    37924 NaN's for French data 
    698780 NaN's for UK data
    0 NaN's for USA data


**QUESTION:** Read the documentation and describe the origin of rows containing the missing values.

In [261]:
print("In USA data there is no need to insert NaN because their files only contain records for a particular year.\nIt means that the starting point is the year. And name is written only if someone was given this name in that year.\n\nIn the UK data, it’s the names who are the starting point: all the names of children ever given from 1996 to 2021 are listed. If in any year this name was not given to anyone, NaN (in forme of '[x]') is indicated. \n\nThe French data contains NaN on format 'XXXX' for privacy reasons: if some name is given less than 3 times a year, it is not published in that year, because this will make it easier to reveal the identity of the people. The number of such events for a given name is summarized and published in a column 'Year' with label 'XXXX'.")

In USA data there is no need to insert NaN because their files only contain records for a particular year.
It means that the starting point is the year. And name is written only if someone was given this name in that year.

In the UK data, it’s the names who are the starting point: all the names of children ever given from 1996 to 2021 are listed. If in any year this name was not given to anyone, NaN (in forme of '[x]') is indicated. 

The French data contains NaN on format 'XXXX' for privacy reasons: if some name is given less than 3 times a year, it is not published in that year, because this will make it easier to reveal the identity of the people. The number of such events for a given name is summarized and published in a column 'Year' with label 'XXXX'.


In [262]:
# WARNING concatenation of 3 dataFrames
col = 'country'
df_fr[col] = "France"
df_br[col] = "UK"
df_usa[col] = "USA"

df_total = pd.concat([df_fr, df_br, df_usa], ignore_index=True)

df_total

Unnamed: 0,gender,firstname,year,count,country
0,Male,_PRENOMS_RARES,1900.0,1249,France
1,Male,_PRENOMS_RARES,1901.0,1342,France
2,Male,_PRENOMS_RARES,1902.0,1330,France
3,Male,_PRENOMS_RARES,1903.0,1286,France
4,Male,_PRENOMS_RARES,1904.0,1430,France
...,...,...,...,...,...
3047644,Male,Zyeire,2021,5,USA
3047645,Male,Zyel,2021,5,USA
3047646,Male,Zyian,2021,5,USA
3047647,Male,Zylar,2021,5,USA


## Checkpointing: save your transformed dataframes

**QUESTION:** Save the transformed dataframe (retyped and renamed) to `./nat2021_csv.zip`. Try several compression methods.

In [None]:
# your code here


**QUESTION:** Save the transformed dataframes (retyped and renamed) to `./nat2021.hdf` using `.hdf` format

In [None]:
# your code here


At that point your working directory should look like:

```
├── homework01.py      # if you use `jupytext`
|── homework01.qmd     # if you use `quarto`
├── homework01.ipynb   # if you use `jupyter` `notebook`
├── babies-fr.hdf
├── babies-fr.zip
├── babies-us.hdf
├── babies-us.zip
├── babies-ew.hdf
├── babies-ew.zip
├── births-fr.csv
├── births-fr.hdf
```

**QUESTION:** Reload the data using `read_hdf(...)` so that the resulting dataframes  are properly typed with meaningful and homogeneous column names.

__Hint:__ use `try: ... except` to handle exceptions such as `FileNotFoundError`

In [None]:
# your code here


## Some data "analytics" and visualization

**QUESTION**: For each year, compute the total number of Female and Male births and the proportion of Female  births among total births

__Hints:__

- Groupby operations using several columns for the groups return a dataframe with a `MultiIndex` index see [Pandas advanced](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

- Have a look at `MultiIndex`, `reset_index`, `pivot`, `columns.droplevel`

In [None]:
# your code here


**QUESTION:** Plot the proportion of female births as a function of year and French, US, en British babynames data. Compare with what you get from `births-fr.hdf`.

Don't forget: title, axes labels, ticks, scales, etc.

Because of what we did before, the `plot` method of a `DataFrame` with be rendered using `plotly`, so you can use this. But you can use also `seaborn` or any other available plotting library that you want.

__Hint:__ Mind the missing values in the `year` column

In [None]:
# your code here


**QUESTION:** Make any sensible comment about these plots.

_Insert your answer here_

> ...

**QUESTION:** Explore the fluctuations of sex ratio around its mean value since 1945  in the US, in France and in the Great Britain.

Plot deviations of sex ratio around its mean since 1945 as a function of time.

In [None]:
# your code here


**QUESTION:**  Assume that baby gender is chosen at random according to a Bernoulli distribution with success probability $.48$, that baby genders are i.i.d. Perform simulations for sex ratios for French and US data since 1945. 

Plot the results, compare with your plots above.  

# The rise, decline and fall of firstnames

**Question:** For each year, country, gender and firstname, compute the popularity rank of the firstname among the names given to babies with that gender, in that country, in that year. The most popular name should be given rank $1$.  


**QUESTION:** For each firstname and sex (some names may be given to girls and boys), compute the total number of times this firstname has been given during `1900-2019`. Print the top 20 firstnames given and style your result dataframe using `background_gradient` for instance.

In [None]:
# your code here


## Rare firstnames

**QUESTION:** In the French data, for each sex, plot the proportion of births given `_PRENOMS_RARES` as a function of the year.

In [None]:
# your code here


# A study of the "Marie" firstname

**QUESTION:** Plot the proportion of female births given name `'MARIE'` or `'MARIE-...'` (compounded names) as a function of the year.
Proceed in such a way that the reader can see the share of compounded names. We are expecting an _area plot_.

__Hints:__

- Have a look at the `.str` accessor (to apply a string method over a whole column containing string)
- Have a look at [r-graph-gallery: stacked area](https://www.r-graph-gallery.com/stacked-area-graph.html)  and
at [ggplot documentation](https://ggplot2.tidyverse.org/reference/geom_ribbon.html). Pay attention on the way you stack the area corresponding to names matching pattern 'MARIE-.*' over or under the are corresponding to babies named 'MARIE'
- See Graphique 3, page 48, de _L'archipel français_  de J. Fourquet. Le Seuil. Essais. Vol. 898.

- Add annotation, 1st World War, Front Populaire, 2nd World War, 1968

In [None]:
# your code here


# Top 10 firstnames of year 2000

**QUESTION:** For each sex, select the ten most popular names in year 2000, and plot the proportion
of newborns given that name over time. Take into account that some names might have
zero occurrence during certain years.

__Hint:__ Leave aside the rows with '_PRENOMS_RARES'.

In [None]:
# your code here


# Picturing concentration of babynames distributions


Every year, the name counts define a discrete probability distribution over the set of names (the universe).

This distribution, just as an income or wealth distribution, is (usually) far from being uniform. We want to assess how uneven it is.

We use the tools developed in econometrics.

Without loss of generality, we assume that we handle a distribution over positive integers $1, \ldots, n$ where $n$ is the number of distinct names given during a year.

We assume that frequencies $p_1, p_2, \ldots, p_n$ are given in ascending order, ties are broken arbitrarily.

The `Lorenz function` ([Lorenz](https://en.wikipedia.org/wiki/Lorenz_curve) not `Lorentz`) maps $[0, 1] \to [0, 1]$.

$$L(x) = \sum_{i=1}^{\lfloor nx \rfloor} p_i .$$

Note that this is a piecewise constant function. 


**Question:** Compute and plot the Lorenz fucntion for a given `sex`, `year` and `country`

**Question:** Design an animated plot that shows the evolution of the Lorenz curve of babynames distribution through the years for a given sex and country.


The Lorenz curve summarizes how far a discrete probability distribution is from the uniform distribution. This is a very rich summary and it is difficult to communicate this message to a wide audience. People tend to favor numerical indices (they don't really understand, but they get used to it): Gini, Atkinson, Theil, ...

The [Gini index](https://en.wikipedia.org/wiki/Gini_coefficient) is twice the surface of the area comprised between curves $y=x$ and $y=L(x)$.

$$G = 2 \times \int_0^1 (x -L(x)) \mathrm{d}x$$

The next formula  allows us to compute it efficiently.

$$G={\frac {2\sum _{i=1}^{n}i p_{i}}{n\sum _{i=1}^{n}p_{i}}}-{\frac {n+1}{n}}.$$


**Question:** Compute and plot Gini index of names distribution over time for sex and countries 


# Picturing surges of popularity

In the sequel, the *popularity* of a gendered name in a population during a given year is the proportion of babies of that gender born during that year in that country,  that are given this name. 

**Question:** Prepare a data frame that contains for each hype name the 20 years before and 30 years after the maximum popularity is achieved, and, for each such year, the rank and popularity of the hype name. Do this for US and French data. 


**Question:** Plot offseted popularity (share of given names within year, country, gender) curves of hype names. Facet by sex and country. 

**Question:** Rescale popularity curves so that all of them have maximum $1$. 

# Getting help

- [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/)

- [plotly](https://plotly.com/python/) for animated plots

- [stackoverflow](https://stackoverflow.com)

- [stackoverflow: pandas](https://stackoverflow.com/questions/tagged/pandas)

- [stackoverflow: plotly+python](https://stackoverflow.com/questions/tagged/plotly+python)

- The US `babynames` analogue of the INSEE file has been a playground for data scientists,
 see [https://github.com/hadley/babynames](https://github.com/hadley/babynames)

- Don't Repeat Yourself (DRY) principle  at [Wikipedia](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)