# Visualization 3

**Are there gender effects in the data? Does popularity of names given to both sexes evolve consistently? (Note: this data set treats sex as binary; this is a simplification that carries into this assignment but does not generally hold.)**

In [212]:
import altair as alt
import pandas as pd
import math 

alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets
alt.data_transformers.enable('default') # work-around to let Altair handle larger data sets
alt.data_transformers.disable_max_rows()

alt.renderers.enable('default') # In some old versions of Jupyter, you may need to enable this.

RendererRegistry.enable('default')

## Sketches


![title](img_1.jpeg)

For this first sketch, we wanted to establish a top10 of the most used names aswell as differiciating if they had more usage in the male community or in the female one. 
This first one wasn't very convincing and didnt show any evolution over the years so we put it on the side.


![title](img_2.jpeg)


For the second one, we wanted to show the evolution over the years of the most mixed names. This was supposed to give a clear insight on which names were once considered mostly for female or male and how this consideration evolved over the years. We thought it wasn't the most suited to answer both questions though.

![title](img_3.jpeg)


For the third one, we displayed the log ratio of male's names and female's names usage over the years. This is the one we implemented because we can see
on one graph how some names could be considered more of one gender during certain periods of time and how this fluctuates over time. The more the
curve is close to 0, the more theses names were used for both male and female babies.


## Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [182]:
baby_names = pd.read_csv('dpt2020.csv', delimiter=';')
baby_names.head()

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
0,1,_PRENOMS_RARES,1900,2,7
1,1,_PRENOMS_RARES,1900,4,9
2,1,_PRENOMS_RARES,1900,5,8
3,1,_PRENOMS_RARES,1900,6,23
4,1,_PRENOMS_RARES,1900,7,9


In [183]:
baby_names_cleaned = baby_names[baby_names['preusuel'] != '_PRENOMS_RARES']
baby_names_cleaned.head()

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
10882,1,A,XXXX,XX,27
10883,1,AADAM,XXXX,XX,30
10884,1,AADEL,XXXX,XX,56
10885,1,AADIL,1983,84,3
10886,1,AADIL,1992,92,3


In [184]:
baby_names_cleaned = baby_names_cleaned[baby_names_cleaned['annais'] != 'XXXX']
baby_names_cleaned.head()

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
10885,1,AADIL,1983,84,3
10886,1,AADIL,1992,92,3
10888,1,AAHIL,2016,95,3
10892,1,AARON,1962,75,3
10893,1,AARON,1976,75,3


We check if there's still any value starting with 'X' in the columns 'annais' or 'dpt'

In [185]:
baby_names_cleaned[(baby_names_cleaned['annais'].str.startswith('X'))| (baby_names_cleaned['dpt'].str.startswith('X'))]


Unnamed: 0,sexe,preusuel,annais,dpt,nombre


## Solution :


We will group the dataset by year, name and gender and drop the department as we're looking for a temporal evolution of the gender effects.

In [186]:
baby_names_grouped = baby_names_cleaned.groupby(['annais', 'preusuel','sexe'],as_index=False).sum()
baby_names_grouped

Unnamed: 0,annais,preusuel,sexe,nombre
0,1900,ABEL,1,382
1,1900,ABRAHAM,1,9
2,1900,ACHILLE,1,152
3,1900,ACHILLES,1,4
4,1900,ADAM,1,9
...,...,...,...,...
257341,2020,ÉVA,2,156
257342,2020,ÉVAN,1,62
257343,2020,ÉZIO,1,12
257344,2020,ÉZÉCHIEL,1,11


In [187]:
top_names = baby_names_grouped.groupby(['preusuel'])['nombre'].sum().sort_values(ascending= False)
top_names.head(10)

preusuel
MARIE       2256072
JEAN        1913130
PIERRE       891794
MICHEL       818025
ANDRÉ        709633
JEANNE       556903
PHILIPPE     535355
LOUIS        523576
RENÉ         514560
ALAIN        504106
Name: nombre, dtype: int64

We notice a clear disparity between the number of occurences between female and male names, top 10 names are mostly male names. Hence why, we will take int account the global top 40 to see a significant number of women names.

In [218]:
top_40_names = top_names[:40].index.to_list()
top_40_names

['MARIE',
 'JEAN',
 'PIERRE',
 'MICHEL',
 'ANDRÉ',
 'JEANNE',
 'PHILIPPE',
 'LOUIS',
 'RENÉ',
 'ALAIN',
 'JACQUES',
 'BERNARD',
 'MARCEL',
 'CLAUDE',
 'DANIEL',
 'ROGER',
 'PAUL',
 'ROBERT',
 'DOMINIQUE',
 'GEORGES',
 'HENRI',
 'CHRISTIAN',
 'NICOLAS',
 'FRANÇOISE',
 'MONIQUE',
 'FRANÇOIS',
 'PATRICK',
 'CATHERINE',
 'NATHALIE',
 'GÉRARD',
 'ISABELLE',
 'JOSEPH',
 'CHRISTOPHE',
 'JACQUELINE',
 'ANNE',
 'SYLVIE',
 'JULIEN',
 'MAURICE',
 'LAURENT',
 'FRÉDÉRIC']

In order to see the gender effects for each name, we will add two columns counting male and female names. It will help us to understand for a selected name, how many were given to male and a female on the same year.

In [200]:
baby_names_gender = baby_names_grouped.merge(baby_names_grouped, how ='left', on = ['annais','preusuel'], suffixes = ['_l','_r'])

baby_names_gender

Unnamed: 0,annais,preusuel,sexe_l,nombre_l,sexe_r,nombre_r
0,1900,ABEL,1,382,1,382
1,1900,ABRAHAM,1,9,1,9
2,1900,ACHILLE,1,152,1,152
3,1900,ACHILLES,1,4,1,4
4,1900,ADAM,1,9,1,9
...,...,...,...,...,...,...
273781,2020,ÉVA,2,156,2,156
273782,2020,ÉVAN,1,62,1,62
273783,2020,ÉZIO,1,12,1,12
273784,2020,ÉZÉCHIEL,1,11,1,11


In [201]:
baby_names_gender = baby_names_gender.loc[baby_names_gender['sexe_r']> baby_names_gender['sexe_l']]
baby_names_gender

Unnamed: 0,annais,preusuel,sexe_l,nombre_l,sexe_r,nombre_r
16,1900,AGATHE,1,3,2,62
50,1900,ALIX,1,6,2,47
83,1900,ANDRÉ,1,5530,2,4
89,1900,ANGE,1,157,2,23
144,1900,ARSENE,1,209,2,12
...,...,...,...,...,...,...
273543,2020,YACINE,1,205,2,3
273549,2020,YAEL,1,6,2,10
273590,2020,YAËL,1,89,2,71
273730,2020,ÉDEN,1,215,2,13


In [202]:
baby_names_gender.rename(columns = {"nombre_l":"male","nombre_r":"female"}, inplace = True)
baby_names_gender.drop(['sexe_l','sexe_r'], axis=1, inplace = True)
baby_names_gender.reset_index()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  baby_names_gender.rename(columns = {"nombre_l":"male","nombre_r":"female"}, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  baby_names_gender.drop(['sexe_l','sexe_r'], axis=1, inplace = True)


Unnamed: 0,index,annais,preusuel,male,female
0,16,1900,AGATHE,3,62
1,50,1900,ALIX,6,47
2,83,1900,ANDRÉ,5530,4
3,89,1900,ANGE,157,23
4,144,1900,ARSENE,209,12
...,...,...,...,...,...
8215,273543,2020,YACINE,205,3
8216,273549,2020,YAEL,6,10
8217,273590,2020,YAËL,89,71
8218,273730,2020,ÉDEN,215,13


As seen in the visual mappings course, slide 33, we will apply the log ratio to display the gender effect on a graph.

$$ Ratio = log(\frac{count(female)}{count(male)}) $$


In [207]:
baby_names_gender['ratio'] = baby_names_gender.apply(lambda x : math.log(x.female/x.male),axis=1)

baby_names_gender

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  baby_names_gender['ratio'] = baby_names_gender.apply(lambda x : math.log(x.female/x.male),axis=1)


Unnamed: 0,annais,preusuel,male,female,ratio
16,1900,AGATHE,3,62,3.028522
50,1900,ALIX,6,47,2.058388
83,1900,ANDRÉ,5530,4,-7.231649
89,1900,ANGE,157,23,-1.920752
144,1900,ARSENE,209,12,-2.857428
...,...,...,...,...,...
273543,2020,YACINE,205,3,-4.224398
273549,2020,YAEL,6,10,0.510826
273590,2020,YAËL,89,71,-0.225956
273730,2020,ÉDEN,215,13,-2.805689


In [224]:
baby_names_gender['preusuel'].isin(top_40_names)

16        False
50        False
83         True
89        False
144       False
          ...  
273543    False
273549    False
273590    False
273730    False
273742    False
Name: preusuel, Length: 8220, dtype: bool

## Chart

We chose to represent the gender effect evolution using Altair library because it's convenient when using dataframes.

In [262]:
baby_names_top_40 = baby_names_gender.loc[baby_names_gender['preusuel'].isin(top_40_names)]

graph = alt.Chart(baby_names_top_40, width=800, height=800 ).mark_line().encode(
    x = alt.X('annais:T', title = 'Year of birth'),
    y = alt.Y('ratio:Q', title = 'Log ratio of female on male names'),
    color=alt.Color('preusuel:N'),
    
).properties(title='Evolution of gender effect on birth names')


highlight = alt.selection(type='single', on='mouseover',fields=['preusuel'], nearest=True)

points = graph.mark_circle().encode(
    opacity=alt.value(0),
    tooltip='preusuel'
).add_selection(
    highlight
).properties(
    width=800, height=800
)

lines = graph.mark_line().encode(
    size=alt.condition(~highlight, alt.value(1), alt.value(3))
)

points + lines

The name Dominique which is common to both genders, was mainly given to women for a short period of time between 1950 and 1955. Then it got back given back more to men.  