# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
2513650,2,GEORGINA,1904,78,3
2479467,2,FRANÇOISE,1940,10,64
3727471,2,ZULMA,1904,62,11
1667205,1,XAVIER,1944,31,4
978059,1,LENNY,2019,73,9


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [3]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
57,57,Moselle,"POLYGON ((5.89340 49.49691, 5.93994 49.50097, ..."
48,48,Lozère,"POLYGON ((3.36134 44.97141, 3.38637 44.95274, ..."
49,49,Maine-et-Loire,"POLYGON ((-1.24588 47.77672, -1.23825 47.80999..."
3,4,Alpes-de-Haute-Provence,"POLYGON ((5.67604 44.19143, 5.69209 44.18648, ..."
76,76,Seine-Maritime,"POLYGON ((1.38155 50.06577, 1.40926 50.05707, ..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [4]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
3134855,7.0,Ardèche,"POLYGON ((4.48313 45.23645, 4.54055 45.23475, ...",2,MAURICETTE,1922,7,3
686172,,,,1,HERVÉ,1975,974,6
988515,59.0,Nord,"MULTIPOLYGON (((3.04040 50.15971, 3.06301 50.1...",1,LOÉVAN,2018,59,3
528095,,,,1,FORTUNE,1912,972,3
2523387,88.0,Vosges,"POLYGON ((5.47006 48.42093, 5.51099 48.41822, ...",2,HÉLÈNE,1922,88,37


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [5]:
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,160
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABBY,2,3
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDALLAH,1,7
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDEL,1,3
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDELKADER,1,3
...,...,...,...,...,...,...,...
239574,,,,974,ÉSAÏE,1,3
239575,,,,974,ÉTHAN,1,53
239576,,,,974,ÉTIENNE,1,3
239577,,,,974,ÉVA,2,32


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [6]:
name = 'LUCIEN'
subset = grouped[grouped.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

# Visualization 1

## Question:
How do baby names evolve over time? Are there names that have consistently remained popular or unpopular? Are there some that have were suddenly or briefly popular or unpopular? Are there trends in time?

To answer these questions, we need to create a visualization that includes the following information:
1. A time series of the evolution of top baby's name over time.
2. Filter for popular and unpopular names
3. Filter for names that are suddenly getting popular

In [7]:
# A set of parameters to control the chart
number_of_names = 50  # number of names displayed in the chart by default
min_year = 1900  # minimum year for which data is displayed
max_year = 2020  # maximum year for which data is displayed
top_popularity = 200  # define the names that remains popular consistently
top_percentage = 0.8
bottom_popularity = 800  # define the names that remains unpopular consistently 
bottom_percentage = 0.8
sudden_switch = 3000  # define the names that suddenly become popular/unpopular, rank change more than 3000
best_rank = 30

typical_name = ['LUCIEN']  # name to display specifically
# choose what to display
default = True
consistent = False
popularity = False
changement = False
display_typcial_name = False

To make it easier to manipulate, we redefine these parameters in the following code blocks (here we just demonstrate the name of these parameters, if you want to change them, please change in the following blocks). So in general, we could set these parameters to do filtering and searching. We will explain more details in the following blocks. Note that although it looks like several figures, it's in fact one visualization with different parameters.

In [8]:
names_period = just_names[(just_names['annais'].values.astype(int)>=min_year) & (just_names['annais'].values.astype(int)<=max_year)]
topnames = names_period.groupby(['preusuel'])['nombre'].sum().sort_values(ascending= False).head(number_of_names).index.tolist()
default_top_name = names_period[names_period.preusuel.isin(topnames)]
default_top_name = default_top_name.groupby(['preusuel', 'annais'])['nombre'].sum().to_frame()
default_top_name.reset_index(inplace=True)
default_top_name  # the top names during the period [min_year, max_year]

Unnamed: 0,preusuel,annais,nombre
0,ALAIN,1900,83
1,ALAIN,1901,99
2,ALAIN,1902,106
3,ALAIN,1903,120
4,ALAIN,1904,136
...,...,...,...
5854,THIERRY,2015,9
5855,THIERRY,2016,7
5856,THIERRY,2017,8
5857,THIERRY,2019,5


In [9]:
all_names_per_year = names_period.groupby(['annais', 'preusuel'])['nombre'].sum().to_frame()
all_names_per_year.reset_index(inplace=True)

all_names_per_year = all_names_per_year.groupby('annais', group_keys=False).apply(lambda x: x.sort_values('nombre', ascending=False)).reset_index(drop=True)

top10_names_per_year = all_names_per_year.groupby('annais').head(10)
top10_names_per_year[all_names_per_year.annais == '1900']

  top10_names_per_year[all_names_per_year.annais == '1900']


Unnamed: 0,annais,preusuel,nombre
0,1900,MARIE,49752
1,1900,JEAN,14100
2,1900,JEANNE,13981
3,1900,LOUIS,9052
4,1900,MARGUERITE,8058
5,1900,PIERRE,7461
6,1900,JOSEPH,7259
7,1900,GERMAINE,6980
8,1900,HENRI,6919
9,1900,LOUISE,6696


By default, we display one main chart: the evolution of names in France over time from min_year to max_year.

By hovering on one name, the second chart below will show the evolution of this typical name during the whole time.

By selecting a time period on the chart, one chart below will show the top 10 most popular names during this time period each year (ranked by the name order).

In [10]:
# parameters for default chart
min_year =1900
max_year = 2020
default = True

default = True
if default:
    single = alt.selection_single(on='mouseover', fields=['preusuel'])
    year_selection = alt.selection_interval(encodings=['x'])

    main_chart = alt.Chart(default_top_name).mark_area().encode(
        alt.X("annais:T", title='Year'),
        alt.Y("nombre:Q", title='Occurence number'),
        color=alt.Color("preusuel:N"),
        opacity = alt.condition(single, alt.value(1.0), alt.value(0.3)),
        tooltip=[alt.Tooltip(field='preusuel', title="Chosen name")],
    ).add_selection(
        single, year_selection
    ).properties(width=1000, height=200, title=f'The evolution of top {number_of_names} names in France over time from {min_year} to {max_year}')

    chosen_name_over_time = alt.Chart(default_top_name).mark_line().encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(nombre):Q', title='Occurence number'),
        text=alt.condition(single, 'preusuel', alt.value('')),
        color = alt.Color('preusuel:N'),
    ).transform_filter(
        single
    ).transform_filter(
        year_selection
    ).properties(width=1000, height=200, title=f'Line chart of the chosen name over time')

    chosen_period = alt.Chart(top10_names_per_year).mark_bar().encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(nombre):Q', title='Occurence number'),
        color=alt.Color('preusuel:N', legend=None),
        tooltip=[alt.Tooltip('year(annais):T', title='Year'),
             alt.Tooltip('preusuel:N', title='Name'),
             alt.Tooltip('nombre:Q', title='Occurence')],
    ).transform_filter(
        year_selection
    ).properties(width=1000, height=200,  title='The 10 most popular names in the selected time period each year')
    
    display(main_chart & chosen_name_over_time & chosen_period)



Here we could search for a typical name to see its evolution over time.

In [26]:
display_typcial_name = True
typical_name = 'LUCIEN'
if display_typcial_name:
    subset = just_names[just_names["preusuel"].values == typical_name]
    typical_chart = alt.Chart(subset).mark_line(strokeWidth=4).encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(nombre):Q', title='Occurence number'),
        color = alt.Color('preusuel:N'),
    ).properties(width=800, height=600)
    display(typical_chart)

In [27]:
just_names_proportion = just_names.groupby(['annais', 'preusuel'])['nombre'].sum().to_frame()
just_names_proportion.reset_index(inplace=True)
# calculate the proportion of each name in each year, add a new column 'proportion'
just_names_proportion = just_names_proportion.groupby('annais').apply(lambda x: x.assign(proportion = x['nombre']*100 / x['nombre'].sum())).reset_index(drop=True)
# sort the proportion in descending order in each year
just_names_proportion = just_names_proportion.groupby('annais', group_keys=False).apply(lambda x: x.sort_values('proportion', ascending=False)).reset_index(drop=True)
just_names_proportion_ascend = just_names_proportion.groupby('annais', group_keys=False).apply(lambda x: x.sort_values('proportion', ascending=True)).reset_index(drop=True)
# sort the proportion in descending order in each year and add a new column 'rank' to show the rank of each name in each year
just_names_proportion = just_names_proportion.groupby('annais').apply(lambda x: x.assign(rank = x['proportion'].rank(ascending=False))).reset_index(drop=True)
just_names_proportion_ascend = just_names_proportion_ascend.groupby('annais').apply(lambda x: x.assign(rank = x['proportion'].rank(ascending=True)-511.5)).reset_index(drop=True)
display(just_names_proportion[just_names_proportion.annais == '1900'])
display(just_names_proportion_ascend[just_names_proportion_ascend.annais == '1900'])
just_names_proportion_ascend[just_names_proportion_ascend.annais == '1900'].proportion.sum()

Unnamed: 0,annais,preusuel,nombre,proportion,rank
0,1900,MARIE,49752,12.740426,1.0
1,1900,JEAN,14100,3.610709,2.0
2,1900,JEANNE,13981,3.580236,3.0
3,1900,LOUIS,9052,2.318024,4.0
4,1900,MARGUERITE,8058,2.063482,5.0
...,...,...,...,...,...
992,1900,EDME,3,0.000768,919.0
993,1900,MATHEA,3,0.000768,919.0
994,1900,JULIANA,3,0.000768,919.0
995,1900,SEBASTIENNE,3,0.000768,919.0


Unnamed: 0,annais,preusuel,nombre,proportion,rank
0,1900,LEONTIN,3,0.000768,-432.5
1,1900,VENANT,3,0.000768,-432.5
2,1900,HEDWIG,3,0.000768,-432.5
3,1900,NOEMI,3,0.000768,-432.5
4,1900,ALZIRE,3,0.000768,-432.5
...,...,...,...,...,...
992,1900,MARGUERITE,8058,2.063482,481.5
993,1900,LOUIS,9052,2.318024,482.5
994,1900,JEANNE,13981,3.580236,483.5
995,1900,JEAN,14100,3.610709,484.5


100.0

In [28]:
consistent = True
top_popularity = 200  # define the names that remains popular consistently
top_percentage = 0.8
bottom_popularity = 800  # define the names that remains unpopular consistently 
bottom_percentage = 0.8
# Here it means that the name is in the top 200 names in 80% of the years
top_names = just_names_proportion[just_names_proportion['rank'] <= top_popularity]
# find the preusuel that exists in the top_names frame in all years
top_names = top_names.groupby('preusuel')['annais'].count().to_frame() 
top_names.reset_index(inplace=True)
top_names = top_names[top_names['annais'] >= int((max_year-min_year+1)*top_percentage)]
display(top_names)

# Here it means that the name is in the bottom 800 names in 80% of the years
bottom_names = just_names_proportion_ascend[just_names_proportion_ascend['rank'] <= bottom_popularity]
# find the preusuel that exists in the top_names frame in all years
bottom_names = bottom_names.groupby('preusuel')['annais'].count().to_frame()
bottom_names.reset_index(inplace=True)
bottom_names = bottom_names[bottom_names['annais'] >= int((max_year-min_year+1)*bottom_percentage)]
display(bottom_names)

top_names_list = top_names['preusuel'].tolist()
bottom_names_list = bottom_names['preusuel'].tolist()
bottom_names_list

Unnamed: 0,preusuel,annais
22,ALEXANDRE,97
70,ANTOINE,121
136,CHARLES,111
154,CLAIRE,106
181,CÉCILE,100
274,FRANÇOIS,102
323,HÉLÈNE,96
351,JEAN,120
457,LOUIS,100
507,MARC,98


Unnamed: 0,preusuel,annais
256,ADRIENNE,97
1963,BLAISE,103
3720,ELVIRE,101
5621,IDA,98
9244,MARIETTE,98


['ADRIENNE', 'BLAISE', 'ELVIRE', 'IDA', 'MARIETTE']

In [29]:
# list the names that are always rank top 50 in each year
if consistent:
    consistent_top_name = just_names_proportion[just_names_proportion["preusuel"].isin(top_names_list)]
    single = alt.selection_single(on='mouseover', fields=['preusuel'])
    year_selection = alt.selection_interval(encodings=['x'])

    main_chart_popular = alt.Chart(consistent_top_name).mark_area().encode(
        alt.X("annais:T", title='Year'),
        alt.Y("proportion:Q", title='Occurence percentage'),
        color=alt.Color("preusuel:N"),
        opacity = alt.condition(single, alt.value(1.0), alt.value(0.3)),
        tooltip=[alt.Tooltip(field='preusuel', title="Chosen name")],
    ).add_selection(
        single, year_selection
    ).properties(width=1100, height=300, title=f'The evolution of consistent popular names in France over time from {min_year} to {max_year}')

    chosen_name_over_time_popular = alt.Chart(consistent_top_name).mark_line().encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(proportion):Q', title='Occurence percentage'),
        text=alt.condition(single, 'preusuel', alt.value('')),
        color = alt.Color('preusuel:N'),
    ).transform_filter(
        single
    ).transform_filter(
        year_selection
    ).properties(width=1100, height=300, title=f'Line chart of the consistent popular name over time')
    display(main_chart_popular & chosen_name_over_time_popular)



In [30]:
if consistent:
    consistent_bottom_name = just_names_proportion_ascend[just_names_proportion_ascend["preusuel"].isin(bottom_names_list)]
    single = alt.selection_single(on='mouseover', fields=['preusuel'])
    year_selection = alt.selection_interval(encodings=['x'])

    main_chart_unpopular = alt.Chart(consistent_bottom_name).mark_area().encode(
        alt.X("annais:T", title='Year'),
        alt.Y("proportion:Q", title='Occurence percentage'),
        color=alt.Color("preusuel:N"),
        opacity = alt.condition(single, alt.value(1.0), alt.value(0.3)),
        tooltip=[alt.Tooltip(field='preusuel', title="Chosen name")],
    ).add_selection(
        single, year_selection
    ).properties(width=1100, height=300, title=f'The evolution of consistent unpopular names in France over time from {min_year} to {max_year}')

    chosen_name_over_time_unpopular = alt.Chart(consistent_bottom_name).mark_line().encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(proportion):Q', title='Occurence percentage'),
        text=alt.condition(single, 'preusuel', alt.value('')),
        color = alt.Color('preusuel:N'),
    ).transform_filter(
        single
    ).transform_filter(
        year_selection
    ).properties(width=1100, height=300, title=f'Line chart of the consistent unpopular name over time')
    display(main_chart_unpopular & chosen_name_over_time_unpopular)



In [31]:
sudden_switch = 3000  # define the names that suddenly become popular/unpopular, rank change more than 3000
best_rank = 20
# list names that has rank changement more than sudden_switch between the year with maximum rank and the yea with minimum rank
just_names_proportion_change = just_names_proportion.groupby('preusuel').apply(lambda x: x.assign(rank_change = x['rank'].max() - x['rank'].min())).reset_index(drop=True)
just_names_proportion_change = just_names_proportion_change[(just_names_proportion_change['rank_change'] >= sudden_switch) & (just_names_proportion_change['rank'] <= best_rank) ]

In [32]:
# deduplicate the name list
just_names_proportion_change_unique = just_names_proportion_change.drop_duplicates(subset='preusuel')
change_name_list = just_names_proportion_change_unique['preusuel'].values.tolist()
len(change_name_list)

53

In [33]:
changement = True
if changement:
    change_name = just_names_proportion[just_names_proportion["preusuel"].isin(change_name_list)]
    single = alt.selection_single(on='mouseover', fields=['preusuel'])
    year_selection = alt.selection_interval(encodings=['x'])

    main_chart_change = alt.Chart(change_name).mark_area().encode(
        alt.X("annais:T", title='Year'),
        alt.Y("proportion:Q", title='Occurence percentage'),
        color=alt.Color("preusuel:N"),
        opacity = alt.condition(single, alt.value(1.0), alt.value(0.3)),
        tooltip=[alt.Tooltip(field='preusuel', title="Chosen name")],
    ).add_selection(
        single, year_selection
    ).properties(width=1100, height=300, title=f'The evolution of sudden-popular/unpopular names in France over time from {min_year} to {max_year}')

    chosen_name_over_time_change = alt.Chart(change_name).mark_line().encode(
        alt.X("year(annais):O", title='Year'),
        alt.Y('sum(proportion):Q', title='Occurence percentage'),
        text=alt.condition(single, 'preusuel', alt.value('')),
        color = alt.Color('preusuel:N'),
    ).transform_filter(
        single
    ).transform_filter(
        year_selection
    ).properties(width=1100, height=300, title=f'Line chart of the sudden-popular/unpopular name over time')
    display(main_chart_change & chosen_name_over_time_change)

