# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets
import numpy as np
pass

In [2]:
import os 
os.getcwd()

'C:\\Users\\CAMARA Yoane Ange E\\Desktop\\WPy64-3860\\notebooks'

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [3]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
691152,1,HENRI,1914,63,90
588517,1,GABRIEL,1996,84,14
1892660,2,ANITA,2009,75,4
3040176,2,MARIA,1959,4,3
1153648,1,MATTHIAS,2005,68,4


In [4]:
names.size

18341370

In [5]:
data_viz1=names[["annais","preusuel","nombre","sexe"]].groupby(["annais","preusuel","sexe"],as_index=False).sum()
total_annais=data_viz1.groupby('annais').sum().to_dict()["nombre"]
data_viz1.head(5)

Unnamed: 0,annais,preusuel,sexe,nombre
0,1900,ABEL,1,382
1,1900,ABRAHAM,1,9
2,1900,ACHILLE,1,152
3,1900,ACHILLES,1,4
4,1900,ADAM,1,9


In [6]:
data_viz1["%"]=data_viz1.apply(lambda x: 100*(x.nombre/total_annais[x.annais]),axis=1)

In [7]:
data_viz1["%"].describe()

count    257346.000000
mean          0.047018
std           0.220308
min           0.000348
25%           0.000653
50%           0.001968
75%           0.011166
max          12.474360
Name: %, dtype: float64

In [8]:
## Affect a pair (x,y) to each name
import random
list_names=data_viz1["preusuel"].drop_duplicates()
name_position={}
def position(name):
    if name.preusuel not in name_position.keys():
        name_position[name.preusuel]=(random.randrange(0,500),random.randrange(0,500))
    return name_position[name.preusuel]
        
data_viz1["position"]=data_viz1.apply(position,axis=1)
data_viz1["x"]=data_viz1["position"].apply(lambda x:x[0]) 
data_viz1["y"]=data_viz1["position"].apply(lambda x:x[1])

In [9]:
data_viz1["annais"]=data_viz1["annais"].astype("int64")
data_viz1.describe()

Unnamed: 0,annais,sexe,nombre,%,x,y
count,257346.0,257346.0,257346.0,257346.0,257346.0,257346.0
mean,1975.738683,1.535959,296.43663,0.047018,251.681639,247.717781
std,34.159966,0.498706,1372.815209,0.220308,142.788711,144.735266
min,1900.0,1.0,3.0,0.000348,0.0,0.0
25%,1951.0,1.0,4.0,0.000653,130.0,120.0
50%,1983.0,2.0,12.0,0.001968,252.0,246.0
75%,2006.0,2.0,69.0,0.011166,375.0,371.0
max,2020.0,2.0,53584.0,12.47436,499.0,499.0


### VISUALIZATION I

In [11]:
slider = alt.binding_range(min=1900, max=2019, step=1)
select_year = alt.selection_single(name='select', fields=['annais'],
                                   bind=slider, init={'annais': 1905})
points=alt.Chart(data_viz1,height=500,width=850).mark_text(size=3).encode(
    x=alt.X('x:Q',scale=alt.Scale(zero=False),axis=None),
    y=alt.Y('y:Q',scale=alt.Scale(zero=False),axis=None),
    size=alt.Size('%:Q',scale=alt.Scale(type='sqrt',domain=[0,5]),title='% of baby named x this year'),
    text='preusuel:N',
    color='orange:N',
    tooltip=[alt.Tooltip("preusuel:N"),alt.Tooltip('nombre:Q')]
).add_selection(
    select_year
).transform_filter(
    select_year
)

points

In [12]:
points.save("visualization1.json")

# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [13]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
95,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ..."
51,51,Marne,"POLYGON ((4.04797 49.40564, 4.07691 49.40161, ..."
34,34,Hérault,"POLYGON ((3.35836 43.91383, 3.42445 43.91160, ..."
41,41,Loir-et-Cher,"POLYGON ((0.84122 48.10306, 0.87589 48.10944, ..."
35,35,Ille-et-Vilaine,"MULTIPOLYGON (((-2.12371 48.60441, -2.14142 48..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [14]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
3659521,46,Lot,"POLYGON ((1.44826 45.01931, 1.47632 45.01845, ...",1,SERGE,1943,46,13
255498,75,Paris,"POLYGON ((2.41634 48.84924, 2.46226 48.84254, ...",2,PEGGY,1978,75,44
1672049,25,Doubs,"POLYGON ((6.80701 47.56280, 6.81666 47.54792, ...",2,MÉLANIE,1993,25,56
524763,76,Seine-Maritime,"POLYGON ((1.38155 50.06577, 1.40926 50.05707, ...",1,LOÏC,1952,76,3
2117147,83,Var,"MULTIPOLYGON (((6.43480 43.01554, 6.45520 43.0...",2,LILAS,2013,83,3


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [15]:
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum()
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped.head(10)

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,AARON,1,160
1,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABBY,2,3
2,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABDALLAH,1,7
3,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABDEL,1,3
4,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABDELKADER,1,3
5,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABDULLAH,1,3
6,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABEL,1,38
7,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ABIGAELLE,2,3
8,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ACHRAF,1,3
9,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,ADAM,1,314


### VISULIZATION II

In [16]:
## Most popular names by dept:
data_viz2=grouped.sort_values(["dpt","nombre"],ascending=False).drop_duplicates(["dpt"])
data_viz2.head(5)

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
237876,,,,974,MARIE,2,199132
234433,,,,973,MARIE,2,1949
231899,,,,972,MARIE,2,19918
228481,,,,971,MARIE,2,12538
225058,95.0,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,NICOLAS,1,6388


In [17]:
V2=alt.Chart(data_viz2).mark_geoshape(stroke='white').encode(
                    color="preusuel:N",
                    tooltip=["code","nom","preusuel","nombre"])
V2

In [18]:
V2.save("visualization2.json")

In [20]:
data_viz2_2=grouped.copy()
data_viz2_2["rank"]=grouped.groupby(["code"])["nombre"].rank(method='dense',ascending=False)
data_viz2_2=data_viz2_2[data_viz2_2["rank"]<5].sort_values(["dpt","rank"],ascending=False)
data_viz2_2.head(5)

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre,rank
237001,,,,974,JOSEPH,1,33393,3.0
236833,,,,974,JEAN,1,64514,2.0
237876,,,,974,MARIE,2,199132,1.0
222114,95.0,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ALEXANDRE,1,5091,4.0
225789,95.0,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,SÉBASTIEN,1,5185,3.0


In [21]:
dropdown=alt.binding_select(options=[1,2,3,4], name='popularity_rank')
selection = alt.selection_single(fields=['rank'], bind=dropdown)
V2_2=alt.Chart(data_viz2_2).mark_geoshape().encode(
                  color="preusuel:N",
                  tooltip=["code","nom","nombre","preusuel"]
).add_selection(selection
).transform_filter(selection)
V2_2

In [22]:
V2_2.save("visualization2_2.json")

In [23]:
subsets=grouped[grouped.preusuel.isin(["JEANNE","MARIE","LUCIEN","MOHAMED","THOMAS","SANDRA"])].sort_values("preusuel",ascending=False)
subsets=subsets.sort_values("preusuel")
subsets.head(4)

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
795,1,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",1,JEANNE,2,4101
6089,4,Alpes-de-Haute-Provence,"POLYGON ((5.67604 44.19143, 5.69209 44.18648, ...",4,JEANNE,2,841
83050,42,Loire,"POLYGON ((3.89953 46.27591, 3.90940 46.25773, ...",42,JEANNE,2,14314
106542,54,Meurthe-et-Moselle,"POLYGON ((5.47091 49.49721, 5.54118 49.51526, ...",54,JEANNE,2,8440


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [24]:
dropdown=alt.binding_select(options=["JEANNE","MOHAMED","LUCIEN","MARIE","THOMAS","SANDRA"], name='name')
selection = alt.selection_single(fields=['preusuel'], bind=dropdown)
map1=alt.Chart(subsets).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre:Q',
).properties(width=800, height=600
).add_selection(selection
).transform_filter(selection
)
map1

In [25]:
map1.save("visualization2_3.json")

In [151]:
name = 'LUCIEN'
subset = grouped[grouped.preusuel == name]
map1=alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600
).add_selection()

map1

### VISUALISATION III

In [26]:
slider = alt.binding_range(min=1900, max=2019, step=1)
select_year = alt.selection_single(name='select', fields=['annais'],
                                   bind=slider, init={'annais': 1905})
points=alt.Chart(data_viz1,height=500,width=850).mark_text(size=3).encode(
    x=alt.X('x:Q',scale=alt.Scale(zero=False),axis=None),
    y=alt.Y('y:Q',scale=alt.Scale(zero=False),axis=None),
    size=alt.Size('%:Q',scale=alt.Scale(type='sqrt',domain=[0,5]),title='% of baby named x this year'),
    text='preusuel:N',
    color='sexe:N',
    tooltip=[alt.Tooltip("preusuel:N"),alt.Tooltip('nombre:Q')]
).add_selection(
    select_year
).transform_filter(
    select_year
)

points

In [27]:
points.save("visualization3.json")

In [28]:
data_viz3=names[["annais","preusuel","nombre","sexe"]].groupby(["annais","preusuel","sexe"],as_index=False).sum()
total_annais_sex=data_viz3.groupby(['annais',"sexe"],as_index=False).sum().sort_values(["annais","sexe"])
total_annais_sex=total_annais_sex.drop("sexe",axis=1).groupby("annais").agg({"nombre":lambda x:list(x)}).to_dict()["nombre"]
data_viz3["%"]=data_viz3.apply(lambda x: 100*(x.nombre/total_annais_sex[x.annais][(int)(x.sexe)-1]),axis=1)
data_viz3.head(4)

Unnamed: 0,annais,preusuel,sexe,nombre,%
0,1900,ABEL,1,382,0.228585
1,1900,ABRAHAM,1,9,0.005386
2,1900,ACHILLE,1,152,0.090955
3,1900,ACHILLES,1,4,0.002394


In [29]:
import operator
M_name_position={}
F_name_position={}
def position(name):
    if name.sexe=="1":
        if name.preusuel not in M_name_position.keys():
            if name.preusuel not in F_name_position.keys():
                M_name_position[name.preusuel]=(random.randrange(0,500),random.randrange(0,500))
            else:
                M_name_position[name.preusuel]=tuple(map(operator.add,F_name_position[name.preusuel],(random.randrange(50,100),random.randrange(50,100))))
        return M_name_position[name.preusuel]
    else:
        if name.preusuel not in F_name_position.keys():
            if name.preusuel not in M_name_position.keys():
                F_name_position[name.preusuel]=(random.randrange(0,500),random.randrange(0,500))
            else :
                F_name_position[name.preusuel]=tuple(map(operator.add,M_name_position[name.preusuel],(random.randrange(50,100),random.randrange(50,100))))
        return F_name_position[name.preusuel]

In [30]:
data_viz3["position"]=data_viz3.apply(position,axis=1)
data_viz3["x"]=data_viz3["position"].apply(lambda x:x[0]) 
data_viz3["y"]=data_viz3["position"].apply(lambda x:x[1])

In [31]:
## Keep only the name valid for both sex:
check=data_viz3[["preusuel","sexe"]].drop_duplicates().groupby("preusuel",as_index=False).agg({"sexe":lambda x:list(x)})
check["len"]=check.apply(lambda x:len(x.sexe),axis=1)
check_list=check[check.len==2].preusuel.values

In [32]:
data_viz3["annais"]=data_viz3["annais"].astype("int64")
data_viz3=data_viz3[data_viz3.preusuel.isin(check_list)]

In [33]:
data_viz3[["preusuel","sexe","position"]].drop_duplicates().sort_values("preusuel").head(25)

Unnamed: 0,preusuel,sexe,position
1032,ABDON,1,"(85, 240)"
20521,ABDON,2,"(85, 240)"
0,ABEL,1,"(7, 266)"
17391,ABEL,2,"(7, 266)"
2,ACHILLE,1,"(458, 101)"
16154,ACHILLE,2,"(458, 101)"
4,ADAM,1,"(113, 179)"
212607,ADAM,2,"(113, 179)"
107249,ADAMA,1,"(485, 315)"
83084,ADAMA,2,"(485, 315)"


In [34]:
slider = alt.binding_range(min=1900, max=2019, step=1)
select_year = alt.selection_single(name='select', fields=['annais'],
                                   bind=slider, init={'annais': 1905})
points=alt.Chart(data_viz3,height=500,width=850).mark_text(size=3).encode(
    x=alt.X('x:Q',scale=alt.Scale(zero=False),axis=None),
    y=alt.Y('y:Q',scale=alt.Scale(zero=False),axis=None),
    size=alt.Size('%:Q',scale=alt.Scale(type='sqrt',domain=[0,10]),title='% of baby named x this year'),
    text='preusuel:N',
    color="sexe:N",
    tooltip=[alt.Tooltip("preusuel:N"),alt.Tooltip('nombre:Q'),alt.Tooltip("sexe:N")]
).add_selection(
    select_year
).transform_filter(
    select_year
)



points

In [35]:
points.save("visualization3_1.json")