# Visualization 2

--- **Group B** ---
 
⠀⠀⠀Is there a regional effect in the data?

⠀⠀⠀Are some names more popular in some regions?

⠀⠀⠀Are popular names generally popular across the whole country?

-------------------------------------------------------------

 
   This project was based on a `Chloropleth maps in Altair.ipynb` file, provided in `Names hints.zip`:
 

# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
# Replace with 'dpttest.csv' for debug
names = pd.read_csv("dpt2020.csv", sep=";")

# Uncomment for debug
# names['dpt'] = names['dpt'].astype(str)

# Remove rows with three-digit codes in the 'dpt' column
names.drop(names[names['dpt'].apply(lambda x: len(x) == 3)].index, inplace=True)

names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
2394363,2,ESTELLE,1960,45,3
1089037,1,MARC,1964,10,23
151444,1,ANTONIO,1921,34,9
1867830,2,ANDRÉE,1930,11,46
801424,1,JEAN-GUY,1953,24,3


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [3]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
79,79,Deux-Sèvres,"POLYGON ((-0.89196 46.97582, -0.85592 46.97908..."
26,28,Eure-et-Loir,"POLYGON ((0.81482 48.67017, 0.82767 48.68072, ..."
25,27,Eure,"POLYGON ((0.29722 49.42986, 0.33898 49.44093, ..."
18,19,Corrèze,"POLYGON ((1.89873 45.69828, 1.91552 45.71126, ..."
56,56,Morbihan,"MULTIPOLYGON (((-3.42179 47.62000, -3.44067 47..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [4]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

# Replace '2A' with '20' in the 'code' column
depts.loc[depts['code'] == '2A', 'code'] = '20'

# Remove rows with '2B' in the 'code' column
depts.drop(depts[depts['code'] == '2B'].index, inplace=True)

# Remove rows with three-digit codes in the 'code' column
depts.drop(depts[depts['code'].apply(lambda x: len(x) == 3)].index, inplace=True)

# Merge the modified 'depts' dataframe into 'names'
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1708391,68,Haut-Rhin,"POLYGON ((7.19828 48.31047, 7.24173 48.30243, ...",2,ANAE,2010,68,4
2181272,69,Rhône,"POLYGON ((4.38808 46.21979, 4.39205 46.26302, ...",2,ELSA,1997,69,34
3079662,49,Maine-et-Loire,"POLYGON ((-1.24588 47.77672, -1.23825 47.80999...",2,NELLY,1940,49,4
302961,53,Mayenne,"POLYGON ((-1.07016 48.50849, -1.06055 48.51534...",1,CORENTIN,2007,53,12
3377456,60,Oise,"POLYGON ((1.78384 49.75831, 1.80898 49.75433, ...",2,TANYA,2010,60,4


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [5]:
# Exclude 'geometry' column from names DataFrame
names_no_geometry = names.drop('geometry', axis=1)

# Group by 'dpt', 'preusuel', and 'sexe' columns and sum the 'nombre' column
grouped = names_no_geometry.groupby(['dpt', 'preusuel', 'sexe', 'annais'], as_index=False)['nombre'].sum()

# Merge with 'depts' DataFrame to add geometry data back in
grouped_with_geometry = depts.merge(grouped, how='right', left_on='code', right_on='dpt')

grouped_with_geometry

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,annais,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,2005,3
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,2007,4
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,2008,6
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,2009,7
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,2010,8
...,...,...,...,...,...,...,...,...
3471082,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ÉVA,2,2019,7
3471083,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ÉVA,2,2020,6
3471084,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ÉVAN,1,2019,4
3471085,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ÖMER,1,2011,3


In [6]:
# Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [7]:
#name = 'LUCIEN'

# Filter the subset based on the specified name
# subset = grouped_with_geometry[grouped_with_geometry.preusuel == name]

# Generate the chart using Altair
#import altair as alt

#chart = alt.Chart(subset).mark_geoshape(stroke='white').encode(
#    tooltip=['nom', 'code', 'nombre'],
#    color='nombre',
#).properties(width=800, height=600)

#chart

# IMPLEMENTATION

Description:

The idea was based on the initial sketch made for the second visualisation. The program allows to manually pick a *year* and a *name*, and then compare the number of it's occurences to both raw numeric data visually on the map (using color scheme gradient coloring), as well as to data scaled relative to other departments in the table (with text colored green or red according to the positivity or negativity of the difference).

-------------------------------------------------------------

In [8]:
# Andrei Ostanin, Group B, IGD M1
# 2023

# External libraries might be required

import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
pd.options.mode.chained_assignment = None

# Get the range of years from the 'annais' column in the 'names' DataFrame
years = sorted(names['annais'].unique(), reverse=True)

# Add an option for combined data
years.insert(0, 'All Years')

# Create a dropdown widget for year selection
year_dropdown = widgets.Dropdown(options=years, description='Year:')
display(year_dropdown)

# Create a text input widget for name
name_input = widgets.Text(placeholder='Enter a name')
display(name_input)

show_button = widgets.Button(description='Show')
display(show_button)

# Create an output widget for displaying the chart
output = widgets.Output()
display(output)

def create_table(selected_year, subset, column_name):
    # Get the unique department names from the 'depts' DataFrame
    department_names = subset['nom'].unique()

    # Create a dropdown widget for department selection
    department_dropdown = widgets.Dropdown(options=department_names, description='Department:')
    display(department_dropdown)

    # Create a button widget for table display
    table_button = widgets.Button(description='Compare')
    display(table_button)
    
    table_output = widgets.Output()
    display(table_output)

    # Define a function to handle button click for the table
    def on_table_button_click(button):
        nonlocal subset

        with table_output:
            clear_output()
            
            # Get the selected department from the dropdown widget
            selected_department = department_dropdown.value

            # Filter the subset based on the selected department
            subset_department = subset[subset.nom == selected_department]
            
            if selected_year != 'All Years':
                region_totals = grouped_with_geometry[grouped_with_geometry['annais'] == selected_year].groupby('nom')['nombre'].sum().reset_index()
            else:
                region_totals = grouped_with_geometry.groupby('nom')['nombre'].sum().reset_index()
            
            # The claculation is not polished for now! Fix required
            department_count = subset_department[column_name].iloc[0]
            department_total = region_totals[region_totals['nom'] == selected_department]['nombre'].iloc[0]
            
            subset['percentage_difference'] = ((department_count - subset[column_name]) / department_total) * 100
            subset = subset.sort_values('percentage_difference', ascending=False)

            # Create the table with the required columns
            table = subset[[column_name, 'percentage_difference', 'nom']].drop_duplicates().reset_index(drop=True).rename_axis(' ')
            table.columns = ['Amount', 'Difference', 'Department']

            # Control the text colors
            def format_table_cell(value):
                if value > 0:
                    return '<span style="color: green;">{:.2f}%</span>'.format(value)
                elif value < 0:
                    return '<span style="color: red;">{:.2f}%</span>'.format(value)
                else:
                    return '{:.2f}%'.format(value)

            table['Difference'] = table['Difference'].apply(format_table_cell)

            # Display the table
            display(HTML(table.to_html(index=False, escape=False)))

    # Attach the button click event handler for table display
    table_button.on_click(on_table_button_click)

def on_button_click(button):
    
    with output:
        clear_output()
        
        name = name_input.value.lower()
        selected_year = year_dropdown.value

        # Filter the subset based on the entered name and selected year
        if selected_year == 'All Years':
            
            subset = grouped_with_geometry[grouped_with_geometry.preusuel.str.lower() == name]
            
            if subset.empty:
                print(f"No data found for the name '{name}' in the year '{selected_year}'. Please enter another name or select another year.")
                
            else:
                total_counts = subset.groupby('code')['nombre'].sum().reset_index()
                subset = subset.merge(total_counts, on='code', suffixes=('', '_total'))

                # Generate the chart map
                chart_map = alt.Chart(subset).mark_geoshape(stroke='white').encode(
                    tooltip=[
                        alt.Tooltip('nom', title='Region'),
                        alt.Tooltip('code', title='Code'),
                        alt.Tooltip('nombre_total', title='Total Count')
                    ],
                    color=alt.Color('nombre_total', legend=alt.Legend(title='Total occurrences')).scale(scheme='plasma',reverse=True),
                ).properties(width=600, height=400)
                
                chart_map.display()
            
                create_table(selected_year, subset, 'nombre_total')
            
        else:
            
            subset = grouped_with_geometry[(grouped_with_geometry.preusuel.str.lower() == name) & (grouped_with_geometry.annais == selected_year)]

            # Check if the subset is empty
            if subset.empty:
                print(f"No data found for the name '{name}' in the year '{selected_year}'. Please enter another name or select another year.")
                
            else:
                # Generate the chart using Altair
                chart_map = alt.Chart(subset).mark_geoshape(stroke='white').encode(
                    tooltip=[
                        alt.Tooltip('nom', title='Region'),
                        alt.Tooltip('code', title='Code'),
                        alt.Tooltip('nombre', title='Count')
                    ],
                    #color='nombre',
                    color=alt.Color('nombre', legend=alt.Legend(title='Occurrences')).scale(scheme='magma',reverse=True),
                ).properties(width=600, height=400)
                
                chart_map.display()
                
                create_table(selected_year, subset, 'nombre')  # Pass the column name 'nombre'
            
show_button.on_click(on_button_click)


Dropdown(description='Year:', options=('All Years', '2020', '2019', '2018', '2017', '2016', '2015', '2014', '2…

Text(value='', placeholder='Enter a name')

Button(description='Show', style=ButtonStyle())

Output()

---------------------------------------------------------------------------------------------------------------------------

Percentage difference = (( Name occurrence in selected dept - Name occurrence in other dept) / Total population in selected dept) x 100

Example:

    Total population of dept 1: 150
    Number of occurrences of the name in dept 1: 30

    Total population of dept 2: 200
    Number of ocurrences of the name in dept 2: 40

Then:

    Percentage difference = ((30 - 40) / 150) * 100
    Percentage difference = (-10 / 150) * 100
    Percentage difference = -6.67%