# Formatting the INSEE data files

INSEE provides per decade a zip file containing a series of ten csv files each covering a single year. 
Visit https://www.insee.fr/fr/information/4190491 to obtain the zip(s) of interest.

The code here-in-below merges all the files together and return a single csv covering the whole decade.

Dr. Morgane FORTIN, Sept. 2020

### Importing libraries

In [1]:
import pandas as pd
import datetime as dt
import numpy as np
from zipfile import ZipFile

### Choosing the INSEE zip file to format

In [2]:
zip_file = ZipFile('deces-1970-1979-csv.zip')

### Formatting the data

In [3]:
# Read all the files
insee = pd.DataFrame()
for text_file in zip_file.infolist():
     if text_file.filename.endswith('.csv'):
            df=pd.read_csv(zip_file.open(text_file.filename),delimiter=';',dtype=str) 
            insee = insee.append(df,ignore_index = True)    
insee

Unnamed: 0,nomprenom,sexe,datenaiss,lieunaiss,commnaiss,paysnaiss,datedeces,lieudeces,actedeces
0,DUCRET*MARIE ANTOINETTE/,2,19220109,01004,AMBERIEU-EN-BUGEY,,19701210,01421,6
1,GRANGEON*ERIC JEAN REMY/,1,19690329,01004,AMBERIEU-EN-BUGEY,,19700425,69383,1059
2,VELLET*PHILIPPE/,1,19700201,01004,AMBERIEU-EN-BUGEY,,19700203,01004,12
3,PRESSAVIN*LYDIE/,2,19700406,01004,AMBERIEU-EN-BUGEY,,19700406,01004,33
4,DOUAT*MARIE-SYLVIA MARTINE/,2,19700708,01004,AMBERIEU-EN-BUGEY,,19700708,01053,457
...,...,...,...,...,...,...,...,...,...
3329654,LOUBEAU*NADEGE CATHERINE PASCALE/,2,19710612,95680,VILLIERS-LE-BEL,,19790204,75114,421
3329655,DENEL*YASMINE ALEXANDRA/,2,19770517,95680,VILLIERS-LE-BEL,,19791225,60175,348
3329656,WEISSENSEEL*CELINE/,2,19790220,95680,VILLIERS-LE-BEL,,19790222,93048,164
3329657,GUSTAN*JEAN-LOUIS/,1,19790509,95680,VILLIERS-LE-BEL,,19790905,95500,390


#### Columns:
1. nomprenom: family name: *nom* and first name(s): *prenom(s)* 
2. sexe: gender 1 for male and 2 for female
3. datenaiss: birthdate; format YYYYMMDD
4. lieunaiss: postcode of the place of birth
5. commnaiss: place of birth
6. paysnaiss: country of birth
7. datedeces: deathdate; format YYYYMMDD
8. lieudeces: postcode of the place of death
9. actedeces: reference of the death record

In [4]:
# Split the date of birth into year (anneenaiss), month (moisnaiss) and day (journaiss)
insee['anneenaiss'] = insee.datenaiss.str[0:4]
insee['moisnaiss'] = insee.datenaiss.str[4:6]
insee['journaiss'] = insee.datenaiss.str[6:]

# Similarly for the date of death (anneedeces, moisdeces, jourdeces)
insee['anneedeces'] = insee.datedeces.str[0:4]
insee['moisdeces'] = insee.datedeces.str[4:6]
insee['jourdeces'] = insee.datedeces.str[6:] 

# Drop the columns with the original date of birth and death
insee.drop(columns=['datedeces', 'datenaiss']);

# When the country of birth (paysnaiss) is NaN replace by FRANCE
insee.paysnaiss=insee.paysnaiss.replace(np.NaN,'FRANCE')

# Split the family and first names column (nomprenom) into two: one with the family name (nom) and another one with the first name(s) (prenom(s))
insee[["nom","prenoms"]]=insee["nomprenom"].str.replace('/','').str.split("*", n = 1, expand = True)

# Rearrange the dataframe:
# 1-family name, 
# 2-first names, 
# 3-gender 1 for male and 2 for female,
# 4-day of birth, 
# 5-month of birth, 
# 6-year of birth, 
# 7-postcode of the place of birth, 
# 8-place of birth, 
# 9-country of birth, 
# 10-day of death, 
# 11-month of death, 
# 12-year of death, 
# 13-postcode of the place of death, 
# 14-reference of the death record
insee=insee[["nom","prenoms", "sexe", "journaiss","moisnaiss","anneenaiss","lieunaiss","commnaiss","paysnaiss","jourdeces","moisdeces","anneedeces","lieudeces","actedeces"]]
insee

Unnamed: 0,nom,prenoms,sexe,journaiss,moisnaiss,anneenaiss,lieunaiss,commnaiss,paysnaiss,jourdeces,moisdeces,anneedeces,lieudeces,actedeces
0,DUCRET,MARIE ANTOINETTE,2,09,01,1922,01004,AMBERIEU-EN-BUGEY,FRANCE,10,12,1970,01421,6
1,GRANGEON,ERIC JEAN REMY,1,29,03,1969,01004,AMBERIEU-EN-BUGEY,FRANCE,25,04,1970,69383,1059
2,VELLET,PHILIPPE,1,01,02,1970,01004,AMBERIEU-EN-BUGEY,FRANCE,03,02,1970,01004,12
3,PRESSAVIN,LYDIE,2,06,04,1970,01004,AMBERIEU-EN-BUGEY,FRANCE,06,04,1970,01004,33
4,DOUAT,MARIE-SYLVIA MARTINE,2,08,07,1970,01004,AMBERIEU-EN-BUGEY,FRANCE,08,07,1970,01053,457
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3329654,LOUBEAU,NADEGE CATHERINE PASCALE,2,12,06,1971,95680,VILLIERS-LE-BEL,FRANCE,04,02,1979,75114,421
3329655,DENEL,YASMINE ALEXANDRA,2,17,05,1977,95680,VILLIERS-LE-BEL,FRANCE,25,12,1979,60175,348
3329656,WEISSENSEEL,CELINE,2,20,02,1979,95680,VILLIERS-LE-BEL,FRANCE,22,02,1979,93048,164
3329657,GUSTAN,JEAN-LOUIS,1,09,05,1979,95680,VILLIERS-LE-BEL,FRANCE,05,09,1979,95500,390


In [5]:
# Export the file to a single csv
insee.to_csv('Insee_70s.csv')  

### Follow-up
Once a single csv is generated for the set of decades of interest, I recommend zipping them all.
In Insee_search.ipynb one can find how to do searches in the generated zip of formatted csv.