# Notebook 1/4: General introduction and initial data exploration
***

# Table of contents
1. [General introduction to the data](#1)
    1. [Source](#1A)
    2. [Datasets](#1B)
    3. [Scope of the analysis](#1C)
2. [Initial exploration](#2)
    1. [Purpose of this notebook](#2A)
    2. [Daily traffic *1st semester 2018*](#2B)
        1. [Observations](#2Ba)
        2. [Ticket Types](#2Bb)
    3. [Daily traffic *2nd semester 2018*](#2C)
        1. [Observations](#2Ca)
        2. [Next steps](#2Cb)
    4. [Hourly profiles 1st semester 2018](#2D)
        1. [Observations](#2Da)
        2. [Day Categories](#2Db)
    5. [Hourly profiles *2nd semester 2018*](#2E)
        1. [Observations](#2Ea)
        2. [Next steps](#2Eb)
    6. [Repository](#2F)
        1. [Identification Codes](#2Fa)
        2. [Stop Types](#2Fb)
        3. [Observations](#2Fc)
    7. [Geographical data](#2G)
        1. [Observations](#2Ga)
        2. [Next steps](#2Gb)

# 1. General introduction to the data <a name="1"></a>


## 1.A. Source: <a name="1A"></a>

Île-de-France Mobilités, formerly STIF, is the organisation authority that controls and coordinates the different transport companies operating in the Paris-area public transport network and rest of Île-de-France region. 

Since 2016, the STIF gives access to some of its raw data through an [opendata portal](https://opendata.stif.info/explore/?sort=modified).

The STIF operates both a road network (bus) and a rail network (train, metro, RER, funicular).

For the purpose of this analysis, we will focus on the **rail newtork**.


## 1.B. Datasets: <a name="1B"></a>

The STIF provides the following data about the rail network: daily traffic per stop (number of checkins per day and per ticket type), hourly profiles per stop (traffic distribution per hour of a typical day), geographical coordinates (arranged by stop or by line of transport), repositories of all stops (arranged by stop or by line of transport).

Data about daily traffic and hourly profiles is available for the years 2015 through 2018. 

We will focus our analysis on the year **2018**.

For the year 2018, data is split accross 2 datasets, corresponding to the 1st and 2nd semester of the year.

Below are the links to the aforementionned datasets:
- [Daily traffic *1st semester 2018*](https://opendata.stif.info/explore/dataset/validations-sur-le-reseau-ferre-nombre-de-validations-par-jour-1er-sem/information/)
- [Daily traffic *2nd semester 2018*](https://opendata.stif.info/explore/dataset/validations-sur-le-reseau-ferre-nombre-de-validations-par-jour-2e-sem/information/)
- [Hourly profiles *1st semester 2018*](https://opendata.stif.info/explore/dataset/validations-sur-le-reseau-ferre-profils-horaires-par-jour-type-1er-sem/information/)
- [Hourly profiles *2nd semester 2018*](https://opendata.stif.info/explore/dataset/validations-sur-le-reseau-ferre-profils-horaires-par-jour-type-2e-sem/information/)
- [Repository](https://opendata.stif.info/explore/dataset/referentiel-arret-tc-idf/information/)
> This page also contains information about identification codes.
- [Geographical coordinates](https://opendata.stif.info/explore/dataset/emplacement-des-gares-idf-data-generalisee/information/)


## 1.C. Scope of the analysis: <a name="1C"></a>

The ultimate goal is to analyze the traffic accross Parisian metro stations for the year 2018.

# 2. Initial exploration <a name="2"></a>


## 2.A. Purpose of this notebook <a name="2A"></a>

In this notebook, we will explore these datasets to gain a first understanding of the nature of the data at hand znd identify the necessary cleaning steps.

We will perform the cleaning tasks and analysis in separate notebooks (see 3. Cleaning and 4. Analysis).

## 2.B. Daily traffic *1st semester 2018* <a name="2B"></a>

In [49]:
# Import libraries

import numpy as np
import pandas as pd

In [5]:
nb_2018s1 = pd.read_csv("../../datasets/validations-sur-le-reseau-ferre-nombre-de-validations-par-jour-1er-sem.csv", sep=";")

In [6]:
nb_2018s1.head()

Unnamed: 0,JOUR,CODE_STIF_TRNS,CODE_STIF_RES,CODE_STIF_ARRET,LIBELLE_ARRET,ID_REFA_LDA,CATEGORIE_TITRE,NB_VALD
0,2018-05-09,100,110,1006,OLYMPIADES,71557.0,AMETHYSTE,462
1,2018-05-09,100,110,1006,OLYMPIADES,71557.0,NAVIGO,10764
2,2018-05-09,100,110,1007,LES AGNETTES-ASNIERES-GENNEVILLIERS,72240.0,NAVIGO,3972
3,2018-05-09,100,110,1008,LES COURTILLES,72286.0,TST,813
4,2018-05-09,100,110,104,BOURSE,73635.0,?,Moins de 5


In [7]:
nb_2018s1.tail()

Unnamed: 0,JOUR,CODE_STIF_TRNS,CODE_STIF_RES,CODE_STIF_ARRET,LIBELLE_ARRET,ID_REFA_LDA,CATEGORIE_TITRE,NB_VALD
777431,2018-06-27,800,853,308,GARE DU NORD,71410.0,FGT,11
777432,2018-06-27,800,853,308,GARE DU NORD,71410.0,IMAGINE R,14
777433,2018-06-27,800,853,784,SARCELLES-SAINT-BRICE,66079.0,?,Moins de 5
777434,2018-06-27,800,853,784,SARCELLES-SAINT-BRICE,66079.0,FGT,80
777435,2018-06-27,800,853,784,SARCELLES-SAINT-BRICE,66079.0,TST,96


In [8]:
nb_2018s1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777436 entries, 0 to 777435
Data columns (total 8 columns):
JOUR               777436 non-null object
CODE_STIF_TRNS     777436 non-null int64
CODE_STIF_RES      777436 non-null object
CODE_STIF_ARRET    777436 non-null object
LIBELLE_ARRET      777436 non-null object
ID_REFA_LDA        767645 non-null float64
CATEGORIE_TITRE    777436 non-null object
NB_VALD            777436 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 47.5+ MB


In [9]:
nb_2018s1.nunique()

JOUR                 181
CODE_STIF_TRNS         3
CODE_STIF_RES         14
CODE_STIF_ARRET      754
LIBELLE_ARRET        732
ID_REFA_LDA          721
CATEGORIE_TITRE        8
NB_VALD            18583
dtype: int64

In [10]:
nb_2018s1['CODE_STIF_RES'].unique()

array(['110', '804', '805', '853', '852', 'ND', '800', '803', '854',
       '802', '801', '850', '822', '851'], dtype=object)

In [11]:
# Number of unique stop names with a CODE_STIF_RES of '110'

nb_2018s1.loc[nb_2018s1['CODE_STIF_RES'] == '110', 'LIBELLE_ARRET'].nunique()

301

In [12]:
nb_2018s1['CATEGORIE_TITRE'].unique()

array(['AMETHYSTE', 'NAVIGO', 'TST', '?', 'FGT', 'AUTRE TITRE',
       'IMAGINE R', 'NON DEFINI'], dtype=object)

### 2.B.a Observations: <a name="2Ba"></a>

> nb_2018s1 has 8 columns:

> - JOUR (= Day): Day of the year. This column has the correct number of unique values (corresponds to the number of days for the 1st semester of 2018). The dtype is object instead of datetime.


> - CODE_STIF_TRNS: Code allocated by the STIF to identify the carrier  


> - CODE_STIF_RES: Code allocated by the STIF to identify the network. Some missing values are designated by ND (means 'Non Disponible' in French = 'Unavailable'). Some research revealed that '110' designates Parisian metro stations. There are 301 stops with a CODE_STIF_RES of '110' in the dataframe. This number is consistent with the number indicated in the Wikipedia page for Paris metro: 302 (https://en.wikipedia.org/wiki/Paris_M%C3%A9tro)


> - CODE_STIF_ARRET: Code allocated by the STIF to identify the stop.   


> - LIBELLE_ARRET (= Stop name): Commercial name of the stop. It has less unique values than CODE_STIF_ARRET, suggesting that one stop name can have several CODE_STIF_ARRET attached to it.     


> - ID_REFA_LDA: Identification code of the LDA (= Lieu d'Arret). LDA refers to the place where different vehicles can stop. This column has missing values. See below for more details about the different identification codes.  


> - CATEGORIE_TITRE (= Ticket type): Rate plan of the ticket used to check in. Some missing values are designated by '?'. See below for more details about the different rate plans.


> - NB_VALD (= Number of checkins): Total number of checkins. These do not include transfers, but only account for checkins to the rail network. It means that if a person enters the network at station A and then travels throughout the network to station B, it is only accounted for once, at station A.

### 2.B.b. Ticket types: <a name="2Bb"></a>

> The STIF gives the following description of each rate plan:

> - IMAGINE R: Annual rate plan reserved for pupils and students.
> - NAVIGO: Annual, monthly or weekly rate plan.
> - AMETHYSTE: Annual rate plan reserved for elderly and disabled people.
> - TST: Discount monthly or weekly rate plan, reserved for eligible people.
> - FGT: Discount annual, monthly or weekly rate plan, reserved for eligible people.
> - AUTRE TITRE: Other special fares.
> - NON DEFINI: Indicates missing data

## 2.C. Daily traffic *2nd semester 2018* <a name="2C"></a>

In [15]:
nb_2018s2 = pd.read_csv("../../datasets/validations-sur-le-reseau-ferre-nombre-de-validations-par-jour-2e-sem.csv", sep = ";")

In [16]:
nb_2018s2.head()

Unnamed: 0,JOUR,CODE_STIF_TRNS,CODE_STIF_RES,CODE_STIF_ARRET,LIBELLE_ARRET,ID_REFA_LDA,CATEGORIE_TITRE,NB_VALD
0,2018-11-14,800,851.0,880.0,MORET-VENEUX-LES-SABLONS,61410,AMETHYSTE,18.0
1,2018-11-14,800,851.0,882.0,VERNOU-SUR-SEINE,61450,IMAGINE R,14.0
2,2018-11-14,800,852.0,176.0,CHAVILLE RIVE GAUCHE,73718,?,
3,2018-11-14,800,852.0,176.0,CHAVILLE RIVE GAUCHE,73718,FGT,18.0
4,2018-11-14,800,852.0,176.0,CHAVILLE RIVE GAUCHE,73718,IMAGINE R,342.0


In [17]:
nb_2018s2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 883145 entries, 0 to 883144
Data columns (total 8 columns):
JOUR               883145 non-null object
CODE_STIF_TRNS     883145 non-null int64
CODE_STIF_RES      881677 non-null float64
CODE_STIF_ARRET    881677 non-null float64
LIBELLE_ARRET      883145 non-null object
ID_REFA_LDA        881869 non-null object
CATEGORIE_TITRE    883145 non-null object
NB_VALD            765356 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 53.9+ MB


In [18]:
nb_2018s2.nunique()

JOUR                 184
CODE_STIF_TRNS         3
CODE_STIF_RES         13
CODE_STIF_ARRET      753
LIBELLE_ARRET        732
ID_REFA_LDA          722
CATEGORIE_TITRE        8
NB_VALD            18546
dtype: int64

In [19]:
nb_2018s2['CODE_STIF_RES'].unique()

array([851., 852., 853., 804., 805., 850., 801., 802., 110., 803., 854.,
       822.,  nan, 800.])

In [20]:
nb_2018s2['CATEGORIE_TITRE'].unique()

array(['AMETHYSTE', 'IMAGINE R', '?', 'FGT', 'NAVIGO', 'AUTRE TITRE',
       'TST', 'NON DEFINI'], dtype=object)

### 2.C.a. Observations (and comparison of the 2 dataframes for the 1st and 2nd semester): <a name="2Ca"></a>

> nb_2018s2 contains the exact same columns as nb_2018s1.

> The number of days is correct (corresponds to the number of days for the 2nd semester of 2018), totalling to 365 unique values for the year 2018. But for both dataframes, the dtype of this column is object instead of datetime.

> We see that these 2 dataframes have different dtypes, so we will have to standardize the data.

> The columns CODE_STIF_RES, CODE_STIF_ARRET and ID_REFA_LDA have a different number of unique values in each dataframe. 

> Looking at the CODE_STIF_RES column, we see that missing values are designated by 'ND' in one dataset and np.nan in the other. The issue is probably similar for CODE_STIF_ARRET & ID_REFA_LDA. We'll have to normalize the missing values for these columns.

> We see that some missing values are designated by a question mark in the CATEGORY_TITRE column. Again, we'll have to normalize missing values in this column.

### 2.C.b. Next steps for the cleaning & merging of nb_2018s1 and nb_2018s2 (daily traffic for the 1st and 2nd semesters of 2018): <a name="2Cb"></a>

> Rename columns (translate to english, except columns that contain IDs)

> Handle misdesignated missing values ('ND', '?')

> Cast the CODE_STIF_RES, CODE_STIF_ARRET and NB_VALD columns of nb_2018s1 to numeric

> Cast the ID_REFA_LDA column of nb_2018s2 to numeric

> Convert the JOUR column to datetime in both nb_2018s1 and nb_2018s2

> Concatenate nb_2018s1 & nb_2018s2 into a new dataframe

> Filter Paris metro stations

> Handle remaining missing values

> Export to csv

## 2.D. Hourly profiles *1st semester 2018* <a name="2D"></a>

In [21]:
hp_2018s1 = pd.read_csv('../../datasets/validations-sur-le-reseau-ferre-profils-horaires-par-jour-type-1er-sem.csv', sep=';')

In [22]:
hp_2018s1.head()

Unnamed: 0,CODE_STIF_TRNS,CODE_STIF_RES,CODE_STIF_ARRET,LIBELLE_ARRET,ID_REFA_LDA,CAT_JOUR,TRNC_HORR_60,pourc_validations
0,100,110,717,QUAI DE LA GARE,71597.0,DIJFP,11H-12H,3.83
1,100,110,717,QUAI DE LA GARE,71597.0,DIJFP,13H-14H,5.33
2,100,110,717,QUAI DE LA GARE,71597.0,DIJFP,14H-15H,6.44
3,100,110,717,QUAI DE LA GARE,71597.0,DIJFP,15H-16H,7.9
4,100,110,717,QUAI DE LA GARE,71597.0,DIJFP,18H-19H,11.28


In [23]:
hp_2018s1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82744 entries, 0 to 82743
Data columns (total 8 columns):
CODE_STIF_TRNS       82744 non-null int64
CODE_STIF_RES        82744 non-null object
CODE_STIF_ARRET      82744 non-null object
LIBELLE_ARRET        82744 non-null object
ID_REFA_LDA          81576 non-null float64
CAT_JOUR             82744 non-null object
TRNC_HORR_60         82744 non-null object
pourc_validations    82744 non-null float64
dtypes: float64(2), int64(1), object(5)
memory usage: 5.1+ MB


In [24]:
hp_2018s1.nunique()

CODE_STIF_TRNS          3
CODE_STIF_RES          14
CODE_STIF_ARRET       754
LIBELLE_ARRET         732
ID_REFA_LDA           721
CAT_JOUR                5
TRNC_HORR_60           25
pourc_validations    2578
dtype: int64

In [25]:
hp_2018s1['CODE_STIF_RES'].unique()

array(['110', '852', '853', 'ND', '800', '803', '851', '854', '801',
       '802', '805', '804', '822', '850'], dtype=object)

In [26]:
hp_2018s1['CAT_JOUR'].unique()

array(['DIJFP', 'JOHV', 'JOVS', 'SAHV', 'SAVS'], dtype=object)

In [27]:
hp_2018s1['TRNC_HORR_60'].unique()

array(['11H-12H', '13H-14H', '14H-15H', '15H-16H', '18H-19H', '1H-2H',
       '21H-22H', '5H-6H', '6H-7H', '20H-21H', '2H-3H', '9H-10H',
       '22H-23H', '23H-0H', '8H-9H', '0H-1H', '12H-13H', '16H-17H',
       '3H-4H', '4H-5H', '7H-8H', '10H-11H', '19H-20H', '17H-18H', 'ND'],
      dtype=object)

### 2.D.a. Observations: <a name="2Da"></a>

> hp_2018s1 has 8 columns.

> Some of these columns are the same as we saw above, and present the same challenges with missing values:

> - CODE_STIF_TRNS
> - CODE_STIF_RES
> - CODE_STIF_ARRET
> - LIBELLE_ARRET
> - ID_REFA_LDA


> These 3 columns, however, are new:

> - CAT_JOUR (= Day type): Designates the type of day. There are 5 categories. See below for more details about these categories.


> - TRNC_HORR_60 (= Time delta): Periods of time of 1 hour each. This column also has missing values indicated by 'ND'.


> - pourc_validations (= Percentage of checkins): Percentage of checkins per hour for a given day category and a given stop.

### 2.D.b. Day Categories: <a name="2Db"></a>

- JOHV (Jour Ouvré Hors Vacances): Working day outside school holidays
- JOVS (Jour Ouvré Vacances Scolaires): Working day during school holidays
- SAHV (Samedi Hors Vacances): Saturday outside school holidays
- SAVS (Samedi Vacances Scolaires): Saturday during school holidays
- DIJFP (Dimanche Jour Férié Pont): Sunday & national holidays

## 2.E. Hourly profiles *2nd semester 2018* <a name="2E"></a>

In [29]:
hp_2018s2 = pd.read_csv('../../datasets/validations-sur-le-reseau-ferre-profils-horaires-par-jour-type-2e-sem.csv', sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
hp_2018s2.head()

Unnamed: 0,CODE_STIF_TRNS,CODE_STIF_RES,CODE_STIF_ARRET,LIBELLE_ARRET,ID_REFA_LDA,CAT_JOUR,TRNC_HORR_60,pourc_validations
0,100,110,1007,LES AGNETTES-ASNIERES-GENNEVILLIERS,72240,SAVS,17H-18H,6.61
1,100,110,1007,LES AGNETTES-ASNIERES-GENNEVILLIERS,72240,SAVS,23H-0H,1.68
2,100,110,1007,LES AGNETTES-ASNIERES-GENNEVILLIERS,72240,SAVS,7H-8H,5.04
3,100,110,1008,LES COURTILLES,72286,DIJFP,11H-12H,6.09
4,100,110,1008,LES COURTILLES,72286,DIJFP,12H-13H,6.66


In [31]:
hp_2018s2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83532 entries, 0 to 83531
Data columns (total 8 columns):
CODE_STIF_TRNS       83532 non-null int64
CODE_STIF_RES        83532 non-null object
CODE_STIF_ARRET      83532 non-null object
LIBELLE_ARRET        83532 non-null object
ID_REFA_LDA          83414 non-null object
CAT_JOUR             83532 non-null object
TRNC_HORR_60         83532 non-null object
pourc_validations    83532 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 5.1+ MB


In [32]:
hp_2018s2.nunique()

CODE_STIF_TRNS          3
CODE_STIF_RES          25
CODE_STIF_ARRET      1265
LIBELLE_ARRET         732
ID_REFA_LDA           722
CAT_JOUR                5
TRNC_HORR_60           25
pourc_validations    2473
dtype: int64

In [33]:
hp_2018s2['CODE_STIF_RES'].unique()

array(['110', '803', '804', '800', 'ND', '805', '822', '850', '801',
       '851', '853', '802', '854', '852', 853, 110, 822, 850, 852, 854,
       801, 851, 802, 805, 804], dtype=object)

In [34]:
hp_2018s2['CAT_JOUR'].unique()

array(['SAVS', 'DIJFP', 'SAHV', 'JOHV', 'JOVS'], dtype=object)

In [35]:
hp_2018s2['TRNC_HORR_60'].unique()

array(['17H-18H', '23H-0H', '7H-8H', '11H-12H', '12H-13H', '15H-16H',
       '16H-17H', '18H-19H', '14H-15H', '1H-2H', '19H-20H', '21H-22H',
       '9H-10H', '3H-4H', '5H-6H', '6H-7H', '10H-11H', '0H-1H', '2H-3H',
       '22H-23H', '4H-5H', '20H-21H', '8H-9H', '13H-14H', 'ND'],
      dtype=object)

### 2.E.a. Observations: <a name="2Ea"></a>

> This dataframe contains the exact same columns as the one for the 1st semester of 2018.

> The 2 dataframes don't have the same dtypes. We'll have to standardize the data.

> Both dataframes have missing values in the ID_REFA_LDA column and don't have the same number of unique values for that column.

> Missing values seem to be indicated by 'ND' in CODE_STIF_RES column and TRNC_HORR_60. We'll have to normalize those missing values.

### 2.E.b. Next steps for the cleaning & merging of hp_2018s1 and hp_2018s2 (daily traffic for the 1st and 2nd semesters of 2018): <a name="2Eb"></a>

> Rename columns (translate to english, except columns that contain IDs)

> Handle misdesignated missing values ('ND', '?')

> Cast the ID_REFA_LDA column of pct_2018s2 to numeric

> Cast the CODE_STIF_RES & CODE_STIF_ARRET to numeric for both dataframes
 
> Concatenate nb_2018s1 & nb_2018s2 into a new dataframe

## 2.F. Repository <a name="2F"></a>

### 2.F.a.  Identification Codes: <a name="2Fa"></a>

The STIF assigns 3 levels of identification to each stop:

- LDA (Lieu D'Arrêt): Designates a place where vehicules from different lines can stop.
- ZDL (Zone De Lieu): Designates an area within an LDA that regroups several ZDEs with the same operating name.
- ZDE (Zone D'Embarquement): Designates a precise spot where people can get in and out of a vehicule (ex: metro platform).

The STIF also allocates internal codes to identify carriers (CODE_STIF_TRNS), network types (CODE_STIF_RES), and stops (CODE_STIF_ARRET).

In [37]:
ref_stops = pd.read_csv('../../datasets/referentiel-arret-tc-idf.csv', sep=';')

In [38]:
ref_stops.head()

Unnamed: 0,ZDEr_ID_REF_A,ZDEr_NOM,ZDEr_ID_TYPE_ARRET,ZDEr_LIBELLE_TYPE_ARRET,ZDEr_X_Y,ZDLr_ID_REF_A,ZDLr_NOM,ZDLr_ID_TYPE_ARRET,ZDLr_LIBELLE_TYPE_ARRET,LDA_ID_REF_A,LDA_NOM,LDA_ID_TYPE_ARRET,LDA_LIBELLE_TYPE_ARRET
0,36757,Saint-Exupéry,5,Arrêt de bus,6408376863829,50507,Saint-Exupéry,5,Arrêt de bus,70831,Saint-Exupéry,5,Arrêt de bus
1,39355,La Paix,5,Arrêt de bus,6410346863697,50508,La Paix,5,Arrêt de bus,70820,La Paix,5,Arrêt de bus
2,39358,Victorien Sardou,5,Arrêt de bus,6411286863107,50510,Victorien Sardou,5,Arrêt de bus,70796,Victorien Sardou,5,Arrêt de bus
3,27653,Aristide Briand,5,Arrêt de bus,6588496869841,50518,Aristide Briand / Centre Culturel,5,Arrêt de bus,72619,Aristide Briand / Centre Culturel,5,Arrêt de bus
4,19289,Bois d'Amour,5,Arrêt de bus,"660488.75,6868398.5",50520,Jean Jaurès / Bois d'Amour,5,Arrêt de bus,72527,Jean Jaurès / Bois d'Amour,5,Arrêt de bus


In [39]:
ref_stops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39925 entries, 0 to 39924
Data columns (total 13 columns):
ZDEr_ID_REF_A              39925 non-null int64
ZDEr_NOM                   39925 non-null object
ZDEr_ID_TYPE_ARRET         39925 non-null int64
ZDEr_LIBELLE_TYPE_ARRET    39925 non-null object
ZDEr_X_Y                   39925 non-null object
ZDLr_ID_REF_A              39925 non-null int64
ZDLr_NOM                   39925 non-null object
ZDLr_ID_TYPE_ARRET         39925 non-null int64
ZDLr_LIBELLE_TYPE_ARRET    39925 non-null object
LDA_ID_REF_A               39925 non-null int64
LDA_NOM                    39925 non-null object
LDA_ID_TYPE_ARRET          39925 non-null int64
LDA_LIBELLE_TYPE_ARRET     39925 non-null object
dtypes: int64(6), object(7)
memory usage: 4.0+ MB


In [40]:
ref_stops.nunique()

ZDEr_ID_REF_A              39925
ZDEr_NOM                   14500
ZDEr_ID_TYPE_ARRET             4
ZDEr_LIBELLE_TYPE_ARRET        4
ZDEr_X_Y                   39763
ZDLr_ID_REF_A              18412
ZDLr_NOM                   13001
ZDLr_ID_TYPE_ARRET             4
ZDLr_LIBELLE_TYPE_ARRET        4
LDA_ID_REF_A               15361
LDA_NOM                    11379
LDA_ID_TYPE_ARRET              4
LDA_LIBELLE_TYPE_ARRET         4
dtype: int64

Each row corresponds to a unique ZDE ID, but several ZDE IDs can share the same commercial name. 

There are no missing values.

Data types are consistent.

There seem to be 4 types of stops, whether accross ZDEs, ZDLs or LDAs. Let's look into it.

In [41]:
ref_stops['ZDEr_LIBELLE_TYPE_ARRET'].value_counts()

Arrêt de bus            36695
Station ferrée / Val     2049
Station de métro          768
Arrêt de tram             413
Name: ZDEr_LIBELLE_TYPE_ARRET, dtype: int64

### 2.F.b.  Stop Types: <a name="2Fb"></a>

> There are 4 types of stops:

> - 5: Arrêt de bus (=bus stop)
> - 1: Station ferrée / Val (=rail station)
> - 2: Station de métro (=metro station)
> - 6: Arrêt de tram (=tram stop)

In [42]:
ZDE_val_counts = ref_stops['ZDEr_ID_TYPE_ARRET'].value_counts()
ZDL_val_counts = ref_stops['ZDLr_ID_TYPE_ARRET'].value_counts()
LDA_val_counts = ref_stops['LDA_ID_TYPE_ARRET'].value_counts()

ZDE_counts = pd.DataFrame(data=ZDE_val_counts.values, index=ZDE_val_counts.index, columns = ['ZDE'])
ZDL_counts = pd.DataFrame(data=ZDL_val_counts.values, index=ZDL_val_counts.index, columns = ['ZDL'])
LDA_counts = pd.DataFrame(data=LDA_val_counts.values, index=LDA_val_counts.index, columns = ['LDA'])

In [43]:
ids_counts = pd.concat([ZDE_counts, ZDL_counts, LDA_counts], axis=1)

ids_counts 

Unnamed: 0,ZDE,ZDL,LDA
5,36695,36691,31819
1,2049,2059,5065
2,768,762,2208
6,413,413,833


The number of value counts per type of stops is not the same accross all of the ID columns.

In [44]:
gbo_ids = ref_stops.groupby(['LDA_ID_TYPE_ARRET', 'ZDLr_ID_TYPE_ARRET', 'ZDEr_ID_TYPE_ARRET'])

gp_ids_counts = pd.DataFrame(gbo_ids.count()['LDA_ID_REF_A'])

gp_ids_counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,LDA_ID_REF_A
LDA_ID_TYPE_ARRET,ZDLr_ID_TYPE_ARRET,ZDEr_ID_TYPE_ARRET,Unnamed: 3_level_1
1,1,1,2049
1,1,2,6
1,1,5,4
1,2,2,78
1,5,5,2884
1,6,6,44
2,2,2,684
2,5,5,1471
2,6,6,53
5,5,5,31819


### 2.F.c. Obervations: <a name="2Fc"></a>

This dataset does not need to be cleaned since we will use it in a fairly superficial way. 

There are no missing values. Data types are consistent.

It contains 13 columns, all containing identification codes along with the names of the stops.

There are 3 identification levels: LDA, ZDL and ZDE. Each LDA can contain several ZDLS that themselves contain a number of ZDEs. Each of these IDs has 4 stop types: metro, rail, bus and tram.

This dataset does not contain internal STIF codes (CODE_STIF_RES for example).

Each row corresponds to a unique ZDE.

## 2.G. Geographical data <a name="2G"></a>

In [45]:
geo_stops = pd.read_csv('../../datasets/emplacement-des-gares-idf-data-generalisee.csv', sep=';')

In [46]:
geo_stops.head()

Unnamed: 0,Geo Point,Geo Shape,id_ref_zdl,nom_long,label,idrefliga,idrefligc,res_com,mode_,fer,...,terrer,termetro,tertram,ternavette,terval,exploitant,principal,idf,x,y
0,"48.8463569889, 2.41947990037","{""type"": ""Point"", ""coordinates"": [2.4194799003...",47247,SAINT-MANDE,Saint-Mandé,A01534,C01371,M1,Metro,0,...,0,0,0,0,0,RATP,0,1,657397.0779,6860858.0
1,"48.8662858046, 2.32294341224","{""type"": ""Point"", ""coordinates"": [2.3229434122...",45676,CONCORDE,Concorde,A01534 / A01541 / A01545,C01371 / C01378 / C01382,M1 / M8 / M12,Metro,0,...,0,0,0,0,0,RATP,0,1,650331.6676,6863130.0
2,"48.8828686476, 2.34413063372","{""type"": ""Point"", ""coordinates"": [2.3441306337...",42210,ANVERS,Anvers,A01535,C01372,M2,Metro,0,...,0,0,0,0,0,RATP,0,1,651901.2249,6864961.0
3,"48.9078125468, 2.45435282652","{""type"": ""Point"", ""coordinates"": [2.4543528265...",47334,JEAN ROSTAND,Jean Rostand,A01191,C01389,T1,Tramway,0,...,0,0,0,0,0,RATP,0,1,660003.4947,6867673.0
4,"48.8930946286, 2.48791098763","{""type"": ""Point"", ""coordinates"": [2.4879109876...",44603,LA REMISE A JORELLE,La Remise à Jorelle,A01761,C01843,T4,Tramway,0,...,0,0,0,0,0,SNCF,0,1,662452.6934,6866020.0


In [47]:
geo_stops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 28 columns):
Geo Point     923 non-null object
Geo Shape     923 non-null object
id_ref_zdl    923 non-null int64
nom_long      923 non-null object
label         923 non-null object
idrefliga     890 non-null object
idrefligc     847 non-null object
res_com       923 non-null object
mode_         923 non-null object
fer           923 non-null int64
train         923 non-null int64
rer           923 non-null int64
metro         923 non-null int64
tramway       923 non-null int64
navette       923 non-null int64
val           923 non-null int64
terfer        923 non-null object
tertrain      923 non-null object
terrer        923 non-null object
termetro      923 non-null object
tertram       923 non-null object
ternavette    923 non-null object
terval        923 non-null object
exploitant    923 non-null object
principal     923 non-null int64
idf           923 non-null int64
x             923 non

In [48]:
geo_stops['id_ref_zdl'].nunique()

923

### 2.G.a. Observations: <a name="2Ga"></a>

> This dataset contains the geo coordinates of all stations of the rail network in Ile de France, by ZDL, along with some other information that is of lesser use to us (13 columns in total). 

> The Geo Point column contains the geographical coordinates of each stop. We will have to split this column into columns containing latitudes and longitudes.

> It contains 923 unique ZDLs (corresponding to the total number of rows). As we saw earlier, there are multiple ZDLs for one LDA.

> The datasets we want to analyze (daily traffic and hourly profiles) only contain LDAs (301 unique values).

> The only way to merge these dataframes with geo_stops is to match the stations names (column 'nom_long'). First, we will have to filter geo_stops in order to keep metro stations only.

### 2.G.b. Next steps to clean geo_stops (geographical data): <a name="2Gb"></a>

> Filter metro stations

> Match station names of geo_stops with that of other dataframes

> Update column names and drop useless columns

> Create latitude and longitude columns
 
> Export to .csv