Data exploration and analysis
===================

Here we are going to explore the SBB dataset _is-daten-sbb_. We are going to select the features that can be of use for the rest of the project, clean the dataset if it needs to be and then compute some statistics on it.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import powerlaw

In [64]:
filepath='./data/ist-daten-sbb.csv'
df=pd.read_csv(filepath,delimiter=';')

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61716 entries, 0 to 61715
Data columns (total 26 columns):
BETRIEBSTAG            61716 non-null object
FAHRT_BEZEICHNER       61716 non-null object
BETREIBER_ID           61716 non-null object
BETREIBER_ABK          61716 non-null object
BETREIBER_NAME         61716 non-null object
PRODUKT_ID             61636 non-null object
LINIEN_ID              61716 non-null int64
LINIEN_TEXT            61716 non-null object
UMLAUF_ID              0 non-null float64
VERKEHRSMITTEL_TEXT    61716 non-null object
ZUSATZFAHRT_TF         61716 non-null bool
FAELLT_AUS_TF          61716 non-null bool
BPUIC                  61716 non-null int64
HALTESTELLEN_NAME      61716 non-null object
ANKUNFTSZEIT           56478 non-null object
AN_PROGNOSE            52344 non-null object
AN_PROGNOSE_STATUS     61716 non-null object
ABFAHRTSZEIT           56477 non-null object
AB_PROGNOSE            52343 non-null object
AB_PROGNOSE_STATUS     61716 non-null object


The dataset contains 25 columns which will be not of use to us. we will now discuss the meaning of these columns and translate them.
The rows correspond to an arrival and departure of a train from one station.
***columns***
- BETRIEBSTAG is the day the train departed but there is some inconsistencies as some trains departs after midnight in the day after.
- FAHRT_BEZEICHNER is an unique indicator for a given train which doesn't seem that useful since LINIEN_ID does that also.
- BETREIBER_ID is the id of the operator of the train which is always SBB in this dataset.
- BETREIBER_ABK is the abrevation of the operator as above it is always SBB.
- BETREIBER_NAME is the full name of the operator.
- PRODUKT_ID is an id that tells us the type of transport it is. It's always trains apart from a few missing values that should also be trains.
- LINIEN_ID is an id for a given course.
- LINIEN_TEXT is the type of course.
- UMLAUF_ID is an ID that tells us if there as been any changes in the programmation. It is always null.
- VERKEHRSMITTEL_TEXT is just LINIEN_TEXT without the number at the end.
- ZUSATZFAHRT_TF is a boolean that tells us if the train was not planned ahead.
- FAELLT_AUS_TF is a boolean that tells us if the train was down.
- BPUIC is the service number
- HALTESTELLEN_NAME is the name of the station.
- ANKUNFTSZEIT is the planned arrival time of the train at that station.
- AN_PROGNOSE is the actual arrival time of the train.
- AN_PROGNOSE_STATUS is an indicator of the deletion of the arrival of the train.
- ABFAHRTSZEIT is the planned departure time of the train.
- AB_PROGNOSE is the actual departure time of the train.
- AB_PROGNOSE_STATUS is an indicator of the deletion of the departure of the train.
- DURCHFAHRT_TF don't know what it is but it is always false.
- ankunftsverspatung is a boolean that tells us if the arrival of the train was delayed.
- abfahrtsverspatung is a boolean that tells us if the departure of the train was delayed.
- lod is a link to a webpage that tells us more information about station.
- geopos is the geoposition of the station.
- GdeNummer don't know.

We are going to drop some columns and rename the remaining ones.

In [65]:
df.drop(['lod','GdeNummer','DURCHFAHRT_TF','UMLAUF_ID','BETREIBER_NAME','BETREIBER_ABK','BETREIBER_ID','FAHRT_BEZEICHNER','VERKEHRSMITTEL_TEXT'],axis=1,inplace=True)
df.columns=['departure_day','product_id','course_id','transport_type','planned','down','BPUIC','station_name','planned_arrival_time',\
          'actual_arrival_time','cancelled_arrival','planned_departure_time','actual_departure_time','cancelled_departure','arrival_delay','departure_delay','geopos']
df.head()

Unnamed: 0,departure_day,product_id,course_id,transport_type,planned,down,BPUIC,station_name,planned_arrival_time,actual_arrival_time,cancelled_arrival,planned_departure_time,actual_departure_time,cancelled_departure,arrival_delay,departure_delay,geopos
0,2019-11-02,Zug,4148,RE,False,False,8500159,Grenchen Nord,2019-11-03T01:29:00,2019-11-03T01:29:22,REAL,2019-11-03T01:30:00,2019-11-03T01:30:34,REAL,False,False,"47.1918033634,7.38946302189"
1,2019-11-02,Zug,4148,RE,False,False,8500105,Moutier,2019-11-03T01:37:00,2019-11-03T01:37:05,REAL,2019-11-03T01:38:00,2019-11-03T01:38:30,REAL,False,False,"47.2806661619,7.38103944266"
2,2019-11-02,Zug,42,EC,False,False,8501120,Lausanne,2019-11-02T22:42:00,2019-11-02T22:59:17,REAL,2019-11-02T22:45:00,,PROGNOSE,True,False,"46.5167786487,6.62909314109"
3,2019-11-02,Zug,4212,RE,False,False,8505300,Lugano,,,PROGNOSE,2019-11-02T11:08:00,2019-11-02T11:09:08,REAL,False,False,"46.0055057275,8.94687441524"
4,2019-11-02,Zug,4212,RE,False,False,8505219,Lamone-Cadempino,2019-11-02T11:12:00,2019-11-02T11:12:55,REAL,2019-11-02T11:12:00,2019-11-02T11:13:28,REAL,False,False,"46.0397205271,8.93212223946"


In [None]:
#gotta clean the data a little bit for the analysis
