We try different models to predict waiting time for a specific station (Goldbrunnenplatz) and line (tram 9) when time of day is given.

Since we are doing prediction rather than classical statistics, I decided to work with the package `scikit-learn` and not with `statsmodels`, even if `statsmodels` may be better for regressions.

In [1]:
%matplotlib inline
import numpy

from datetime import datetime

import matplotlib
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import pandas as pd

# Not necessary, but I like the ggplot style better
plt.style.use('ggplot') 

# Read csv to dataframe
df = pd.read_csv('small.csv')
df_hpunkt = pd.read_csv('haltepunkt.csv')
df_hstelle = pd.read_csv('haltestelle.csv')

Let's get the row of the station and line.

In [15]:
halt_id = df_hstelle[df_hstelle['halt_lang'] == 'Zürich, Goldbrunnenplatz']['halt_id'].item()
linie = 9
print(df[(df['halt_id_von'] == halt_id) & (df['linie'] == linie)].head())

       linie  richtung betriebsdatum  fahrzeug  kurs  seq_von  halt_diva_von  \
25910      9         2      02.07.17      3045     9        4           1012   
25924      9         1      02.07.17      2087     6       30           1012   
25978      9         1      02.07.17      3054     7       30           1012   
26002      9         1      02.07.17      3045     9       30           1012   
26061      9         2      02.07.17      3054     7        4           1012   

       halt_punkt_diva_von halt_kurz_von1 datum_von         ...          \
25910                    1           GOLP  02.07.17         ...           
25924                    0           GOLP  02.07.17         ...           
25978                    0           GOLP  02.07.17         ...           
26002                    0           GOLP  02.07.17         ...           
26061                    1           GOLP  02.07.17         ...           

       fahrweg_id  fw_no  fw_typ  fw_kurz      fw_lang  umlauf_von  

## Multiple linear regression
First let's just do something "interesting":
1. We take continuous variables as independent variables.
2. We put them into a regression to predict delays. This gives a regression model.
3. Then we drop the insignificant variables from the model and rerun it. Let's see what r^2 we get.

In [21]:
from sklearn import linear_model

data_x = []
data_y = []
for index, row in df[(df['halt_id_von'] == halt_id) & (df['linie'] == linie)].iterrows():
    
    
    
    diff = row['ist_an_von'] - row['soll_an_von']
    data_y.append(diff)
    

regr = linear_model.LinearRegression()


diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

linie                             9
richtung                          2
betriebsdatum              02.07.17
fahrzeug                       3045
kurs                              9
seq_von                           4
halt_diva_von                  1012
halt_punkt_diva_von               1
halt_kurz_von1                 GOLP
datum_von                  02.07.17
soll_an_von                   62994
ist_an_von                    62973
soll_ab_von                   63012
ist_ab_von                    62988
seq_nach                          5
halt_diva_nach                 2256
halt_punkt_diva_nach              1
halt_kurz_nach1                SWIE
datum_nach                 02.07.17
soll_an_nach                  63084
ist_an_nach1                  63061
soll_ab_nach                  63114
ist_ab_nach                   63087
fahrt_id                       9476
fahrweg_id                    50260
fw_no                             2
fw_typ                            1
fw_kurz                     

Name: 30536, dtype: object
linie                             9
richtung                          1
betriebsdatum              02.07.17
fahrzeug                       2097
kurs                              8
seq_von                          30
halt_diva_von                  1012
halt_punkt_diva_von               0
halt_kurz_von1                 GOLP
datum_von                  02.07.17
soll_an_von                   79986
ist_an_von                    79983
soll_ab_von                   80004
ist_ab_von                    79995
seq_nach                         31
halt_diva_nach                 2638
halt_punkt_diva_nach              0
halt_kurz_nach1                TALW
datum_nach                 02.07.17
soll_an_nach                  80046
ist_an_nach1                  80039
soll_ab_nach                  80058
ist_ab_nach                   80048
fahrt_id                      10159
fahrweg_id                    50259
fw_no                             1
fw_typ                            1
f



NameError: name 'np' is not defined

In [20]:
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, numpy.newaxis, 2]
print(diabetes_X)

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes.target[:-20]
print(diabetes.target)

[[ 0.06169621]
 [-0.05147406]
 [ 0.04445121]
 [-0.01159501]
 [-0.03638469]
 [-0.04069594]
 [-0.04716281]
 [-0.00189471]
 [ 0.06169621]
 [ 0.03906215]
 [-0.08380842]
 [ 0.01750591]
 [-0.02884001]
 [-0.00189471]
 [-0.02560657]
 [-0.01806189]
 [ 0.04229559]
 [ 0.01211685]
 [-0.0105172 ]
 [-0.01806189]
 [-0.05686312]
 [-0.02237314]
 [-0.00405033]
 [ 0.06061839]
 [ 0.03582872]
 [-0.01267283]
 [-0.07734155]
 [ 0.05954058]
 [-0.02129532]
 [-0.00620595]
 [ 0.04445121]
 [-0.06548562]
 [ 0.12528712]
 [-0.05039625]
 [-0.06332999]
 [-0.03099563]
 [ 0.02289497]
 [ 0.01103904]
 [ 0.07139652]
 [ 0.01427248]
 [-0.00836158]
 [-0.06764124]
 [-0.0105172 ]
 [-0.02345095]
 [ 0.06816308]
 [-0.03530688]
 [-0.01159501]
 [-0.0730303 ]
 [-0.04177375]
 [ 0.01427248]
 [-0.00728377]
 [ 0.0164281 ]
 [-0.00943939]
 [-0.01590626]
 [ 0.0250506 ]
 [-0.04931844]
 [ 0.04121778]
 [-0.06332999]
 [-0.06440781]
 [-0.02560657]
 [-0.00405033]
 [ 0.00457217]
 [-0.00728377]
 [-0.0374625 ]
 [-0.02560657]
 [-0.02452876]
 [-0.01806