# Data Quality Plan
Analysing the Dublin Bus Data provided to optimise model performance

## Summary
Each file has about 750,000 rows. Three quarters of a million.

## Ideas to handle data.
1. __Incremental Handing__
__Route Division__
Logic: Partition data into subsets of data which are internally comparable.
         Does it make sense that a single model should describe all routes?
         Since routes are different they should be modeled individually.

A machine can maintain 12 DF max in a single object. (atleast mine can)
If instead we extract each route from the data to build a dataframe of a single 
route. Then model it, take the best model. Now we have 100th of the data to handle in each increment.

When we have derived the model, we store it and its route, but can discard the data then. 

If we create a function that returns the model. We can either use re-assignment or scoping discard the dataframe in python.

For each route we store a model which is the sum of the models for the sections in which it is composed.

__Section Division__
Logic: The sum of the parts gives the value of the whole.

Many smaller data sets. This models the travel time for different sections in which the busses pass through. This reduces redundancy in the analysis of bus routes (which overlap)

Though since there are more sections than bus routes, there will be many more models to store, problem then becomes iterative. given t_0 the time we , we can calculate t_1 (the time we arrive in section1 after traversing seciont_0) then we apply the model for section 1 with t_1 as an input. (recirsive problem, easily described iteritavly)

the total time is the sum of all the predicted times for each section.

### Task
Create a function that tests different models and chooses the best.
Run this for each section storing each model. Then given a departure point and a destination we can derive the appropriate route and the sum of its sections.

How to derive the route. How often do time stamps occur. Accuracy of estimation of length of a section depends on the frequancy in which a time stamp is put out. Assume a section takes 100.5 seconds to cross. if a stamp is put out every second and there are 100 time stamps in a journey we know that the time taken is between 100 and 101 (small percentage error). The best estimate that can be made is that that time taken is in between 100 and 102.

If a time stamp occurs only every 50 seconds, then in a 100 second interval there could be 2 or 3 time stamps. i.e. we will predict it takes between 100 and 150.

Max-Min <= x <= (n+1/n)*(Max-Min), where n is the number of timestamps for a specific journey id occuring within a section)



##
2. __


We can use sparse data, take only every third entry for a given bus on a route. We then can fit the whole frame?


Best approach may be to develop models route by route, then save sparse data for increasing performance (if nexesary?)

SELECT t.id, t.key
FROM	
(
    SELECT id, key, ROW_NUMBER() OVER (ORDER BY key) AS rownum
    FROM datatable
) AS t
WHERE t.rownum % 30 = 0    -- or % 40 etc
ORDER BY t.key




In [1]:
#import sections
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from patsy import dmatrices
import matplotlib.patches as mpatches
import statsmodels.formula.api as sm
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from statsmodels.formula.api import logit
from sklearn.cross_validation import train_test_split

import os
import glob




In [2]:
# #method 1, ends at 12 dataframes.

# path=os.getcwd() + "\\data"

# def merge_frames(path):
#     column_names = ["Timestamp",
#                     "LineID", 
#                     "Direction",
#                     "Journey_Pattern_ID", 
#                     "Timeframe", 
#                     "Vehicle_Journey_ID", 
#                     "Operator", 
#                     "Congestion", 
#                     "Lon",
#                     "Lat", 
#                     "Delay", 
#                     "Block_ID",
#                     "Vehicle_ID",
#                     "Stop_ID",
#                     "At_Stop"]
    
#     #should help loop run a little faster.
#     read = pd.read_csv
#     concat = pd.concat
    
#     #stores contents of a folder as a list.
#     contents = os.listdir(path) 
#     content_length = len(contents)
    
    
#     #reads csv from data folder in cwd
#     accumulator = read(path+"\\"+contents[0], index_col=None, header=0, encoding="utf-8")
#     accumulator.columns = column_names
    
#     for i in range(content_length):

#         next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
#         next_df.columns = column_names

#         accumulator = pd.concat([accumulator[(accumulator["LineID"]==4)], next_df[(next_df["LineID"]==4)]], axis=0)
#         print(accumulator.shape, "acc")
        
#     return accumulator

# df_1 = merge_frames(path)

# df_1.shape


In [3]:

path=os.getcwd() + "\\data"

def extract_merge_frames(path):
    column_names = ["Timestamp",
                    "LineID", 
                    "Direction",
                    "Journey_Pattern_ID", 
                    "Timeframe", 
                    "Vehicle_Journey_ID", 
                    "Operator", 
                    "Congestion", 
                    "Lon",
                    "Lat", 
                    "Delay", 
                    "Block_ID",
                    "Vehicle_ID",
                    "Stop_ID",
                    "At_Stop"]
    
    
    #stores contents of a folder as a list.
    contents = os.listdir(path) 
    content_length = len(contents)
    
    
    #reads csv from data folder in cwd
    read = pd.read_csv
    accumulator = read(path+"\\"+contents[0], index_col=None, header=0)
    accumulator.columns = column_names
    set_of_routes = set(accumulator.LineID.unique())
    comparison_set = set_of_routes
    
    #should help loop run a little faster.
    concat = pd.concat
    unite = set_of_routes.union
    
        
    #read through all files once and determine maximum set of routes
    
    for i in range(1,content_length):
        next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
        next_df.columns = column_names
        next_df.columns = column_names
        next_set_of_routes = set(next_df.LineID.unique())
        set_of_routes = unite(next_set_of_routes)
        print("routes extracted from file:", i)
    
    print("total set of routes\n", set(set_of_routes)==set(comparison_set))
    print(comparison_set - set_of_routes)
    print(set_of_routes - comparison_set)
    
    for route in set_of_routes: 
        try:
            print(route)
            int(route) #if i get past this op, i can continue

            for i in range(1, content_length):
                print("extracting route:",route,"\tfrom file ",i)
                next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
                next_df.columns = column_names
                accumulator = pd.concat([accumulator[(accumulator["LineID"]==route)], next_df[(next_df["LineID"]==route)]], axis=0)
        
            accumulator.to_csv( "route-" + str(int(route)) + "-raw1.csv", encoding = "utf-8" )
        
            print ("route:", route , "complete\n")
        
        except:
            print("You just got NaN-ed!")
        
        
            
    return accumulator #i return is so i can assess the dataframe after

df_1 = extract_merge_frames(path)

routes extracted from file: 1
routes extracted from file: 2
routes extracted from file: 3


  if self.run_code(code, result):


routes extracted from file: 4
routes extracted from file: 5
routes extracted from file: 6
routes extracted from file: 7
routes extracted from file: 8
routes extracted from file: 9
routes extracted from file: 10
routes extracted from file: 11
routes extracted from file: 12
routes extracted from file: 13
routes extracted from file: 14
routes extracted from file: 15
routes extracted from file: 16
routes extracted from file: 17
routes extracted from file: 18
routes extracted from file: 19
routes extracted from file: 20
routes extracted from file: 21
routes extracted from file: 22
routes extracted from file: 23
routes extracted from file: 24
routes extracted from file: 25
routes extracted from file: 26
routes extracted from file: 27
routes extracted from file: 28
routes extracted from file: 29
routes extracted from file: 30
total set of routes
 False
set()
{nan, 32.0, 142.0, 111.0, 114.0, 51.0, 116.0, 118.0}
nan
You just got NaN-ed!
1.0
extracting route: 1.0 	from file  1
extracting route: 

In [4]:
df_1.head(5)

Unnamed: 0,Timestamp,LineID,Direction,Journey_Pattern_ID,Timeframe,Vehicle_Journey_ID,Operator,Congestion,Lon,Lat,Delay,Block_ID,Vehicle_ID,Stop_ID,At_Stop
11187,1357106972000000,238.0,0,,2013-01-02,1588,HN,0,-6.2765,53.416916,0,238003,33214,,0
11470,1357106991000000,238.0,0,,2013-01-02,1588,HN,0,-6.2765,53.416916,0,238003,33214,,0
11509,1357107001000000,238.0,0,,2013-01-02,1588,HN,0,-6.2765,53.416916,0,238003,33214,,0
11531,1357107003000000,238.0,0,,2013-01-02,1588,HN,0,-6.2765,53.416916,0,238003,33214,,0
11899,1357107042000000,238.0,0,,2013-01-02,1570,HN,0,-6.266617,53.417118,0,238002,33153,,0


array([ 238.])

In [None]:
# extra routes 32, 142, 111,114, 51, 116 , 118

path=os.getcwd() + "\\by_day-Nov2012"

def extract_merge_frames(path):
    column_names = ["Timestamp",
                    "LineID", 
                    "Direction",
                    "Journey_Pattern_ID", 
                    "Timeframe", 
                    "Vehicle_Journey_ID", 
                    "Operator", 
                    "Congestion", 
                    "Lon",
                    "Lat", 
                    "Delay", 
                    "Block_ID",
                    "Vehicle_ID",
                    "Stop_ID",
                    "At_Stop"]
    
    
    #stores contents of a folder as a list.
    contents = os.listdir(path) 
    content_length = len(contents)
    
    
    #reads csv from data folder in cwd
    read = pd.read_csv
    accumulator = read(path+"\\"+contents[0], index_col=None, header=0)
    accumulator.columns = column_names
    set_of_routes = set(accumulator.LineID.unique())
    comparison_set = set_of_routes
    
    #should help loop run a little faster.
    concat = pd.concat
    unite = set_of_routes.union
    
        
    #read through all files once and determine maximum set of routes
    
    for i in range(1,content_length):
        next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
        next_df.columns = column_names
        next_df.columns = column_names
        next_set_of_routes = set(next_df.LineID.unique())
        set_of_routes = unite(next_set_of_routes)
        print("routes extracted from file:", i)
    
    print("total set of routes\n", set(set_of_routes)==set(comparison_set))
    print(comparison_set - set_of_routes)
    print(set_of_routes - comparison_set)
    
    for route in set_of_routes: 
        try:
            print(route)
            int(route) #if i get past this op, i can continue

            for i in range(1, content_length):
                print("extracting route:",route,"\tfrom file ",i)
                next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
                next_df.columns = column_names
                accumulator = pd.concat([accumulator[(accumulator["LineID"]==route)], next_df[(next_df["LineID"]==route)]], axis=0)
        
            accumulator.to_csv( "route-" + str(int(route)) + "-raw2.csv", encoding = "utf-8" )
        
            print ("route:", route , "complete\n")
        
        except:
            print("You just got NaN-ed!")
        
        
            
    return accumulator #i return is so i can assess the dataframe after


df_1 = extract_merge_frames(path)



  if self.run_code(code, result):
  if self.run_code(code, result):


routes extracted from file: 1
routes extracted from file: 2
routes extracted from file: 3
routes extracted from file: 4
routes extracted from file: 5
routes extracted from file: 6
routes extracted from file: 7
routes extracted from file: 8
routes extracted from file: 9
routes extracted from file: 10
routes extracted from file: 11
routes extracted from file: 12
routes extracted from file: 13
routes extracted from file: 14
routes extracted from file: 15
routes extracted from file: 16
routes extracted from file: 17
routes extracted from file: 18
routes extracted from file: 19
routes extracted from file: 20
routes extracted from file: 21
routes extracted from file: 22
routes extracted from file: 23
routes extracted from file: 24
total set of routes
 False
set()
{'27B', '116', '51D', '33A', '185', 185.0, '17A', '45A'}
nan
You just got NaN-ed!
1
extracting route: 1 	from file  1
extracting route: 1 	from file  2
extracting route: 1 	from file  3
extracting route: 1 	from file  4
extracting r