# Data Quality Plan
Analysing the Dublin Bus Data provided to optimise model performance

## Summary
Each file has about 750,000 rows. Three quarters of a million. The data has been analysed extensivly. The purpose of this document is to analyses the raw data and its weaknesses and then try ogranise it in a useable way and synthesise it to be suitable for modeling.


## Ideas to handle data.

### __Incremental Handing__:


Each file is big, handling multiple files and multiple dataframes which have the same scale as the files is expensive on memory.
The files are currently supplied as the global dublin bus data for a given day. For handling, it would be preferable instead of having a csv file representing the full fleet for a day, if we could look at a single route for the month. (partition based on route pattern instead of day)

Having data partitioned by routes makes it more internally consistent. Rows and sets of rows are more comparable with eachother as they pass the same sets of stops. Different routes operates relatively independently so we can split these into sizeable yet relatively independent parts. This structure is optimal since we wish to analyse and model each route individually.

This process should be completed iteritively. Iterate through each file to extract every possible journey pattern.
Then for each journey pattern, iterate through each file extracting, re-constructing infromation only for that journey pattern. If a programmer attempts all these tasks the dataframes rapidly use the available memory, as the meachine begins to run low on RAM and it has to access the disk (evan an ssd) performance decreases substantially. Also the process could fail due to a memory error.


### __Re-Construction__:


From analysing the data, we can see that many columns are incorrect or null. But by knowing the relationships and semantics between different columns it is possible to reconstruct "lost" infromation with reasonable levels of certainty. In the handling of this data. Journey_Pattern_ID will be used to re-construct LineID and Direction and VehicleID, Timeframe and Vehicle_Journey_ID, will be used to reconstruct missing Journey_Pattern_ID infromations.


### __Analysis and Test Cases__:


Other sections of this document were used to just explore the data and tease out properties of different columns to help the programmer gain understanding of the datas' structure and properties. These include certain observations about missing data, inconsistent data types and some mother anomalies.

Other considerations and challenges form this data are - How to derive the route.
How often do time stamps occur.



### __Section Division__: (A potential analytic model for the data supplied)


Logic: The sum of the parts gives the value of the whole.

    - Many smaller data sets. This models the travel time for different sections in which the busses pass through.
        - This reduces redundancy in the analysis of bus routes (which overlap)

    - Though since there are more sections than bus routes, there will be many more models to store. 
        - Given t_0 the time we can calculate t_1 (the time we arrive in section1 after traversing seciont_0)
            - Then we apply the model for section 1 with t_1 as an input. (recirsive problem, easily described iteritavly)
                - The total time is the sum of all the predicted times for each section.

    - Accuracy of estimation of length of a section depends on the frequancy in which a time stamp is put out.
        - Assume a section takes 100.5 seconds to cross. 
            - if a stamp is put out every second and there are 100 time stamps in a journey.
                - We know that the time taken is between 100 and 101 (small percentage error).
                    - The best estimate that can be made is that that time taken is in between 100 and 102.

    - If a time stamp occurs only every 50 seconds.
        - Then in a 100 second interval there could be 2 or 3 time stamps.
            - i.e. we will predict it takes between 100 and 150.
    
    - In general: Max-Min <= x <= (n+1/n)*(Max-Min) 
        - where n is the number of timestamps for a specific journey id occuring within a section)
            - x is the actual time taken to cross the section
                - Maximum margine of error is 1/n

### __General Goal__:


Create a function that tests different models on a route (or a section) or whatever scope the model has.
Run this for each section/route (whatever scope is used) storing each model. Then given a departure point and a destination we can derive the appropriate route and the sum of its sections.

In [None]:
#import sections something to cover all bases
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from patsy import dmatrices
import matplotlib.patches as mpatches
import statsmodels.formula.api as sm
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from statsmodels.formula.api import logit
from sklearn.cross_validation import train_test_split

import os
import glob


 # Processing Overview.
 
Re-construction: First fixes journey pattern id (takes time), Secondly fixes LineID and Direction (very fast).
                 Issues. For november data, there may be data typeing issues, these should be resolved here.

Extraction: Partitions Data by LineID
            Issues. Columns types when data is read in


If I had to do this manually, what would I do?
I would open a raw data time, I'd perform the re-constructions on each file, so I wouldn't have to do it again.
     - Est. 1min per file ==> 1 hour unfortunately to fully re-construct the raw data.
     - Improvements: time-profile this function in eclipse (no improvement yeilded, try horizontal scaling on ec2)
     - Possibly distribute its task accross a number of t2.micro instances to reduce the overall runtimes.
     
I would iterate over the re-constructed files extracting info route by route, then deleteing if from the sourece file so I wouldn't have to read it again. (iterations should speed up towards the end)
Files will be saved to a seperate folder.

All lineID and Pattern ID (they can contain numbers and letters) should be converted to the string type of themselves.

note: I have cut code that does not fit the purpose of this document, the code demonstrating journey pattern id characteristics can be found in the previous commit. This document also comments out test cases so the reader can run them selevtively and restart the kernel at their convinience.





## Re-Construction 1
This function basically teases out the most exclusive set of combinations of Timeframes, Vehicle Journey Ids and Vehicle Ids that have a "null" string for Vehicle Journey Pattern. (hopefully these conditions narrow down to series of data that represent a single journey) then it looks for any Journey Pattern ID that was transmitted during that journey and replaces all the null values with the known journey pattern.

Limitation, re-constructing for a single route in a single file is comparably as expensive as constructing all routes.


In [None]:

#This restores about 20% of data frome nulls
def fix_JPID(dataframe):
    
    #should help loop run a little faster
    loc = dataframe.loc
 
    #every day
    time_frames = set(dataframe[ dataframe["Journey_Pattern_ID"] == "null" ].Timeframe.unique())
    
    total1=len(time_frames)
    increment1 = 0
    
    not_enough_info = 0
    too_much_info = 0
    total_attempts = 0
    
    
    for day in time_frames:
        increment1 += 100
        
        #find list of vehicle journey ids with nulls during that day
        vjid = set( dataframe[ (dataframe   [ "Timeframe" ]      ==  day  ) &\
                               (dataframe ["Journey_Pattern_ID"] == "null") ]\
                               .Vehicle_Journey_ID.unique())                     
        total2 = len(vjid)
        increment2 = 0
        
        for run in vjid: #vehicle_journey_id
#             print("just working away...")
            increment2 += 100
            print(increment1/total1,increment2/total2, sep="\t")
            #vehicle ids
            vids = set(dataframe[ (dataframe    [ "Timeframe" ]     ==  day   ) &\
                                  (dataframe ["Journey_Pattern_ID"] == "null" ) &\
                                  (dataframe ["Vehicle_Journey_ID"] ==  run   ) ]\
                                   .Vehicle_ID.unique())

            for vehicle in vids:
                #list of potential journeys eliguble for re-construction
                re_construct = list(loc[ (dataframe    ["Timeframe"]       ==   day )  &\
                                         (dataframe ["Vehicle_Journey_ID"] ==   run )  &\
                                         (dataframe    ["Vehicle_ID"]      == vehicle ),\
                                         "Journey_Pattern_ID" ].unique() )  

                total_attempts += 1

                #re-constructs journey pattern to non-null entry
                if len(re_construct) == 2:
                    if re_construct[0] != "null":
                        #replaces nulls
                        loc[ (dataframe    ["Timeframe"]       ==  day    ) &\
                             (dataframe ["Vehicle_Journey_ID"] ==  run    ) &\
                             (dataframe    ["Vehicle_ID"]      == vehicle ), \
                             "Journey_Pattern_ID" ] = re_construct[0]
                    
                    else:
                        #replaces nulls
                        loc[ (dataframe    ["Timeframe"]        == day     ) &\
                             (dataframe ["Vehicle_Journey_ID"]  == run     ) &\
                             (dataframe    ["Vehicle_ID"]       == vehicle ), \
                            "Journey_Pattern_ID" ] = re_construct[1]
                
                #if re-construction cannot occur, tally reason why it couldn't.
                else:
                    
                    #print("Exception for", day, run, re_construct) #displays exception circumstances
                    if len(re_construct) > 2:
                        too_much_info +=1
                    else:
                        not_enough_info +=1

    print("Total candidates:", total_attempts ,"\tnot enough info", not_enough_info, "\ntoo much info:", too_much_info)
    
    return dataframe
    


In [None]:
# test_me = pd.read_csv("siri.20130122.csv")
# this could take about an hour

# columns   =    ["Timestamp",
#                 "LineID", 
#                 "Direction",
#                 "Journey_Pattern_ID", 
#                 "Timeframe", 
#                 "Vehicle_Journey_ID", 
#                 "Operator", 
#                 "Congestion", 
#                 "Lon",
#                 "Lat", 
#                 "Delay", 
#                 "Block_ID",
#                 "Vehicle_ID",
#                 "Stop_ID",
#                 "At_Stop"]

# test_me.columns = columns
# before ="__Before:__\n" + str(test_me[test_me["Journey_Pattern_ID"] == "null"].count())
# test_me = fix_JPID(test_me)
# print(before)
# print("__After:__\n",test_me[test_me["Journey_Pattern_ID"] == "null"].count())
# # this one takes unbelieveably long :(
# it has to go from 0 to 100 once (its only one day)

In [None]:
# test_me = pd.read_csv("route-4-raw1.csv", index_col=0)
# # print(test_me.head())

# columns   =    ["Timestamp",
#                 "LineID", 
#                 "Direction",
#                 "Journey_Pattern_ID", 
#                 "Timeframe", 
#                 "Vehicle_Journey_ID", 
#                 "Operator", 
#                 "Congestion", 
#                 "Lon",
#                 "Lat", 
#                 "Delay", 
#                 "Block_ID",
#                 "Vehicle_ID",
#                 "Stop_ID",
#                 "At_Stop"]

# test_me.columns = columns
# print("__Before:__\n",test_me[test_me["Journey_Pattern_ID"] == "null"].count())
# test_me = fix_JPID(test_me)
# print("__After:__\n",test_me[test_me["Journey_Pattern_ID"] == "null"].count())

# test_me.to_csv("re_constructed_Dec\\route4-reconstructed.csv", index = False)
# # this one runs faster. it goes from 0 to 100 30 times (its a month)

## Re-Construction 2. 
A simple section that re-derives LineID and Direction from the string.


In [None]:

def fix_LID_and_Dir(dataframe):
    modified_frame = dataframe
    modified_frame["Journey_Pattern_ID"] = dataframe["Journey_Pattern_Id"].str
    modified_frame["LineID"] =  dataframe["Journey_Pattern_ID"].str[:4]
#     modified_frame[modified_frame["LineID"].str[:3] == "000"] =  dataframe["Journey_Pattern_ID"].str[3]
#     modified_frame[modified_frame["LineID"].str[:2] == "00"] =  dataframe["Journey_Pattern_ID"].str[2:4]
#     modified_frame[modified_frame["LineID"].str[:1] == "0"] =  dataframe["Journey_Pattern_ID"].str[1:4]
    modified_frame["Direction"] = dataframe["Journey_Pattern_ID"].str[4]
    
    return modified_frame
    

In [None]:

# test_me = pd.read_csv("route4-reconstructed.csv", index_col = None)
# test_me = fix_LID_and_Dir(test_me)
# test_me.head()

## Re-Construction 3.
Takes both re-construction methods and implements them on all files.


In [None]:
def re_construct(path, month): #month should be Nov or Dec and is case sensitive
    
    read = pd.read_csv
    fix_jpid = fix_JPID
    fix_lidir = fix_LID_and_Dir
    
    #stores contents of a folder as a list.
    contents = os.listdir(path)
    
    #small performance improvement.
    content_length = len(contents)
    
    columns   =    ["Timestamp",
                    "LineID", 
                    "Direction",
                    "Journey_Pattern_ID", 
                    "Timeframe", 
                    "Vehicle_Journey_ID", 
                    "Operator", 
                    "Congestion", 
                    "Lon",
                    "Lat", 
                    "Delay", 
                    "Block_ID",
                    "Vehicle_ID",
                    "Stop_ID",
                    "At_Stop"]
    
    
    for i in range(content_length):
        #reads csv from data folder in cwd    
        #This line may need to be changed to use "/" instead of "\\" to run on mac or linux.
        modify_me = read(path+"\\"+contents[i], index_col=None, header=0, encoding="utf-8")
        
        #Sets columns names for dataframe (so it can operated easily and readably)
        modify_me.columns = columns
        modify_me = fix_jpid(modify_me)
        modify_me = fix_lidir(modify_me)
        modify_me.to_csv("re_constructed_"+month+"\\re_con_"+contents[i], encoding = "utf-8")
        
    return modify_me


    

In [None]:
# #this is the path to november data 
# path=os.getcwd() + "\\data1"

# re_construct(path, "Dec")

In [None]:
# #this is the path to november data 
# path=os.getcwd() + "\\data1"

# #this will save to Nov file
# re_construct(path, "Nov")
    

## Extraction Part 1.
- This function is used to extract a single route from all files.
- Requirement: the folder this file is contained in __MUST CONTAIN__ a folder called __"data"__ containing all the files where extraction will occur. 


- This will need to run for every LineID. It is not scalable to call it for every route, this must be automated. Within another function.

In [None]:

def extract_route(path, route):
    
    #should help loop run a little faster.
    read = pd.read_csv
    concat = pd.concat
    
    #stores contents of a folder as a list.
    contents = os.listdir(path)
    
    #small performance improvement.
    content_length = len(contents)
    
    columns   =    ["Timestamp",
                    "LineID", 
                    "Direction",
                    "Journey_Pattern_ID", 
                    "Timeframe", 
                    "Vehicle_Journey_ID", 
                    "Operator", 
                    "Congestion", 
                    "Lon",
                    "Lat", 
                    "Delay", 
                    "Block_ID",
                    "Vehicle_ID",
                    "Stop_ID",
                    "At_Stop"]
    
    
    #reads csv from data folder in cwd    
    #This line may need to be changed to use "/" instead of "\\" to run on mac or linux.
    accumulator = read(path+"\\"+contents[0], index_col=None, header=0, encoding="utf-8")
    
    #Sets columns names for dataframe (so it can operated easily and readably)
    accumulator.columns = columns
    
    
    
    for i in range(content_length):
        print("extracting", route, "from file", i)
        #This line may need to be changed to use "/" instead of "\\" to run on mac or linux.
        next_df = pd.read_csv(path+"\\"+contents[i], index_col=None, header=0)
        next_df.columns = columns

        #Line Continuation char is used for readability
        accumulator = concat([accumulator[( accumulator["LineID"] == route)], \
                                 next_df [(   next_df["LineID"]   == route)]] \
                                 , axis=0)
        
#         print(accumulator.shape, "acc") (use this to track concats are happening correctly; debugging)
        
    return accumulator







In [None]:
# # the first line takes the path to the folder containing this document.
# # for me this path looks like this. C:\Users\Andy\Desktop\DataAnalytics
# # then \data is appended to the path. (not it is the path to the data folder)
# # on a windows system the seperators for paths is "\" , on mac or linux  us /
# # I also use two seperators as the "\" is the esc char in python, so to use it in the path it must be escaped from itself.

# #if you change the folder name or need to make this run on mac or linux, this is the line you change.
# path=os.getcwd() + "\\data1"
# #calls function
# df_1 = extract_route(path, "13") #this would extract for route 4.

# #shows "shape of result", important for debugging.
# df_1.shape
# df_1.to_csv("route-13-raw1.csv" , encoding="utf-8")

# path=os.getcwd() + "\\data2"

# df_1 = extract_route(path, "26", encoding = "utf-8" )
# df_1.to_csv( "route-26-raw2.csv", encoding="utf-8")

## Extraction Part 2.
This function takes the first function (which extracts a single route from files) and scales it it.
We don't have a complete list of routes, So we will skim over the files as quickly as we can and put together a list of routes.

It would be more efficient to construct this complete list of routes as we perform the other functions, but to save programmer time, I will develop this independently, and run it once, and save the result. This saves time if I need to do partial test runs of the other extract functions.



In [None]:

def find_routes(path):
    #should help loop run a little faster.
    read = pd.read_csv
    
    #stores contents of a folder as a list.
    contents = os.listdir(path)
    
    #small performance improvement.
    content_length = len(contents)
    
    columns   =    ["Timestamp",
                    "LineID", 
                    "Direction",
                    "Journey_Pattern_ID", 
                    "Timeframe", 
                    "Vehicle_Journey_ID", 
                    "Operator", 
                    "Congestion", 
                    "Lon",
                    "Lat", 
                    "Delay", 
                    "Block_ID",
                    "Vehicle_ID",
                    "Stop_ID",
                    "At_Stop"]
    
    
    #reads csv from data folder in cwd    
    #This line may need to be changed to use "/" instead of "\\" to run on mac or linux.
    accumulator = read(path+"\\"+contents[0], index_col=None, header=0, encoding="utf-8")
    
    #Sets columns names for dataframe (so it can operated easily and readably)
    accumulator.columns = columns
    
    set_of_routes = set(accumulator.LineID.unique())
#     comparison_set = set_of_routes #this is used to test the necesity of looping through all files
    
    #Should help loop run a bit faster
    unite = set_of_routes.union
    
    #read through all files once and determine maximum set of routes
    for i in range(1,content_length):
        
        #if you change the folder name or need to make this run on mac or linux, this is the line you change.
        next_df = read(path+"\\"+contents[i], index_col=None, header=0, encoding ="utf-8")
        next_df.columns = columns
        
        next_set_of_routes = set(next_df.LineID.unique())
        set_of_routes = unite(next_set_of_routes)
        print("Constructing Route Set... Current File:", i)
    
    # This block of code can be used to show that extracting all routes from a single file does not work, its nexesary to go through each file exhaustively
#
#     print("total set of routes\n", set(set_of_routes)==set(comparison_set))
#     print(comparison_set - set_of_routes)
#     print(set_of_routes - comparison_set)
    
    intset = {str(x) for x in set_of_routes if isinstance(x, int)}
    strset = {x for x in set_of_routes if isinstance(x, str)}
    set_of_routes = sorted(intset.union(strset), key=str)
    
    return set_of_routes


    

    
    
    
    




In [None]:
# #if you change the folder name or need to make this run on mac or linux, this is the line you change.
# path=os.getcwd() + "\\data1"

# all_routes = find_routes(path)

# print(all_routes)

## Extraction Part 3.
This function will take all sets of routes from re-constructed files and then for each route it will scrape the all re-constrcted files (which are sorted by date) into files which are dedicated to a single route.

at this stage route should be passed in in the format 0004, for the number 4 bus. The path is to the re-constructed folder for a particular month.


In [None]:
def complete_extraction(path):
    
    all_routes = find_routes(path)
    
    for route in all_routes:
        dataframe = extract_route(path, route)
        dataframe.to_csv("alpha-"+route+".csv")
        

In [None]:
# #if you change the folder name or need to make this run on mac or linux, this is the line you change.
path = os.getcwd() + "\\data1"

complete_extraction(path)
    

# I Cut some code used for experimentation 
(it can be seen in the previous commit), know that we have explored the data many of these properties are known and documented.) I have a spare copy of the code that inhabbited this section, but excluded it from this document as it was not clean and clear cut and may only serve to disorientate the reader.