## Intro

We have been having multiple issues with our data. For this reason, we have to revisit our data cleaning and separating process. The plan is as follows:

* Transform LineID using JourneyPatternID - have to check that this process happens properly
* Separate the data into individual files based on this.

AFTERWARDS we can clean the data

* Transform AtStop based on StopID? 
* Remove rows where bus is not AtStop? 
* Normalize time and location? 


* Get stops, routes, and routestation information? Link external data to our data, extracting headsigns? NOTE: We must only add StopIDs which show up in our data. Adding other StopIDs breaks our model. 
* Find the order of each stop on each route - how to go about this? 


Retrain models based on this new cleaned data

## New plan

* Transform LineID using JourneyPatternID - Include only major variations of each line (1001, 0001)
* Separate the data into individual files based on this.

AFTERWARDS we can clean the data

* Remove all journeys which don't start at the most common first stop
* Use only the rows where the StopID changes (Instead of StopID)
* Normalize time, include Hour & Day 

STOPS, ROUTES, ROUTESTATIONS

* Use external data - it is accurate for the main variations according to Charlotte
* Collect StopID, Lat, Lon, order, headsigns etc 
* Validate this using the cleaned data? 

TIMETABLE 

* Can use Eoghan's timetable information, as we are only using the main variations

Retrain models based on this new cleaned data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.formula.api as sm

import pprint as pp

import time
import os

%matplotlib inline

## Reading Data & Formatting

In [2]:
def get_line(string):
    if len(string) > 4:
        var = string[:4]
        var = var.lstrip("0")
        return var
    else:
        print("Error with", string, "on get_line")
        return None

def get_journey(string):
    if len(string) > 4:
        var = string[-4:]
        return var
    else:
        print("Error with", string, "on get_journey")
        return None

In [3]:
def insert_into_file(df, writefile):
    """  This function writes a dataframe (df) to a file (writefile),
        creating that file if it doesn't exist.
    """
    try:
        with open(writefile, 'a') as f:
            df.to_csv(f, header=False, index=False)
    except IOError:
        with open(writefile, 'w+') as f:
            df.to_csv(f, header=False, index=False)

In [4]:
def filter(filename, target_dir):
    df = pd.read_csv(filename, low_memory=False, header=None)
    df.columns = ["Timestamp", "LineID", "Direction", "JourneyPatternID", "TimeFrame", 
                  "VehicleJourneyID", "Operator", "Congestion", "Lon", "Lat", 
                  "Delay", "BlockID", "VehicleID", "StopID", "AtStop"]
  
    #Select all columns of type 'object'
    object_columns = df.select_dtypes(['object']).columns

    #Convert selected columns to type 'category'
    for column in object_columns:
        df[column] = df[column].astype('category')

    # Convert other features to categorical
    for column in ['Congestion', 'BlockID', 'VehicleID', 'AtStop']:
        df[column] = df[column].astype('category')
    
    # Convert LineID & VehicleJourneyID features to str
    for column in ['LineID', 'VehicleJourneyID',]:
        df[column] = df[column].astype('str')
     
    # Transforming LineID and VehicleJourneyID
    df['LineID'] = df['JourneyPatternID'].apply(lambda x: get_line(x))
    df['JourneyPatternID'] = df['JourneyPatternID'].apply(lambda x: get_journey(x))

    # Removing non-primary journeypatterns
    df = df[(df.JourneyPatternID == '1001') | (df.JourneyPatternID == '0001')]

    # Writing each line to a file
    lines = df.LineID.unique()
#     patterns = df.JourneyPatternID.unique()
#     for pattern in patterns:
#         print(pattern)
    for line in lines:
        try:
#             print("Writing line", line)
            line_df = df[df.LineID == line]
            insert_into_file(line_df, target_dir + line + ".csv")
        except:
            pass

In [5]:
def main(read_directory, write_directory):
    for read_file in os.listdir(read_directory):
        if read_file.endswith(".csv"):
            print("Reading", read_file, "from", read_directory)
            read_file = read_directory + "/" + read_file
            filter(read_file, write_directory)
            print("Finished", read_file)
            print()
    print("Finished main!")

In [6]:
# Main section - running the 

start = time.time()

# read_directory1 = "bus_data/test_data"

read_directory1 = "bus_data/Dcc"
read_directory2 = "bus_data/Sir"
write_directory = "bus_data/line_data2/"

main(read_directory1, write_directory)
main(read_directory2, write_directory)

end = time.time()

print()
print("Program took", end - start, "seconds")

Reading siri.20121106.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey


Setting NaNs in `categories` is deprecated and will be removed in a future version of pandas.
  ordered=self.ordered)


Finished bus_data/Dcc/siri.20121106.csv

Reading siri.20121107.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121107.csv

Reading siri.20121108.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121108.csv

Reading siri.20121109.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121109.csv

Reading siri.20121110.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121110.csv

Reading siri.20121111.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121111.csv

Reading siri.20121112.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_journey
Finished bus_data/Dcc/siri.20121112.csv

Reading siri.20121113.csv from bus_data/Dcc
Error with null on get_line
Error with null on get_

In [7]:
# df = pd.read_csv("bus_data/Dcc/siri.20121106.csv", low_memory=False, header=None)
# df.columns = ["Timestamp", "LineID", "Direction", "JourneyPatternID", "TimeFrame", 
#               "VehicleJourneyID", "Operator", "Congestion", "Lon", "Lat", 
#               "Delay", "BlockID", "VehicleID", "StopID", "AtStop"]



In [8]:
# #Select all columns of type 'object'
# object_columns = df.select_dtypes(['object']).columns

# #Convert selected columns to type 'category'
# for column in object_columns:
#     df[column] = df[column].astype('category')
    
# # Convert other features to categorical
# for column in ['Congestion', 'BlockID', 'VehicleID', 'AtStop']:
#     df[column] = df[column].astype('category')

# # Convert LineID & VehicleJourneyID features to str
# for column in ['LineID', 'VehicleJourneyID',]:
#     df[column] = df[column].astype('str')

### Separate LineID & JourneyPatternID

In [9]:

        
# df['LineID'] = df['JourneyPatternID'].apply(lambda x: get_line(x))
# df['JourneyPatternID'] = df['JourneyPatternID'].apply(lambda x: get_journey(x))

# # df.head(1000)

## Separating journey patterns

In [10]:
# newdf = df[(df.JourneyPatternID == '1001') | (df.JourneyPatternID == '0001')]


### Saving newdf to a dataframe


In [11]:
# newdf.to_csv("bus_data/cleaned_data/line15_00150001.csv")