# **Time series matrix construction**

Now, that the data are processed and cleaned, everything is ready to create a time series matrix where a row corresponds to a regionID. The column corresponds to an hour from January 1st midnight to October 31st Midnight.


An element i,j of the matrix contains an array, for region i, where the first element is the new flow, and the second element the end flow, at a given hour j.

An important note to mention is that the matrix contains the list of regions IDs that had at least one taxi trip between January 1st to March 31st. Since, that period will serve as a basis for training the models, it will make sense to predict future taxi trips (from April 1st 2017) for those region IDs only.

The functions used to get that time series matrix are all located inside the folder: https://github.com/aissahm/urban-data/tree/master/prep

The dataframe of the taxi trips are located in the folder: https://github.com/aissahm/urban-data/tree/master/data 

## **Including the taxi trips data from January to end of May 2017**

After including the functions mentionned above in the Jupyter Notebook environment (simply uploading the files), and the dataframes of the taxi trips as listed below, I can now construct the time series matrix.

The reason I start with the first 5 months of year 2017 is because the data is massive, and I had to do it by iterations. It will be also easier later on, during the construction process of the time series, to check that every thing makes sense (the time series were properly constructed, and checking the missing region IDs notably). 

In [None]:
import timeSeriesMatrix
import pandas as pd

csvFilePathArray = ["/content/clean_df_jan.csv","/content/clean_df_feb.csv", '/content/clean_df_mar.csv', "/content/clean_df_apr.csv", '/content/clean_df_may.csv']

taxiTripsTimeSeriesMatrix = timeSeriesMatrix.returnNullTimeSeriesMatrix()

OutOfTimeSeries = []

for csvFilePath in csvFilePathArray:
  df = pd.read_csv(csvFilePath)
  results = timeSeriesMatrix.returnTimeSeriesMatrix(df, taxiTripsTimeSeriesMatrix)
  
  taxiTripsTimeSeriesMatrix = results[0]
  OutOfTimeSeries.append(results[1])

print("done")
print("OutOfTimeSeries: ", OutOfTimeSeries)

size of null time series 8760
done
OutOfTimeSeries:  [[], [], [], [Unnamed: 0           86111
pickup_regionIDs       122
dropoff_regionIDs      122
pickup_month             4
pickup_day               1
pickup_hour             10
dropoff_month            4
dropoff_day              1
dropoff_hour            10
Name: 56671, dtype: int64, Unnamed: 0           86111
pickup_regionIDs       122
dropoff_regionIDs      122
pickup_month             4
pickup_day               1
pickup_hour             10
dropoff_month            4
dropoff_day              1
dropoff_hour            10
Name: 56671, dtype: int64, Unnamed: 0           86122
pickup_regionIDs       122
dropoff_regionIDs      122
pickup_month             4
pickup_day               1
pickup_hour             10
dropoff_month            4
dropoff_day              1
dropoff_hour            10
Name: 56679, dtype: int64, Unnamed: 0           86122
pickup_regionIDs       122
dropoff_regionIDs      122
pickup_month             4
pickup_day     

The list above shows the region IDs that weren't added to the time series. The reason is that those regions didn't have any taxi trip for the months of January, February and March 2017. Looking at the list, region ID 139 appears to have only 3 taxi trips from January to May 2019. Since the historic for that region is not significant, we have good reason to ignore that region. The same can be said for regions 122, and 149.

**We test the time series built on region ID 66:**

In [None]:
regionID = 66

indexForRegionID = timeSeriesMatrix.returnIndexForRegionID(regionID)

print(taxiTripsTimeSeriesMatrix[indexForRegionID])

[[1296, 869], [1785, 1349], [1785, 1362], [1497, 1028], [987, 621], [367, 232], [95, 75], [46, 30], [35, 23], [95, 64], [190, 166], [300, 281], [414, 357], [403, 368], [373, 367], [374, 365], [388, 419], [450, 518], [501, 479], [465, 539], [441, 456], [439, 460], [361, 336], [285, 284], [317, 319], [310, 257], [195, 188], [105, 81], [31, 17], [10, 3], [20, 4], [46, 36], [68, 56], [103, 102], [182, 146], [226, 230], [285, 289], [382, 382], [464, 450], [380, 408], [411, 449], [370, 446], [387, 441], [263, 316], [218, 283], [161, 265], [115, 201], [87, 113], [44, 66], [10, 32], [1, 9], [4, 3], [11, 1], [27, 12], [121, 51], [370, 170], [610, 378], [664, 524], [548, 481], [494, 494], [520, 576], [541, 553], [533, 572], [516, 573], [525, 704], [608, 842], [731, 912], [588, 741], [402, 544], [308, 387], [209, 308], [122, 209], [49, 59], [43, 64], [38, 30], [11, 5], [13, 0], [25, 10], [162, 65], [539, 239], [849, 489], [881, 630], [657, 573], [647, 681], [633, 730], [721, 765], [661, 698], [67

In order th export the data to csv and save the work for later time, a modification in the structure of the time series matrix must be applied. CSV can contain only 2D matrices, and the time series matrix is 3D. So the following function will remove the array structure of new-flows and end-flows at each hour. They will simply be listed one after each one. 

In [None]:
timeSeriesList = []

timeSeriesListMatrix = []

for timeSeriesArray in taxiTripsTimeSeriesMatrix:
  timeSeriesList = []
  for timeSeriesElem in timeSeriesArray:
    timeSeriesList.append(timeSeriesElem[0])
    timeSeriesList.append(timeSeriesElem[1])
  timeSeriesListMatrix.append(timeSeriesList)

[1296, 869, 1785, 1349, 1785, 1362, 1497, 1028, 987, 621, 367, 232, 95, 75, 46, 30, 35, 23, 95, 64, 190, 166, 300, 281, 414, 357, 403, 368, 373, 367, 374, 365, 388, 419, 450, 518, 501, 479, 465, 539, 441, 456, 439, 460, 361, 336, 285, 284, 317, 319, 310, 257, 195, 188, 105, 81, 31, 17, 10, 3, 20, 4, 46, 36, 68, 56, 103, 102, 182, 146, 226, 230, 285, 289, 382, 382, 464, 450, 380, 408, 411, 449, 370, 446, 387, 441, 263, 316, 218, 283, 161, 265, 115, 201, 87, 113, 44, 66, 10, 32, 1, 9, 4, 3, 11, 1, 27, 12, 121, 51, 370, 170, 610, 378, 664, 524, 548, 481, 494, 494, 520, 576, 541, 553, 533, 572, 516, 573, 525, 704, 608, 842, 731, 912, 588, 741, 402, 544, 308, 387, 209, 308, 122, 209, 49, 59, 43, 64, 38, 30, 11, 5, 13, 0, 25, 10, 162, 65, 539, 239, 849, 489, 881, 630, 657, 573, 647, 681, 633, 730, 721, 765, 661, 698, 674, 708, 677, 894, 883, 1091, 949, 1287, 819, 1093, 650, 741, 485, 556, 337, 459, 195, 266, 117, 134, 48, 59, 30, 19, 20, 14, 12, 6, 38, 17, 163, 81, 559, 268, 994, 593, 999, 7

Now, the new 2D matrix can be saved to CSV.

In [None]:
from numpy import asarray
from numpy import savetxt
# define data
data = asarray( timeSeriesListMatrix )
# save to csv file
savetxt('timeseriesuptomay.csv', data, delimiter=',')

## **Including the taxi trips data from June to end of October 2017**

Below, I extract the taxi trips from the months June to October 2017, and put everything accordingly inside the original time series matrix (the 3D one).

In [None]:
import timeSeriesMatrix
import pandas as pd

csvFilePathArray = ["/content/clean_df_jun.csv","/content/clean_df_jul.csv", '/content/clean_df_aug.csv', "/content/clean_df_sep.csv", '/content/clean_df_oct.csv']

taxiTripsTimeSeriesMatrixCopy = taxiTripsTimeSeriesMatrix.copy()

OutOfTimeSeries = []

for csvFilePath in csvFilePathArray:
  df = pd.read_csv(csvFilePath)
  results = timeSeriesMatrix.returnTimeSeriesMatrix(df, taxiTripsTimeSeriesMatrix)
  
  taxiTripsTimeSeriesMatrix = results[0]
  OutOfTimeSeries.append(results[1])

print("done")
print("OutOfTimeSeries: ", OutOfTimeSeries)

done
OutOfTimeSeries:  [[Unnamed: 0           1289610
pickup_regionIDs          72
dropoff_regionIDs        143
pickup_month               6
pickup_day                15
pickup_hour               18
dropoff_month              6
dropoff_day               15
dropoff_hour              18
Name: 877741, dtype: int64], [], [Unnamed: 0           349908
pickup_regionIDs         77
dropoff_regionIDs       122
pickup_month              8
pickup_day                4
pickup_hour              18
dropoff_month             8
dropoff_day               4
dropoff_hour             19
Name: 234906, dtype: int64, Unnamed: 0           782856
pickup_regionIDs        139
dropoff_regionIDs       139
pickup_month              8
pickup_day               11
pickup_hour              10
dropoff_month             8
dropoff_day              11
dropoff_hour             10
Name: 519040, dtype: int64, Unnamed: 0           782856
pickup_regionIDs        139
dropoff_regionIDs       139
pickup_month              8
pickup_d

Below, the same process that transformed the 3D matrix into a 2D matrix is repated.

In [None]:
timeSeriesList = []

timeSeriesListMatrix = []

for timeSeriesArray in taxiTripsTimeSeriesMatrix:
  timeSeriesList = []
  for timeSeriesElem in timeSeriesArray:
    timeSeriesList.append(timeSeriesElem[0])
    timeSeriesList.append(timeSeriesElem[1])
  timeSeriesListMatrix.append(timeSeriesList)

from numpy import asarray
from numpy import savetxt
# define data
data = asarray( timeSeriesListMatrix )
# save to csv file
savetxt('timeseriesuptooct.csv', data, delimiter=',')

!cp  '/content/timeseriesuptooct.csv' '/content/gdrive/My Drive/chicago_taxi_trips/timeseriesuptooct.csv'

## **Including the taxi trips data from end of October to end of December 2017**

In [None]:
import timeSeriesMatrix
import pandas as pd

import numpy as np
from numpy import asarray
from numpy import savetxt
from numpy import loadtxt

In [None]:
data = np.loadtxt("/content/timeseriesuptooct.csv", delimiter=',')

print(data[33][:8])

[286. 194. 355. 296. 241. 229. 137. 199.]


In [None]:
#the CSV file contains the time series in a 2D matrix format
#the function below transforms it in a 3D matrix format 
def return3DTaxiTimeSeriesMatrix(taxiTimeSeries2DMatrix):
  n = len(taxiTimeSeries2DMatrix)
  m = len(taxiTimeSeries2DMatrix[0])

  taxiTimeSeriesMatrix = []
  i = 0
  while i < n:
    taxiTimeSeriesArray = []
    j = 0
    while j < m:
      taxiTimeSeriesArray.append( [taxiTimeSeries2DMatrix[i][j], taxiTimeSeries2DMatrix[i][j+1] ])
      j = j + 2
    taxiTimeSeriesMatrix.append(taxiTimeSeriesArray)
    i = i + 1
  
  return taxiTimeSeriesMatrix

In [None]:
taxiTimeSeriesMatrixUptoOct =  return3DTaxiTimeSeriesMatrix(data)

print(taxiTimeSeriesMatrixUptoOct[33][:4])

[[286.0, 194.0], [355.0, 296.0], [241.0, 229.0], [137.0, 199.0]]


We check above we get the exact same entries with the proper format [new flow, end flow].

In [None]:
import timeSeriesMatrix
import pandas as pd

csvFilePathArray = ["/content/clean_df_nov.csv","/content/clean_df_dec.csv" ]

taxiTripsTimeSeriesMatrix = taxiTimeSeriesMatrixUptoOct.copy()

OutOfTimeSeries = []

for csvFilePath in csvFilePathArray:
  df = pd.read_csv(csvFilePath)
  results = timeSeriesMatrix.returnTimeSeriesMatrix(df, taxiTripsTimeSeriesMatrix)
  
  taxiTripsTimeSeriesMatrix = results[0]
  OutOfTimeSeries.append(results[1])

print("done")
print("OutOfTimeSeries: ", OutOfTimeSeries)

done
OutOfTimeSeries:  [[Unnamed: 0           80884
pickup_regionIDs        77
dropoff_regionIDs      122
pickup_month            11
pickup_day               1
pickup_hour             12
dropoff_month           11
dropoff_day              1
dropoff_hour            13
Name: 51628, dtype: int64, Unnamed: 0           365544
pickup_regionIDs         60
dropoff_regionIDs       122
pickup_month             11
pickup_day                5
pickup_hour               4
dropoff_month            11
dropoff_day               5
dropoff_hour              4
Name: 246625, dtype: int64, Unnamed: 0           366182
pickup_regionIDs        122
dropoff_regionIDs       122
pickup_month             11
pickup_day                5
pickup_hour               4
dropoff_month            11
dropoff_day               5
dropoff_hour              4
Name: 246837, dtype: int64, Unnamed: 0           366182
pickup_regionIDs        122
dropoff_regionIDs       122
pickup_month             11
pickup_day                5
picku

In [None]:
timeSeriesList = []

timeSeriesListMatrix = []

for timeSeriesArray in taxiTripsTimeSeriesMatrix:
  timeSeriesList = []
  for timeSeriesElem in timeSeriesArray:
    timeSeriesList.append(timeSeriesElem[0])
    timeSeriesList.append(timeSeriesElem[1])
  timeSeriesListMatrix.append(timeSeriesList)

# define data
data = asarray( timeSeriesListMatrix )
# save to csv file
savetxt('timeseriesuptodec.csv', data, delimiter=',')

## **Conclusion**

In this Jupyter notebook, the raw taxi trips data obtained from the Chicago Open Data website are pre-processed, cleaned and used to build a time series matrix.


The time series matrix includes the taxi trips from January to end of December 2017. To summarize, the time series matrix is structured as follows:



*   each row corresponds to a region ID
*   each coloumn corresponds to an hour unit from January midnight to October 31 midnight.
*   the element i,j of the matrix corresponds to the [new flow, end flow] for a region i, at hour j





