# Projet INFOH600 - Group 10

- **Amaury Lekens** (student ID: 000496361)

- **Marlene Silva Marchena** (student ID: 000498403)

# Sampled Dataset exploration, meta-data collection

### 1 Data


The data used for this analysis is the TLC trip dataset provided by the New York City Taxi and Limousine Commission (TLC). The data has 4 different sub-datasets types:

• **Yellow** taxi records (2009.01 - 2019.06), it contains trip information of New York’s famous yellow
taxi cars. <br>
• **Green** taxi records (2013.08 - 2019.06), it refers to a so-called ’boro’ taxis — a service introduced to improve taxi service and availability in the boroughs.  <br>
• **FHV** records (2015.01 - 2019.11), For Hire Vehicles contains information from services that offer for-hire vehicles (such as Uber, Lyft, Via, and Juno), but also luxury limousine bases.  <br>
• **FHVHV** records (2009.02 - 2019.06), the High volume FHV data are FHV offered by services that make more than 10,000 trips per day.

For more information about the dataset and the variables of each sub-dataset types see https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

### 2. Data Exploration

In what follows we are going to create a dataframe containing all the relevant information (metadata) about the files that compose our database. In order to do that, we define first some useful functions.

In [1]:
# Convenience functions
import os
import glob
import pandas as pd

def get_schema(filename):
    '''Extracts the schema from the given file
    
    Assumes that the first line of the file includes the schema
    '''
    with open(filename, 'r') as f:
        return tuple([attr.strip('" ').lower() for attr in f.readline().strip().split(',')])

def get_month(filename):
    '''Returns the month that the TCL file reports on.
    
       Assumes that the filename uses the TLC convensions:
       $(fileSource)_tripdata_$(year)-$(month).csv
    '''
    return int(filename[-6:-4])

def get_year(filename):
    '''Returns the month that the TCL file reports on. 
    
       Assumes that the filename uses the TLC convensions:
       $(fileSource)_tripdata_$(year)-$(month).csv
    '''
    return int(filename[-11:-7])

def get_type(filename):
    '''Returns the type of trip that the TCL file reports on (yellow, green, fhv, hvfhv). 
    
       Assumes that the filename uses the TLC convensions:
       $(fileSource)_tripdata_$(year)-$(month).csv
    '''
    basename = os.path.basename(filename)
    transport_class = basename.split('_', 1)[0]
    return transport_class

def get_numrecords(filename):
    '''Returns the number of records in a TCL file.
       
       Equals the number of lines in the file minus one 
       (the header, which is the schema, not a record)
    '''
    with open(filename) as f:
        lines = 0
        for line in f:
            lines += 1
        return lines - 1                   
    
def get_metadata(filename):
    '''Returns all metadata associated to the `filename` datafile as one big tuple'''
    return (filename, 
            get_type(filename),
            get_year(filename),
            get_month(filename),
            os.path.getsize(filename),
            get_numrecords(filename),
            get_schema(filename) )

With the help of the auxiliar functions defined above, we compute the metadata for each file of our database and save it in a dataframe

In [2]:
# Get a sorted list of all files
files = sorted(glob.glob("/home/marlene/Documents/ULB/CFDS/Final_Project_CFDS/data/sampled/*.csv"))
# Compute the metadata for each such file
metadata = [ get_metadata(f) for f in files ]
# Put the metadata in a Pandas dataframe
metadata_labels = [ 'filename',  'type', 'year', 'month', 'size', 'num_records', 'schema']
df = pd.DataFrame.from_records(metadata, columns=metadata_labels)
# Save the dataframe
df.to_csv('/home/marlene/Documents/ULB/CFDS/Final_Project_CFDS/dataset-description.csv')

In [6]:
df.head()

Unnamed: 0,filename,type,year,month,size,num_records,schema
0,/home/marlene/Documents/ULB/CFDS/Final_Project...,fhv,2015,1,4126514,136556,"(dispatching_base_num, pickup_date, locationid)"
1,/home/marlene/Documents/ULB/CFDS/Final_Project...,fhv,2015,2,4712489,155514,"(dispatching_base_num, pickup_date, locationid)"
2,/home/marlene/Documents/ULB/CFDS/Final_Project...,fhv,2015,3,4922012,163232,"(dispatching_base_num, pickup_date, locationid)"
3,/home/marlene/Documents/ULB/CFDS/Final_Project...,fhv,2015,4,5845469,195182,"(dispatching_base_num, pickup_date, locationid)"
4,/home/marlene/Documents/ULB/CFDS/Final_Project...,fhv,2015,5,6434970,214016,"(dispatching_base_num, pickup_date, locationid)"


We have a database of 14.876 GB distributed in 261 files. The total number of records (lines) is 122.3 millions. 

In [7]:
df['size'].describe()

count    2.610000e+02
mean     5.699806e+07
std      5.040927e+07
min      5.787400e+04
25%      1.033738e+07
50%      4.048584e+07
75%      1.111178e+08
max      1.489577e+08
Name: size, dtype: float64

In [29]:
df['size'].sum()

14876493159

In [30]:
df['num_records'].sum()

122302140

Here we define a pandas dataframe that contains all the files + their metadata for each sub-dataset

In [28]:
fhv_files = df[ df['type'] == 'fhv']
fhvhv_files = df[ df['type'] == 'fhvhv']
green_files = df[ df['type'] == 'green']
yellow_files = df[ df['type'] == 'yellow']

#### 2.1 FHV dataset exploration


The fhv sub-dataset has a size of 1.821 GB with 34.5 million records (lines)

In [27]:
fhv_files.describe()

Unnamed: 0,year,month,size,num_records
count,59.0,59.0,59.0,59.0
mean,2016.966102,6.40678,30869320.0,584325.4
std,1.4138,3.434704,28285480.0,370667.9
min,2015.0,1.0,2884698.0,51875.0
25%,2016.0,3.5,6404827.0,212969.5
50%,2017.0,6.0,16849120.0,564428.0
75%,2018.0,9.0,59925560.0,897739.5
max,2019.0,12.0,83483020.0,1191813.0


In [13]:
fhv_files['size'].sum()

1821290072

In [15]:
fhv_files['num_records'].sum()

34475199

#### 2.2. FHVHV dataset exploration

The fhvhv sub-dataset has a size of 0.34 GB with 5.5 million records

In [16]:
fhvhv_files.describe()

Unnamed: 0,year,month,size,num_records
count,5.0,5.0,5.0,5.0
mean,2019.0,4.0,68055150.0,1089970.0
std,0.0,1.581139,4374677.0,70184.86
min,2019.0,2.0,62947750.0,1007262.0
25%,2019.0,3.0,65453500.0,1049133.0
50%,2019.0,4.0,67782050.0,1085520.0
75%,2019.0,5.0,69620810.0,1115337.0
max,2019.0,6.0,74471630.0,1192596.0


In [17]:
fhvhv_files['size'].sum()

340275757

In [18]:
fhvhv_files['num_records'].sum()

5449848

#### 2.3. Green dataset exploration

The green sub-dataset has a size of 0.494 GB with 3.8 million records

In [19]:
green_files.describe()

Unnamed: 0,year,month,size,num_records
count,71.0,71.0,71.0,71.0
mean,2016.042254,6.492958,6953828.0,53537.577465
std,1.768265,3.500503,4205176.0,21385.229774
min,2013.0,1.0,57874.0,390.0
25%,2015.0,3.5,3446216.0,37654.0
50%,2016.0,6.0,5173788.0,53869.0
75%,2017.5,9.5,11657320.0,74096.5
max,2019.0,12.0,14267140.0,88661.0


In [20]:
green_files['size'].sum()

493721820

In [21]:
green_files['num_records'].sum()

3801168

#### 2.4. Yellow dataset exploration

The yellow sub-dataset has a size of 12.221 GB with 78.58 million records

In [22]:
yellow_files.describe()

Unnamed: 0,year,month,size,num_records
count,126.0,126.0,126.0,126.0
mean,2013.761905,6.357143,96993690.0,623618.452381
std,3.050059,3.462864,39263690.0,127771.104832
min,2009.0,1.0,31795330.0,346739.0
25%,2011.0,3.0,45069570.0,508161.0
50%,2014.0,6.0,114246000.0,665162.0
75%,2016.0,9.0,129091100.0,728899.0
max,2019.0,12.0,148957700.0,807519.0


In [23]:
yellow_files['size'].sum()

12221205510

In [24]:
yellow_files['num_records'].sum()

78575925

### 3. Schema evolution

We are going to use the analyze_schema_changes() function to extract the schema, to detect if two schemas are the same, and to compute the difference between two schemas. In order to understand the schema evolution along the time, we create for each dataset a new dataframe that adds two new columns: removed and added schema columns. Finally, we present only the lines were a change happened, i.e., something is added or removed from the previous file. Our analysis will focus on the evolution of those changes along the time.

In [15]:
# defining some useful functions
def diff_schema(schema1, schema2):
    ''' Compute a tuple containing all elements of schema1 that are not in schema2        
    
        Example: if  schema1= ("a", "b", "c") and schema2 = ("b", "d", "e") the result = ("a", "c")
    '''
    lschema1 = list(schema1) # schema1 is a tuple, convert it to a list
    lschema2 = list(schema2) # schema2 is a tuple, convert it to a list
    removed = [ x for x in lschema1 if x not in lschema2 ]
    return tuple(removed) # removed is a list, convert it back to a tuple

def analyze_schema_changes(dataset):
    '''Analyze schema changes over time for all files in the dataset
    
    dataset: A dataframe that lists all files beloning to a given sub-dataset (fhv, yellow, green, ...)
             with their metadata, sorted lexicographically on (year, month)
    
    output: a dataframe that contains for each file two extra columns: removed, and added containing 
    '''
  
    prev_schema = () # assume the initial schema is empty
    labels = ['type', 'year', 'month',  'schema', 'removed', 'added'] # The column labels of the resulting dataframe

    # Solution approach: 
    dataset = dataset.sort_values(by=['year', 'month'])
    removed = []  # list of columns removed
    added = []  # list of columns added

    for row in range(len(dataset)):
        a = dataset.iloc[row]['schema']
        b = dataset.iloc[row-1]['schema']
        if row != 0:
            rm = diff_schema(b,a)
            ad = diff_schema(a,b)
        else:
            rm = prev_schema
            ad = prev_schema

        removed.append(rm)
        added.append(ad)
    dataset['removed'] = removed
    dataset['added'] = added

    # convert the result list to the dataframe
    return pd.DataFrame(dataset, columns=labels)

#### 3.1. Schema evolution for fhv

From the results for the FHV cab data files presented below we observe:

- **From 2015.01 to 2016.12** the original schema does not change (dispatching_base_num, pickup_date, locationid)  <br>
- **In 2017.01** columns (pickup_date, locationid) were removed and columns (pickup_datetime, dropoff_datetime, pulocationid, dolocationid) <br> were added <br>
- **In 2017.07**  column (sr_flag) was added. <br>
- **In 2018.01**  column (dispatching_base_number,) was added.  <br>
- **In 2019.01**  column (dispatching_base_number,) was removed.  <br>

In [19]:
# Analyze the schema changes
fhv_changes = analyze_schema_changes(fhv_files)
# Keeping only lines where something is added or removed
fhv_delta = fhv_changes[(fhv_changes['added'] != ()) |  (fhv_changes['removed'] != ())]
fhv_delta

Unnamed: 0,type,year,month,schema,removed,added
24,fhv,2017,1,"(dispatching_base_num, pickup_datetime, dropof...","(pickup_date, locationid)","(pickup_datetime, dropoff_datetime, pulocation..."
30,fhv,2017,7,"(dispatching_base_num, pickup_datetime, dropof...",(),"(sr_flag,)"
36,fhv,2018,1,"(pickup_datetime, dropoff_datetime, pulocation...",(),"(dispatching_base_number,)"
48,fhv,2019,1,"(dispatching_base_num, pickup_datetime, dropof...","(dispatching_base_number,)",()


#### 3.2. Schema evolution for fhvhv

The FHV cab data schema does not change along the time. The schema original schema (hvfhs_license_num, dispatching_base_num, pickup_datetime, dropoff_datetime, pulocationid, dolocationid, sr_flag) remains the same from 2019.02 to 2019.06

In [23]:
# Analyze the schema changes for fhvhv data set
fhvhv_changes = analyze_schema_changes(fhvhv_files)
fhvhv_delta = fhvhv_changes[(fhvhv_changes['added'] != ()) |  (fhvhv_changes['removed'] != ())]
fhvhv_delta

Unnamed: 0,type,year,month,schema,removed,added


#### 3.3. Schema evolution for green data 

From the results for the green cab data files presented below we observe:

- **From 2013.08 to 2014.12** the schema in use is:(vendorid, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, ratecodeid, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, total_amount, payment_type, trip_type)
- **In 2015.01**  column (improvement_surcharge) was added <br>
- **In 2016.07**  columns (pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude) were removed and columns (pulocationid, dolocationid) were added <br>
- **In 2019.01**  column (congestion_surcharge) was added  <br>

In [24]:
# Analyze the schema changes
green_changes = analyze_schema_changes(green_files)
green_delta = green_changes[(green_changes['added'] != ()) |  (green_changes['removed'] != ())]
green_delta

Unnamed: 0,type,year,month,schema,removed,added
81,green,2015,1,"(vendorid, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, ratecodeid, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type)",(),"(improvement_surcharge,)"
99,green,2016,7,"(vendorid, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, ratecodeid, pulocationid, dolocationid, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type)","(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude)","(pulocationid, dolocationid)"
129,green,2019,1,"(vendorid, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, ratecodeid, pulocationid, dolocationid, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)",(),"(congestion_surcharge,)"


#### 3.4. Schema evolution for yellow dataset

From the results for the yellow cab data files presented below we observe:

- **From 2015.01 to 2016.12** we have (dispatching_base_num, pickup_date, locationid) as schema <br>
- **In 2017.07**  column (sr_flag) was added <br>
- **In 2018.01**  column (dispatching_base_number,) was added  <br>
- **In 2019.01**  column (dispatching_base_number,) was removed  <br>

In [25]:
# Analyze the schema changes
yellow_changes = analyze_schema_changes(yellow_files)
yellow_delta = yellow_changes[(yellow_changes['added'] != ()) |  (yellow_changes['removed'] != ())]
yellow_delta

Unnamed: 0,type,year,month,schema,removed,added
147,yellow,2010,1,"(vendor_id, pickup_datetime, dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, rate_code, store_and_fwd_flag, dropoff_longitude, dropoff_latitude, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount)","(vendor_name, trip_pickup_datetime, trip_dropoff_datetime, start_lon, start_lat, store_and_forward, end_lon, end_lat, fare_amt, tip_amt, tolls_amt, total_amt)","(vendor_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, store_and_fwd_flag, dropoff_longitude, dropoff_latitude, fare_amount, tip_amount, tolls_amount, total_amount)"
207,yellow,2015,1,"(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, ratecodeid, store_and_fwd_flag, dropoff_longitude, dropoff_latitude, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount)","(vendor_id, pickup_datetime, dropoff_datetime, rate_code, surcharge)","(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, ratecodeid, extra, improvement_surcharge)"
225,yellow,2016,7,"(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount)","(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude)","(pulocationid, dolocationid)"
255,yellow,2019,1,"(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)",(),"(congestion_surcharge,)"
