# Report Projet INFOH600

## 1. Data exploration

## 2. Integration

For each sub dataset, all the data are distributed on different files. As we have seen in part 1, for 3 sub-dataset (fvh, green, yellow) the pattern of the csv files has varied over time. This integration step will therefore make it possible to work later on data that have the same schema. 

In practice, for the 3 subdatasets, the most recent schema will be taken as the reference schema. All files with a different schema will be converted to their reference schema. This conversion can be either a simple change of column name but can also be a function of several columns.

The first step will therefore be to identify these transformations.

### 2.1. Identification of the transformations

The 3 following figures represent the correspondences between the different schemas of the same sub dataset. A reference schema is composed of the entries in the blue boxes. The names or functions below allow to retrieve or rebuild the data from a file built on a different schema. For example for the FHV subdataset the data _pulocationid_ can be found in a _pulocationid_ column but also in a _locationid_ column. In the green subdataset, this same data can also be found in the _pulocationid_ column but also by a function of the _pickup_longitude_ and _pickup_latitude_ columns.

<img src="img/transformation_fhv.png" width="800" align="left"/>

<img src="img/transformation_green.png" width="800" align="left"/>

<img src="img/transformation_yellow.png" width="800" align="left"/>

### 2.2. Implementation

To perform these transformations, we decided to implement a very general function to avoid having to write a specific function for each dataset. The idea of this function is that it transforms a row of data not conforming to the reference scheme into a conforming row from a json configuration file. 

<img src="img/transformation_function.png" width="800" align="left"/>

The json configuration file is subdataset specific, i.e. for each subdataset a configuration file must be defined. These configurations describe the hierarchical diagrams shown above. Part of one of these files (green dataset) is shown below. The whole files are in the repository in the _config_files_ folder. 

```json
{ 
 "vendorid": [
              {
               "type": "column", 
               "content": "vendorid"
              },
              {
               "type": "column",
               "content": "vendor_id"
              }
             ],

 ...
 
 "store_and_fwd_flag": [
                        {
                         "type": "column", 
                         "content": "store_and_fwd_flag"
                        },
                        {
                         "type": "column", 
                         "content": "store_and_forward"
                        }
                       ],

 "pulocationid": [
                  {
                   "type": "column", 
                   "content": "pulocationid"
                  },
                  {
                   "type": "function",
                   "content": {
                               "func_name": "compute_location_id",
                               "params": ["pickup_longitude", "pickup_latitude"]
                              }
                  },
                  {
                   "type": "function",
                   "content": {
                               "func_name": "compute_location_id",
                               "params": ["start_lon", "start_lat"]
                              }
                  }
                 ], 
                         
 ...
}

```

In this json, the keys are the entries in the reference scheme and the values are _aliases_ of the key. An alias has a different name but represents in another file the same data as the key. There are several types of aliases :

* *column* : the data is retrieved from another column whose name is specified by _content_.
* *function* : the data is calculated from the data of several other columns. The name of the function is the name of columns-parameters are specified by _content_.

The function will simply read the json and loop through its keys. For each key, it checks either that the column name or parameter names are in the wrong schema. If this is the case it retrieves the associated data and copies/calculates the value to be associated with the key in the new good schema.

In our case, there will be column name changes and a single multi-column function that transforms geographic coordinates (latitude, longitude) into an area identifier. This function needs another parameter other than column names: a dataframe geopandas of geographical areas. The integration function must allow this parameter to be passed.

In [None]:
def integrate(data, schema, integration_conf, params_f):

    """
    Transforms data into the desired schema 
    by following the configuration

    :param data: list of original data
    :param schema: schema of original data
    :param integration_conf: dict with the configuration
    :param params_f: params for functions of columns
    """

    data = dict(zip(schema, data))
    t_data = dict() # transformed data

    # loop through all the columns of the reference schema
    for column, alias_list in integration_conf.items():
        found = False

        # loop through all alias (column or function)
        for alias in alias_list:
            category = alias['type']
            content = alias['content']

            # check category of the alias
            # the alias is an other column name
            if category == 'column':

                # check that this name is in of the schema data
                if content in list(data.keys()):
                    t_data[column] = data[content]
                    found = True
                    break

            # the alias is a function with other column name as param
            elif category == 'function':

                func_name = content['func_name']
                param_names = content['params']
                params = []

                # check that all the params are in the schema of the data
                eval_func = True
                for param_name in param_names:
                    if param_name not in list(data.keys()):
                        eval_func = False
                    else:
                        params.append(data[param_name])

                # eval the function if all the params are there
                if eval_func :
                    eval("dkdkdk")
                    
                    
        # if there is no valid alias add empty data
        if not found:
            t_data[column] = ''

    data = list(t_data.values())

    return data

The biggest advantage of this function is its flexibility. I.e. if in the future a schema transformation is different (an extra column, changing date format, new taxi zone, ...), the function will remain valid. We have in fact simply separated the fixed part from the variable part of the transformation. The fixed part is the code of the function and the variable part is passed as a parameter.

To make this flexibility possible, we use the python function _eval()_ which allows to evaluate python code passed as a string. For functions with several columns, a string "name_of_the_function(param1, param2, ...)" is constructed from the config file and passed to the _eval_ function.

In the rest of the project, we will still have to implement functions that work with rows, to group these functions, we define a static Row class where each of these functions will be a static method. This class can be found in the _row.py_ file.

### 2.3. Implementation with spark

Comme l'intégration d'un row est indépendant de l'intégration d'un autre row, on va pouvoir utiliser spark pour parraléliser cette opération. 

1. lecture le fichier csv dans HDFS
2. application de la fonction _process_ aux rows : elle transforme un _string_ en _list_
3. application la fonction _integrate_ aux rows
4. application de la fonction _join_ aux rows : elle transforme une _list_ en _string_
5. écriture des rows dans HDFS

<img src="img/spark.png" width="800" align="left"/>

In [19]:
# spark configuration
sc.addFile("./row.py")
sc.addFile("./transformations.py")
sc.addFile("./shape_files/location.shp")
sc.addFile("./shape_files/location.dbf")
sc.addFile("./shape_files/location.shx")
sys.path.insert(0,SparkFiles.getRootDirectory())

In [26]:
import json
from row import Row
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyspark import SparkFiles

# load integration configuration
integration_confs = dict()
conf_filenames = sorted(glob.glob('/home/ceci18/INFOH600-project/integration_conf/*.json'))
for conf_filename in conf_filenames:    
    with open(conf_filename, 'r') as f:
        integration_conf = json.load(f)
        dataset_name = os.path.basename(conf_filename)[:-5]
        integration_confs[dataset_name] = integration_conf

# get all the filename
hdfs_path = 'hdfs://public00:8020/user/hpda000034/infoh600/sampled'
local_path = '/home/hpda00034/infoh600/sampled'
#filenames = sorted(glob.glob("{}/yellow_*.csv".format(local_path)))
#filenames = [os.path.basename(filename) for filename in filenames]
filenames = ["green_tripdata_2014-01.csv", "green_tripdata_2014-02.csv", "green_tripdata_2016-07.csv"]

# construct a rtree index with the geopanda df
zones = gpd.read_file("./shape_files/location.shp")
zones.set_geometry('geometry', crs=(u'epsg:'+str(4326)), inplace=True)

# get the right schema 
right_schema = list(integration_confs['yellow'].keys())

# Create one big rdd that holds all of the file's contents
for filename in filenames:
    # write the right schema on the first line
    green_rdd = sc.parallelize([','.join(right_schema)])
    
    # get the schema with the local file
    file = open('{}/{}'.format(local_path, filename), 'r')
    schema = file.readline().replace('\n', '')
    schema = schema.replace('"','').replace("'", '').lower().split(',')
    
    rdd = sc.textFile('{}/{}'.format(hdfs_path, filename))
    # Each original file contains also the header (schema). 
    # We ignore this first line and convert everything else
    rdd = rdd.zipWithIndex() \
             .filter(lambda x:x[1] > 1) \
             .map(lambda x: Row.process(x[0]))\
             .map(lambda x: Row.integrate(x, schema, integration_confs['yellow'], zones))\
             .map(lambda x: Row.join(x))
            # .map(lambda x: ','.join(x))
    # add it to the rdd that we already have
    green_rdd = green_rdd.union(rdd)


    # saves this to HDFS in your home HDFS folder
    green_rdd.saveAsTextFile('./integrated/yellow/{}'.format(filename[:-4]))

In [17]:
# Stop spark when we are done with it! This frees resources on the cluster so that you peers can also use the cluster
try: 
    spark.stop()
except: 
    pass

### 3. Cleaning

This step will allow you to remove invalid data from the datasets. An example of invalid data is a negative value for a data representing a distance. For each column in the 4 sub-datasets, assertion are maded based on the TLC specifications.

### 3.1. Identification of the assertions

<img src="img/assertion_fhv_fhvhv.png" width="800" align="left"/>

<img src="img/assertion_green.png" width="800" align="left"/>

<img src="img/assertion_yellow.png" width="800" align="left"/>

### 3.2. Implementation

A python module that checks such assertion exists: _panda_schema_. This module is very easy to use, an example is given for the FHV dataset (the implementation for the other sub-datasets can be found in the _cleaning.py_ file).

In [None]:
from pandas_schema import Column, Schema
import pandas_schema.validation as validation 

# validation schema for fhv dataset
schema_fhv = Schema([
    Column('dispatching_base_num', 
           [validation.MatchesPatternValidation('^B[0-9]{5}$')], 
           allow_empty=True),
    Column('pickup_datetime', 
           [validation.DateFormatValidation('%Y-%m-%d %H:%M:%S')],
           allow_empty=True),
    Column('dropoff_datetime', 
           [validation.DateFormatValidation('%Y-%m-%d %H:%M:%S')],
           allow_empty=True),
    Column('pulocationid',
           [validation.InRangeValidation(1, 266)],
           allow_empty=True),
    Column('dolocationid', 
           [validation.InRangeValidation(1, 266)],
           allow_empty=True),
    Column('sr_flag', 
           [validation.InListValidation([1, None])],
           allow_empty=True)
])


def validate(data, schema, validation_schema):

    """
    Validate the entries of the row with
    the validation schema

    :param data: data to validate
    :param schema: the csv schema of the data to validate
    :param validation_schema: schema to validate the data 
    :return: boolean, true if validated
    """

    # validate the data
    df = pd.DataFrame([data], columns=schema)
    errors = validation_schema.validate(df)

    validated = len(errors) == 0

    return validated



the _validate_ function is a static method of the Row class

### 3.3. Implementation with spark

<img src="img/spark_validation.png" width="800" align="left"/>

In [None]:
import json
from row import Row
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyspark import SparkFiles

# load integration configuration
integration_confs = dict()
conf_filenames = sorted(glob.glob('/home/ceci18/INFOH600-project/integration_conf/*.json'))
for conf_filename in conf_filenames:    
    with open(conf_filename, 'r') as f:
        integration_conf = json.load(f)
        dataset_name = os.path.basename(conf_filename)[:-5]
        integration_confs[dataset_name] = integration_conf

# get all the filename
hdfs_path = 'hdfs://public00:8020/user/hpda000034/infoh600/sampled'
local_path = '/home/hpda00034/infoh600/sampled'
#filenames = sorted(glob.glob("{}/yellow_*.csv".format(local_path)))
#filenames = [os.path.basename(filename) for filename in filenames]
filenames = ["green_tripdata_2014-01.csv", "green_tripdata_2014-02.csv", "green_tripdata_2016-07.csv"]
schema = []

# Create one big rdd that holds all of the file's contents
for filename in filenames:
    # write the right schema on the first line
    validated_rdd = sc.parallelize([','.join(schema)])
    unvalidated_rdd = sc.parallelize([','.join(schema)])
    
    # get the schema with the local file
    file = open('{}/{}'.format(local_path, filename), 'r')
    schema = file.readline().replace('\n', '')
    schema = schema.replace('"','').replace("'", '').lower().split(',')
    
    rdd = sc.textFile('{}/{}'.format(hdfs_path, filename))
    # Each original file contains also the header (schema). 
    # We ignore this first line and convert everything else
    rdd = rdd.zipWithIndex() \
             .filter(lambda x:x[1] > 1) \
             .map(lambda x: Row.process(x[0]))\
             .persist()
    
    validated = rdd.map(lambda x: Row.validate(x, schema, validation_schema))\
                   .map(lambda x: Row.join(x))
    # add it to the rdd that we already have
    validated_rdd = total_rdd.union(validated)
    
    unvalidated = rdd.map(lambda x: Row.validate(x, schema, validation_schema))\
                     .map(lambda x: Row.join(x))
    # add it to the rdd that we already have
    unvalidated_rdd = total_rdd.union(validated)


    # saves this to HDFS in your home HDFS folder
    green_rdd.saveAsTextFile('./integrated/yellow/{}'.format(filename[:-4]))

### 4. Analysis