In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Purpose
This notebook describes the typical activities carried out at the beginning of a project/thread when a customer shares new data. We will be trying to understand the tables, columns and information flow. Typically we also look for data issues and confirm with respective owners for resolution. At the end of this activity, the data sources and their treatment is finalized. Code in this notebook will not be part of the production code.

This data is stored currently in the tiger databricks storage. The notebooks are configured to connect directly to the Databricks fielstore and pull/save relevant files, therefore it is not required to download the files. 

Contact [code templates support](code-templates-support@tigeranalytics.com) for access to databricks.

# Imports

In [7]:
# Standard Library Imports
import os
import os.path as op
import sys
import time
import warnings
import re
import random

# Third Party imports
import yaml
import hvplot
import pandas as pd
import numpy as np
import holoviews as hv
import panel as pn
from pyspark_dist_explore import (
    Histogram,
    hist,
    distplot,
    pandas_histogram
)
from IPython.display import (
    display,
    display_html
)

# Spark imports
from pyspark.sql import (
    types as DT,
    functions as F,
    Window
)
from pyspark.ml import Pipeline
from pyspark.ml.tuning import (
    ParamGridBuilder,
    CrossValidator,
    CrossValidatorModel
)
from pyspark.ml.feature import (
    VectorAssembler,
    StandardScaler,
    StringIndexer,
    OneHotEncoder,
    Imputer
)
from pyspark.ml.evaluation import RegressionEvaluator

# Project Imports
from ta_lib.pyspark import (
    dp,
    features,
    eda,
)
# Project Imports
from ta_lib.pyspark.core import (
    utils,
    context
)
# Setting Options
random_seed = 0
pn.extension('bokeh')
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)



# Initialization

`config.yml` is used to store all the parameters required for the template

In [8]:
config_path = op.join(os.getcwd(),'conf', 'config.yml')
with open(config_path, 'r') as fp:
    config = yaml.safe_load(fp)
config

{'all': {'core': 'default',
  'log_catalog': 'production',
  'data_catalog': 'remote',
  'job_catalog': 'local'},
 'spark': {'spark.executer.cores': 4, 'spark.cores.max': 4}}

In [9]:
data_config_path = op.join(os.getcwd(),'conf/data_catalog', 'local.yml')
with open(data_config_path, 'r') as fp:
    data_config = yaml.safe_load(fp)
data_config

{'reference_date': datetime.date(2020, 8, 31),
 'num_days_prediction': 7,
 'raw': {'filesystem': 'file',
  'base_path': './../../data/raw/',
  'carrier_data_path': 'carrier_data.csv',
  'fuel_prices_data_path': 'fuel_prices.csv',
  'market_carrier_rates_data_path': 'market_carrier_rates_data.csv',
  'route_mapping_data_path': 'route_mapping.csv'},
 'clean': {'filesystem': 'file',
  'base_path': './../../data/cleaned/',
  'carrier_data_path': 'carrier_data',
  'fuel_prices_data_path': 'fuel_prices',
  'market_carrier_rates_data_path': 'market_carrier_rates',
  'final_routes_data_path': 'final_data',
  'trasnformed_routes_data_path': 'transformed_data'},
 'processed': {'filesystem': 'file',
  'base_path': './../../data/processed/',
  'train': 'train_carrier',
  'test': 'test_carrier',
  'preds': 'predictions_carrier'},
 'spark': {'spark.executer.cores': 4, 'spark.cores.max': 4}}

## Create spark session

`talib.pyspark.context` module is leveraged to build the sparksession so as to consider the spark session related params in the config file while building the session.

In [10]:
%%time
session = context.CustomSparkSession(config)
session.CreateSparkSession()
spark = session.spark
sc = session.sc.setLogLevel("ERROR")


CPU times: user 13.4 ms, sys: 22.6 ms, total: 36 ms
Wall time: 55.6 ms


# Background
The client is a goods carrier serving multiple trade lanes across cities. They set the carrier price for each trip based on the distance of the trip and the prevailing fuel price in the market, and negotiate with their customers to arrive at the final trip price. The client wants to understand if they can determine optimal price for each trip using machine learning. 

# Data Read

### Carrier data

The `carrier_data` dataset contains information about the price for trips completed by the client across various routes. For every trip `trip_id` covering a certain `distance` between `origin_zip` and `destination_zip` with a specific type of vehicle (`vehicle_type`), the price for that trip is provided in column `carrier_price`.

In [11]:
df_carrier_data = utils.read_data(
    spark=spark,
    paths=[data_config['raw']['base_path'] + data_config['raw']['carrier_data_path']],
    fmt="csv",
    fs=data_config['raw']['filesystem'],
)
df_carrier_data.printSchema()

                                                                                

root
 |-- trip_id: integer (nullable = true)
 |-- distance: double (nullable = true)
 |-- vehicle_type: integer (nullable = true)
 |-- pickup_date: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- origin_state: string (nullable = true)
 |-- origin_zip: string (nullable = true)
 |-- origin_country: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- destination_state: string (nullable = true)
 |-- destination_zip: string (nullable = true)
 |-- destination_country: string (nullable = true)
 |-- carrier_price: double (nullable = true)



### Fuel prices Data
The fuel prices dataset contains the prevailing market fuel price on a given date. These prices are reported on a weekly basis, on the first Monday of each week. For any trips that the client schedules during a given week, the reference fuel price would be the price prevailing as of the Monday of that week.

In [12]:
df_fuel_prices_data = utils.read_data(
    spark=spark,
    paths=[data_config['raw']['base_path'] + data_config['raw']['fuel_prices_data_path']],
    fmt="csv",
    fs=data_config['raw']['filesystem'],
)
df_fuel_prices_data.printSchema()

[Stage 3:>                                                          (0 + 1) / 1]

root
 |-- date: string (nullable = true)
 |-- national_price: double (nullable = true)



                                                                                

### Market carrier rates data

This is a third-party market dataset consolidated across competing carrier providers in the freight market, containing information on the total cost and total distance serviced by each `vehicle_type` across all carrier providers in the geography. 

In [13]:
df_market_carrier_rates_data = utils.read_data(
    spark=spark,
    paths=[data_config['raw']['base_path'] + data_config['raw']['market_carrier_rates_data_path']],
    fmt="csv",
    fs=data_config['raw']['filesystem'],
)
df_market_carrier_rates_data.printSchema()



root
 |-- week_ending_date: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- origin_state: string (nullable = true)
 |-- origin_zip: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- destination_state: string (nullable = true)
 |-- destination_zip: string (nullable = true)
 |-- vehicle_type: integer (nullable = true)
 |-- first_monday_of_week: string (nullable = true)
 |-- total_cost_all_providers: double (nullable = true)
 |-- total_distance_all_providers: double (nullable = true)



                                                                                

### Route mapping

This contains a mapping of each zip code to a custom market `market_id`. 

In [14]:
df_route_mapping_data = utils.read_data(
    spark=spark,
    paths=[data_config['raw']['base_path'] + data_config['raw']['route_mapping_data_path']],
    fmt="csv",
    fs=data_config['raw']['filesystem'],
)
df_route_mapping_data.printSchema()

[Stage 7:>                                                          (0 + 1) / 1]

root
 |-- reference_state: string (nullable = true)
 |-- market_id: string (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)



                                                                                

### Consolidating data objects in a dictionary

In [15]:
data = {
    'carrier_data': df_carrier_data,
    'fuel_prices_data':df_fuel_prices_data,
    'market_carrier_rates_data':df_market_carrier_rates_data,
    'route_mapping_data':df_route_mapping_data,
}

# Data Discovery

Given the raw data from data ingestion, we would now like to explore and learn more details about the data.

The output of the step would be a summary report and discussion of any pertinent findings.

## Shape of Data

In [16]:
%%time
utils.display_as_tabs([(k, dp.get_shape(v)) for k,v in data.items()])

[Stage 17:>                                                         (0 + 1) / 1]

CPU times: user 148 ms, sys: 3.45 ms, total: 151 ms
Wall time: 14.7 s


                                                                                

## Clean Column Names

Standardize the column names of the dataframe. Converts camelcase into snakecase

In [17]:
%%time
data = {k:dp.clean_columns(v) for k,v in data.items()}
utils.display_as_tabs([(k, v.columns) for k,v in data.items()])

CPU times: user 68.5 ms, sys: 1.85 ms, total: 70.4 ms
Wall time: 204 ms


## Identification of columns types in the data

Obtaining the columns by different types of data (numerical, categorical, datelike and boolean)

In [18]:
%%time
types = {
    'numerical': dp.list_numerical_columns,
    'cat_cols': dp.list_categorical_columns,
    'date_cols': dp.list_datelike_columns,
    'bool_cols': dp.list_boolean_columns
}
res = [(datakey, {typekey: typeval(dataval) for typekey, typeval in types.items()}) for datakey, dataval in data.items()]
utils.display_as_tabs(res)

CPU times: user 54.4 ms, sys: 0 ns, total: 54.4 ms
Wall time: 52.8 ms


## Check for data consistency in Columns

Data consistency refers to any case related inconsistencies in an object column.

> Example -  Having "APPLE" and "apple" as part of cell values in the same column is considered as an inconsistency

In [19]:
%%time
utils.display_as_tabs([(k, dp.check_column_data_consistency(v)) for k,v in data.items()])

                                                                                

CPU times: user 1.27 s, sys: 1.24 s, total: 2.52 s
Wall time: 3min 6s


## Columns Unique Values Summary

A summary of number of distinct count and the ratio of number of unique values to the total count is obtained.

This helps in identifying any categorical features sneaking in as numerical columns

In [20]:
%%time
utils.display_as_tabs([(k, eda.column_values_summary(v).T) for k,v in data.items()])

                                                                                

CPU times: user 3.79 s, sys: 10.2 s, total: 14 s
Wall time: 2min 5s


## Identification of Missing Values

This step summarizes the Number of Missing Values in each column of the data.

In [21]:
%%time
utils.display_as_tabs([(k, dp.identify_missing_values(v).toPandas()) for k,v in data.items()])

                                                                                

CPU times: user 600 ms, sys: 217 ms, total: 817 ms
Wall time: 1min 19s


## Health Analysis of the data

This step generates a set of data analyses that could be useful to showcase to clients.

1. % of numerical columns in the data

2. % of missing values in the data

2. % of duplicated data points

In [22]:
%%time
utils.display_as_tabs([(k, eda.plot_health(v)) for k,v in data.items()])

                                                                                

CPU times: user 1.61 s, sys: 286 ms, total: 1.89 s
Wall time: 2min 32s


# Missing values Plot

In [23]:
%%time
utils.display_as_tabs([(k, eda.missing_plot(v)) for k,v in data.items()])

[Stage 269:>                                                        (0 + 1) / 1]

CPU times: user 980 ms, sys: 620 ms, total: 1.6 s
Wall time: 1min 16s


                                                                                

## Missing data summary

In [24]:
%%time
utils.display_as_tabs([(k, eda.missing_value_summary(v)) for k,v in data.items()])

                                                                                

CPU times: user 688 ms, sys: 586 ms, total: 1.27 s
Wall time: 1min 10s


## Cardinality check of tables wrt consumer data

In [25]:
data_to_check = {
    'carrier_data': df_carrier_data,
    'market_carrier_rates_data':df_market_carrier_rates_data,
}


In [26]:
%%time
utils.display_as_tabs([(k, eda.setanalyse(df_carrier_data, v,"origin_zip")) for k,v in data_to_check.items()])

                                                                                

CPU times: user 52.3 s, sys: 9.67 s, total: 1min 1s
Wall time: 1min 46s
