Skip to content

Tutorial demonstrating pipeline-oriented approach in data analytics

License

Notifications You must be signed in to change notification settings

bbiletskyy/pipeline-oriented-analytics

Repository files navigation

pipeline-oriented-analytics

This is a tutorial demonstrating pipeline-oriented data analytics approach applied to taxi trip duration data. This project should NOT be viewed as an example how to solve a particular regression problem. It is rather a demonstration how to organize computation when solving data analytics prblems. While solving the toy problem some features were introduced artificially just for demo purposes.

The proposed approach is described in the following articles:

Prerequisites

Getting started

  1. Run make init test to initialize the conda environment and to launch the tests
  2. (Optional, sample datasets are available) Download complete train and test datasets from Kaggle's New York City Trip Duration, extract them and overwrite train.csv, test.csv in data/raw folder.

Running examples

  1. run make distance_matrix to generate distance matrix
  2. run make prepare_train features_train train to pre-process train data, extract train features and train
  3. run make prepare_test features_test predict to pre-process test data, extract test features and predict
  4. run make select_params to run hyper-parameter tuning.

About

Tutorial demonstrating pipeline-oriented approach in data analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages