# Kaggle Data Pipelines


## Introduction

The following are summary notes (see Jupyter notebook for more complete notes), on [Kaggle](https://www.kaggle.com/)'s, [Getting Started with Automated Data Pipelines](https://www.kaggle.com/professional-skills-series#pipelines?utm_medium=email&utm_source=intercom&utm_campaign=pipelines-event), professional skills series by Rachael Tatman.


## Day 1 - Dataset Versioning

Open Source Tools for versioning datasets:

* [Data Version Control(DVC)](https://dvc.org/) - This is a commandline only tool based on [Git](https://git-scm.com/)
* [Git Large File Storage](https://git-lfs.github.com/) - replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like [GitHub.com](https://github.com/).

### Versioning a Dataset on Kaggle

Example: https://www.kaggle.com/frankhjung/debian-data/settings

![Set versioning](https://raw.githubusercontent.com/frankhjung/jupyter-datapipelines/master/images/dataset-versioning.png)

### Further Reading

* [A Practical Taxonomy of Reproducibility for Machine Learning Research (PDF)](https://openreview.net/pdf?id=B1eYYK5QgX)
* [Dashboarding with Notebooks](https://www.kaggle.com/rtatman/dashboarding-with-notebooks-day-1)

## Day 2 - Validation and URLs

* getting data from a URL
* [Representational State Transfer (REST)](https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm)
* Google has a [Dataset Search](https://toolbox.google.com/datasetsearch)

My implementation in both Python and R is on Github [here](https://github.com/frankhjung/jupyter-datapipelines).
There is also a [Gitlab version](https://gitlab.com/theMarloGroup/jupyter-notebooks/datapipelines) that uses pipelines to execute the scripts.

### Creating a Dataset on Kaggle

Manufacturing employment as a proportion of total employment is a CSV file:

* [2018-Australian SDG-indicator-9-2-2.csv](https://data.gov.au/dataset/7f90d314-fa64-4bae-8609-2e26ff48f6fa/resource/05390642-bdeb-4174-9bd5-c1df6e6d1e9e/download/2018-australian-sdg-indicator-9-2-2v2.csv)

Added dataset to Kaggle here: https://www.kaggle.com/frankhjung/australian-manufacturing-employment-2018

![Kaggle Dataset](https://raw.githubusercontent.com/frankhjung/jupyter-datapipelines/master/images/dataset-validation.png)

### Validating Dataset

Before we use the dataset we should ensure that it is valid. For instance we could check:

* that file is a valid type (expect CSV)
* no missing data (no NA's in data)

To perform the validation, it is recommended that a script be run before the data is processed, rather than performing the validation in a notebook.

The script is [validation.py](/validation.py).  

**HINT** You could save scripts on GitHub, then import as a GitHub data file into Kaggle.

### Further Reading

* [Crisis to Calm: Story of Data Validation @ Netflix](https://www.infoq.com/presentations/data-validation-netflix)

## Day 3 - ETL & Creating Datasets from Kernel Output

* basic principles of Extract, Transform & Load (aka ETL) pipelines
* creating datasets from Kaggle Kernel outputs

The notebook for this section is [here](https://www.kaggle.com/rtatman/automating-data-pipelines-day-3).

### ETL

* **Extracting**: get the raw data you need from where it is being stored
* **Transforming**: rearrange that data to fit your needs
* **Loading**: storing your transformed data in a different place so it can be used

Useful packages:

* [rvest](https://cran.r-project.org/web/packages/rvest/index.html) - [easy web scraping](https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/)
* [tidytext](https://cran.r-project.org/web/packages/tidytext/index.html) - text mining for word processing and sentiment analysis
* [mice](https://cran.r-project.org/web/packages/mice/index.html) - [Mice: Multivariate Imputation By Chained Equations](https://www.rdocumentation.org/packages/mice/versions/3.3.0/topics/mice) for imputation of missing values, see also [MCAR](https://rdrr.io/cran/mice/man/ampute.html)
