# Lecture 22: Data Pipelines

Applying machine learning algorithms and approaches to real-world data
involves a number of practical challenges. In this series of lectures,
we will look at tools and ways of working that address questions like:

- How can I store and access large amounts of data remotely?
- How can I keep track of different versions of datasets?
- How can I share my results and make my analyses reproducible by others?

Before we get into individual questions, let's take a look at the broader picture of how we handle data, before and after running any analysis or inference methods.

## Pipelines

More often than not, the data of interest will not be in a directly useable form, or perhaps not even immediately available to you.

Before we can train any model, we first need to make sure that the data is available and properly formatted. This can involve a number of steps:

- Accessing the data
- Cleaning and other preprocessing
- Reformatting

After this, the data is ready to be used in our algorithm of choice.

Image: access -> preprocess -> (formatting ->?) modelling/prediction -> publishing

This sequence of steps is sometimes called a **data pipeline**. Another related term used to describe similar workflows is ETL: Extract-Transform-Load.

## Accessing data

If we haven't collected the data ourselves, we will first need to get our (virtual) hands on it! This can be done in a number of ways. For example:
- A colleague gives us a file
- We connect to a web service that produces the data
- We "scrape" a web page or other source to extract the data ourselves
- We query a database for the particular data we want

Sometimes we will need to combine more than one source to get the full set of data that we require.

We will talk more about databases in the next lecture. 

A common element in the above examples is that the data can exist in some **remote** location. How we get it on our own computer will depend on the source, format and size of the data. However, this can often be done programmatically.

### Example: Downloading data from S3

S3 is a storage service offered by Amazon Web Services (AWS). Users can upload datasets which can then be accessed by others. For this example we will look at an open dataset that can be downloaded by anyone.

(To run this locally, you will need to install the Python package `boto3`, which will let you communicate with AWS).

In [2]:
import boto3

ModuleNotFoundError: No module named 'boto3'

## Preprocessing

It is not always enough to get hold of the data you want to work with. Sometimes this raw or preliminary data has to be transformed. There are many reasons why:

- Errors in data
- Dimensionality or size of dataset is too high
- We want to focus only on a subset of interest
- Raw data does not directly contain variables of interest 
- Some algorithms are negatively impacted by e.g. imbalances in class frequencies or extreme values

Preprocessing steps can include:
- Replacing values that are incorrect or cause problems
- Filtering, subsampling (discarding samples) or supersampling (repeating samples)
- Creating new variables such as by combining existing ones
- Removing outliers

Aspects of this are often referred to as **"cleaning"** the data. This is an important and often undervalued step in the pipeline. These transformations can be performed manually, although tools like [OpenRefine](https://openrefine.org/) can simplify and automate the process.

## Reformatting
**Maybe should be part of preprocessing?**

The code you have written may expect to read in data in a particular format, such as a CSV file, or a collection of files. The result of your preprocessing must therefore be put into the same format.

This can often be done through a library. For instance, `pandas` offers several methods for writing out a data frame to a number of commonly-used formats.

Example of pandas or json or...?


## Publishing
Make your results and models available:
- Serializing results, making things reproducible
- Uploading remotely
- Metadata and explanation of provenance/meaning
- FAIR

## Summary
- Data needs to go through a number of steps before it can used in a model.
- Doing this process programmatically, not manually, leaves a record and facilitates repetition and verification.
- Remote access to data is becoming increasingly important as size grows.