## Motivation

You probably have heard already that **80%** of Data Scientist's work consists on finding, preparing, cleaning data and only the remaining **20%** is dedicated to modeling and analysing.

<img src="./images/datascienceprocess.png">

In all these workflows, it is very important to keep your code clean and organized. Moreover there are standard procedures (ex: replacing missing values) that need to be consistent and automated so that when you have new data coming, exactly the same transformation is applied.
**Pipelines** are a way to overcome these problems and keep your code clean!

Most of the time, the data you obtain is very heterogenous: it is not in a matrix form already prepared and it contains a mixture of data type: categorical features, numerical features, text. Often you also have the problem of missing values.

**Pipelines** can be seen as a series of transformations that you can apply to your data one after the other and that can get your data to a format that can be "readable" by a model. You can even add the final step (train your model + make predictions) to the pipeline:

<img src="./images/diagram_data.png">

```python
data = read_file('..')

data_num = extract_num(data)
data_cat = extract_cat(data)
data_bool = extract_bool(data)

data_num = replace_missing(data_num)
data_cat = replace_missing(data_cat)
data_cat = get_num_features(data_cat)

concatenate...

split_train_test...

model fit....

model predict


```

Suppose you want to make some changes, add some other transformations, normalize, etc. With the code above you need to keep track at every step and you might end up with a mess of functions.

## Assumptions

The word "pipeline" in data science might refer to different concepts. **Here** we will consider pipelines in scikit-learn 
- we will use python
- we will work with scikit-learn pipelines (https://scikit-learn.org/stable/modules/compose.html)
- we will use pandas dataframe as inputs

## Course Summary

1. Object oriented programming: Classes in Python
2. Special type of classes in Python: Transformers and Estimators
3. Pipelines
4. Feature Union

Throught the course we will have a lot of exercises and hands on!