# [Pipelines](https://www.kaggle.com/dansbecker/pipelines/code)

Pipelines are a simple way to keep your data processing and modeling code organized.  
Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.  
Many data scientists hack together models without pipelines, but Pipelines have some important benefits, including:  
* **Cleaner Code:** You won't need to keep track of your training (and validation) data at each step of processing. Accounting for data at each step of processing can get messy. With a pipeline, you don't need to manually keep track of each step.  
* **Fewer Bugs:** There are fewer opportunities to misapply a step or forget a pre-processing step.  
* **Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.  
* **More Options For Model Testing:** You will see an example in the next tutorial, which covers cross-validation.

## Example

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('input/melbourne_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price
train_X, test_X, train_y, test_y = train_test_split(X, y)

You have a modeling process that uses an Imputer to fill in missing values, followed by a RandomForestRegressor to make predictions.  
These can be bundled together with the **make_pipeline()** function.

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

Now you can use this pipeline for fitting and prediction:

In [5]:
my_pipeline.fit(train_X, train_y)
predictions = my_pipeline.predict(test_X)
predictions[:5]

array([1413955.55555556,  805350.        ,  805000.        ,
        767800.        ,  963264.73673385])

### Compared to the code without pipelines:

In [6]:
my_imputer = Imputer()
my_model = RandomForestRegressor()

imputed_train_X = my_imputer.fit_transform(train_X)
imputed_test_X = my_imputer.transform(test_X)
my_model.fit(imputed_train_X, train_y)
predictions = my_model.predict(imputed_test_X)
predictions[:5]

array([1409800.        ,  853666.66666667,  810050.        ,
        805950.        ,  972176.63690476])

With this simplified example, it is difficult to appreciate the utility of pipelines.  
As data processing tasks become more complex, the elegance and portability of pipelines can be a great help.

## Understanding Pipelines

Most scikit-learn objects are either **transformers** or **models**.  
* *Transformers* are for pre-processing before modeling.  
The Imputer class (for filling in missing values) is an example of a transformer.  
Over time, you will learn many more transformers, and you will frequently use multiple transformers sequentially.  
* *Models* are used to make predictions.  
You will usually preprocess your data (with transformers) before putting it in a model.  

You can tell if an object is a transformer or a model by how you apply it.  
After fitting a transformer, you apply it with the transform command.  
After fitting a model, you apply it with the predict command.  
Your pipeline must start with transformer steps and end with a model.  
Eventually you will want to apply more transformers and combine them more flexibly.  
We will cover this later in an Advanced Pipelines tutorial.