# Modeling workflow

Even though we cover many different models, the steps that we need to take from getting from raw data to results, its usually very similar. 

We assume you already have a data set with the features and the target you want to model (this is the output you get from doing the EDA).

Basic Workflow:
-----

1. Do Training/Test split on the dataframe.

2. Create an Instance of the model.

3. Fit the model with the training data.

4. Asses the model we just fitted with the test data.

5. (Optional) Use predict to generate new data.

## Example

First we load the libraries that we will need. 

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets, metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Now we load the Boston Housing dataset, that's included in sklearn.

In [None]:
boston = datasets.load_boston()

features_df = pd.DataFrame(boston.data, columns=boston.feature_names)

features_df.head(3)

In [None]:
target_df = pd.DataFrame(boston.target, columns=["MEDV"])
target_df.head(3)

### Linear regression

Now that we have the data (features and target), we will go through the steps in the workflow, so we can fit a linear regression on _**all**_ of our features.

#### 1: Do Training/Test split on the dataframe.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features_df, target_df, test_size=0.33, random_state=42
)

#### 2: Create an Instance of the model.

In [None]:
regr = LinearRegression()

#### 3: Fit the model with the training data.

In [None]:
fitted_regr = regr.fit(X_train, y_train)

#### 4: Asses the model we just fitted with the test data.

In [None]:
print('Variance score: %.2f' % regr.score(X_test, y_test))

#### 5: (Optional) Use predict to generate new data.

In [None]:
# this simulates a feature row for a house we want to predict the prices 
new_sample = np.array([0.00632, 18.0, 2.31, 0.0, 0.538, 
              6.575, 65.2, 4.0900, 1.0 ,296.0, 15.3, 396.90, 4.98]
            ).reshape(1, -1) # this is necessary to mute a warning from sklearn.

print "New house price: " + str(fitted_regr.predict(new_sample)[0][0])

# More advanced workflows:

We have just covered a basic workflow, but there are more things you can add, that might improve your end performance.

Things we haven't cover in this notebook: 

- EDA (specially feature selection and feature engineering)
- Hyperparameter optimization with Gridsearch.
- Crossvalidation for model assessment.
- More advanced model assesment (specially important for classification).