## MLeap Scikit-Learn Demo

The goal of this demo is to:
    1. Put together an ML pipeline using scikit transformers, pipeline and feature unions
    2. Train a linear regression to predict listing prices
    3. Demonstrate how to serialize scikit-learn transformers and models to bundle.ml
    4. TODO: use .deploy() to deploy a model to combust cloud
    5. TODO: deserialize the pipeline in Spark
    
Note: MLeap <> Scikit-Learn itegration is experimental. We are planning to release a stable version with mleap-0.7.0

## Background on the Dataset

The dataset used for the demo was pulled together from individual cities' data found [here](http://insideairbnb.com/get-the-data.html). We've also gone ahead and pulled the individual datasets and relevant features into this [research dataset](https://s3-us-west-2.amazonaws.com/mleap-demo/datasources/airbnb.csv) stored as csv.

### Step 0: Load libraries and data

In [1]:
import pandas as pd

import mleap.sklearn.pipeline
import mleap.sklearn.feature_union
import mleap.sklearn.base
import mleap.sklearn.logistic
import mleap.sklearn.preprocessing.data
from mleap.sklearn.ensemble import forest

from mleap.sklearn.preprocessing.data import FeatureExtractor, LabelEncoder, ReshapeArrayToN1

from sklearn.linear_model import LinearRegression
from sklearn.ensemble.forest import RandomForestRegressor
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

import numpy as np

In [3]:
df = pd.read_csv('/tmp/airbnb.csv', error_bad_lines=False, warn_bad_lines=False)
df[:5]

Unnamed: 0,id,name,price,bedrooms,bathrooms,room_type,square_feet,host_is_superhost,state,cancellation_policy,security_deposit,cleaning_fee,extra_people,number_of_reviews,price_per_bedroom,review_scores_rating,instant_bookable
0,1949687,Delectable Victorian Flat for two,80.0,1.0,1.0,Entire home/apt,,0.0,London,moderate,100.0,20.0,10.0,8,80.0,94.0,0.0
1,3863509,Fully Furnished 3 Bed House/Garden,40.0,1.0,1.0,Private room,,0.0,London,flexible,0.0,0.0,0.0,5,40.0,55.0,0.0
2,1988980,Cozy Double Room in Victorian House,35.0,1.0,1.0,Private room,,0.0,Greenwich,flexible,0.0,5.0,10.0,32,35.0,89.0,0.0
3,2347198,Double Room In Central London.,42.0,1.0,1.0,Private room,,0.0,London,strict,0.0,0.0,15.0,24,42.0,71.0,0.0
4,144337,Fast WIFI Breakfast FREE Parking 5,200.0,1.0,1.5,Private room,250.0,0.0,London,strict,300.0,0.0,20.0,24,200.0,84.0,0.0


### Step 1: Standardize the data for out demo

In [4]:
def _transform_state(state):
    if state in ['NY', 'CA', 'London', 'Berlin', 'TX', 'IL', 'OR', 'DC', 'WA']:
        return state
    return 'Other'


### Step 1.1: Take a look at some summary statistics of the data

In [5]:
df[['state', 'price']].groupby('state').agg([np.size, np.mean]).sort_values(by=('price', 'size'), ascending=False)[:10]

Unnamed: 0_level_0,price,price
Unnamed: 0_level_1,size,mean
state,Unnamed: 1_level_2,Unnamed: 2_level_2
NY,52737.0,142.010695
CA,48467.0,159.67287
Île-de-France,47371.0,98.960166
Berlin,23842.0,60.337807
London,22873.0,99.098413
NSW,15356.0,170.841039
VIC,9788.0,135.221087
Noord-Holland,9256.0,125.68669
Catalunya,8929.0,65.640945
Catalonia,8728.0,82.069317


In [6]:
price_stats=df[['state', 'price']].groupby('state').agg([np.size, np.mean, np.max]).sort_values(by=('price', 'mean'), ascending=False)
price_stats[price_stats[('price','size')]>25][:10]

Unnamed: 0_level_0,price,price,price
Unnamed: 0_level_1,size,mean,amax
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
TX,7515.0,213.240319,2549.0
TN,3347.0,184.74873,1570.0
LA,4196.0,174.78122,1840.0
MA,4067.0,173.470371,1200.0
NSW,15356.0,170.841039,9002.0
Vic,36.0,162.722222,701.0
New South Wales,1474.0,161.00882,2001.0
CA,48467.0,159.67287,10000.0
NY,52737.0,142.010695,6001.0
IL,8208.0,137.878533,2000.0


In [7]:
# convert to categorical feature
df['host_is_superhost'] = df['host_is_superhost'].apply(str)
df['instant_bookable'] = df['instant_bookable'].apply(str)

# normalize state
df['state'] = df.state.apply(_transform_state)


### Step 2: Define continous and categorical features and filter nulls

In [8]:
continuous_features = ["bathrooms",
  "bedrooms",
  "security_deposit",
  "cleaning_fee",
  "extra_people",
  "number_of_reviews",
  "square_feet",
  "review_scores_rating"]

categorical_features = ["room_type",
  "host_is_superhost",
  "cancellation_policy",
  "state",
  "instant_bookable"]


In [9]:
imputed_continuous_features = ['imp_{}'.format(x) for x in continuous_features]

feature_extractor2_tf = FeatureExtractor(continuous_features, 'imputed_features', imputed_continuous_features)

impute_security_deposit_tf = Imputer(strategy='mean', axis=0)
impute_security_deposit_tf.mlinit(input_features=feature_extractor2_tf.output_vector, output_features='imputed_features')

impute_pipeline = Pipeline([
        (feature_extractor2_tf.name, feature_extractor2_tf),
        (impute_security_deposit_tf.name, impute_security_deposit_tf)
    ])
impute_pipeline.mlinit()

# Consider doing this via a feature union
df2 = df.join(pd.DataFrame(impute_pipeline.fit_transform(df), columns=feature_extractor2_tf.output_vector_items))

all_features = imputed_continuous_features + categorical_features

vector_assembler_d2bcd4d1-e7e9-11e6-8ddc-acbc329465af


### Step 3: Split data into training and validation 

In [10]:
# First filter out outlier prices
df2 = df2[(df2.price>=50)&(df2.price<=500)]

# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(df2[all_features], df2[['price']], test_size=0.33, random_state=42)

### Step 4: Continous Feature Pipeline

In [11]:
feature_extractor_tf = FeatureExtractor(imputed_continuous_features, 'unscaled_cont_features', ["scaled_{}".format(x) for x in imputed_continuous_features])

standard_scaler_tf = StandardScaler()
standard_scaler_tf.mlinit(input_features=feature_extractor_tf.output_vector, output_features='scaled_cont_features')

standard_scaler_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
                            (standard_scaler_tf.name, standard_scaler_tf)])
standard_scaler_pipeline.mlinit()

### Step 5: Categorical Feature Pipeline

In [12]:
# TODO: Need to fix scikit's One-Hot-Encoder to drop the last column of a matrix if we're using it for ML
def _create_le_one_hot_pipeline(feature_name):
    feature_extractor3_tf = FeatureExtractor([feature_name], '{}'.format(feature_name), 
                                         ['{}'.format(feature_name)])

    # Label Encoder for x1 Label 
    label_encoder_tf = LabelEncoder(input_features = [feature_extractor3_tf.output_vector], output_features='{}_label_le'.format(feature_name))

    # Reshape the output of the LabelEncoder to N-by-1 array
    reshape_le_tf = ReshapeArrayToN1()

    # Vector Assembler for x1 One Hot Encoder
    one_hot_encoder_tf = OneHotEncoder(sparse=False)
    one_hot_encoder_tf.mlinit(input_features = label_encoder_tf.output_features, output_features = '{}_label_one_hot_encoded'.format(feature_name))

    one_hot_encoder_pipeline_x0 = Pipeline([
                                             (feature_extractor3_tf.name, feature_extractor3_tf),
                                             (label_encoder_tf.name, label_encoder_tf),
                                             (reshape_le_tf.name, reshape_le_tf),
                                             (one_hot_encoder_tf.name, one_hot_encoder_tf)
                                            ])
    
    one_hot_encoder_pipeline_x0.mlinit()
    
    return one_hot_encoder_pipeline_x0

In [13]:
oh_pipelines = [_create_le_one_hot_pipeline(x) for x in categorical_features]

### Step 6: Assemble our features and feature pipeline

In [14]:
feature_union = FeatureUnion([
        (standard_scaler_pipeline.name, standard_scaler_pipeline)
    ] + [(x.name, x) for x in oh_pipelines])
feature_union.mlinit()

### Step 7: Define our linear regression model

In [15]:
# Put all of the categorical features into a list
oh_features_lists = [[y[1].output_features for y in x.steps if y[1].op == 'one_hot_encoder'] for x in oh_pipelines]
oh_features = [item for sublist in oh_features_lists for item in sublist]
oh_features

['room_type_label_one_hot_encoded',
 'host_is_superhost_label_one_hot_encoded',
 'cancellation_policy_label_one_hot_encoded',
 'state_label_one_hot_encoded',
 'instant_bookable_label_one_hot_encoded']

In [16]:
# Vector Assembler, for serialization purposes only
feature_extractor_lr_model_tf = FeatureExtractor([standard_scaler_tf.output_features] + oh_features, 'input_features', [standard_scaler_tf.output_features] + oh_features)
feature_extractor_lr_model_tf.skip_fit_transform = True

# Define our linear regression
lr_model = LinearRegression()
lr_model.mlinit(input_features='input_features', prediction_column='price_prediction')

lr_model_pipeline = Pipeline([
        (feature_extractor_lr_model_tf.name, feature_extractor_lr_model_tf),
        (lr_model.name, lr_model)
    ])
lr_model_pipeline.mlinit()

In [17]:
model_pipeline = Pipeline([(feature_union.name, feature_union),
                            (lr_model_pipeline.name, lr_model_pipeline)])

model_pipeline.mlinit()

### Step 9: Define our Random Forest Regression Model

In [18]:
# Vector Assembler, for serialization purposes only
feature_extractor_rf_model_tf = FeatureExtractor(imputed_continuous_features, 'input_features', imputed_continuous_features)
feature_extractor_rf_model_tf.skip_fit_transform = True


rf = RandomForestRegressor(max_depth=4, n_estimators=11)
rf.mlinit(input_features=feature_extractor_rf_model_tf.output_vector, prediction_column='price_prediction', feature_names=imputed_continuous_features)

rf_model_pipeline = Pipeline([
        (feature_extractor_rf_model_tf.name, feature_extractor_rf_model_tf),
        (rf.name, rf)
    ])
rf_model_pipeline.mlinit()


In [3]:
rf_model_pipeline.fit(X_train[imputed_continuous_features], y_train)

### Step 8: Fit our pipeline and regression


In [2]:
from sklearn.model_selection import GridSearchCV
params = {
    "{}__max_depth".format(rf.name): [5, 10],
    "{}__n_estimators".format(rf.name): [10, 15, 20]
}

rf_grid = GridSearchCV(estimator=rf_model_pipeline, param_grid=params, n_jobs=-1)
rf_grid.fit(X_train[imputed_continuous_features], y_train)

In [21]:
best_rf = rf_grid.best_params_
best_max_depth = best_rf["{}__max_depth".format(rf.name)]
best_n_estimators = best_rf["{}__n_estimators".format(rf.name)]

rf_optimal = RandomForestRegressor(max_depth=best_max_depth, n_estimators=best_n_estimators)

rf_optimal.mlinit(input_features=feature_extractor_rf_model_tf.output_vector, 
                  prediction_column='price_prediction', 
                  feature_names=imputed_continuous_features)

rf_model_pipeline = Pipeline([
        (feature_extractor_rf_model_tf.name, feature_extractor_rf_model_tf),
        (rf_optimal.name, rf_optimal)
    ])
rf_model_pipeline.mlinit()

In [1]:
model_pipeline.fit(X_train, y_train)
rf_model_pipeline.fit(X_train[imputed_continuous_features], y_train)

### Step 9: Serialize our pipelines to bundle.ml

In [23]:
# Serialize the linear regression model
model_pipeline.serialize_to_bundle('/tmp', 'scikit-airbnb.lr', init=True)

# Serialiaze the random forest model
rf_model_pipeline.serialize_to_bundle('/tmp', 'scikit-airbnb.rf', init=True)