In [1]:
#learning how create a webservice for a trained model, so it can be consumed by apps
#A Flask API for serving scikit-learn models
#reference, by Amir Ziai
#https://medium.com/towards-data-science/a-flask-api-for-serving-scikit-learn-models-c8bcdaa41daa
#https://towardsdatascience.com/a-flask-api-for-serving-scikit-learn-models-c8bcdaa41daa
#a working solution is available on GitHub.
#https://github.com/amirziai/sklearnflask/


In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.externals import joblib
from flask import Flask, jsonify, request
import traceback

# A Flask API for serving scikit-learn models

Scikit-learn models can be persisted (pickled) to avoid retraining the model every time they are used. You can use Flask to create an API that can provide predictions based on a set of input variables using a pickled model.

Use the titanic dataset: 
https://www.kaggle.com/c/titanic

Use four variables: age, sex, embarked, and survived.

In [3]:
df = pd.read_csv('kaggle_titanic_train.csv')
include = ['Age', 'Sex', 'Embarked', 'Survived']
df_ = df[include]  # only using 4 variables

In [4]:
print df_.shape
df_.head()

(891, 4)


Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,1
2,26.0,female,S,1
3,35.0,female,S,1
4,35.0,male,S,0


In [5]:
#transform categorical vars and impute missing values
categoricals = []
for col, col_type in df_.dtypes.iteritems():
    if col_type == 'O':
        categoricals.append(col)
    else:
        df_[col].fillna(0, inplace=True)
        
print "categoricals: ", categoricals

categoricals:  ['Sex', 'Embarked']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [6]:
#Pandas provides a simple method get_dummies for creating OHE variables for a given dataframe.
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
df_ohe = pd.get_dummies(df_, columns=categoricals, dummy_na=True)
print df_ohe.shape
df_ohe.head()

(891, 9)


Unnamed: 0,Age,Survived,Sex_female,Sex_male,Sex_nan,Embarked_C,Embarked_Q,Embarked_S,Embarked_nan
0,22.0,0,0,1,0,0,0,1,0
1,38.0,1,1,0,0,1,0,0,0
2,26.0,1,1,0,0,0,0,1,0
3,35.0,1,1,0,0,0,0,1,0
4,35.0,0,0,1,0,0,0,1,0


Now that we’ve successfully transformed our dataset we’re ready to train our model.

In [7]:
#fit model
# using a random forest classifier (can be any classifier)
#from sklearn.ensemble import RandomForestClassifier as rf
dependent_variable = 'Survived'
X = df_ohe[df_ohe.columns.difference([dependent_variable])]
y = df_ohe[dependent_variable]

clf = rf()
clf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [8]:
#the trained model is ready to be pickled
#from sklearn.externals import joblib
joblib.dump(clf, 'model.pkl')

#We have persisted our model. We can load this model into memory in a single line.
#clf = joblib.load('model.pkl')

['model.pkl']

In [9]:
model_columns = list(X.columns)
joblib.dump(model_columns, 'model_columns.pkl')
print model_columns

['Age', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Sex_female', 'Sex_male', 'Sex_nan']



## Flask
We’re now ready to use Flask to serve our persisted model.

In [None]:
#Flask is pretty minimalistic. Here’s what you need to start a bare bones Flask application (on port 8080 in this case).
"""# now ready to use Flask to serve our persisted model
from flask import Flask

app = Flask(__name__)

if __name__ == '__main__':
    app.run(port=8080)
"""

"# now ready to use Flask to serve our persisted model\nfrom flask import Flask\n\napp = Flask(__name__)\n\nif __name__ == '__main__':\n    app.run(port=8080)\n"

We have to do two things: 

(1) load our persisted model into memory when the application starts, and   
(2) create an endpoint that takes input variables, transforms them into the appropriate format, and returns predictions.


In [None]:
#from flask import Flask, jsonify
#from sklearn.externals import joblib
#import pandas as pd

app = Flask(__name__)
@app.route('/predict', methods=['POST'])

def predict():
    json_ = request.json
    query_df = pd.DataFrame(json_)
    query = pd.get_dummies(query_df)
    #query = query.reindex(columns=model_columns, fill_value=0) #may work without this line
    prediction = clf.predict(query).tolist()
    return jsonify({'prediction': prediction})
    
if __name__ == '__main__':
    clf = joblib.load('model.pkl')
    print 'model loaded'
    model_columns = joblib.load('model_columns.pkl')
    print 'model columns loaded'
    app.run(port=8080)

 * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)


model loaded
model columns loaded


127.0.0.1 - - [08/Nov/2017 13:10:16] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [08/Nov/2017 13:10:20] "POST /predict HTTP/1.1" 200 -


This would only work under ideal circumstances where the incoming request contains all possible values for the categorical variables. If that’s not the case, get_dummies would generate a dataframe that has less columns than the classifier excepts, which would result in a runtime error. Also numerical variables need to be replaced using the same methodology that we trained the model with.  

A solution to the less than expected number of columns is to persist the list of columns from training. Remember that Python objects (including lists and dictionaries) can be pickled. To do this I’m going to use joblib, as I did previously, to dump the list of columns into a pkl file.

## Note
Only managed to make it work with a specific format for prediction:  

Single record:  
[{"Age": 17.0, "Embarked_C": "0","Embarked_Q": "0", "Embarked_S": "1", "Embarked_nan": "0","Sex_female": "0","Sex_male": "1","Sex_nan": "0"}]

Multi:  
[  
{"Age": 17.0, "Embarked_C": "0","Embarked_Q": "0", "Embarked_S": "1", "Embarked_nan": "0","Sex_female": "0","Sex_male": "1","Sex_nan": "0"},  
{"Age": 77.0, "Embarked_C": "0","Embarked_Q": "0", "Embarked_S": "1", "Embarked_nan": "0","Sex_female": "0","Sex_male": "1","Sex_nan": "0"}  
]


Here's CLI command:  
$ curl -X POST http://127.0.0.1:8080/predict -H "Content-Type: application/json" -d '[{"Age": 17.0, "Embarked_C": "0","Embarked_Q": "0", "Embarked_S": "1", "Embarked_nan": "0","Sex_female": "0","Sex_male": "1","Sex_nan": "0"}]'

In reality, need to pass input in real format:  
[
    {'Age': 85, 'Sex': 'male', 'Embarked': 'S'},
    {'Age': 24, 'Sex': 'female', 'Embarked': 'C'},
    {'Age': 3, 'Sex': 'male', 'Embarked': 'C'},
    {'Age': 21, 'Sex': 'male', 'Embarked': 'S'}
]

In [None]:
#try to figure out how to pass input in real format

# Summary
Complete:  
- fit a model
- pickle it
- publish as RESTful API using Flask