# Demo of Web API Creation for a Machine Learning Model

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.externals import joblib
import requests, json

In [2]:
df = pd.read_csv("./dataset/salary_data.csv")

In [3]:
type(df)

pandas.core.frame.DataFrame

In [4]:
df.describe()

Unnamed: 0,YearsExperience,Salary
count,30.0,30.0
mean,5.313333,76003.0
std,2.837888,27414.429785
min,1.1,37731.0
25%,3.2,56720.75
50%,4.7,65237.0
75%,7.7,100544.75
max,10.5,122391.0


In [5]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

In [6]:
train_set.head()

Unnamed: 0,YearsExperience,Salary
28,10.3,122391.0
24,8.7,109431.0
12,4.0,56957.0
0,1.1,39343.0
4,2.2,39891.0


In [7]:
test_set.head()

Unnamed: 0,YearsExperience,Salary
27,9.6,112635.0
15,4.9,67938.0
23,8.2,113812.0
17,5.3,83088.0
8,3.2,64445.0


In [8]:
train_set.describe()

Unnamed: 0,YearsExperience,Salary
count,24.0,24.0
mean,5.1875,74207.625
std,2.943129,28240.733473
min,1.1,37731.0
25%,2.975,55456.75
50%,4.3,62164.5
75%,7.3,99030.25
max,10.5,122391.0


In [9]:
test_set.describe()

Unnamed: 0,YearsExperience,Salary
count,6.0,6.0
mean,5.816667,83184.5
std,2.546697,24757.930695
min,3.2,57189.0
25%,4.0,65318.25
50%,5.1,75513.0
75%,7.475,105248.25
max,9.6,113812.0


In [10]:
df_copy = train_set.copy()
test_set_full = test_set.copy()

In [11]:
test_set = test_set.drop(["Salary"], axis=1)

In [12]:
train_labels = df_copy["Salary"]
train_set_full = train_set.copy()
train_set = train_set.drop(["Salary"], axis=1)

In [13]:
lin_reg = LinearRegression()
lin_reg.fit(train_set, train_labels)
salary_pred = lin_reg.predict(test_set)
salary_pred

array([ 115790.21011287,   71498.27809463,  102596.86866063,
         75267.80422384,   55477.79204548,   60189.69970699])

### Model persistence

In [14]:
import pickle

with open("./pickled_files/python_lin_reg_model.pkl", "wb") as file_handler:
    pickle.dump(lin_reg, file_handler)
    
with open("./pickled_files/python_lin_reg_model.pkl", "rb") as file_handler:
    loaded_pickle = pickle.load(file_handler)
    
loaded_pickle

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:
BASE_URL = "http://localhost:5000"
joblib.dump(lin_reg, "./pickled_files/linear_regression_model.pkl")
joblib.dump(train_set, "./pickled_files/training_data.pkl")
joblib.dump(train_labels, "./pickled_files/training_labels.pkl")

['./pickled_files/training_labels.pkl']

### Prediction API

We feed relevant data to this API, in this case `YearsExperience` and pass that into the predict method of our model. Following is a code snippet of the `predict()` method in the script file `linear_regression_API.py`. This file can be found in the same repository. Explanation follows the snippet. 


```python
@app.route("/predict", methods=['POST'])
def predict():
    if request.method == 'POST':
        try:
            data = request.get_json()
            years_of_experience = float(data["yearsOfExperience"])

            lin_reg = joblib.load("./pickled_files/linear_regression_model.pkl")
        except ValueError:
            return jsonify("Please enter a number.")

        return jsonify(lin_reg.predict(years_of_experience).tolist())
```

We are calling the `@app.route decorator` with some additional information. We are telling it that we want it to handle where the URI is `/predict`. In addition we are also telling it to handle only `POST` requests. 

 With the API running, execute the following snippet to get the predicted salary for a given `YearsExperience`. We will use the `requests` package to make an API call. We will call the `post()` method in request indicating that we want to send a POST request and pass it in our URL. We then give the method the JSON paramater of our data as a Python dictionary. The `json` parameter of the `post()` method will automatically send it as a JSON to the API. 

Following this, we will save our response variable and then call the `json()` method to extract the response variable as JSON. As a result, we get a predicted salary of $100,712 for `YearsExperience=8`.

In [16]:
years_exp = {"yearsOfExperience": 8}
response = requests.post("{}/predict".format(BASE_URL), json = years_exp)
response.json()

[100712.10559602463]

To see how close we are to the data in the training set, let's query on the training data for the same parameter. We can see that the predicted is close to very close and only marginally off. 

In [17]:
df_copy.query('YearsExperience > 7 and YearsExperience <= 8')

Unnamed: 0,YearsExperience,Salary
22,7.9,101302.0
21,7.1,98273.0


### Retrain API

In a production environment, as we keep on collecting more data, we'd like to improve our model for better accuracy. We'll have to retrain the model with all of the old and the new data. To be able to do this, we will make use of the saved training data and labels which were pickled earlier. 

#### Review of the `retrain` API
In this section, we will walk through the implementation of the `retrain()` API. The implementation contains the following steps. Code snippet follows the explanation. 

1. Get the JSON data from the API request. Load the training data and training labels, contained in pickle files, into the memory.
2. Using pandas, create a data frame from the request data, which is in JSON format.  
3. From the new data frame, separate the training data and labels. 
4. Use pandas to concatenate the new training data with the old one. Repeat this with the new and the old training label. 
5. Call the fit method to create a new model. In order to save the models, we will first have to delete the existing ones because the `joblib.dump` doesn't overwrite the existing file. 
6. Finally, load the saved regression model back into memory. 

Once we have added the following snippet added to our API we can call it. We will not have to manually restart the server. Flask automatically restarts the server once it finds a change. 
```python
@app.route("/retrain", methods=['POST'])
def retrain():
    if request.method == 'POST':
        data = request.get_json()

        try:
            training_set = joblib.load("./pickled_files/training_data.pkl")
            training_labels = joblib.load("./pickled_files/training_labels.pkl")

            df = pd.read_json(data)

            df_training_set = df.drop(["Salary"], axis=1)
            df_training_labels = df["Salary"]

            df_training_set = pd.concat([training_set, df_training_set])
            df_training_labels = pd.concat([training_labels, df_training_labels])

            new_lin_reg = LinearRegression()
            new_lin_reg.fit(df_training_set, df_training_labels)

            os.remove("./pickled_files/linear_regression_model.pkl")
            os.remove("./pickled_files/training_data.pkl")
            os.remove("./pickled_files/training_labels.pkl")

            joblib.dump(new_lin_reg, "./pickled_files/linear_regression_model.pkl")
            joblib.dump(df_training_set, "./pickled_files/training_data.pkl")
            joblib.dump(df_training_labels, "./pickled_files/training_labels.pkl")

            lin_reg = joblib.load("./pickled_files/linear_regression_model.pkl")
        except ValueError as e:
            return jsonify("Error when retraining - {}".format(e))
```

#### Create new data and call the `retrain API`

In [22]:
data = json.dumps([{"YearsExperience": 12,"Salary": 140000}, 
                   {"YearsExperience": 12.1,"Salary": 142000}])

data

'[{"Salary": 140000, "YearsExperience": 12}, {"Salary": 142000, "YearsExperience": 12.1}]'

In [23]:
response = requests.post("{}/retrain".format(BASE_URL), json = data)

response.json()

u'Retrained model successfully.'

#### Another prediction on the same input as before

In [20]:
response = requests.post("{}/predict".format(BASE_URL), json = years_exp)
response.json()

[101090.28347749889]

### Model Details API

Details such as the coefficients and intercepts of the model and the current score of the model may be another useful endpoint for our API. 

#### Model Details API Definition
The implementation of this functionality involves the following steps. Code snippet follows the explanation. 

1. Use `GET` as a request method, as we are requesting some information from the endpoint instead of passing any to it. 
2. Load the training data and labels along with the model. 
3. Pass the training set and labels to get the scores by calling the `score` method of the model. The coefficients and intercept are just attributes of the model. We have called `tolist()` on the coefficients as was done earlier. Recall that objects of type `ndarray` are not serializable. 

```python
if request.method == 'GET':
    try:
        lr = joblib.load("./linear_regression_model.pkl")
        training_set = joblib.load("./training_data.pkl")
        labels = joblib.load("./training_labels.pkl")

        return jsonify({"score": lr.score(training_set, labels),
                        "coefficients": lr.coef_.tolist(), "intercepts": lr.intercept_})
    except (ValueError, TypeError) as e:
        return jsonify("Error when getting details - {}".format(e))
```

#### Model Details API Call

In [24]:
response = requests.get("{}/currentDetails".format(BASE_URL))
response.json()

{u'coefficients': [9562.490457571217],
 u'intercepts': 24769.317784908955,
 u'score': 0.980380285350829}