# Simple machine learning project

The goal of this project is to build a very simple machine learning model (probably linear regression) with no preprocessing, save it, then deploy it in as a simple api using flask inside a docker.

So let's get started

## Importing necessary libraries

In [41]:
import requests
import pickle

import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LinearRegression

### Downloading the data


In [2]:
cal_data = fetch_california_housing()

In [3]:
X, Y = cal_data.data, cal_data.target

### Basic overview of the data

In [4]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [5]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [6]:
X.shape

(20640, 8)

In [7]:
Y

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [8]:
Y.shape

(20640,)

In [9]:
cal_data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [10]:
print(cal_data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

### Splitting the data to train and test sets

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2020)

Checking the size of the train and test sets


In [12]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((14448, 8), (6192, 8), (14448,), (6192,))

### Defining the machine learning model

We are using a simple machine learning model, Linear Regression and performing no preprocessing on the data because  the focus of this tutorial is  to actually learn how to deploy machine learning models and not actually building machine learnining models or hyperparametuning to find the best possible result

Note: For real world application, you'll need to do data preprocessing, feauture engineering and hyperparameter tuning to get the best result or the best fastest result 

In [13]:
linear_model = LinearRegression()

In [14]:
linear_model.fit(X_train, Y_train)

LinearRegression()

### Evaluating our model

This is just to have an idea of how poor or good our model is but actually not the focus of the tutorial

In [15]:
linear_model.score(X_train, Y_train)

0.61247824295048

In [16]:
linear_model.score(X_test, Y_test)

0.5916017337619133

### Saving model 

In [100]:
with open("assets/test_logit.pkl", "wb") as file:
    pickle.dump(linear_model, file)

### Preprocessing our data

So this preprocessing is to convert our data (test data) to dictionary format. This is because we usually send data in json format over the internet (which looks very similar to python dictionaries) and thus, it may be easier to process the data already as a dictionary for our model

In [17]:
test_data = pd.DataFrame(X_test, columns=cal_data.feature_names)

In [18]:
dict_test_data = test_data.to_dict(orient="records")

In [19]:
dict_test_data[0].values()

dict_values([4.1902, 42.0, 4.818604651162791, 0.9255813953488372, 656.0, 3.0511627906976746, 34.07, -118.14])

### Vectorizer

While dictionary-like format may be easier to send over the web, our model still expect the data to be feed into it as arrays, using a dictionary vectorizer, we can easily convert a dictionary to an array

In [21]:
vect = DictVectorizer(sparse=False)

In [22]:
new_test = vect.fit_transform(dict_test_data)

In [23]:
new_test

array([[ 9.25581395e-01,  3.05116279e+00,  4.81860465e+00, ...,
        -1.18140000e+02,  4.19020000e+00,  6.56000000e+02],
       [ 9.80487805e-01,  2.36585366e+00,  6.12520325e+00, ...,
        -1.22290000e+02,  6.87870000e+00,  1.45500000e+03],
       [ 1.03284072e+00,  3.07881773e+00,  7.23973727e+00, ...,
        -1.17200000e+02,  5.54300000e+00,  1.87500000e+03],
       ...,
       [ 1.01204819e+00,  2.79116466e+00,  5.22088353e+00, ...,
        -1.18320000e+02,  5.16690000e+00,  6.95000000e+02],
       [ 1.02729885e+00,  1.97844828e+00,  3.49425287e+00, ...,
        -1.22260000e+02,  2.58980000e+00,  1.37700000e+03],
       [ 1.08881356e+00,  2.93084746e+00,  5.47254237e+00, ...,
        -1.22670000e+02,  3.95180000e+00,  4.32300000e+03]])

In [24]:
linear_model.predict(new_test)

array([ -288.49927765,  -632.57790721,  -825.81610109, ...,
        -304.16363334,  -587.80926542, -1875.16312662])

In [25]:
vect.transform(dict_test_data[0])

array([[   0.9255814 ,    3.05116279,    4.81860465,   42.        ,
          34.07      , -118.14      ,    4.1902    ,  656.        ]])

### Saving the vectorizer

In [100]:
with open("assets/vectorizer.bin", "wb") as file:
    pickle.dump(vect, file)

### Testing the raw test data 

Testing the raw test data (as it was in X_test will result in an error but seeing is believing so they say

In [68]:
X_test[0]

array([   4.1902    ,   42.        ,    4.81860465,    0.9255814 ,
        656.        ,    3.05116279,   34.07      , -118.14      ])

### Sending requests to our model

SO we have created our flask app (in deployment.py) and running it as an API service, so we wan to send a request to it to get back the prediction

In [69]:
url = "http://0.0.0.0:1200/predict"
result = requests.post(url, json=dict_test_data[0])

In [70]:

result.content

b'<!doctype html>\n<html lang=en>\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n'

### Converting to data to an internet portable format

To send our data to our model, it has to be in a dictionary format that can easily be converted to a json format

In [71]:
X_test[0]

array([   4.1902    ,   42.        ,    4.81860465,    0.9255814 ,
        656.        ,    3.05116279,   34.07      , -118.14      ])

list function converts the data from a numpy array to normal python list datatype that request can easily process,
we also enclosed an external bracket because the model expects the data in 2D format and not 1D with just the list function

In [85]:
try_data = {"data": [list(X_test[129])]}

In [86]:
try_data["data"]

[[11.706,
  52.0,
  8.97553516819572,
  1.0428134556574924,
  975.0,
  2.981651376146789,
  34.13,
  -118.12]]

### Sending request to our model again

In [87]:
%time

url = "http://0.0.0.0:1200/predict"
result = requests.post(url, json=try_data)

CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs
Wall time: 11.4 µs


### Response 200 meaning everything is working perfectly

In [88]:
result

<Response [200]>

In [89]:
result.json()

{'result': 5.336929763835805}