# 0. Prerequisites

We will be using boto3 in this script later to upload our model to object store. boto3 has some issues with urllib3 and gives an error if we install it after importing other libraries hence we are gonna be intstalling it first in this script


In [1]:
!pip install boto3
# !pip install pandas tensorflow

Collecting boto3
  Downloading boto3-1.20.12-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 4.0 MB/s 
[?25hCollecting botocore<1.24.0,>=1.23.12
  Downloading botocore-1.23.12-py3-none-any.whl (8.2 MB)
[K     |████████████████████████████████| 8.2 MB 33.6 MB/s 
[?25hCollecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 6.8 MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.7-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 55.1 MB/s 
Installing collected packages: urllib3, jmespath, botocore, s3transfer, boto3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
[31mERROR: pip's dependency resolver does not currently take 

# 1. Prepare Training Data

Creating a dataset rarely happens next to where you run the training. In the first step we will utilize the Data Preparation notebook that we created earlier to extract the data required to perform training.

We will load the pickled data from the Data Preprocessing notebook. While the code uses pickle to load in data, this data is actually exported via pickle when we execute the `%run` in the last block. Since pickle can be unsafe to use from third-party downloaded data, we actually generate (again using `%run`) this pickle data and therefore is safe to use -- it's never downloaded.

Please note. We need to re run the data preparation script because we are running the modeling and training script in google colab to utilize the gpus. If we were running the scripts locally we would script the first three block of codes and start from unpickling the training data saved by the data preparation script.



In [2]:
import pathlib
import pickle

In [3]:
# path where the data preparation notebook is stored on github
DATA_PREPARATION_NOTEBOOK_LINK = "https://raw.githubusercontent.com/abdullahaleem/spam-detection-microservice/master/app/notebooks/1.%20Spam%20Detection%20-%20Data%20Preparation.ipynb?token=AKTTIMEIAHWZJIIC43VMH4DBU7AT2"

# path where the notebook we created for data preprocessing would be downloaded
NOTEBOOKS_DIR = pathlib.Path("/notebooks/")
NOTEBOOKS_DIR.mkdir(exist_ok=True, parents=True)
DATA_PREPARATION_NOTEBOOK = NOTEBOOKS_DIR / "Data Preparation.ipynb"

In [4]:
!curl $DATA_PREPARATION_NOTEBOOK_LINK  -o "$DATA_PREPARATION_NOTEBOOK"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20648  100 20648    0     0  51620      0 --:--:-- --:--:-- --:--:-- 51620


In [5]:
%run "$DATA_PREPARATION_NOTEBOOK"



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0   152k      0  0:00:01  0:00:01 --:--:--  152k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  159k  100  159k    0     0   173k      0 --:--:-- --:--:-- --:--:--  172k
Archive:  /data/zips/sms-spam-dataset.zip
  inflating: /data/spam-classifier/sms_spam/SMSSpamCollection  
  inflating: /data/spam-classifier/sms_spam/readme  
Archive:  /data/zips/youtube-spam-dataset.zip
  inflating: /data/spam-classifier/youtube_spam/Youtube01-Psy.csv  
   creating: /data/spam-classifier/youtube_spam/__MACOSX/
  inflating: /data/spam-classifier/youtube_spam/__MACOSX/._Youtube01-Psy.csv  
  inflating: /data/spam-classifier/youtube_spam/Youtube02-KatyPerry.csv  
  inflating: /data/spam-classifier/youtube

**If you are running the scripts locally and you have already run Data Preparation script on this system you will start from here.**

In [6]:
# path where the training data would be pickled to using our data prepartation notebook
EXPORT_DIR = pathlib.Path('/data/exports/')
TRAINING_DATA_PATH = EXPORT_DIR / 'spam-training-data.pkl'

In [7]:
with open(TRAINING_DATA_PATH, 'rb') as f:
    data = pickle.load(f)

In [8]:
x_train, y_train = data['x_train'], data['y_train']
x_test, y_test  = data['x_test'], data['y_test']

label_legend = data['label_legend']
label_legend_inverted = data['label_legend_inverted']

max_sequence_length = data['max_sequence_length']
max_words = data['max_words']
tokenizer = data['tokenizer']

# 2. Create and Train our LSTM Model

In [9]:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.models import Model, Sequential

In [10]:
embed_dim = 128
lstm_out = 196
batch_size = 32
epochs = 5

In [11]:
model = Sequential()
model.add(Embedding(max_words, embed_dim, input_length=max_sequence_length))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 128)          35840     
                                                                 
 spatial_dropout1d (SpatialD  (None, 300, 128)         0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 196)               254800    
                                                                 
 dense (Dense)               (None, 2)                 394       
                                                                 
Total params: 291,034
Trainable params: 291,034
Non-trainable params: 0
_________________________________________________________________
None


In [12]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=batch_size, verbose=1, epochs=epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb430156dd0>

# 3. Inferencing Trained Model

Once we have our model trained we can go ahead test our function on any custom string. We will also write some what of a structure here for inference, which we will build upon in the future

In [13]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
def predict(text_str, max_words=280, max_sequence=280, tokenizer=None):
  if tokenizer:
    sequences = tokenizer.texts_to_sequences([text_str])
    x_input = pad_sequences(sequences, maxlen=max_sequence)
    y_output = model.predict(x_input)
    top_y_index = np.argmax(y_output)
    preds = y_output[top_y_index]
    labeled_preds = [{f"{label_legend_inverted[i]}": x} for i, x in enumerate(preds)]
    return labeled_preds

In [15]:
predict("Hello world", max_words=max_words, max_sequence=max_sequence_length, tokenizer=tokenizer)

[{'ham': 0.96547115}, {'spam': 0.03452892}]

# 4. Exporting Model, Tokenizer & Metadata Locally
 
We can load `tokenizer_as_json` with `tensorflow.keras.preprocessing.text.tokenizer_from_json`.

In [27]:
MODEL_EXPORT_PATH = EXPORT_DIR / 'spam-detection-model.h5'
model.save(str(MODEL_EXPORT_PATH))

In [28]:
import json
metadata = {
    "label_legend_inverted": label_legend_inverted,
    "label_legend": label_legend,
    "max_sequence_length": max_sequence_length,
    "max_words": max_words,
}

METADATA_EXPORT_PATH = EXPORT_DIR / 'spam-detection-metadata.json'
METADATA_EXPORT_PATH.write_text(json.dumps(metadata, indent=4))

199

In [29]:
tokenizer_as_json = tokenizer.to_json()

TOKENIZER_EXPORT_PATH = EXPORT_DIR / 'spam-detection-tokenizer.json'
TOKENIZER_EXPORT_PATH.write_text(tokenizer_as_json)

1090335

# 5. Upload Model, Tokenizer, & Metadata to Object Storage

Notebooks on colab are emphemeral and will only keep the files store till our session is active. We also need a place from where our inference scripts which will be deployed will be able to access our model. Hence, we will upload our model to an object store. Over here, we are gonna be using AWS S3 to store our model.

Object Storage options include:

- AWS S3
- Linode Object Storage
- DigitalOcean Spaces


All three of these options can use `boto3`.

In [30]:
import os
import boto3

#### AWS S3 Object Storage Config

In [32]:
# AWS S3 Config
ACCESS_KEY = "AKIA5255GGZVAP3XAKCF"
SECRET_KEY = "QWbMcekqc2c5vlr9AeO0+X57YNF/Ny+rr9cypxuw"

# You should not have to set this
ENDPOINT = None

# Your s3-bucket region
REGION = 'us-east-1'

BUCKET_NAME = 'spam-detection-object-store'

#### Linode Object Storage Config

In [None]:
ACCESS_KEY = "<your_linode_object_storage_access_key>"
SECRET_KEY = "<your_linode_object_storage_secret_key>"

# Object Storage Endpoint URL
ENDPOINT = "https://cfe3.us-east-1.linodeobjects.com"

# Object Storage Endpoint Region (also in your endpoint url)
REGION = 'us-east-1'

# Set this to a valid slug (without a "/" )
BUCKET_NAME = 'datasets'

#### DigitalOcean Spaces Config

In [None]:
ACCESS_KEY = "<your_do_spaces_access_key>"
SECRET_KEY = "<your_do_spaces_secret_key>"

# Space Endpoint URL
ENDPOINT = "https://ai-cfe-1.nyc3.digitaloceanspaces.com"

# Space Region (also in your endpoint url)
REGION = 'nyc3'

# Set this to a valid slug (without a "/" )
BUCKET_NAME = 'datasets'

## Perform Upload with Boto3

In [33]:
os.environ["AWS_ACCESS_KEY_ID"] = ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"] = SECRET_KEY

In [34]:
# Upload paths 
MODEL_KEY_NAME = f"exports/spam-detection/{MODEL_EXPORT_PATH.name}"
TOKENIZER_KEY_NAME = f"exports/spam-detection/{TOKENIZER_EXPORT_PATH.name}"
METADATA_KEY_NAME = f"exports/spam-detection/{METADATA_EXPORT_PATH.name}"

In [35]:
session = boto3.session.Session()
client = session.client('s3', region_name=REGION, endpoint_url=ENDPOINT)
client.upload_file(str(MODEL_EXPORT_PATH), BUCKET_NAME,  MODEL_KEY_NAME) 
client.upload_file(str(TOKENIZER_EXPORT_PATH), BUCKET_NAME,  TOKENIZER_KEY_NAME) 
client.upload_file(str(METADATA_EXPORT_PATH), BUCKET_NAME,  METADATA_KEY_NAME)  

In [26]:
client.download_file(BUCKET_NAME, MODEL_KEY_NAME, pathlib.Path(MODEL_KEY_NAME).name)
client.download_file(BUCKET_NAME, TOKENIZER_KEY_NAME, pathlib.Path(TOKENIZER_KEY_NAME).name)
client.download_file(BUCKET_NAME, METADATA_KEY_NAME, pathlib.Path(METADATA_KEY_NAME).name)

# 6. Model Download Pipeline
In [this blog post](https://www.codingforentrepreneurs.com/blog/ai-model-download-pipeline) I'll show you how to turn the `client.download_file()` portion into a pipeline so you can make it reusable in future projects. Further, if you ever need to bundle these models into a Docker image, you will be able to use the pipeline.

It is not recommended that we upload and manage the model on github as it can get very big. Object store is very good option for storing these models. We don't have versioning and history which we would like to have. 

We would be downloading three files pretty much every time we deploy this code. We will be using pypyr for automation pipeline (similar to github actions for ci/cd pipelines). We will create a pipeline and a script to download these pipelines in a reperatable manner.

The pipeline in created in the code base.

We will create a .env which are very common when setting up enviroment variable when working locally. We will never put these files on git. We can have these in product env as well.

When we go into production we would have to run `python -m pypyr pipelines/ml-model-download` as step in production

