# Deploying an NLTK + scikit-learn (spam detection) model on Verta

Within Verta, a "Model" can be any arbitrary function: a traditional ML model (e.g., sklearn, PyTorch, TF, etc); a function (e.g., squaring a number, making a DB function etc.); or a mixture of the above (e.g., pre-processing code, a DB call, and then a model application.) See more [here](https://docs.verta.ai/verta/registry/concepts).

This notebook provides an example of how to deploy a spam detection model built using NLTK and scikit-learn on Verta as a Verta Standard Model by extending [VertaModelBase](https://verta.readthedocs.io/en/master/_autogen/verta.registry.VertaModelBase.html?highlight=VertaModelBase#verta.registry.VertaModelBase).

Updated for Verta version: 0.18.2

This example features:
- word similarity detection using [WordNet](https://github.com/nltk/wordnet) from **NLTK**
- [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorization using **scikit-learn**
- **verta**'s Python client logging a `class` as a model to be instantiated at deployment time
- predictions against a deployed model

<a href="https://colab.research.google.com/github/VertaAI/examples/blob/main/deployment/nltk-sklearn/spam-detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Imports

In [1]:
!python -m pip install wget

In [2]:
from __future__ import print_function

import json
import os
import pickle
import re
import time

import wget

import numpy as np
import pandas as pd

from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_curve, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

### 0.1 Verta import and setup

In [3]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !python -m pip install verta

In [4]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST'] = 

In [5]:
import os
from verta import Client
from verta.utils import ModelAPI

client = Client(os.environ['VERTA_HOST'])

---

## 1. Log model

### 1.1 Prepare data

In [6]:
train_data_url = "http://s3.amazonaws.com/verta-starter/spam.csv"
train_data_filename = wget.detect_filename(train_data_url)
if not os.path.isfile(train_data_filename):
    wget.download(train_data_url)

In [7]:
raw_data = pd.read_csv(train_data_filename, delimiter=',', encoding='latin-1')

raw_data.head()

In [8]:
# turn spam/ham to 0/1, and remove unnecessary columns
raw_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1,inplace=True)
raw_data.v1 = LabelEncoder().fit_transform(raw_data.v1)

raw_data.head()

In [9]:
# lemmatize text
total_stopwords = set([word.replace("'",'') for word in stopwords.words('english')])
lemma = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = text.replace("'",'')
    text = re.sub('[^a-zA-Z]',' ',text)
    words = text.split()
    words = [lemma.lemmatize(word) for word in words if (word not in total_stopwords) and (len(word)>1)] # Remove stop words
    text = " ".join(words)
    return text

raw_data.v2 = raw_data.v2.apply(preprocess_text)

raw_data.head()

In [10]:
x_train, x_test, y_train, y_test = train_test_split(
    raw_data.v2,
    raw_data.v1,
    test_size=0.15,
    stratify=raw_data.v1,
)

### 1.2 Train model

In [11]:
proj = client.set_project("Spam Detection")
expt = client.set_experiment("tf–idf")
run = client.set_experiment_run()

In [12]:
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train)

x_train_vec = vectorizer.transform(x_train).toarray()

model = linear_model.LogisticRegression()
model.fit(x_train_vec, y_train)

In [13]:
x_test_vec = vectorizer.transform(x_test).toarray()
y_pred = model.predict(x_test_vec)

m_confusion_test = confusion_matrix(y_test, y_pred)
display(pd.DataFrame(data=m_confusion_test,
                     columns=['Predicted 0', 'Predicted 1'],
                     index=['Actual 0', 'Actual 1']))

print("This model misclassifies {} genuine SMS as spam"
      " and misses only {} SPAM.".format(m_confusion_test[0,1], m_confusion_test[1,0]))

In [14]:
accuracy = accuracy_score(y_test, y_pred)

run.log_metric("accuracy", accuracy)

accuracy

### 1.3 Log model artifacts

#### This example logs a `class` (instead of an object instance) as a model.
This allows for custom setup configuration in the class's `__init__()` method,  
and access to logged artifacts at deployment time.

In [15]:
# save and upload weights
model_param = {}
model_param['coef'] = model.coef_.reshape(-1).tolist()
model_param['intercept'] = model.intercept_.tolist()

json.dump(model_param, open("weights.json", "w"))

run.log_artifact("weights", open("weights.json", "rb"))

In [16]:
# serialize and upload vectorizer
run.log_artifact("vectorizer", vectorizer)

### 1.4 Define model class

Our model—with its pre-trained weights and serialized vectorizer—will require some setup at deployment time.

To support this, the Verta platform allows a model to be defined as a `class` that will be instantiated when it's deployed.  
This class should have provide the following interface:

- `__init__(self, artifacts)` where `artifacts` is a mapping of artifact keys to filepaths. This will be explained below, but Verta will provide this so you can open these artifact files and set up your model. Other initialization steps would be in this method, as well.
- `predict(self, data)` where `data`—like in other custom Verta models—is a list of input values for the model.

In [17]:
class SpamModel():    
    def __init__(self, artifacts):
        from nltk.corpus import stopwords  # needs to be re-imported to remove local file link
        
        # get artifact filepaths from `artifacts` mapping
        weights_filepath = artifacts['weights']
        vectorizer_filepath = artifacts['vectorizer']

        # load artifacts
        self.weights = json.load(open(weights_filepath, "r"))
        self.vectorizer = pickle.load(open(vectorizer_filepath, "rb"))
        
        # reconstitute logistic regression
        self.coef_ = np.array(self.weights["coef"])
        self.intercept_ = self.weights["intercept"]
        
        # configure text preprocessing
        self.total_stopwords = set([word.replace("'",'') for word in stopwords.words('english')])
        self.lemma = WordNetLemmatizer()

    def preprocess_text(self, text):
        text = text.lower()
        text = text.replace("'",'')
        text = re.sub('[^a-zA-Z]',' ',text)
        words = text.split()
        words = [self.lemma.lemmatize(word) for word in words if (word not in self.total_stopwords) and (len(word)>1)] # Remove stop words
        text = " ".join(words)
        return text     
        
    def predict(self, data):
        predictions = []
        for inp in data:
            # preprocess input
            processed_text = self.preprocess_text(inp)
            inp_vec = self.vectorizer.transform([inp]).toarray()
            
            # make prediction
            prediction = (np.dot(inp_vec.reshape(-1), self.coef_.reshape(-1)) + self.intercept_)[0]
            predictions.append(prediction)
            
        return predictions
    
    def example(self):
        return ["FREE FREE FREE"]

Earlier we logged artifacts with the keys `"weights"` and `"vectorizer"`.  
You can obtain an `artifacts` mapping mentioned above using `run.fetch_artifacts(keys)` to work with locally.  
A similar mapping—that works identically—will be passed into `__init__()` when the model is deployed.

In [18]:
artifacts = run.fetch_artifacts(["weights", "vectorizer"])

spam_model = SpamModel(artifacts=artifacts)

In [19]:
data = spam_model.example()
prediction = spam_model.predict(data)

print(data, prediction)

## 2. Log model for deployment

In [20]:
run.log_model(
    model=SpamModel,
    model_api=ModelAPI(data, prediction),
    artifacts=['weights', 'vectorizer'],
)

We also have to make sure we provide every package involved in the model.

In [21]:
from verta.environment import Python

run.log_environment(Python([
    "nltk",
    "numpy",
    "sklearn",
]))

And we need to ensure that the appropriate NLTK packages are available during deployment.

In [22]:
run.log_setup_script("""
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
""")

## 3. Deploy model to endpoint

In [23]:
endpoint = client.get_or_create_endpoint("spam-detection")
endpoint.update(run, wait=True)

In [24]:
endpoint.get_deployed_model().predict(spam_model.example())

---