# NLP training example
In this example, we'll train an NLP model for sentiment analysis of tweets using spaCy.

First we download spaCy language libraries.

In [1]:
!python -m spacy download en_core_web_sm

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


And import the boilerplate code.

In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

## Data prep

Download the dataset from S3

In [3]:
S3_BUCKET = "verta-strata"
S3_KEY = "english-tweets.csv"
FILENAME = S3_KEY

boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)

Clean and load data using our library.

In [4]:
import utils

data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()

Unnamed: 0,text,sentiment
0,plus countless wine/spirit/liqueur tastings an...,1
1,can we ask questions to thru you ? (vangees...,1
2,within a day i have gone from hating PS3 to lo...,1
3,watching last episode of *One Liter of Tears* ...,0
4,i need more time for my tests!!!!!! (poor madd...,0


## Set up ModelDB
ModelDB organizes our work, and enables us to log and version metadata.

In [5]:
from verta import Client

client = Client('https://dev.verta.ai')
client.set_project('Tweet Classification')
client.set_experiment('SpaCy')
run = client.set_experiment_run()

set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy
created new ExperimentRun: Run 70151584411368372998


We'll first record our code, configuration, dataset, and environment versions to a ModelDB repository.

In [6]:
repo = client.set_repository('Verta Strata')
commit = repo.get_commit(branch='master').new_branch("utils")

set existing Repository: Verta Strata from personal workspace


In [7]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3("s3://{}/{}".format(S3_BUCKET, S3_KEY))
env_ver = Python()

commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)
commit.save("Improved pre-processing utils")

commit

<IPython.core.display.Javascript object>

(Branch: utils)
Commit 601704439f202d2204f0806946ea5aec1cf0c19c0900d395c6462915f2520f1b containing:
config/hyperparams (Blob)
data/tweets (Blob)
env/python (Blob)
notebooks/tweet-analysis (Blob)

## Train the model
We'll use a pre-trained model from spaCy and fine tune it in our new dataset.

In [8]:
nlp = spacy.load('en_core_web_sm')

Update the model with the current data using our library.

In [9]:
import training

training.train(nlp, data, n_iter=20)

Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
15.909	0.746	0.717	0.731
0.352	0.756	0.723	0.739
0.103	0.762	0.726	0.743
0.091	0.760	0.711	0.734
0.081	0.761	0.716	0.738
0.068	0.758	0.724	0.740
0.059	0.754	0.727	0.740
0.050	0.753	0.722	0.737
0.042	0.749	0.713	0.731
0.037	0.746	0.711	0.728
0.032	0.747	0.713	0.730
0.029	0.747	0.719	0.732
0.025	0.743	0.714	0.728
0.024	0.738	0.701	0.719
0.021	0.740	0.698	0.718
0.020	0.736	0.699	0.717
0.018	0.734	0.704	0.719
0.019	0.736	0.704	0.720
0.017	0.733	0.704	0.718
0.016	0.741	0.709	0.725


## Save and version the model
We log the model itself as an artifact to ModelDB.

In [10]:
run.log_model(nlp)

upload complete (custom_modules.zip)
upload complete (model.pkl)
upload complete (model_api.json)


And finally, link the commit to our Experiment Run.

In [11]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

## Deployment

Great! Now you have a model that you can use to run predictions against. Follow the next step of this tutorial to see how to do it.