# NLP training example
In this example, we'll train an NLP model for sentiment analysis of tweets using spaCy.

First we download spaCy language libraries.

In [1]:
!python -m spacy download en_core_web_sm

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2mâœ” Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


And import the boilerplate code.

In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

## Data prep

Download the dataset from S3

In [3]:
S3_BUCKET = "verta-strata"
S3_KEY = "english-tweets.csv"
FILENAME = S3_KEY

boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)

Clean and load data using our library.

In [4]:
import utils

data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()

Unnamed: 0,text,sentiment
0,i wont be able to see you today i miss you.,0
1,Well I'd buy it! How's your son doing now?,1
2,do you still use LJ? i use mine mostly for com...,1
3,"Congrats, Scott! That's awesome.",1
4,jealous of everyone at Summertime Ball today. ...,0


## Set up ModelDB
ModelDB organizes our work, and enables us to log and version metadata.

In [5]:
from verta import Client

client = Client('https://dev.verta.ai')
client.set_project('Tweet Classification')
client.set_experiment('SpaCy')
run = client.set_experiment_run()

set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy
created new ExperimentRun: Run 866041584393615914012


We'll first record our code, configuration, dataset, and environment versions to a ModelDB repository.

In [6]:
repo = client.set_repository('Verta Strata')
commit = repo.get_commit(branch='master')

set existing Repository: Verta Strata from personal workspace


In [7]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()
config_ver = Hyperparameters({'n_iter': 10})
dataset_ver = S3("s3://{}/{}".format(S3_BUCKET, S3_KEY))
env_ver = Python()

commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)
commit.save("Deployment-ready sentiment analysis")

commit

<IPython.core.display.Javascript object>

Commit cd802a73d78dbf40cef3c85e16ae79e39a4f5d147eb8bbaf903996a62c40acda containing:
config/hyperparams (Blob)
data/tweets (Blob)
env/python (Blob)
notebooks/tweet-analysis (Blob)

## Train the model
We'll use a pre-trained model from spaCy and fine tune it in our new dataset.

In [8]:
nlp = spacy.load('en_core_web_sm')

Update the model with the current data using our library.

In [9]:
import training

training.train(nlp, data[:100])

Using 16000 examples (80 training, 20 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
0.635	0.437	0.875	0.583
0.521	0.714	0.625	0.667
0.232	0.667	0.250	0.364
0.033	0.600	0.375	0.462
0.003	0.600	0.375	0.462
0.000	0.600	0.375	0.462
0.000	0.500	0.250	0.333
0.000	0.500	0.250	0.333
0.000	0.600	0.375	0.462
0.000	0.750	0.375	0.500


## Save and version the model
We log the model itself as an artifact to ModelDB.

In [None]:
run.log_model(nlp)

upload complete (custom_modules.zip)


And finally, link the commit to our Experiment Run.

In [None]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

## Deployment

Great! Now you have a model that you can use to run predictions against. Follow the next step of this tutorial to see how to do it.