# Batch inference using a BERT model for named entity recognition
This notebook demonstrates how to do the following tasks:
- Build a `pyfunc` model encapsulating a [BERT language model](https://github.com/google-research/bert) for named entity recognition (NER).
- Deploy the [`pyfunc`](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html) model to a Mosaic AI Model Serving endpoint.
- Perform batch inference using `ai_query`([AWS](https://docs.databricks.com/sql/language-manual/functions/ai_query.html) | [Azure](https://learn.microsoft.com/azure/databricks/sql/language-manual/functions/ai_query)) on the Mosaic AI Model Serving endpoint

To test the model before deploying, run this notebook on a cluster with a GPU.

## Download and import libraries

Download and install the latest versions of `torch`, `torchvision`, `transformers`, and `mlflow`.

In [0]:
%pip install --upgrade torch transformers mlflow torchvision databricks-sdk
dbutils.library.restartPython()

In [0]:
import torch
import torchvision
import mlflow
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import transformers
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput
import datetime
import json
import pyspark.sql.functions as F

## Set registry for the model

The following sets the model registry to use the Unity Catalog model registry.

In [0]:
mlflow.set_registry_uri("databricks-uc")

## Define PyFunc to load and create pipeline
 

Define an MLflow `pyfunc` to take input text and return the NER results from the model.

In [0]:
# Define an mlflow pyfunc class which will be servable on a model serving endpoint.
class ner_pyfunc(mlflow.pyfunc.PythonModel):
  def __init__(self, hf_model_name : str) -> None:
    self.model_name = hf_model_name
  
  # `load_context` is run once at initialization of the model on and endpoint. Here we will download and initialize the model.
  def load_context(self, context : dict) -> None:
    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(self.model_name)
    model = AutoModelForTokenClassification.from_pretrained(self.model_name)

    # Create the NER pipeline, specify it will be put onto the GPU
    self.ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, device="cuda:0")
  
  # `predict` is called each time a batch of data is passed to the model serving endpoint.
  def predict(self, context : dict, model_input : pd.DataFrame) -> pd.DataFrame:
    
    # Run the NER pipeline
    results = self.ner_pipeline(model_input["text"].tolist())

    # Convert outputs to a pandas dataframe
    return pd.DataFrame({"ner" : pd.Series(results)})

## Test the BERT model

In [0]:
_model = ner_pyfunc("dslim/bert-base-NER-uncased")
_model.load_context({}) # This is done automatically on the serving endpoint, but in the notebook this needs to be done manually.

# Create some test inputs.
test_inputs = pd.DataFrame({"text" : ["My name is Wolfgang and I live in Berlin", "My name is Colton, I work at Databricks."]})
outputs = _model.predict({}, test_inputs)

In [0]:
display(outputs)

## Register the BERT model

First, create an MLflow signature to tell MLflow what inputs and outputs the model expects.

In [0]:
# Load MLflow infer_signature
from mlflow.models import infer_signature

# Infer the signature from our test inputs and the associated outputs from the test model
sig = infer_signature(test_inputs, outputs)

Create the dependencies list, so that your endpoint has all of the necessary libraries at runtime.

In [0]:
# Get the default pip requirements
pip_reqs = mlflow.pyfunc.get_default_conda_env()
# Add the libraries we need for our model
pip_reqs["dependencies"][-1]["pip"] += [
  f"torch=={torch.__version__.split('+')[0]}",
  f"torchvision=={torchvision.__version__.split('+')[0]}",
  f"transformers=={transformers.__version__}"
]

Register the model to the model registry

In [0]:
with mlflow.start_run() as run:
  info = mlflow.pyfunc.log_model(
    "model",
    python_model=ner_pyfunc("dslim/bert-base-NER-uncased"),
    conda_env=pip_reqs,
    input_example=test_inputs,
    signature=sig,
    registered_model_name="autobricks.default.ner_pyfunc"
  )

## Deploy model to Mosaic AI Model Serving endpoint

Use the [Databricks Python SDK](https://databricks-sdk-py.readthedocs.io/en/latest/index.html) to create the model serving endpoint. Building the container and bringing it online can take some time, so the timeout is set to 30 minutes.

In [0]:
# Initialize a workspace client
w = WorkspaceClient()

# Create the serving endpoint
w.serving_endpoints.create_and_wait(
  name="ner-serving-endpoint",
  config=EndpointCoreConfigInput(
    name="ner-serving-endpoint",
    served_entities=[
      ServedEntityInput(
        entity_name="autobricks.default.ner_pyfunc",
        entity_version=info.version,
        name="ner",
        scale_to_zero_enabled=True,
        workload_type="GPU_MEDIUM",
        workload_size="Small"
      )
    ],
  ),
  timeout=datetime.timedelta(minutes=30)
)

Now, you can run a test query.

In [0]:
response = w.serving_endpoints.query(name="ner-serving-endpoint", dataframe_records=json.loads(test_inputs.to_json(orient="records")))
print(response)

## Batch inference using `ai_query`

To run inference for several records at a time, you can perform batch inference using the `ai_query` function.

In [0]:
# First create a PySpark dataframe
df = (
  spark.createDataFrame(
    pd.DataFrame({
      "text" : ["My name is John, and I love databricks!" for _ in range(1000)]
    })
  )
  # Now add a column where we query our model serving endpoint
  .withColumn("outputs", F.expr(f"""ai_query(
      "ner-serving-endpoint",
      text
    )"""))
)

Call the `display` function to execute the spark code above.

In [0]:
display(df)

Now you have performed batch inference for 1000 data points on a BERT NER model serving endpoint!