# Serving a simple ML model

I wanted to start by training a simple model, a profanity detection model. I wanted to use TF-IDF as explained in my blog post. 

In this part, we're going to be in `/model`. Let's start by generating our data using the raw data that's available on github.

In [1]:
import os
os.chdir("../model")

In [2]:
import subprocess
subprocess.run(["python", "generate_data.py"])

CompletedProcess(args=['python', 'generate_data.py'], returncode=0)

And that generated a cleaned dataset for us to use at `data/tweets.csv`. We can use that for training now.

In [3]:
subprocess.run(["python", "train.py"])

Model + val data saved.


CompletedProcess(args=['python', 'train.py'], returncode=0)

In that file, we used a pretty simple TF-IDF model (as explained in the blog post). I set it up such that the model is saved as a `.joblib` file in the /app directory.

Let's run some quick evals so we can make sure that the model isn't horrible (I'm not *too* concerned about eval metrics but I also don't want a garbage model).

In [4]:
subprocess.run(["python", "evaluate.py"])

X_val shape: (4954,)
y_val shape: (4954,)
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.66      0.76       870
           1       0.93      0.98      0.96      4084

    accuracy                           0.93      4954
   macro avg       0.91      0.82      0.86      4954
weighted avg       0.92      0.93      0.92      4954

Confusion Matrix:
[[ 578  292]
 [  74 4010]]
Validation Accuracy: 0.9261203068227695


CompletedProcess(args=['python', 'evaluate.py'], returncode=0)

Okay so this means that our model isn't horrible (acutally it's not bad at all - ~93% accuracy and decent F1). Let's try it on a couple of examples so that we know it's actually fine (and will work for people who want to hit this API eventually).

In [5]:
inputSentence1 = "this is so fucking boring"
inputSentence2 = "what a bitch bruh"
inputSentence3 = "this ice cream is actually amazing"

inputs = [inputSentence1, inputSentence2, inputSentence3]

But we have to switch directories to `/app` first.

In [6]:
os.chdir("../app")

In [7]:
import subprocess

for i, sentence in enumerate(inputs, start=1):
    result = subprocess.run(
        ["python", "check_model.py", sentence],
        capture_output=True, text=True
    )
    print(f"stdout {i}:", result.stdout.strip())
    if result.stderr:
        print(f"stderr {i}:", result.stderr.strip())

stdout 1: Text: this is so fucking boring
is_profane: True | confidence: 91.59%
stdout 2: Text: what a bitch bruh
is_profane: True | confidence: 98.0%
stdout 3: Text: this ice cream is actually amazing
is_profane: False | confidence: 38.22%


That was a nice sanity check and we can see that any inappropriate word is instantly flagged with high confidence. 


Our machine learning is done now and for the rest we are going to focus on engineering a system around serving this model.

## Bentoml Deployment

In the blog post, I talked about how much bentoml helps. I'll now show how much it helps in action.

If you take a look at `app/service_bento.py` we can see that we loaded our model, set up a service, and made an endpoint of requests to hit. We can start testing it by servign it locally. 

Ideally you would do

```bash
cd app
bentoml serve service_bento:svc
```

but since we are in a .ipynb I'm going to try to make subprocess run that.

In [10]:
import joblib
import bentoml

pipeline = joblib.load("profanity.joblib")
bentoml.sklearn.save_model("profanity_detector", pipeline)

Model(tag="profanity_detector:att6b7ddiwqbxbxv", path="/Users/akhilvreddy/bentoml/models/profanity_detector/att6b7ddiwqbxbxv/")

In [11]:
process = subprocess.Popen(
    ["bentoml", "serve", "service_bento:ProfanityService"]
)

  __import__("pkg_resources").declare_namespace(__name__)  # type: ignore


2025-07-17T15:34:16-0400 [INFO] [cli] Starting production HTTP BentoServer from "service_bento:ProfanityService" listening on http://localhost:3000 (Press CTRL+C to quit)


  __import__("pkg_resources").declare_namespace(__name__)  # type: ignore


2025-07-17T15:34:17-0400 [INFO] [entry_service:ProfanityService:1] Service ProfanityService initialized


Let's make some API calls now to make sure it's working well.

In [13]:
import requests

url = "http://localhost:3000/predict"
headers = {"Content-Type": "application/json"}
data = {"text": "you fucking dumbass"}

response = requests.post(url, json=data, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())

2025-07-17T15:37:05-0400 [INFO] [entry_service:ProfanityService:1] 127.0.0.1:55281 (scheme=http,method=POST,path=/predict,type=application/json,length=31) (status=200,type=application/json,length=42) 1.779ms (trace=f319d0ad54bea97cbb1cb3b7ef535f2d,span=be13ee9505ec893e,sampled=0,service.name=ProfanityService)
Status Code: 200
Response: {'is_profane': True, 'confidence': 0.9841}


In [18]:
data = {"text": "ice cream is good"}

response = requests.post(url, json=data, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())

2025-07-17T15:37:58-0400 [INFO] [entry_service:ProfanityService:1] 127.0.0.1:55300 (scheme=http,method=POST,path=/predict,type=application/json,length=29) (status=200,type=application/json,length=43) 1.587ms (trace=02ed65ae8554d29b5eb16f3fbb93da31,span=15043a3ec72bc5ef,sampled=0,service.name=ProfanityService)
Status Code: 200
Response: {'is_profane': False, 'confidence': 0.3354}


Okay so that's good too. We're hitting our local API and getting responses just like we would've had we just used the model in eval mode.

The next part would be to actually ship this as a containerized API with the model on a registry.

We would have to do a `bento build` first which would create a `bento/` directory that contains model and service. Then, we can containerize. 

Again, ideally I would've like to do

```bash
bentoml build
bentoml containerize profanity_api:latest
```

but we have to use our workaround.

In [19]:
import subprocess

subprocess.run(["bentoml", "build"], check=True)
subprocess.run(["bentoml", "containerize", "profanity_api:latest"], check=True)

  __import__("pkg_resources").declare_namespace(__name__)  # type: ignore
Error: [bentos] `build` failed: Failed to load bento or import service ''. The directory '/Users/akhilvreddy/Documents/deploying-profanity-service/app' does not contain a valid bentofile.yaml or service.py.


CalledProcessError: Command '['bentoml', 'build']' returned non-zero exit status 1.

Ignore the above error - I ran the rest on my local terminal and I'll attach pictures.

I'll just talk through the next few steps because it makes 0 sense to run these from ipynb python cells. 

After those two commands, I created a docker container with our model that's being served by calling

```bash
bentoml containerize profanity_service:p2cgpvddjc4szbxv 
```

Optionally, I could have pushed this to a model registry (if I was deploying this as part of a true service on production) with:

```bash
bentoml push profanity_service:p2cgpvddjc4szbxv 
```

this would've pushed my local image to bento's cloud and this would've worked as a registry.

Here's some results: 

![one](../assets/pic1.png)
![two](../assets/pic2.png)



To recap, we went from:

1) Raw data
2) Model
3) Service
4) Bento
5) Docker container

Referring back to what I had on my blog, we can wrap this up by pushing to GHCR, Docker Hub, or ECR. We're basically in the "lock and ship" phase.

I'm going to do GHCR for simplicity (well integrated with the ecosystem already). Here's the packages getting uploaded:

![yo](../assets/pic3.png)

So once this has been uploaded, it means that any machine that can authenticate to GHCR can now run `docker pull` and run this API.

Let's move to the last part, which is deploying this on the cloud. I'm going to go with fly.io sicne it's free to start.

Essentially, I want fly.io to host and serve my dockerized app without me having to worry about the infra overhead. It's supposed to run it on a public server and usually returns a URL for you to make calls to your service. 

For this case, we should expect a link like https://profanity-service.fly.dev/predict and now anyone can now curl or `requests.post()` to that endpoint and hit that API. **I would say the work here is done when we can curl a phrase with profanity to that link straight from my terminal.** 

After this, I ran some commands to set up fly:

```bash
flyctl deploy --image ghcr.io/akhilvreddy/profanity_service:latest
```

That deployed my base image to fly. However, I quickally got an email saying that I ran out of compute because of how big the application is. And that makes sense because this isn't a tiny microserivce - it's a logistic regression model that's trained on a decent amount of data. 

I had to scale it up to serve

```bash
fly scale memory 512 -a deploying-profanity-service
```

and then to start the application again

```bash
flyctl start -a deploying-profanity-service
```

And that worked for me! Here's some proof


The page (https://profanity-service.fly.dev/predict) opened to a swagger page:

![](../assets/pic4.png)

And as we can see, it is predicting quickly! 

Here's what the whole page looked like:

![](../assets/pic5.png)

As I mentioned in the blog post, BentoML automatically gives us those health endpoints which is so useful.

Here's proof that it was working on my phone as well:

![](../assets/pic6.png)

And it was super nice to see that logs in prometheus were automatically generated:

![](../assets/pic7.png)



Okay so the app is fully hosted, and the last (but most important part) would be to set up a re-training loop. I currently don't have any way to get the "latest profane words" so I'll simulate a pipeline given the fact that I have some new data coming in.

Here's a pipeline that we would use in practice to fully automate this loop

deploy.yaml
```yaml
name: Retrain, Containerize & Deploy

on:
  push:
    branches: [main]
    paths:
      # retrigger pipeline if relevant files changed
      - "app/train.py"
      - "app/eval.py"
      - "app/generate_data.py"
      - "data/**"
      - "app/service_bento.py"
      - "app/bentofile.yaml"

  # nightly drift detection
  schedule:
    - cron: '0 4 * * *'

  # manual trigger support
  workflow_dispatch: {}

jobs:
  deploy:
    runs-on: ubuntu-latest

    permissions:
      contents: read
      packages: write

    steps:
      # checkout and setup
      - name: Checkout repo
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt

      # once new data is in, generate a new cleaned file
      - name: Generate training data
        run: python app/generate_data.py

      # retrain & evaluate Model
      - name: Train model
        run: python app/train.py

      - name: Evaluate model quality
        run: python app/evaluate.py

      # in production, fail here if performance drops too low
      # like if evaluate.py results in bad F1 or accuracy

      # BentoML
      - name: Set up BentoML
        uses: bentoml/setup-bentoml-action@v1
        with:
          python-version: "3.11"

      - name: Build Bento
        run: bentoml build

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Containerize & push to GHCR
        uses: bentoml/containerize-push-action@v1
        with:
          bento-tag: profanity_service:latest
          push: true
          tags: ghcr.io/${{ github.repository_owner }}/profanity_service:latest
          platform: linux/amd64

      # deploy to Fly.io      
      - name: Install flyctl
        uses: superfly/flyctl-actions/setup-flyctl@master

      - name: Deploy to Fly.io
        run: flyctl deploy --image ghcr.io/${{ github.repository_owner }}/profanity_service:latest --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

      # drift monitoring and prometheus hook
      - name: (Optional) Send Prometheus Drift Metric
        if: github.event_name == 'schedule'
        run: |
          echo "::notice ::Insert Prometheus drift check & pushgateway curl here"
          # e.g. curl -X POST http://prometheus-server/push/...
```

That gives us an end-to-end pipeline that would retrain, evaluate, containerize, and serve our application again. This is a bare-bones version but in production settings companies probably have thousands of pipelines like this running on various triggers.