<style>
    div:has(> hr) h1 {
        font-size: 5em !important;
    }
    .reveal h3 {
        margin-top: 1em;
    }
</style>

# Model Deployment

*Model formats, batch vs realtime scoring, and deployment pipelines*

---

Ethan Swan, 2023

# Today's Topic: Deployment

Once a model has been selected, we usually need to do something with it.

*e.g.* We want to predict what products we should suggest to an online user. 

Setting up a model to run automatically on new data is called **deployment**.

# Agenda

1. About Me
2. Exporting a Model
3. Batch, Realtime, & On-demand Scoring
4. Deployment Pipelines

# About Me
---

# About Me

- Senior Backend Developer at [ReviewTrackers](https://www.reviewtrackers.com/)
    - Startup, ~100 employees
    - SaaS platform for online reputation management
- Analytics & ML Engineering Team
- NLP Microservice (Python), Main API Layer (Go)

- Started in February 2022
- Wanted to see a techy startup from the inside
    - especially engineering practices

# About Me

- Previously: [84.51˚](https://www.8451.com/)
    - Marketing Analytics Branch of Kroger
    - Lead Data Scientist - Internal Tools & Infrastructure
- Education: University of Notre Dame
    - B.S. in Computer Science
    - M.B.A.

- 8451:
    - Did some measurement work
    - quickly transitioned to functional support
    - taught classses and helped with tech strategy

# What is Model Deployment?
---

# Why Do We Deploy Models?

- A model is ultimately a function that maps inputs (*features*) to outputs (*targets*).
    - Usually Python or R code
- How can we use that function in a real-world application?
    - How do we get new data into the model?
    - What happens if we shut down the session? Is the model gone forever?

# Deploying the Model

1. Export (save) the model in a reusable place and format.
2. Build batch, streaming, or on-demand scoring system that loads the model and uses it to score new data.

Bonus: whenever you create a new model, you can just drop it into the deployment system and overwrite the old one.

*(though modern ML Ops best-practice is to tag your models with versions and save old ones.)*

# Exporting Models
---

# Exporting Models: Formats

- Pickle
    - Special, non--human-readable binary format
    - Can save any Python object
    - Some compatibility issues
- Raw weights/parameters
    - Just a bunch of numbers in a file
    - More common for TensorFlow, PyTorch, etc.

- Pickle and similar libraries are easier and **more flexible**
    - but compatibility concerns
- raw model weights are **more portable**
    - but not necessarily easy to reload


# Exporting Models: Locations


- Local filesystem
    - Only if the model is going to be deployed locally
- Cloud storage
    - S3, GCS, Azure Blob Storage, etc.
    - Works for almost any deployment location

# Deployment Approaches

---

# Batch, Streaming, and On-demand Scoring
- **Batch**
    - Run the model in advance and save the output
    - Think: Spotify Discover Weekly
- Realtime
    - Run the model on new data as it comes in
    - **Streaming**
        - Queue up data and run the model 
        - Think: Facebook makes new friend recommendations soon after you add a friend
    - **On-demand**
        - Run the model only when the prediction is needed via an API
        - Think: GitHub Copilot recommmends code as you type

# Batch Scoring

A system runs the model on a batch of data every hour, day, week, etc., and saves the output to a database to be used when needed.

### Architecture
- Airflow or cloud-based scheduler to kick off the model
- Chain parts of the job (tasks) together
- Save output to a persistent location: database or cloud storage

# Batch Scoring

![Batch Scoring Architecture](images/batch_architecture.jpeg)

# Batch Scoring

Pros
- Predictable workload (always one run per hour/day/week)
- Relatively easy to set up

Cons
- Predictions are stale until the next run
- Reruns happen even if nothing has changed -- wasting resources

# Realtime Scoring

- **Streaming** – Trigger the model on new "events"
- **On-demand** – Access the model via an API when new predictions are needed

# Streaming

Track new events and send them into a queue to process. This is **realtime-ish**.
- Typically on a delay, which can be large if the queue is long


### Architecture
- A "publisher" sends a message to a "subscriber" when an event occurs
- Messages are "queued" up until the subscriber is ready to process them
    - Thus not truly realtime
- Platform: Kafka, RabbitMQ, cloud-based pub/sub, etc.

# Streaming

![Streaming Architecture](images/streaming_architecture.jpeg)

# Streaming

Pros
- Queues help manage spikes in workload
- Consumers (scorers) can run in parallel
    - Enables easy horizontal scaling
- Queues add additional resilience
    - A consumer crash doesn't result in lost data

Cons
- Requires an additional system (the message queue service)
- Data flow through queues is difficult to trace and reason about
- Large workloads can cause long backups

# On-demand

Build an API that accepts new data and returns scores immediately
- Typically response time of under a second

### Architecture
- Write API server in a language that can run the model
    - If Python, use Flask or FastAPI framework
    - Alternatively, sometimes people rebuild the model in a different language for performance (not recommended)
- Deploy the API server to a cloud service that supports scalability
- Each instance loads a copy of the model

# On-demand

![On-demand Architecture](images/on_demand_architecture.jpeg)

# On-demand

Pros
- Only runs the model when needed
- No queueing system that can back up
- Easy to parallelize -- just run many instances of the API

Cons
- Slow scoring process can cause problems
    - APIs responding slowly can cause timeouts
- System can be overwhelmed by a large number of requests
    - Need to rate limit

# How to Choose?

- Start with batch if you can: easiest to set up
    - No need for a queueing system or complex API deployment
    - Latency won't be an issue -- scores are just stored in a database to be retrieved any time
- If you need realtime...
    - Use on-demand if predictions must be served immediately
        - If users are waiting for the prediction, you need to get it to them
    - If delay is tolerable, or if system overload is a real risk, use streaming and queues

# Deployment Pipelines
---

# Deployment Pipelines

- When deployed, a model goes from the domain of *data science* to the domain of *software engineering*, and that means...
- Version control
    - Can be on the model itself or just a reference to the name/version of the model
- CI/CD
    - Continuous Integration - automated testing
    - Continuous Deployment - automated deployment

# Deployment Pipelines

- What's important:
    - The model can't get to production without being tested
    - It's possible to "roll back" to a past version of the model if issues are found

# Questions