# AI Engineer Technical Assessment

## Overview
Build an AI-powered solution for sentiment analysis of movie reviews that leverages the existing dataset to improve accuracy. This assessment is designed to be completed in 2-3 hours, we do NOT expect very detailed answers or long explanations.

## Notes
- AI assistance is allowed and, in fact, encouraged. caveats are:
    - Concise explanations and simple code are preferred
    - Solutions that use newer information and go beyond LLMs cuttof date are valuable.
    - You must be able to explain the code you write here

- Look up any information you need, copy and paste code is allowed.
- Setup the environment as needed. You can use your local environment, colab, or any other environment of your preferenc.
- Focus on working solutions, leave iteration and improvements if you have extra time.

## Setup
The following cells will download and prepare the IMDB dataset. 

In [1]:
import pandas as pd
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset("imdb")
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

# Sample subset for quicker development
train_df = train_df.sample(n=5000, random_state=42)
test_df = test_df.sample(n=10, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

# Display sample data
print("\nSample review:")
sample = train_df.iloc[0]
print(f"Text: {sample['text'][:200]}...")
print(f"Sentiment: {'Positive' if sample['label'] == 1 else 'Negative'}")

Training samples: 5000
Test samples: 10

Sample review:
Text: Dumb is as dumb does, in this thoroughly uninteresting, supposed black comedy. Essentially what starts out as Chris Klein trying to maintain a low profile, eventually morphs into an uninspired version...
Sentiment: Negative


In [2]:
train_df.head()

Unnamed: 0,text,label
6868,"Dumb is as dumb does, in this thoroughly unint...",0
24016,I dug out from my garage some old musicals and...,1
9668,After watching this movie I was honestly disap...,0
13640,This movie was nominated for best picture but ...,1
14018,Just like Al Gore shook us up with his painful...,1


## Task 1: Model Implementation
Implement a solution that analyzes sentiment in movie reviews. This part is explicitly open-ended: Explore ways to leverage the example dataset to enhance predictions. You can consider a pre-trained language model that can understand and generate text, external API's, RAG systems etc. 
Feel free to use any library or tool you are comfortable with.

To address this sentiment analysis assessment, I have chosen to evaluate two approaches:

1. **Lightweight Encoder Model (ModernBERT):** Using an encoder-only transformer model trained for sentiment classification. Specifically, I have decided to use [ModernBERT](https://arxiv.org/pdf/2412.13663), a next-generation encoder model (2024) that introduces several architectural advancements over the original BERT.
2. **Sentiment Classification with a Small LLM:** The second approach explores the use of a small-scale Large Language Model (LLM), ranging between 1B to 3B parameters, to perform sentiment classification through prompt engineering.

In [3]:
# Lightweight Encoder Model (ModernBERT)
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class LightweightModelService:
    def __init__(self, model_name: str ="clapAI/modernBERT-base-multilingual-sentiment") -> None:
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        # Load the tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype=torch.float16).to(self.device).eval()

    def predict(self, movie_review: str) -> tuple[str, float]:
        inputs = self.tokenizer(movie_review, return_tensors="pt").to(self.device)

        with torch.inference_mode():
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
            confidence, prediction = torch.max(probs, dim=-1)

        label_map = self.model.config.id2label
        sentiment = label_map[prediction.item()].lower()  

        # Map neutral to negative for binary compatibility
        if sentiment == "neutral":
            sentiment = "negative"

        return sentiment, round(confidence.item(), 4)

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [None]:
import re
from typing import Any, cast

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline  # type: ignore

### Instructions:
# System-level prompt defining the assistant's behavior
def get_system_prompt() -> str:
    return """You are an expert in sentiment analysis of movie reviews. Your task is to evaluate the sentiment of the given review and classify it as "positive" or "negative", and provide a confidence between 0-1. Your analysis must consider both explicit and implicit sentiment cues while maintaining a focus on the target entity. 
    
    1. **Sentiment Classification Criteria**:  
    - **Positive**: The review expresses favorable opinions about the movie, highlighting its strengths or praising specific aspects.  
    - **Negative**: The review conveys unfavorable opinions, criticizing elements of the movie or expressing disappointment.

    2. **Handling Mixed Sentiments and Implicit Sentiment**:  
    - If both positive and negative elements exist, classify based on the **dominant sentiment**, considering intensity and frequency.  
    - Detect subtle tones, including **irony, implied sentiment, and framing biases** (e.g., selective comparisons, loaded phrases). 

    3. **Classification Constraints**:
        - Never return "neutral". If the sentiment is unclear or confidence is low, default to "negative". 

    4. **Output Format**:  
    - Return:
        classification: <positive|negative>  
        confidence: <float between 0 and 1>
    - Do not include explanations, additional text, or punctuation.
    """


def get_system_prompt_with_examples(examples) -> str:
    
    base_prompt = """You are an expert in sentiment analysis of movie reviews. Your task is to evaluate the sentiment of the given review and classify it as "positive" or "negative", and provide a confidence between 0-1. Your analysis must consider both explicit and implicit sentiment cues while maintaining a focus on the target entity. 
    
    1. **Sentiment Classification Criteria**:  
    - **Positive**: The review expresses favorable opinions about the movie, highlighting its strengths or praising specific aspects.  
    - **Negative**: The review conveys unfavorable opinions, criticizing elements of the movie or expressing disappointment.

    2. **Handling Mixed Sentiments and Implicit Sentiment**:  
    - If both positive and negative elements exist, classify based on the **dominant sentiment**, considering intensity and frequency.
    - Detect subtle tones, including **irony, implied sentiment, and framing biases** (e.g., selective comparisons, loaded phrases).  

    3. **Classification Constraints**:
        - Never return "neutral". If the sentiment is unclear or confidence is low, default to "negative".

    4. **Output Format**:  
    - Return:
        classification: <positive|negative>  
        confidence: <float between 0 and 1>
    - Do not include explanations, additional text, or punctuation.

    ## Examples
    """
    formatted_examples = ""
    for ex in examples:
        formatted_examples += (
            f"Review: {ex['text']}\n"
            f"classification: {ex['label']}\n"
        )

    return base_prompt + "\n" + formatted_examples.strip()


# User-level prompt defining the interaction model
def get_user_prompt(movie_review: str) -> str:
    return f"""Analyze the sentiment of the provided movie review: {movie_review}"""

class Phi4LLM:
    """
    A service class for interacting with the PHI 4 model.
    """

    def __init__(
        self, model_name: str = "microsoft/Phi-4-mini-instruct", temperature: float = 0.1, max_new_tokens: int = 500
    ) -> None:
        torch.random.manual_seed(0)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.temperature = temperature
        self.max_new_tokens = max_new_tokens

        self.model_name = model_name

        self.initialize_model()

    def initialize_model(self) -> None:

        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            **self.get_model_kwargs(),
        ).to(self.device).eval()

        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)

        with torch.no_grad():
            self.completion_pipeline = pipeline(
                "text-generation",
                model=self.model,
                tokenizer=self.tokenizer,
            )

    def get_completions(self, movie_review: str, examples: list[dict] = None) -> tuple[str, float]:
        """
        Generate completions based on the provided messages.
        """
        if examples is not None and not examples.empty:
            few_shot_examples = [
                {
                    "text": row["text"],
                    "label": "positive" if row["label"] == 1 else "negative",
                }
                for _, row in examples.iterrows()
            ]
            system_prompt = get_system_prompt_with_examples(few_shot_examples)
        else:
            system_prompt = get_system_prompt()
            
        messages = [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": get_user_prompt(movie_review),
            },
        ]

        generation_args = {
            "max_new_tokens": self.max_new_tokens,
            "return_full_text": False,
            "temperature": self.temperature,
            "do_sample": True,
        }

        with torch.no_grad():
            output = self.completion_pipeline(messages, **generation_args)

        return self.parse_llm_output(cast(str, output[0]["generated_text"]))
    
    def parse_llm_output(self, llm_output: str) -> tuple[str, float]:
        """
        Parse the LLM response to extract sentiment and confidence.
        """
        classification_match = re.search(r"classification:\s*(positive|negative|neutral)", llm_output, re.IGNORECASE)
        confidence_match = re.search(r"confidence:\s*([0-1](?:\.\d+)?)", llm_output)

        if classification_match and confidence_match:
            classification = classification_match.group(1).lower()
            confidence = float(confidence_match.group(1))

            # Map neutral to negative for binary compatibility
            if classification == "neutral":
                classification = "negative"
            
            return classification, confidence

        raise ValueError(f"Could not parse output: {llm_output}")

    def get_model_kwargs(self) -> dict[str, Any]:
        """
        Return default kwargs for huggingface model loading.
        """
        model_kwargs = {
            "device_map": self.device,
            "torch_dtype": "auto",
            "trust_remote_code": True,
        }
    
        # if self.device.type == "cuda":
        #     model_kwargs["attn_implementation"] = "flash_attention_2"

        return model_kwargs

## Task 2: API Implementation
Create a simple API using FastAPI that serves your solution. The API should accept a review text and return the sentiment analysis result.

Expected format:
```python
# Request
{
    "review_text": "This movie exceeded my expectations..."
}

# Response
{
    "sentiment": "positive",
    "confidence": 0.92,
    "similar_reviews": [
        {},
        {}
    ]
}
```

In [None]:
from datetime import datetime
from fastapi import FastAPI, HTTPException

from pydantic import BaseModel
from typing import Optional, List, Dict

# Your API implementation here
class ClassificationInputRequest(BaseModel):
    review_text: str

class ClassificationResponse(BaseModel):
    sentiment: str # Should be positive or negative
    confidence: Optional[float] 
    similar_reviews: List[Dict]

class HealthCheckResponse(BaseModel):
    status: str
    version: str
    timestamp: str  

app = FastAPI()

lightweight_model = LightweightModelService()
llm_model = Phi4LLM()

@app.post("/sentiment-classification-lightweight", response_model=ClassificationResponse)
async def sentiment_predict_lightweight(request: ClassificationInputRequest):
    try:
        result = lightweight_model.predict(request.review_text)
        sentiment, confidence = result

        return ClassificationResponse(
            sentiment=sentiment,
            confidence=confidence,
            similar_reviews=[]
        )

    except Exception as e:
        raise HTTPException(detail=str(e), status_code=400)

@app.post("/sentiment-classification-llm", response_model=ClassificationResponse)
async def sentiment_predict_llm(request: ClassificationInputRequest):
    try:
        result = llm_model.get_completions(request.review_text)
        sentiment, confidence = result

        return ClassificationResponse(
            sentiment=sentiment,
            confidence=confidence,
            similar_reviews=[]
        )
    except Exception as e:
        raise HTTPException(detail=str(e), status_code=400)
    
@app.get("/health", response_model=HealthCheckResponse)
async def health_check():
    return HealthCheckResponse(
        status="healthy",
        version="1.0.0",
        timestamp=datetime.utcnow().isoformat()
    )

# Code for run a FastAPI server inside a Jupyter notebook

# import uvicorn
# import threading

# def run_api():
#     uvicorn.run(app, host="0.0.0.0", port=8000, reload=False)

# thread = threading.Thread(target=run_api, daemon=True)
# thread.start()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


## Task 3: Testing and Performance
Evaluate your solution's performance on the test set. Include:
1. Accuracy metrics (precision, recall, F1-score)
2. Inference speed (average time per prediction)

Compare performance with and without using the example data to demonstrate any improvements.

In [None]:
import time
from sklearn.metrics import classification_report

# Your testing code here
def evaluate_model(model, model_type="lightweight"):
    y_true = []
    y_pred = []
    times = []

    for review, label in zip(test_df["text"], test_df["label"]):
        start = time.time()

        if model_type == "lightweight":
            pred_label, _ = model.predict(review)
        elif model_type == "llm_with_examples":
            few_shot_df = train_df.sample(n=3, random_state=123)
            pred_label, _ = model.get_completions(review, few_shot_df)
        else:
            pred_label, _ = model.get_completions(review)

        end = time.time()

        pred_bin = 1 if pred_label == "positive" else 0
        y_true.append(label)
        y_pred.append(pred_bin)
        times.append(end - start)

    print(f"\nEvaluation for {model_type} model")
    print(classification_report(y_true, y_pred, target_names=["negative", "positive"]))
    print(f"Avg inference time per review: {round(sum(times) / len(times), 4)} seconds")

In [None]:
# Preparing a small set for the evaluation. To select the best model for make the complete evaluation. 
eval_test_df = pd.DataFrame(dataset['test'])
test_df = eval_test_df.sample(n=50, random_state=42)

In [9]:
evaluate_model(lightweight_model, model_type="lightweight")


Evaluation for lightweight model
              precision    recall  f1-score   support

    negative       0.97      1.00      0.98        28
    positive       1.00      0.95      0.98        22

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50

Avg inference time per review: 0.4188 seconds


In [10]:
evaluate_model(llm_model, model_type="llm")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Evaluation for llm model
              precision    recall  f1-score   support

    negative       0.80      1.00      0.89        28
    positive       1.00      0.68      0.81        22

    accuracy                           0.86        50
   macro avg       0.90      0.84      0.85        50
weighted avg       0.89      0.86      0.85        50

Avg inference time per review: 2.7652 seconds


In [13]:
evaluate_model(llm_model, model_type="llm_with_examples")


Evaluation for llm_with_examples model
              precision    recall  f1-score   support

    negative       0.90      0.96      0.93        28
    positive       0.95      0.86      0.90        22

    accuracy                           0.92        50
   macro avg       0.93      0.91      0.92        50
weighted avg       0.92      0.92      0.92        50

Avg inference time per review: 6.2335 seconds


##  Evaluation Analysis
Based on the results from the small evaluation set, the **best performing and fastest model** is the `lightweight_model` (ModernBERT). In my view, this is expected, as it is a task-specific model trained explicitly for sentiment classification.

The LLM (`Phi-4-mini`), although not optimized for this task, still achieves good results. However, it is significantly slower and more resource-intensive, making it less suitable for production use in this context. 

We shoud take into account that the the few-shot prompting approach (LLM with examples) shows an improvement in accuracy and F1-score compared to the zero-shot version. This confirms that providing examples in the prompt helps the LLM better align with task expectations, although at the cost of increased inference time. 

Given these findings, I will now perform a full evaluation on the entire IMDB test set using only the `lightweight_model`, which is more appropriate for this use case.

In [15]:
eval_complete_test_df = pd.DataFrame(dataset['test'])
test_df = eval_complete_test_df.sample(n=25000, random_state=42)

evaluate_model(lightweight_model, model_type="lightweight")


Evaluation for lightweight model
              precision    recall  f1-score   support

    negative       0.98      0.99      0.98     12500
    positive       0.99      0.98      0.98     12500

    accuracy                           0.98     25000
   macro avg       0.98      0.98      0.98     25000
weighted avg       0.98      0.98      0.98     25000

Avg inference time per review: 0.0426 seconds


## Task 4: Deployment Strategy

1. Describe your deployment strategy considering:
   - Data storage and retrieval
   - Scalability
   - Resource requirements
   - Cost considerations

2. Create a simple Dockerfile to package your solution

In [None]:
# Write your deployment strategy here as a markdown cell

deployment_strategy = """
# Deployment Strategy

## Infrastructure

For a simple and portable setup, we will containerize the ModernBERT sentiment analysis API using Docker. The API itself will be built with FastAPI and uvicorn. The idea, is to package the model and the app in a docker container, which ensures consistency across environments and makes it easy to deploy on any cloud or platform.
    - Deployment Target: We want to use docker as it can be deployed on a cloud service (AWS, Baseten, HuggingFace Inference Endpoints). A typical approach would involve AWS Elastic Container Service (ECS) or Google Cloud Run. These services can directly run the container without needing to manage servers, aligning with the goal of simplicity and portability. 
    - Load Balancer: In a production setup, the container would be placed behind a load balancer. For instance, if deploying on AWS, an Application Load Balancer (ALB) would handle HTTPS termination and route traffic to the FastAPI container. If self-hosting, an Nginx reverse proxy could be used to handle TLS and route traffic to Uvicorn.

Overall, this infrastructure prioritizes simplicity and portability. Once the Docker image is built, deploying to AWS, GCP, or Render is as simple as pushing the image and letting the platform run it.
 
## Scalability Approach

To handle increasing traffic, the deployment will utilize horizontal scaling

    - Horizontal Scaling with Load Balancing: As demand grows, we can run multiple instances of the Docker container across different nodes. A load balancer will distribute incoming requests to avoid overloading a single instance. For example, on AWS, this could be implemented via ECS tasks or EC2 instances behind an ALB; in Kubernetes, using a Service with autoscaled pods.
    - Autoscaling Strategies: We could define autoscaling rules based on CPU or GPU usage (e.g., add a new container when usage exceeds 70%). Platforms like Cloud Run and Baseten offer scale-to-zero capabilities, reducing cost during idle times. Additionally, a scale-to-zero approach can be considered on platforms like Cloud Run or AWS Lambda – meaning if there's no traffic, instances scale down to zero to save cost, and scale up on incoming requests.

Note: Platforms like Baseten or Hugging Face Endpoints also handle autoscaling internally, so if operational simplicity is preferred, these are viable alternatives.

## Model & Data Storage

The ModernBERT model should be accessible at runtime:

    - Model Packaging: The model weights will be downloaded at build time and included in the Docker image using the Hugging Face `transformers` library. This ensures fast startup and avoids downloading weights during runtime.
    - Data Storage and Logging: The API is stateless and does not persist input/output data. However, we could store some logs results if needed. Optionally, we can log request metadata for monitoring and future model refinement.
        - Logs: Log requests and responses (excluding sensitive data) for debugging or analytics.
        - Caching Layer: If certain inputs are very frequent, a cache could be used to speed up repeated queries. Given this task is quite fast, caching is not critical unless we encounter specific hot inputs.

## Resource & Cost Considerations

One of the reasons to choose ModernBERT (a lightweight model) is its small size and fast inference. This allows us to deploy it on low-cost infrastructure. Therefore, probably we could use a CPU for cost effiencey.  A small instance (e.g., 2 vCPU) is sufficient for moderate traffic.
"""

print(deployment_strategy)

# Write your Dockerfile content
dockerfile_content = """
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install uv (Python package manager)
RUN apt-get update && apt-get install -y curl && \\
    curl -LsSf https://astral.sh/uv/install.sh | sh

# Add uv to PATH
ENV PATH="/root/.cargo/bin:$PATH"

# Copy project files and install dependencies
COPY pyproject.toml .
COPY uv.lock .  # Optional, if you have a lock file
RUN uv sync

# Copy the actual app code
COPY . /app

# Expose FastAPI port
EXPOSE 8000

# Start the API server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

print("\nDockerfile:")
print(dockerfile_content)
print("Note: This Dockerfile assumes your FastAPI app is defined in a standalone Python file (e.g., `main.py`). If you're currently running everything in a Jupyter notebook, you'd need to extract the API logic into a file for deployment.")


# Deployment Strategy

## Infrastructure

For a simple and portable setup, we will containerize the ModernBERT sentiment analysis API using Docker. The API itself will be built with FastAPI and uvicorn. The idea, is to package the model and the app in a docker container, which ensures consistency across environments and makes it easy to deploy on any cloud or platform.
    - Deployment Target: We want to use docker as it can be deployed on a cloud service (AWS, Baseten, HuggingFace Inference Endpoints). A typical approach would involve AWS Elastic Container Service (ECS) or Google Cloud Run. These services can directly run the container without needing to manage servers, aligning with the goal of simplicity and portability. 
    - Load Balancer: In a production setup, the container would be placed behind a load balancer. For instance, if deploying on AWS, an Application Load Balancer (ALB) would handle HTTPS termination and route traffic to the FastAPI container. If self-hosting, an N

## Evaluation Criteria
- Implementation that can process reviews and return sentiments
- Use of extra data to improve predictions
- Proper API design
- Reasonable deployment strategy

Good luck!