# AI Engineer Technical Assessment

## Overview
Build an AI-powered solution for sentiment analysis of movie reviews that leverages the existing dataset to improve accuracy. This assessment is designed to be completed in 2-3 hours, we do NOT expect very detailed answers or long explanations.

## Notes
- AI assistance is allowed and, in fact, encouraged. caveats are:
    - Concise explanations and simple code are preferred
    - Solutions that use newer information and go beyond LLMs cuttof date are valuable.
    - You must be able to explain the code you write here

- Look up any information you need, copy and paste code is allowed.
- Setup the environment as needed. You can use your local environment, colab, or any other environment of your preferenc.
- Focus on working solutions, leave iteration and improvements if you have extra time.

## Setup
The following cells will download and prepare the IMDB dataset. 

In [3]:
import pandas as pd
import numpy as np
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset("imdb")
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

# Sample subset for quicker development
train_df = train_df.sample(n=5000, random_state=42)
test_df = test_df.sample(n=10, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

# Display sample data
print("\nSample review:")
sample = train_df.iloc[0]
print(f"Text: {sample['text'][:200]}...")
print(f"Sentiment: {'Positive' if sample['label'] == 1 else 'Negative'}")

Training samples: 5000
Test samples: 10

Sample review:
Text: Dumb is as dumb does, in this thoroughly uninteresting, supposed black comedy. Essentially what starts out as Chris Klein trying to maintain a low profile, eventually morphs into an uninspired version...
Sentiment: Negative


In [4]:
train_df.head()

Unnamed: 0,text,label
6868,"Dumb is as dumb does, in this thoroughly unint...",0
24016,I dug out from my garage some old musicals and...,1
9668,After watching this movie I was honestly disap...,0
13640,This movie was nominated for best picture but ...,1
14018,Just like Al Gore shook us up with his painful...,1


## Task 1: Model Implementation
Implement a solution that analyzes sentiment in movie reviews. This part is explicitly open-ended: Explore ways to leverage the example dataset to enhance predictions. You can consider a pre-trained language model that can understand and generate text, external API's, RAG systems etc. 
Feel free to use any library or tool you are comfortable with.

In [5]:
# Your implementation here
# Feel free to create additional cells as needed

## Task 2: API Implementation
Create a simple API using FastAPI that serves your solution. The API should accept a review text and return the sentiment analysis result.

Expected format:
```python
# Request
{
    "review_text": "This movie exceeded my expectations..."
}

# Response
{
    "sentiment": "positive",
    "confidence": 0.92,
    "similar_reviews": [
        {},
        {}
    ]
}
```

In [6]:
from fastapi import FastAPI
from pydantic import BaseModel

# Your API implementation here

## Task 3: Testing and Performance
Evaluate your solution's performance on the test set. Include:
1. Accuracy metrics (precision, recall, F1-score)
2. Inference speed (average time per prediction)

Compare performance with and without using the example data to demonstrate any improvements.

In [7]:
import time
from sklearn.metrics import classification_report

# Your testing code here

ModuleNotFoundError: No module named 'sklearn'

## Task 4: Deployment Strategy

1. Describe your deployment strategy considering:
   - Data storage and retrieval
   - Scalability
   - Resource requirements
   - Cost considerations

2. Create a simple Dockerfile to package your solution

In [None]:
# Write your deployment strategy here as a markdown cell
deployment_strategy = """
# Deployment Strategy

## Infrastructure
...

## Scalability Approach
...

## Model & Data Storage
...

## Resource & Cost Considerations
...
"""

print(deployment_strategy)

# Write your Dockerfile content
dockerfile_content = """
# Your Dockerfile here
...
"""

print("\nDockerfile:")
print(dockerfile_content)

## Evaluation Criteria
- Implementation that can process reviews and return sentiments
- Use of extra data to improve predictions
- Proper API design
- Reasonable deployment strategy

Good luck!