# Mistral NeMo: A Comparative Analysis of Text Models

This notebook provides a comprehensive comparison of Mistral NeMo with Mistral 7B and Mixtral 8x7B, three advanced language models developed by Mistral AI. Our primary objective is to evaluate Mistral NeMo's performance, identify its strengths and limitations, and establish best practices for integrating it into workflows that require accurate and efficient natural language processing.

To accomplish this, we will conduct a series of controlled tests and qualitative assessments, utilizing appropriate APIs and inference endpoints for each model. We will explore model outputs on various natural language processing tasks, including but not limited to:

- General text generation and completion
- Question answering
- Sentiment analysis
- Text summarization
- Language translation

Additionally, we will employ a judging model to systematically evaluate and rank the quality of responses from each model.

Through this process, the notebook will:

- Demonstrate how to efficiently use Mistral NeMo's endpoints for real-time inference.
- Compare Mistral NeMo's capabilities to Mistral 7B and Mixtral 8x7B using standardized prompts and test datasets.
- Help you understand the relative advantages of Mistral NeMo, guiding you in deciding when and how to deploy it in your own applications.

We have included licensing details and quick-start references for further exploration. By the end of this analysis, you should have a clear perspective on Mistral NeMo's performance profile and actionable insights into optimizing its use in your specific scenarios.

All example outputs have been preserved in this notebook, allowing you to review the results without needing to run the code on your own instance or pay for compute costs.

## Model Overview
- **Mistral NeMo**: A state-of-the-art 12 billion multilingual model with 128k context length
- **Mistral 7B**: A 7 billion parameter model known for its efficiency and performance
- **Mixtral 8x7B**: A mixture-of-experts model with 8 experts, each containing 7 billion parameters


By conducting these comparisons, we aim to provide a clear understanding of how Mistral NeMo stands in relation to its predecessors and guide users in selecting the most appropriate model for their specific use cases.

## Model License

- **License:** Apache 2.0 - Mistral NeMo

## Getting Started

The instructions to deploy Mistral NeMo from Bedrock Marketplace and it's capabilities can be found in the [Deploy-Mistral-NeMo-from-Bedrock-Marketplace-and-its-Capabilities.ipynb](https://github.com/aws-samples/mistral-on-aws/blob/main/notebooks/Deploy-Mistral-NeMo-from-Bedrock-Marketplace-and-its-Capabilities.ipynb) notebook.

Want to learn more about Mistra NeMo?  
[Mistral AI Blog](https://mistral.ai/news/mistral-nemo/)  
[NVIDIA Model Card](https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct/modelcard)  
[AWS Blog](https://aws.amazon.com/blogs/machine-learning/mistral-nemo-instruct-2407-and-mistral-nemo-base-2407-are-now-available-on-sagemaker-jumpstart/)


## Installation and configuration

In this section we will install python modules needed in the notebook

In [1]:
%pip install --upgrade pip
%pip install botocore boto3 sagemaker --upgrade --quiet

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.1.1 requires nvidia-ml-py3==7.352.0, which is not installed.
aiobotocore 2.13.3 requires botocore<1.34.163,>=1.34.70, but you have botocore 1.36.6 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.3 which is incompatible.
autogluon-core 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.5.2 which is incompatible.
autogluon-core 1.1.1 requires scipy<1.13,>=1.5.4, but you have scipy 1.14.1 which is incompatible.
autogluon-features 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.5.2 which is incompatible.
autogluon-multimodal 1.1.1 requires js

In [2]:
import boto3
import json
import sagemaker
import time
from botocore.exceptions import ClientError
from sagemaker.djl_inference import DJLModel



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
# Colors to display information
RESET = "\033[0m"
GREEN = "\033[38;5;29m"
BLUE = "\033[38;5;43m"
ORANGE = "\033[38;5;208m"
PURPLE = "\033[38;5;93m"
RED = "\033[38;5;196m"

In [4]:
# Set configuration
MODEL_SOURCE_ID = 'huggingface-llm-mistral-nemo-instruct-2407'
MODEL_SOURCE_ARN = 'arn:aws:sagemaker:{region}:aws:hub-content/SageMakerPublicHub/Model/huggingface-llm-mistral-nemo-instruct-2407/1.0.3'
INSTANCE_TYPE = 'ml.g6.24xlarge'
ENDPOINT_NAME = 'nemo-on-bedrock'

MISTRAL_7B_MODEL_ID = 'mistral.mistral-7b-instruct-v0:2'
MIXTRAL_8x7B_MODEL_ID = 'mistral.mixtral-8x7b-instruct-v0:1'
JUDGE_MODEL_ID = 'anthropic.claude-3-5-sonnet-20241022-v2:0'

In [5]:
# function to grab aws account id, sagemaker execution role and region
def get_current_session_info():
    sagemaker_role_arn = sagemaker.get_execution_role()
    session = sagemaker.Session()
    account_id = session.account_id()
    region = session._region_name

    return account_id, region, sagemaker_role_arn

aws_account_id, aws_region, sagemaker_role_arn = get_current_session_info()

print(f'aws region: {aws_region}')

MODEL_SOURCE_ARN = MODEL_SOURCE_ARN.format(region=aws_region)

aws region: us-west-2


In [6]:
# Create bedrock client object
bedrock_client = boto3.client('bedrock')

In [7]:
# create bedrock marketplace endpoint
def create_endpoint(model_source_arn: str, 
                    endpoint_name: str,
                    instance_type: str, 
                    instance_count: int = 1):

    response = bedrock_client.create_marketplace_model_endpoint(
            modelSourceIdentifier=model_source_arn,
            endpointConfig={
                'sageMaker': {
                    'initialInstanceCount': instance_count,
                    'instanceType': instance_type,
                    'executionRole': sagemaker_role_arn,
                }
            },
            acceptEula=True,
            endpointName=endpoint_name
        )
    return response

create_response = create_endpoint(model_source_arn=MODEL_SOURCE_ARN, endpoint_name=ENDPOINT_NAME, instance_type=INSTANCE_TYPE)

In [8]:
# Retrieve endpoint arn from response text

endpoint_arn = create_response['marketplaceModelEndpoint']['endpointArn']
MISTRAL_NEMO_MODEL_ID = endpoint_arn

In [9]:
# Check endpoint creation status until it's in service

while(True):
    endpoint_reponse = bedrock_client.get_marketplace_model_endpoint(endpointArn=endpoint_arn)
    status = endpoint_reponse['marketplaceModelEndpoint']['endpointStatus']
    print(f'endpoint status: {status}')
    if (status != 'Creating'):
        break

    # wait for 10 seconds
    time.sleep(30)

endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: Creating
endpoint status: InService


## Helper Functions

In [11]:
# Create bedrock runtime object

bedrock_runtime = boto3.client("bedrock-runtime")

In [10]:
# Helper function to invoke Bedrock model using invoke APIs
def invoke_model(model_id: str, prompt: str, display_usage=False):
    
    prompt = f"<s>[INST] {prompt} [/INST]"

    body = json.dumps({
        "prompt": prompt,
        "max_tokens": 2000,
        "temperature": 0.7,
        "top_p": 0.7,
        "top_k": 50
    })
    accept = 'application/json'
    contentType = 'application/json'
    response = bedrock_runtime.invoke_model(body=body,
                                            modelId=model_id,
                                            accept=accept,
                                            contentType=contentType)
    
    response_body = json.loads(response.get('body').read())
    outputs = response_body.get('outputs')
    response = ''
    for index, output in enumerate(outputs):
        response = response + output['text']
    return response

In [12]:
# Helper function to invoke Bedrock model using invoke APIs and message format
def invoke_model_with_message_format(model_id: str, prompt: str, display_usage=False):
    
    payload = {
        "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "max_tokens": 2000,
            "temperature": 0.7,
             "top_p": 0.7,
            "top_k": 50
        }

    body = json.dumps(payload)
    accept = 'application/json'
    contentType = 'application/json'
    response = bedrock_runtime.invoke_model(body=body,
                                            modelId=model_id,
                                            accept=accept,
                                            contentType=contentType)
    
    response = json.loads(response.get('body').read())
    response = response['choices'][0]['message']['content']
    return response

Next, we'll use an LLM as a "judge" to assess the quality of each response. While this automated evaluation can provide useful insights, it's important to complement it with human judgment to ensure the chosen response aligns with your specific goals. If all three outputs seem equally strong, your personal criteria and preferences will help make the final decision.

For this demo, we’ll use Sonnet 3.5 as the judge. We’ll present the original image along with the three responses to determine which one is the most accurate and helpful.

In [13]:
# Helper function to invoke judge model using invoke APIs
def invoke_judge_model(model_id: str, prompt: str, display_usage=False):
    
    payload = {
        "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2000,
            "temperature": 0.7,
             "top_p": 0.7,
            "top_k": 50
        }

    body = json.dumps(payload)
    accept = 'application/json'
    contentType = 'application/json'
    response = bedrock_runtime.invoke_model(body=body,
                                            modelId=model_id,
                                            accept=accept,
                                            contentType=contentType)
    
    response = json.loads(response.get('body').read())

    #response = response['choices'][0]['message']['content']
    return response['content'][0]['text']

In [24]:
def evaluate_responses(task:str, mistral7b_response:str, mixtral8x7b_response:str, nemo_response:str):

    evaluation_prompt = f"""
        You are evaluating output of 3 models for a given task. Please evaluate which model produced the best output and explain why.
        
        Task: {task}
        
        Model A (Mistral 7b): {mistral7b_response}
        
        Model B (Mixtral 8x7b): {mixtral8x7b_response}

        Model C (Mistral NeMo): {nemo_response}
        
        Which model provided the best output? Please explain your reasoning and declare a winner."""

    judge_response = invoke_judge_model(
        model_id=JUDGE_MODEL_ID,
        prompt=evaluation_prompt
    )
    print(f"{RED}Judge's Evaluation:{RESET}")
    print(f"{GREEN}{judge_response}{RESET}")

## Comparitive Analysis

### Summarization

Text summarization is a common use case for large language models, and Mistral models have performed well in this area. They are effective at condensing long texts into clear and concise summaries, making them valuable for tasks that require quick and accurate information extraction.

In [16]:
prompt = ''''You are a technical writer. Summarize the following technical content:


In recent years, machine learning (ML) has moved from research and development to the mainstream, driven by the increasing number of data sources and scalable cloud-based compute resources. AWS’ customers currently use AI/ML for a wide variety of applications such as call center operations, personalized recommendations, identifying fraudulent activities, social media content moderation, audio and video content analysis, product design services, and identity verification. Industries using AI/ML include healthcare and life sciences, industrial and manufacturing, financial services, media and entertainment, and telecom.

Machine learning, through its use of algorithms to find patterns in data, can bring considerable power to its customers and thus recommends responsibility in its use. AWS is committed to developing fair and accurate AI and ML services and providing you with the tools and guidance needed to build AI and ML applications responsibly. For more information on this important topic, refer to AWS' Responsible AI.

This whitepaper provides you with a set of proven best practices. You can apply this guidance and architectural principles when designing your ML workloads, and after your workloads have entered production as part of continuous improvement. Although the guidance is cloud- and technology-agnostic, the paper also includes guidance and resources to help you implement these best practices on AWS.

The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you learn operational and architectural best practices for designing and operating workloads in the cloud. It provides a way to consistently measure your operations and architectures against best practices and identify areas for improvement.

Your ML models depend on the quality of input data to generate accurate results. As data changes with time, monitoring is required to continually detect, correct, and mitigate issues with accuracy and performance. This monitoring step might require you to retrain your model over time using the latest refined data.

While application workloads rely on step-by-step instructions to solve a problem, ML workloads enable algorithms to learn from data through an iterative and continuous cycle. The ML Lens complements and builds upon the Well-Architected Framework to address this difference between these two types of workloads.

This paper is intended for those in a technology role, such as chief technology officers (CTOs), architects, developers, data scientists, and ML engineers. After reading this paper, you will understand the best practices and strategies to use when you design and operate ML workloads on AWS.

'''


mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")

[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m Machine learning (ML) has become mainstream due to the abundance of data and scalable cloud resources. AWS customers utilize ML for various applications such as call center operations, recommendations, fraud detection, content moderation, analysis, design services, and identity verification across industries like healthcare, industrial manufacturing, financial services, media, and telecom. AWS is dedicated to creating fair and accurate AI/ML services and providing resources for responsible use.

This whitepaper offers best practices for designing and operating ML workloads, applicable to both cloud-agnostic and AWS environments. The AWS Well-Architected Framework helps understand the benefits and risks of decisions made while building workloads on AWS, ensuring operational and architectural best practices.

ML models rely on high-quality input data for accurate results, necessitating continuous monitoring and potential retraining as d

In [25]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet me analyze each model's output based on key criteria:

1. Completeness of Information:
- Model A covers all major points but is more condensed
- Model B misses some specific details about applications
- Model C uses bullet points and headers, making key information easily digestible and covers all major themes

2. Organization:
- Model A presents information in traditional paragraph form, flowing logically
- Model B follows a similar paragraph structure but is less detailed
- Model C uses clear headers and bullet points, making it more scannable and organized

3. Technical Accuracy:
All three models maintain technical accuracy, but Model C's structured format helps prevent any confusion between concepts

4. Clarity and Readability:
- Model A is clear but dense
- Model B is concise but might be too brief
- Model C's bullet-point format with headers makes it easiest to read and reference

WINNER: Model C (Mistral NeMo)

Reasoning:
Model C 

### Code Generation

Mistral LLMs are increasingly capable of generating high-quality code in various programming languages based on user prompts. These models can assist with tasks ranging from writing simple functions to generating complex algorithms, saving time and reducing errors. By understanding context and logic, these LLMs can also help debug code, suggest optimizations, and provide documentation, making them valuable tools for developers. In this section, we test code generation capabilities of Mistral models.

In [26]:
prompt = '''
You are a software engineer with expertise on React.

Create a React component that calculates body mass index.
'''

mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")


[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m I'd be happy to help you create a simple React component for calculating Body Mass Index (BMI). Here's an example of how you might implement this:

```javascript
import React, { useState } from 'react';

const BmiCalculator = () => {
  const [weight, setWeight] = useState('');
  const [height, setHeight] = useState('');
  const [bmi, setBmi] = useState(0);

  const calculateBmi = () => {
    if (weight && height) {
      const calculatedBmi = parseFloat(weight) / (parseFloat(height) * parseFloat(height)) * 7030;
      setBmi(calculatedBmi.toFixed(1));
    }
  };

  return (
    <div>
      <h1>Body Mass Index (BMI) Calculator</h1>
      <label htmlFor="weight">Weight (kg):</label>
      <input
        type="number"
        id="weight"
        value={weight}
        onChange={(e) => setWeight(e.target.value)}
      />
      <label htmlFor="height">Height (cm):</label>
      <input
        type="number"
        id="height"
        value

In [27]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet me analyze each model's output based on several key criteria:

1. Functionality:
- Model A: Basic BMI calculation with input validation
- Model B: Basic BMI calculation with proper height conversion
- Model C: Advanced BMI calculation with categorization and proper form handling

2. Code Quality:
- Model A: Has a calculation error (multiplies by 7030)
- Model B: Clean code with correct BMI formula
- Model C: Most comprehensive with proper error handling and form submission

3. User Experience:
- Model A: Basic input and output
- Model B: Basic input and output with better formatting
- Model C: Enhanced UX with:
  - Form submission
  - BMI categories
  - Step attribute for weight input
  - Input validation
  - Clear feedback

4. Technical Implementation:
- Model C stands out with:
  - Form handling with preventDefault()
  - Proper type conversion with parseFloat
  - Input validation
  - Comprehensive state management
  - BMI categorizatio

### Email Drafting

Mistral LLMs are highly effective at drafting professional emails, offering users a quick way to compose clear and concise messages. They can adapt to various tones—whether formal, casual, or persuasive—ensuring the right communication style for any situation. Additionally, these LLMs can help structure emails, suggest relevant content, and even proofread for grammar and clarity, making them invaluable tools for efficient communication. In this section, we evaluate email drafting capabilities of Mistral LLMs.

In [36]:
prompt = '''
You are a skilled customer support agent known for your polite and helpful communication. Your role involves revising poorly written emails to ensure they are clear, professional, and appropriately toned.


hi, this is to inform u that you that u don't qualify for this offer.

rgds,
support team
'''

mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")


[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m Subject: Regrettably, Your Application for the Offer Did Not Meet the Requirements

Dear Valued Customer,

I hope this message finds you well. I regret to inform you that based on the current eligibility criteria, your application for the offer did not meet the necessary qualifications.

We appreciate your interest in our program and value your continued support. If you have any questions or concerns regarding this decision, please do not hesitate to contact us. Our team is always here to help and provide clarification.

Thank you for choosing us as your preferred service provider. We look forward to the opportunity to serve you in the future.

Best regards,
Your Support Team[0m


[38;5;196m#### Mixtral 8x7b Response:[0m
[38;5;43m Dear [Customer],

Thank you for your interest in our offer. After reviewing your information, I regret to inform you that you do not meet the qualifications for this promotion.

We appreciate your unders

In [37]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet me analyze each response based on key criteria for professional customer service communication:

1. Professionalism & Structure
2. Clarity of message
3. Tone & Empathy
4. Completeness

Model A (Mistral 7b):
+ Very professional and well-structured
+ Comprehensive with clear opening and closing
+ Shows empathy and maintains positive tone
+ Includes subject line
+ Offers clear next steps
+ Maintains relationship-building elements

Model B (Mixtral 8x7b):
+ Clear and concise
+ Professional tone
+ Includes next steps
+ Somewhat brief but complete
- Less empathetic compared to A and C

Model C (Mistral NeMo):
+ Professional and well-structured
+ Shows empathy
+ Includes placeholder for company details
+ Good balance of information
- Missing subject line

Winner: Model A (Mistral 7b)

Reasoning:
Model A provides the most complete and professional response while maintaining the best balance of all desired elements. It stands out for:
1. Includin

### Sentiment Analysis

Mistral LLMs excel at sentiment analysis by identifying the tone and emotions expressed in text, whether positive, negative, or neutral. They can analyze customer reviews, social media posts, or any form of written content to gauge sentiment, providing valuable insights for businesses and researchers. With their ability to understand context and nuance, these LLMs offer a powerful tool for monitoring brand perception and customer feedback. In this section, we evaluate sentiment analysis capabilities of Mistral models.

In [30]:
prompt = '''You are a language model trained to classify customer feedback into categories based on sentiment. 

Customer Review: I recently purchased this product, and I must say, I'm really impressed with it. The build quality is top-notch, and it performs exactly as advertised. 
The setup was easy, and I had no issues at all. However, I did encounter a small delay in delivery, which was a bit frustrating, but it was handled well by the customer service team. 
Overall, I'm happy with my purchase and would definitely recommend it to others. It's been a great addition to my routine!"

Task: classify customer review into one of the following categories: Positive, Negative, Neutral
'''

mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")

[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m Based on the given customer review, the sentiment can be classified as Positive. The customer expressed their satisfaction with the product's build quality, performance, and ease of setup. Although they mentioned a delay in delivery, they acknowledged that the customer service team handled it well. The overall tone of the review is positive, and the customer expressed their intention to recommend the product to others.[0m


[38;5;196m#### Mixtral 8x7b Response:[0m
[38;5;43m Based on the customer's positive comments about the product's build quality, performance, and ease of setup, as well as their overall satisfaction and recommendation, this review falls into the Positive category.[0m


[38;5;196m#### Mistral NeMo Response:[0m
[38;5;43mBased on the provided customer review, the sentiment is overwhelmingly positive. Here's why:

1. The customer expresses satisfaction with the product's build quality and performance.
2. They fo

In [31]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet's analyze each model's response:

Model A (Mistral 7b):
- Provides clear classification (Positive)
- Explains reasoning with specific examples from the review
- Acknowledges the negative point (delivery delay) but explains why it doesn't affect overall sentiment
- Concise yet comprehensive

Model B (Mixtral 8x7b):
- Provides classification (Positive)
- Lists some supporting points
- Very brief, missing some nuance
- Doesn't address the negative aspect at all

Model C (Mistral NeMo):
- Provides clear classification (Positive)
- Structured response with numbered points
- Comprehensive analysis of both positive and negative aspects
- Explains how the negative point is mitigated
- Clear formatting with bold conclusion
- Well-balanced and thorough explanation

Winner: Model C (Mistral NeMo)

Reasoning:
1. Most structured and easy to follow response
2. Most comprehensive analysis
3. Balanced treatment of both positive and negative aspects
4. C

In [32]:
prompt = '''You are a language model trained to classify customer feedback into categories based on sentiment. 

I’m really disappointed with this product. It stopped working after just a few uses, and I had high hopes based on the description. 
The quality feels cheap, and it doesn't match what was promised in the ad. 
I reached out to customer service, but they took too long to respond, and when they did, their solution didn’t fix the issue. I wouldn’t recommend this to anyone.

Task: classify customer review into one of the following categories: Positive, Negative, Neutral
'''

mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")

[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m Negative. The customer expressed disappointment with the product's performance, perceived it as having low quality, and had a negative experience with customer service.[0m


[38;5;196m#### Mixtral 8x7b Response:[0m
[38;5;43m Negative[0m


[38;5;196m#### Mistral NeMo Response:[0m
[38;5;43mBased on the provided customer feedback, the review can be classified as **Negative**. Here's why:

1. The customer expresses disappointment with the product ("I’m really disappointed").
2. They mention that the product stopped working after a short period of use ("It stopped working after just a few uses"), indicating dissatisfaction with its durability.
3. The customer feels that the product's quality is poor ("The quality feels cheap") and doesn't meet their expectations based on the product description and advertisement.
4. They had a negative experience with customer service, stating that it took too long to respond and that the solution 

In [33]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet's analyze each model's response:

Model A (Mistral 7b):
- Provides the correct classification (Negative)
- Includes a brief but clear explanation with key points
- Mentions three main aspects: product performance, quality, and customer service

Model B (Mixtral 8x7b):
- Provides only the classification (Negative)
- No explanation or reasoning provided
- Most minimal response of all three

Model C (Mistral NeMo):
- Provides the correct classification (Negative)
- Offers the most comprehensive explanation
- Breaks down the analysis into 5 specific points
- Supports each point with direct quotes from the review
- Considers alternative classifications (mentions absence of positive/neutral aspects)
- Well-structured response with clear formatting

Winner: Model C (Mistral NeMo)

Reasoning:
While all models provided the correct classification, Model C stands out as the best output because it:
1. Provides the most thorough and well-structured a

In [34]:
prompt = '''You are a language model trained to classify customer feedback into categories based on sentiment. 

The product works as expected, but it’s nothing special. The setup was straightforward, and it functions like most similar products out there. 
I haven’t had any major issues, but I also haven’t been blown away by its performance. It’s okay, but I might look for alternatives next time.

Task: classify customer review into one of the following categories: Positive, Negative, Neutral
'''

mistral7b_output = invoke_model(MISTRAL_7B_MODEL_ID, prompt)
mixtral8x7b_output = invoke_model(MIXTRAL_8x7B_MODEL_ID, prompt)
nemo_output = invoke_model_with_message_format(MISTRAL_NEMO_MODEL_ID, prompt)

print(f"{RED}#### Mistral 7b Response:{RESET}")
print(f"{BLUE}{mistral7b_output}{RESET}\n\n")

print(f"{RED}#### Mixtral 8x7b Response:{RESET}")
print(f"{BLUE}{mixtral8x7b_output}{RESET}\n\n")

print(f"{RED}#### Mistral NeMo Response:{RESET}")
print(f"{BLUE}{nemo_output}{RESET}")

[38;5;196m#### Mistral 7b Response:[0m
[38;5;43m Based on the customer feedback provided, I would classify this review as Neutral. The customer did not express any major issues or dissatisfaction with the product, but they also did not express any significant praise or excitement. The feedback suggests that the product is functional and meets expectations, but does not stand out from similar products in the market.[0m


[38;5;196m#### Mixtral 8x7b Response:[0m
[38;5;43m Neutral[0m


[38;5;196m#### Mistral NeMo Response:[0m
[38;5;43mBased on the provided customer feedback, I would classify it as "Neutral". Here's why:

- The customer states that the product "works as expected" and the setup was "straightforward", which are neutral statements and not particularly positive or negative.
- They also mention that they "haven't had any major issues", which is a positive aspect, but it's presented in a neutral manner.
- However, the customer expresses that they "haven't been blown a

In [35]:
evaluate_responses(task=prompt,
                    mistral7b_response=mistral7b_output,
                    mixtral8x7b_response=mixtral8x7b_output,
                    nemo_response=nemo_output)

[38;5;196mJudge's Evaluation:[0m
[38;5;29mLet me analyze each model's response:

Model A (Mistral 7b):
- Provides clear classification as Neutral
- Explains reasoning with specific references to the review
- Balanced analysis of both functional aspects and lack of standout features

Model B (Mixtral 8x7b):
- Provides only the classification "Neutral"
- No explanation or reasoning provided
- Too minimal to be helpful

Model C (Mistral NeMo):
- Provides clear classification as Neutral
- Offers detailed explanation with bullet points
- Breaks down specific phrases from the review
- Shows balanced analysis of positive, negative, and neutral aspects
- Provides clear conclusion based on the evidence presented

Winner: Model C (Mistral NeMo)

Reasoning:
While Model A provides a good response with clear reasoning, Model C offers the most comprehensive and well-structured analysis. The bullet-point format makes it easier to follow the reasoning, and it specifically addresses multiple aspects

## Observations

Mistral NeMo consistently delivers the most comprehensive and structured responses when compared to its predecessors. Unlike earlier models, NeMo's outputs are well-organized, providing clear reasoning and a logical flow of information. It excels in tasks like summarization, code generation, and sentiment analysis, often outperforming both Mistral 7b and Mixtral 8x7b in these areas. For summarization, NeMo demonstrates a stronger ability to condense long texts while retaining key details and clarity. In code generation, it provides precise and efficient solutions, showcasing a deep understanding of programming concepts. NeMo also excels in sentiment analysis, accurately gauging the tone and emotions in text, making it highly effective for customer feedback analysis and market research. However, Mistral 7b remains superior when it comes to email drafting tasks. It provides more contextually appropriate and professionally toned responses, making it a better fit for generating emails in various business scenarios. Overall, NeMo stands out in many advanced use cases, while Mistral 7b still holds an edge in specific areas like email drafting.

## Cleanup

In [None]:
bedrock_client.delete_marketplace_model_endpoint(endpointArn=endpoint_arn)