# LLM Evaluation Framework for Smartphone Review Analysis

This notebook presents an evaluation of several large language models (LLMs) from Hugging Face, focusing on their performance across key natural language processing tasks. The dataset, generated by AI and manually verified, contains over 80 smartphone reviews.

The evaluation framework covers four essential NLP tasks:

* **Sentiment Classification**: Identifying positive or negative sentiment with performance metrics.
* **Cross-language Translation**: English-to-Spanish translation evaluated using BLEU scores.
* **Question Answering**: Extractive QA to assess reading comprehension abilities.
* **Text Summarization**: Producing concise summaries while retaining key information.

In [23]:
# Import necessary packages
import pandas as pd
import torch
from transformers import pipeline
import evaluate
from transformers import logging

In [24]:
# Configuration
logging.set_verbosity(logging.WARNING)

# Dataset
The dataset comprises 80+ customer reviews about mobile phones, each labeled as either positive or negative sentiment. This collection includes intentionally ambiguous reviews that challenge classification boundaries, making it ideal for testing model robustness and decision-making under uncertainty. The reviews vary in length and complexity, providing a realistic representation of user-generated content found in e-commerce platforms.

In [25]:
reviews_df = pd.read_csv("../data/phone_reviews.csv", sep=";", dtype={'Review':'string', 'Class':'string'})
print(reviews_df.info())
display(reviews_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  83 non-null     string
 1   Class   83 non-null     string
dtypes: string(2)
memory usage: 1.4 KB
None


Unnamed: 0,Review,Class
0,The iPhone 14 Pro has some impressive features...,NEGATIVE
1,I regret buying this Samsung Galaxy S23. The p...,NEGATIVE
2,My OnePlus 11 has been a mixed experience over...,POSITIVE
3,This Google Pixel 7 has been a nightmare. Cons...,NEGATIVE
4,The iPhone 13 Mini is exactly what I wanted in...,NEGATIVE


# Sentiment Analysis

The analysis employs [DistilBERT-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), a lightweight yet powerful model specifically optimized for sentiment classification. This model represents an ideal balance between computational efficiency and accuracy, making it suitable for production environments.
The evaluation focuses on challenging edge cases where sentiment boundaries are unclear, testing the model's ability to handle nuanced language and mixed opinions commonly found in real customer feedback.

In [26]:
# Load the model pipeline
pipe_sentiment = pipeline(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Sentiment analysis
predicted_labels = pipe_sentiment(reviews_df['Review'].tolist())
display(predicted_labels[:5])

# Map predictions to 0 1
predictions = [1 if prediction['label'] == 'POSITIVE' else 0 for prediction in predicted_labels]
predictions[:5]

Device set to use mps:0


[{'label': 'NEGATIVE', 'score': 0.9684767723083496},
 {'label': 'NEGATIVE', 'score': 0.999528169631958},
 {'label': 'POSITIVE', 'score': 0.9947659969329834},
 {'label': 'NEGATIVE', 'score': 0.9997991919517517},
 {'label': 'NEGATIVE', 'score': 0.7173128724098206}]

[0, 0, 1, 0, 0]

Displayed are the first five predictions, with labels assigned and percentage scores calculated. Predictions are mapped to 0 and 1 to enable metric computation.

In [27]:
# Load metrics
accuracy_evaluation = evaluate.load("accuracy")
f1_evaluation = evaluate.load("f1")

# Map review classes
validate_labels = reviews_df['Class'].map({'POSITIVE':1, 'NEGATIVE':0})
validate_labels = validate_labels.tolist()

# Calculate metrics
accuracy_result = accuracy_evaluation.compute(references=validate_labels, predictions=predictions)
accuracy_result = accuracy_result['accuracy']
f1_result = f1_evaluation.compute(references=validate_labels, predictions=predictions)
f1_result = f1_result['f1']
print(f" Accuracy: {accuracy_result:.2f}\n", f"F1 Score: {f1_result:.2f}\n")


 Accuracy: 0.93
 F1 Score: 0.95



The model demonstrates strong classification performance with **93% accuracy and an F1 score of 0.95**, indicating excellent balance between precision and recall. These metrics suggest the model effectively minimizes both false positives and false negatives.

Let's examine the wrong classifed ones

In [28]:
# Identify wrong ones
diff_positions = [i for i, (a, b) in enumerate(zip(predictions, validate_labels)) if a != b]

# Change config to show full review content
pd.set_option('display.max_colwidth', None)

# Show reviews
wrong_classified = reviews_df.iloc[diff_positions]
display(wrong_classified)

# Restore default
pd.reset_option('display.max_colwidth')

Unnamed: 0,Review,Class
23,"My Samsung Galaxy A54 works fine for what it is, though I have mixed feelings about the overall experience. The design looks modern and feels decent in hand. Camera produces acceptable photos for social media, but nothing spectacular. Battery life gets me through the day with normal usage. One UI has nice features, but feels a bit sluggish on this hardware. For the price, it's reasonable, but I sometimes wonder if spending a bit more would have been better. It does what I need, just not with much excitement.",POSITIVE
25,"The iPhone SE 3rd generation is a confusing phone to evaluate. The A15 chip is genuinely flagship-level and handles everything smoothly, which is impressive. Touch ID works reliably, especially with masks becoming common. However, the design feels dated even if some prefer the classic look. Battery life is concerning for moderate to heavy users. Camera quality is good but lacks modern features like Night mode. It's affordable for an iPhone, but expensive compared to Android alternatives with similar features. Whether it's worth it depends entirely on how much you value iOS and long-term updates versus modern features.",POSITIVE
53,"The Samsung Galaxy A34 offers budget-friendly smartphone functionality with predictable compromises. Performance handles essential apps adequately, though multitasking can feel sluggish. Camera quality is acceptable for social media sharing, but lacks advanced features. One UI provides familiar Samsung experience, though it feels heavy on the limited hardware. Battery life gets through typical days with careful management. Build quality is basic but functional for the price point. It delivers on core smartphone needs without excitement, making it suitable for users with modest expectations and tight budgets.",NEGATIVE
55,"The Vivo X100 Pro showcases impressive camera hardware that doesn't always translate to consistent results. Zeiss partnership produces excellent photos in optimal conditions, though performance varies in challenging scenarios. Low-light capabilities are good when they work properly. Video stabilization helps with content creation. Performance generally matches the camera's demands smoothly. Funtouch OS has improved but still feels cluttered compared to stock Android. Premium pricing puts it in competition with established flagship cameras. For photography enthusiasts, it offers professional tools, but general users might find better overall value elsewhere.",POSITIVE
61,"My iPhone 12 Mini presents a challenging trade-off between size and functionality. The compact form factor is perfect for one-handed use and portability, which I genuinely appreciate. Performance remains solid for most current applications. Camera quality matches larger iPhones in good lighting conditions. However, battery life is genuinely problematic, often requiring midday charging with moderate use. Screen size limitations become apparent with modern apps and media consumption. It's the right choice for users who prioritize compactness above all else, but requires significant compromises that may frustrate average users.",NEGATIVE
81,The Samsung Galaxy A05s provides ultra-budget smartphone access. Performance handles very basic tasks slowly. Camera quality is poor but functional for emergencies. Battery life is acceptable for light usage. One UI Core offers simplified Samsung experience. Build quality is basic plastic but functional. Perfect option for elderly users or emergency backup device needs.,POSITIVE


The misclassified reviews reveal inherent challenges in sentiment analysis: subjective interpretation and contextual ambiguity. These reviews contain mixed signals where positive features are mentioned alongside limitations, creating uncertainty even for human annotators.

This analysis highlights the importance of considering confidence scores and potentially implementing **multi-class sentiment categories** (positive, negative, neutral, mixed) for more nuanced sentiment understanding in production systems.

# Translate

To assess translation quality and explore sentiment consistency across languages, the implementation uses English-to-Spanish translation using [Helsinki-NLP's OPUS-MT](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-es) model. This transformer-based translation system is specifically trained on large-scale multilingual corpora, offering robust performance for European language pairs.

The translation approach handles long-form text by processing sentence-by-sentence, ensuring optimal model performance while maintaining contextual coherence across the complete review.

In [29]:
# Load the model pipeline
pipe_translate = pipeline(task="translation", model="Helsinki-NLP/opus-mt-tc-big-en-es")

Device set to use mps:0


Long-form reviews exceed model input limits, requiring sentence segmentation. This ensures semantic coherence, respects model constraints, and allows accurate reconstruction of the full translation.

In [30]:
# Load translation pipeline
pipe_translate = pipeline(task="translation", model="Helsinki-NLP/opus-mt-tc-big-en-es")

# Example input
texts = wrong_classified['Review'].tolist()

# Simple sentence splitting using periods
def translate_by_sentence_simple(text_list):
    translated_texts = []  # List to store the translated paragraphs

    for paragraph in text_list:
        # Split the paragraph into sentences using periods and remove empty strings or extra spaces
        sentences = [s.strip() for s in paragraph.split('.') if s.strip()]

        translated_sentences = []

        for sentence in sentences:
            # Translate each sentence individually using the translation pipeline
            translation = pipe_translate(sentence, clean_up_tokenization_spaces=True)[0]['translation_text']
            translated_sentences.append(translation)

        # Reconstruct the translated paragraph by joining the translated sentences with periods
        translated_paragraph = '. '.join(translated_sentences) + '.'
        translated_texts.append(translated_paragraph)

    return translated_texts

# Apply the sentence-by-sentence translation to the list of reviews
spanish_texts = translate_by_sentence_simple(texts)


Device set to use mps:0


In [31]:
from IPython.display import display, HTML

# Get original and translated lists
english_texts = wrong_classified['Review'].tolist()

# Build the HTML table with centered headers and left-aligned text
html = """
<table style='width:100%; border-collapse: collapse;'>
    <tr>
        <th style='border: 1px solid black; padding: 8px; text-align: center;'>English</th>
        <th style='border: 1px solid black; padding: 8px; text-align: center;'>Spanish</th>
    </tr>
"""

for eng, esp in zip(english_texts, spanish_texts):
    html += f"""
    <tr>
        <td style='border: 1px solid black; padding: 8px; text-align: left;'>{eng}</td>
        <td style='border: 1px solid black; padding: 8px; text-align: left;'>{esp}</td>
    </tr>
    """

html += "</table>"

# Display the table
display(HTML(html))



English,Spanish
"My Samsung Galaxy A54 works fine for what it is, though I have mixed feelings about the overall experience. The design looks modern and feels decent in hand. Camera produces acceptable photos for social media, but nothing spectacular. Battery life gets me through the day with normal usage. One UI has nice features, but feels a bit sluggish on this hardware. For the price, it's reasonable, but I sometimes wonder if spending a bit more would have been better. It does what I need, just not with much excitement.","Mi Samsung Galaxy A54 funciona bien para lo que es, aunque tengo sentimientos encontrados sobre la experiencia general. El diseño se ve moderno y se siente decente en la mano. La cámara produce fotos aceptables para las redes sociales, pero nada espectacular. La duración de la batería me ayuda a pasar el día con un uso normal. Una interfaz de usuario tiene buenas características, pero se siente un poco lento en este hardware. Por el precio, es razonable, pero a veces me pregunto si gastar un poco más habría sido mejor. Hace lo que necesito, pero no con mucha emoción."
"The iPhone SE 3rd generation is a confusing phone to evaluate. The A15 chip is genuinely flagship-level and handles everything smoothly, which is impressive. Touch ID works reliably, especially with masks becoming common. However, the design feels dated even if some prefer the classic look. Battery life is concerning for moderate to heavy users. Camera quality is good but lacks modern features like Night mode. It's affordable for an iPhone, but expensive compared to Android alternatives with similar features. Whether it's worth it depends entirely on how much you value iOS and long-term updates versus modern features.","El iPhone SE de 3a generación es un teléfono confuso para evaluar. El chip A15 es genuinamente de nivel insignia y maneja todo sin problemas, lo cual es impresionante.. Touch ID funciona de forma fiable, especialmente con máscaras cada vez más comunes. Sin embargo, el diseño se siente anticuado incluso si algunos prefieren el aspecto clásico.. La duración de la batería es preocupante para usuarios moderados a pesados. La calidad de la cámara es buena, pero carece de características modernas como el modo nocturno. Es asequible para un iPhone, pero caro en comparación con las alternativas de Android con características similares. Si vale la pena depende completamente de cuánto valoras iOS y las actualizaciones a largo plazo en comparación con las funciones modernas.."
"The Samsung Galaxy A34 offers budget-friendly smartphone functionality with predictable compromises. Performance handles essential apps adequately, though multitasking can feel sluggish. Camera quality is acceptable for social media sharing, but lacks advanced features. One UI provides familiar Samsung experience, though it feels heavy on the limited hardware. Battery life gets through typical days with careful management. Build quality is basic but functional for the price point. It delivers on core smartphone needs without excitement, making it suitable for users with modest expectations and tight budgets.","El Samsung Galaxy A34 ofrece funcionalidad de smartphone económica con compromisos predecibles. El rendimiento maneja las aplicaciones esenciales de manera adecuada, aunque la multitarea puede parecer lenta. La calidad de la cámara es aceptable para compartir en las redes sociales, pero carece de funciones avanzadas. Una interfaz de usuario proporciona una experiencia familiar de Samsung, aunque se siente pesado en el hardware limitado.. La duración de la batería supera los días típicos con una gestión cuidadosa. La calidad de construcción es básica pero funcional para el punto de precio. Cumple con las necesidades básicas de los teléfonos inteligentes sin emoción, por lo que es adecuado para usuarios con expectativas modestas y presupuestos ajustados.."
"The Vivo X100 Pro showcases impressive camera hardware that doesn't always translate to consistent results. Zeiss partnership produces excellent photos in optimal conditions, though performance varies in challenging scenarios. Low-light capabilities are good when they work properly. Video stabilization helps with content creation. Performance generally matches the camera's demands smoothly. Funtouch OS has improved but still feels cluttered compared to stock Android. Premium pricing puts it in competition with established flagship cameras. For photography enthusiasts, it offers professional tools, but general users might find better overall value elsewhere.","La Vivo X100 Pro muestra un hardware de cámara impresionante que no siempre se traduce en resultados consistentes. La asociación de Zeiss produce excelentes fotos en condiciones óptimas, aunque el rendimiento varía en escenarios desafiantes.. Las capacidades de poca luz son buenas cuando funcionan correctamente. La estabilización de video ayuda con la creación de contenido. El rendimiento generalmente coincide con las demandas de la cámara sin problemas. El sistema operativo Funtouch ha mejorado, pero aún se siente desordenado en comparación con el Android de stock. El precio premium lo pone en competencia con las cámaras insignia establecidas. Para los entusiastas de la fotografía, ofrece herramientas profesionales, pero los usuarios en general pueden encontrar un mejor valor general en otros lugares.."
"My iPhone 12 Mini presents a challenging trade-off between size and functionality. The compact form factor is perfect for one-handed use and portability, which I genuinely appreciate. Performance remains solid for most current applications. Camera quality matches larger iPhones in good lighting conditions. However, battery life is genuinely problematic, often requiring midday charging with moderate use. Screen size limitations become apparent with modern apps and media consumption. It's the right choice for users who prioritize compactness above all else, but requires significant compromises that may frustrate average users.","Mi iPhone 12 Mini presenta un compromiso desafiante entre tamaño y funcionalidad. El factor de forma compacto es perfecto para el uso con una sola mano y la portabilidad, que realmente aprecio. El rendimiento sigue siendo sólido para la mayoría de las aplicaciones actuales. La calidad de la cámara coincide con los iPhones más grandes en buenas condiciones de iluminación. Sin embargo, la duración de la batería es realmente problemática, a menudo requiere carga al mediodía con un uso moderado.. Las limitaciones de tamaño de pantalla se hacen evidentes con las aplicaciones modernas y el consumo de medios. Es la opción correcta para los usuarios que priorizan la compacidad por encima de todo, pero requiere compromisos significativos que pueden frustrar a los usuarios promedio.."
The Samsung Galaxy A05s provides ultra-budget smartphone access. Performance handles very basic tasks slowly. Camera quality is poor but functional for emergencies. Battery life is acceptable for light usage. One UI Core offers simplified Samsung experience. Build quality is basic plastic but functional. Perfect option for elderly users or emergency backup device needs.,El Samsung Galaxy A05s proporciona acceso a teléfonos inteligentes de ultra-presupuesto. El rendimiento maneja tareas muy básicas lentamente. La calidad de la cámara es pobre pero funcional para emergencias. La duración de la batería es aceptable para el uso ligero. One UI Core ofrece una experiencia simplificada de Samsung. La calidad de construcción es plástico básico pero funcional. Opción perfecta para usuarios mayores o necesidades de dispositivos de respaldo de emergencia.


looks good but to see how well the translator performs, the next step is to calculate bleu score to see how well perform.
For this there is a txt contains some variants of the translation of the thirst wrong classified review, stored at translation_references.txt

In [32]:
# Load references
with open('../data/translation_references.txt', 'r') as file:
    translated_references = file.read().splitlines()


# Show sentences to analyze
first_english_sentence = english_texts[0].split(".")[0]
first_spanish_sentence = spanish_texts[0].split(".")[0]
reference_sentences = [s.split('.')[0] for s in translated_references]
print("Original english sentence: \t", first_english_sentence)
print("Translated spanish sentence: \t", first_spanish_sentence)
display("Reference sentences to compare with:", reference_sentences)

Original english sentence: 	 My Samsung Galaxy A54 works fine for what it is, though I have mixed feelings about the overall experience
Translated spanish sentence: 	 Mi Samsung Galaxy A54 funciona bien para lo que es, aunque tengo sentimientos encontrados sobre la experiencia general


'Reference sentences to compare with:'

['Mi Samsung Galaxy A54 funciona bien para lo que es, a pesar de que tengo sentimientos encontrados sobre la experiencia en general',
 'Mi Samsung Galaxy A54 funciona correctamente para lo que necesito, aunque la experiencia general me deja con sensaciones mixtas',
 'Mi Samsung Galaxy A54 cumple bien con lo que se espera, aunque tengo opiniones divididas sobre la experiencia general']

In [33]:
# Calculate Bleu Metric
bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=[first_spanish_sentence], references=[reference_sentences])
display(bleu_score)

{'bleu': 0.8556936239970985,
 'precisions': [1.0, 1.0, 0.8823529411764706, 0.75],
 'brevity_penalty': 0.9487294800164372,
 'length_ratio': 0.95,
 'translation_length': 19,
 'reference_length': 20}

**The BLEU evaluation** reveals strong translation quality with a **score of 85.6%**, indicating high fidelity to reference translations. The component analysis shows:

* **Perfect 1-gram and 2-gram precision**: Excellent word choice and local phrase accuracy
* **Strong 3-gram precision (88%)**: Good medium-length phrase consistency
* **Moderate 4-gram precision (75%)**: Some variation in longer phrase structures
* **Minimal brevity penalty (95%)**: Appropriate length preservation
 
This breakdown confirms reliable translation quality suitable for cross-linguistic sentiment analysis and content localization applications.

# Question answering

The evaluation assesses reading comprehension capabilities using [deepset's MiniLM](https://huggingface.co/deepset/minilm-uncased-squad2) model, fine-tuned on SQuAD 2.0. This model excels at extractive QA, identifying precise text spans that answer specific questions. 
The evaluation tests the model's ability to understand context, locate relevant information, and provide accurate, contextually appropriate responses from customer review content.

In [34]:
pipe_qa = pipeline("question-answering", model="deepset/minilm-uncased-squad2")

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [35]:
# Define a question
question = "Why he regrets to buy the phone?"

# Provide context, review in this case
second_review = reviews_df.loc[1, 'Review']
display(second_review)

# 
pipe_answer = pipe_qa(question=question, context=second_review)
answer = pipe_answer['answer']
print("Answer: ", answer)

'I regret buying this Samsung Galaxy S23. The phone constantly overheats during normal usage, making it uncomfortable to hold. Battery drains incredibly fast, barely lasting 6 hours with moderate use. The fingerprint sensor fails to recognize my finger half the time. Customer service has been unhelpful and dismissive of these issues. For the premium price I paid, I expected much better quality and reliability. Switching to iPhone next.'

Answer:  constantly overheats during normal usage, making it uncomfortable to hold


The model successfully identifies and extracts the specific reason for customer dissatisfaction, demonstrates reading comprehension and information retrieval capabilities. 

# Summarization

The implementation uses abstractive summarization with [T5-small-booksum](https://huggingface.co/cnicu/t5-small-booksum), a compact yet effective model for generating concise summaries. Unlike extractive approaches, this model creates new sentences that capture essential information, making it valuable for content curation and summary generation from lengthy customer feedback.

In [36]:
# Load the model pipeline
pipe_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum")

# Pass the long text to the model
pipe_summarized = pipe_summarizer(second_review, max_length=25)

Device set to use mps:0
Your min_length=30 must be inferior than your max_length=25.
Both `max_new_tokens` (=256) and `max_length`(=25) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In [37]:
# Access and print the summarized text
summarized_text = pipe_summarized[0]["summary_text"]

# Build the HTML table with centered headers and left-aligned text
summary_html = f"""
<table style='width:100%; border-collapse: collapse;'>
    <tr>
        <th style='border: 1px solid black; padding: 8px; text-align: center;'>Original</th>
        <th style='border: 1px solid black; padding: 8px; text-align: center;'>Summarized</th>
    </tr>
    <tr>
        <td style='border: 1px solid black; padding: 8px; text-align: left;'>{second_review}</td>
        <td style='border: 1px solid black; padding: 8px; text-align: left;'>{summarized_text}</td>
    </tr>
    """

summary_html += "</table>"

# Display the table
display(HTML(summary_html))

Original,Summarized
"I regret buying this Samsung Galaxy S23. The phone constantly overheats during normal usage, making it uncomfortable to hold. Battery drains incredibly fast, barely lasting 6 hours with moderate use. The fingerprint sensor fails to recognize my finger half the time. Customer service has been unhelpful and dismissive of these issues. For the premium price I paid, I expected much better quality and reliability. Switching to iPhone next.","I regret buying this Samsung Galaxy S23. The phone constantly overheats during normal usage, making it uncomfortable to hold. Battery drains incredibly fast, barely lasting 6 hours with moderate use. The fingerprint sensor fails to recognize my finger half the time."


This model tends to simply reduce text length rather than creating meaningful summaries. Larger, more capable models would provide better summarization quality, but hardware constraints limit the evaluation to smaller models that can run locally.

# Key Findings and Model Performance Summary

This evaluation reveals distinct strengths across different LLM architectures:

* **Sentiment Analysis**: DistilBERT achieves excellent performance (93% accuracy, 0.95 F1) but struggles with genuinely ambiguous reviews containing mixed sentiments - a challenge that reflects real-world complexity rather than model limitations.
  
* **Translation Quality**: OPUS-MT delivers high-quality English-Spanish translation (85.6% BLEU) with particularly strong performance on vocabulary and short phrases, making it useful for cross-linguistic content analysis.
  
* **Question Answering**: MiniLM shows strong extractive precision, effectively identifying specific information within context. It’s well-suited for applications like automated customer insight extraction.
  
* **Summarization**:  T5-small shows limitations in true abstractive summarization, primarily reducing text length rather than creating meaningful summaries. This highlights the importance of model size and capability for complex generative tasks - larger models would likely provide significantly better results but remain computationally prohibitive for local execution.