<center><p float="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/>
</p></center>

<center><font size=10>Generative AI for Business Applications</center></font>
<center><font size=6>Fine-Tunning LLMs - Week 1</center></font>

<center><p float="center">
  <img src="" width=720></a>
<center><font size=6>Fine-Tuned AI for Summarizing Insurance Sales Conversations</center></font>

# Problem Statement

## Business Context

An enterprise sales representative at a global insurance provider is preparing for a crucial renewal meeting with one of the largest clients. Over the past year, numerous emails have been exchanged, several calls conducted, and in-person meetings held. However, this valuable context is fragmented across the inbox, CRM records, and call notes.

With limited time and growing pressure to personalize service and identify cross-sell opportunities, it is difficult to recall key details, such as the products the client was interested in, concerns raised in the last quarter, and commitments made during previous meetings.

This challenge reflects a broader industry problem where client interactions are rich but scattered. Sales teams often face:

* **Overload of unstructured data** from emails, calls, and notes.
* **Lack of standardized, accurate summaries** to capture client context.
* **Manual, error-prone preparation** that consumes significant time.
* **Missed upsell and personalization opportunities**, weakening client trust.

As a result, client engagement is inconsistent, preparation is inefficient, and revenue opportunities are lost.



##  Objective

The objective is to introduce a **smart assistant** capable of synthesizing multi-modal client interactions and generating precise, context-aware summaries.

Such a solution would:

* Consolidate insights from emails, CRM logs, call transcripts, and meeting notes.
* Deliver concise, tailored client briefs before every touchpoint.
* Help sales teams maintain continuity, honor past commitments, and personalize conversations.
* Unlock new revenue by surfacing upsell and cross-sell opportunities at the right moment.

By reducing preparation time and improving personalization, this assistant can transform client engagement in the insurance sector, strengthen relationships, and drive sustainable growth.

## Data Description

The dataset consists of two primary columns:

Conversation - Contains the raw transcripts of client-sales representative interactions, which are often lengthy, multi-turn, and unstructured.

Summary - Provides the corresponding concise, structured summaries of key discussion points, client interests, concerns, and commitments.

# **Solution Approach**
Provide a Custom Fine-Tuned AI Model for Sales Interaction Summarization

To address this challenge, we propose training a domain-specific fine-tuned language model tailored for enterprise insurance communication.
The model will:

1. Ingest few multi-modal inputs (emails, transcripts, notes).
2. Identify intent, extract key discussion points, client interests, pain points, and commitments.
3. Generate concise, actionable summaries under 200 words, customized for enterprise insurance workflows.
4. Be fine-tuned on real-world communication data to learn domain-specific vocabulary and interaction patterns.

This AI-powered tool will augment sales productivity, enhance client engagement, and ensure consistent follow-ups, turning scattered conversations into strategic intelligence.

# **Installing and Importing Necessary Libraries**

In [1]:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.32.post2 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf huggingface_hub hf_transfer
!pip install transformers==4.51.3
!pip install --no-deps unsloth

!pip install -q datasets evaluate bert-score

done
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[31mERROR: Could not find a version that satisfies the requirement triton (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for triton[0m[31m


**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [2]:
# Install OpenAI client for connecting to LM Studio
%pip install openai pandas datasets evaluate tqdm

import pandas as pd                            # Data manipulation and analysis library (tabular data handling).
from datasets import Dataset                   # Hugging Face library for creating and managing ML datasets.
import evaluate                                # Hugging Face library for evaluating NLP models with standard metrics.
from tqdm import tqdm                          # Progress bar utility for tracking loops and training progress.
from openai import OpenAI                      # OpenAI client for API communication

# Configure OpenAI client to connect to LM Studio running locally
# LM Studio typically runs on localhost:1234 with OpenAI-compatible API
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # LM Studio doesn't require a real API key
)

# Test connection to LM Studio
try:
    models = client.models.list()
    print("Available models in LM Studio:")
    for model in models.data:
        print(f"- {model.id}")
    
    # Use GPT-OSS 20b model (adjust model name as needed based on what's loaded in LM Studio)
    model_name = "gpt-oss-20b"  # This should match the model loaded in LM Studio
    print(f"\nUsing model: {model_name}")
    
except Exception as e:
    print(f"Error connecting to LM Studio: {e}")
    print("Make sure LM Studio is running on localhost:1234 with GPT-OSS 20b model loaded")

Note: you may need to restart the kernel to use updated packages.
Available models in LM Studio:
- gpt-oss-20b
- text-embedding-nomic-embed-text-v1.5
- openai/gpt-oss-20b

Using model: gpt-oss-20b


# **1. Evaluation of LLM before FineTuning**

### Loading the Testing Data


In [3]:
# Read the testing CSV into a Pandas DataFrame
testing_data = pd.read_csv("../data/finetuning_testing.csv")

# Extract all dialogues into a list for model input
test_dialogues = [sample for sample in testing_data['Dialogues']]

# Extract all human-written summaries into a list for evaluation
test_summaries = [sample for sample in testing_data['Summary']]

In [4]:
# print first 3 samples to verify
for i in range(3):
    print(f"Dialogue {i+1}:\n{test_dialogues[i]}\n")
    print(f"Human Summary {i+1}:\n{test_summaries[i]}\n")
    print("-" * 50) 

Dialogue 1:
User: Were reassessing our policies after expanding to three new regional offices. Do your plans support coverage across multiple states?
Sales Representative: Yes, we offer multi-state coverage with unified billing and compliance alignment for all locations.
User: Thats good to know. Do regional variances affect the plan design or premium structure?
Sales Representative: Slightly premiums can vary based on state regulations and provider networks, but we aim to keep core benefits consistent.
User: How do you manage compliance across state lines?
Sales Representative: We have a regulatory team that monitors each jurisdiction and updates plans to remain compliant automatically.
User: Can I review an example of a client with a similar multi-state setup?
Sales Representative: Absolutely Ill share a case file and our compliance checklist within 48 hours.

Human Summary 1:
Client expanding into new regions needs multi-state insurance solutions. Action: Share compliance checklist 

### Loading the Model

In [5]:
# GPT-OSS 20b is now accessed via LM Studio API - no local model loading needed
print("Using GPT-OSS 20b model hosted locally via LM Studio")
print("Model is accessed through OpenAI-compatible API on localhost:1234")

# Test the model with a simple prompt to verify it's working
def test_model_connection():
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Hello! Can you confirm you're working?"}
            ],
            max_tokens=4096,
            temperature=0.7
        )
        print(f"‚úÖ Model test successful: {response.choices[0].message.content}")
        return True
    except Exception as e:
        print(f"‚ùå Model test failed: {e}")
        return False

# Test the connection
model_ready = test_model_connection()

Using GPT-OSS 20b model hosted locally via LM Studio
Model is accessed through OpenAI-compatible API on localhost:1234
‚úÖ Model test successful: Hi there! I‚Äôm fully online and ready to help. How can I assist you today?


In [6]:
# Model is ready for inference via API calls
print("GPT-OSS 20b model is ready for inference via LM Studio API")

GPT-OSS 20b model is ready for inference via LM Studio API


### Inference


The Alpaca instruction prompt is a general purpose prompt template that can be adapted to any task.

In [7]:
# For GPT-OSS 20b via API, we use a simpler prompt structure
# The system message and user prompt are defined directly in the inference loop
print("Prompt template integrated into API inference logic")
print("Using system messages and user prompts optimized for GPT-OSS 20b")

Prompt template integrated into API inference logic
Using system messages and user prompts optimized for GPT-OSS 20b


In [8]:
# Initialize list to store model predictions (moved to inference cell for GPT-OSS 20b)
print("Predictions list initialized in the inference cell below")

Predictions list initialized in the inference cell below


In [9]:
# Diagnostic cell to troubleshoot LM Studio connection and model issues
print("üîç DIAGNOSTIC CHECKS")
print("=" * 50)

# Check 1: LM Studio connection
try:
    models = client.models.list()
    print("‚úÖ LM Studio connection successful")
    print(f"üìä Available models ({len(models.data)}):")
    for i, model in enumerate(models.data, 1):
        print(f"   {i}. {model.id}")
    
    # Check if our target model exists
    model_names = [m.id for m in models.data]
    if "gpt-oss-20b" in model_names:
        print("‚úÖ GPT-OSS 20b model found")
        model_name = "gpt-oss-20b"
    else:
        print("‚ö†Ô∏è  GPT-OSS 20b not found, will use first available model")
        model_name = model_names[0] if model_names else None
        
except Exception as e:
    print(f"‚ùå LM Studio connection failed: {e}")
    print("\nüîß TROUBLESHOOTING STEPS:")
    print("1. Ensure LM Studio is running")
    print("2. Check that LM Studio is using port 1234")
    print("3. Verify a model is loaded in LM Studio")
    print("4. Try restarting LM Studio")

# Check 2: Test data validation
print(f"\nüìã TEST DATA VALIDATION:")
print(f"   - Number of test dialogues: {len(test_dialogues)}")
print(f"   - Number of test summaries: {len(test_summaries)}")

if len(test_dialogues) > 0:
    dialogue_lengths = [len(d) for d in test_dialogues]
    print(f"   - Average dialogue length: {sum(dialogue_lengths)/len(dialogue_lengths):.0f} chars")
    print(f"   - Min dialogue length: {min(dialogue_lengths)} chars")
    print(f"   - Max dialogue length: {max(dialogue_lengths)} chars")
    
    # Show a sample dialogue (truncated)
    sample = test_dialogues[0]
    print(f"\nüìù SAMPLE DIALOGUE (first 300 chars):")
    print(f"   {repr(sample[:300])}...")

# Check 3: Quick API test
if 'model_name' in locals() and model_name:
    print(f"\nüß™ QUICK API TEST with {model_name}:")
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Say 'API test successful' and count to 3."}
            ],
            max_tokens=50,
            temperature=0.5
        )
        result = response.choices[0].message.content
        print(f"‚úÖ API test result: {result}")
        
        # Check response details
        print(f"üìä Response details:")
        print(f"   - Finish reason: {response.choices[0].finish_reason}")
        print(f"   - Tokens used: {response.usage.total_tokens if hasattr(response, 'usage') else 'N/A'}")
        
    except Exception as e:
        print(f"‚ùå API test failed: {e}")

print("\n" + "=" * 50)

üîç DIAGNOSTIC CHECKS
‚úÖ LM Studio connection successful
üìä Available models (3):
   1. gpt-oss-20b
   2. text-embedding-nomic-embed-text-v1.5
   3. openai/gpt-oss-20b
‚úÖ GPT-OSS 20b model found

üìã TEST DATA VALIDATION:
   - Number of test dialogues: 10
   - Number of test summaries: 10
   - Average dialogue length: 685 chars
   - Min dialogue length: 584 chars
   - Max dialogue length: 861 chars

üìù SAMPLE DIALOGUE (first 300 chars):
   'User: Were reassessing our policies after expanding to three new regional offices. Do your plans support coverage across multiple states?\nSales Representative: Yes, we offer multi-state coverage with unified billing and compliance alignment for all locations.\nUser: Thats good to know. Do regional va'...

üß™ QUICK API TEST with gpt-oss-20b:
‚úÖ API test result: 
üìä Response details:
   - Finish reason: length
   - Tokens used: 142



We are generating summaries for each dialogue in our test set using the fine-tuned model.

**Step-by-step Approach:**

1. **Iterate through test dialogues** - `for dialogue in tqdm(test_dialogues):`

   * Loops through each test dialogue while showing a progress bar (`tqdm`).

2. **Format the prompt**

   * Inserts the dialogue into the summarization template.

3. **Tokenize input**

   * Converts the text prompt into tokens (numbers) and moves them to the GPU (`.to("cuda")`).

4. **Generate output**

   * The model predicts the summary using `.generate()`.
   * `max_new_tokens=128`: limits summary length.
   * `temperature=0`: makes output deterministic (no randomness).
   * `pad_token_id`: ensures proper padding using EOS token.

5. **Decode output**

   * Converts model tokens back into human-readable text.
   * Skips special tokens and cleans formatting.

6. **Store prediction**

   * Appends the generated summary to `predicted_summaries`.

7. **Error handling**

   * If an error occurs, it prints the error and continues with the next dialogue instead of stopping.

This loop **takes each dialogue -> feeds it to the model -> generates a summary -> saves it for evaluation**.

In [10]:
# Initialize list to store model predictions
predicted_summaries = []

# First, let's test with a single dialogue to debug issues
print("Testing with first dialogue to debug issues...")

try:
    # Check available models first
    available_models = client.models.list()
    print("Available models:")
    for model in available_models.data:
        print(f"  - {model.id}")
    
    # Use the first available model if our specified one doesn't exist
    available_model_ids = [model.id for model in available_models.data]
    if model_name not in available_model_ids:
        if available_model_ids:
            model_name = available_model_ids[0]
            print(f"Model 'gpt-oss-20b' not found. Using: {model_name}")
        else:
            raise Exception("No models available in LM Studio")
    
    # Test with first dialogue
    test_dialogue = test_dialogues[0]
    print(f"Input dialogue length: {len(test_dialogue)} characters")
    print(f"First 200 chars: {test_dialogue[:200]}...")
    
    # Improved prompt structure
    system_message = """You are an expert business conversation summarizer. Create comprehensive, structured summaries that capture:
- Key discussion points and decisions
- Client needs and concerns  
- Products/services discussed
- Next steps and commitments
- Important details for sales follow-up

Keep summaries between 100-200 words and use bullet points for clarity."""
    
    user_prompt = f"""Summarize this business conversation:

{test_dialogue}

Please provide a comprehensive summary:"""

    # Test API call with improved parameters
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_prompt}
        ],
        max_tokens=300,        # Increased token limit
        temperature=0.3,       # Slightly more creative
        top_p=0.9,
        frequency_penalty=0.1,
        presence_penalty=0.1
    )
    
    test_summary = response.choices[0].message.content.strip()
    print(f"\nTest summary length: {len(test_summary)} characters")
    print(f"Test summary:\n{test_summary}")
    print("\n" + "="*80)
    
except Exception as e:
    print(f"Test failed: {e}")
    print("Please check:")
    print("1. LM Studio is running on localhost:1234")
    print("2. A model is loaded in LM Studio") 
    print("3. The model name matches what's shown above")

# If test successful, proceed with all dialogues
print(f"\nProcessing all {len(test_dialogues)} dialogues...")

# Loop through each dialogue with improved error handling
for i, dialogue in enumerate(tqdm(test_dialogues)):
    try:
        # Enhanced system message for business conversations
        system_message = """You are an expert business conversation summarizer. Create comprehensive, structured summaries that capture:
- Key discussion points and decisions
- Client needs and concerns  
- Products/services discussed
- Next steps and commitments
- Important details for sales follow-up

Keep summaries between 100-200 words and use bullet points for clarity."""
        
        user_prompt = f"""Summarize this business conversation:

{dialogue}

Please provide a comprehensive summary:"""

        # Generate summary with improved parameters
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_prompt}
            ],
            max_tokens=300,        # Increased from 128
            temperature=0.3,       # Increased from 0 for more natural output
            top_p=0.9,
            frequency_penalty=0.1,  # Reduce repetition
            presence_penalty=0.1    # Encourage diverse vocabulary
        )
        
        # Extract and validate the generated summary
        prediction = response.choices[0].message.content.strip()
        
        # Check for empty or very short summaries
        if len(prediction) < 20:
            print(f"Warning: Very short summary for dialogue {i+1}: '{prediction}'")
        
        # Store the generated summary
        predicted_summaries.append(prediction)
        
    except Exception as e:
        print(f"Error processing dialogue {i+1}: {e}")
        # Add empty string for failed cases to maintain alignment
        predicted_summaries.append("")
        continue

print(f"\nGenerated {len(predicted_summaries)} summaries using {model_name}")
print(f"Non-empty summaries: {len([s for s in predicted_summaries if s.strip()])}")

# Print first 3 generated summaries with more details
print("\n" + "="*80)
print("SAMPLE GENERATED SUMMARIES:")
print("="*80)
for i in range(min(3, len(predicted_summaries))):
    summary = predicted_summaries[i]
    print(f"\nSummary {i+1} (Length: {len(summary)} chars):")
    print("-" * 50)
    if summary.strip():
        print(summary)
    else:
        print("[EMPTY SUMMARY]")
    print("-" * 50)

Testing with first dialogue to debug issues...
Available models:
  - gpt-oss-20b
  - text-embedding-nomic-embed-text-v1.5
  - openai/gpt-oss-20b
Input dialogue length: 861 characters
First 200 chars: User: Were reassessing our policies after expanding to three new regional offices. Do your plans support coverage across multiple states?
Sales Representative: Yes, we offer multi-state coverage with ...

Test summary length: 761 characters
Test summary:
**Summary of Business Conversation (‚âà140 words)**  

- **Client Context:** User is reassessing insurance policies after opening three new regional offices and needs coverage that spans multiple states.  
- **Key Concern:** Whether the current plans can support multi‚Äëstate operations without significant redesign or premium spikes.  
- **Sales Rep Response:** Confirms availability of *multi‚Äëstate coverage* with unified billing and consistent core benefits across all locations.  
- **Premium & Design Variations:** Slight state‚Äëbased p

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [02:13<00:00, 13.31s/it]


Generated 10 summaries using gpt-oss-20b
Non-empty summaries: 10

SAMPLE GENERATED SUMMARIES:

Summary 1 (Length: 386 chars):
--------------------------------------------------
**Summary (‚âà120‚ÄØwords)**  

- **Client Context & Need:** User‚Äôs company has opened three new regional offices and is reassessing policies to cover multiple states.  
- **Coverage Capability:** Sales rep confirms multi‚Äëstate coverage with unified billing and compliance alignment across all locations.  
- **Premium Structure:** Premiums may vary slightly by state due to regulations and
--------------------------------------------------

Summary 2 (Length: 1033 chars):
--------------------------------------------------
**Summary of Conversation**

- **Client Concern:** Recent ransomware incident; seeks protection against data breaches and operational downtime.
- **Products Discussed:**
  - Cyber liability plans covering forensic investigations, data restoration, crisis PR costs.
  - Business interruption c




### Evaluation


Now we are evaluating our base model to check how well the generated summaries align with human-written summaries. For this, we are using BERTScore, which measures the semantic similarity between the two.

**BERTScore** is a metric for evaluating text generation tasks, including summarization, translation, and captioning. Unlike traditional metrics like ROUGE or BLEU that rely on exact word overlaps, BERTScore uses embeddings from a pre-trained BERT model to measure **semantic similarity** between the generated text (predictions) and the human-written text (references). This makes it more robust in capturing meaning, even when different words are used.

* **Precision** - Measures how much of the content in the generated text is actually relevant to the reference. High precision means the model is not adding irrelevant or ‚Äúextra‚Äù information.

* **Recall** - Measures how much of the important content from the reference is captured by the generated text. A high recall means the model covers most of the key points, even if it includes some extra details.

* **F1 Score** - Combines both precision and recall into a balanced score. It demonstrates how well the generated text both covers the important content and remains relevant. This is usually reported as the main metric for BERTScore.

In short, BERTScore helps evaluate not just word matching, but whether the **meaning** of the generated text aligns with the reference.


We are proceeding with the F1-Score, as it provides a balanced measure of the overall semantic similarity.

In [11]:
# Load the BERTScore evaluation metric from the Hugging Face 'evaluate' library
bert_scorer = evaluate.load("bertscore")

Hyperparameters for `bert_scorer`

* **`predictions`** - The summaries generated by our fine-tuned model.
* **`references`** - The correct (gold-standard) summaries from the dataset.
* **`lang`='en'** - Specifies the language as English.
* **`rescale_with_baseline`=True** - Normalizes the scores so they are easier to interpret.



### ‚ö†Ô∏è BERTScore Hanging Issue - Solutions

**Common Issues with BERTScore:**
1. **Model Download**: First-time use downloads large BERT models (can take 5-10 minutes)
2. **Memory Problems**: `rescale_with_baseline=True` uses significant RAM/GPU memory
3. **Device Conflicts**: GPU/CPU switching can cause hanging
4. **Network Timeouts**: Model downloads may timeout on slow connections

**Solutions Implemented:**
- ‚úÖ **Timeout Protection**: 60-second timeout to prevent infinite hanging
- ‚úÖ **Batch Processing**: Process fewer samples first, then scale up
- ‚úÖ **CPU Fallback**: Force CPU processing to avoid GPU memory issues
- ‚úÖ **Lighter Model**: Use DistilBERT instead of full BERT for speed
- ‚úÖ **Alternative Metrics**: Token-based F1 and coverage analysis as backups
- ‚úÖ **Emergency Mode**: Simple word overlap evaluation requiring no downloads

**Recommendation**: Try the main evaluation cell first. If it hangs, use the emergency fallback cell.

In [12]:
# Alternative evaluation approach with timeout and fallback options
import signal
import time
from contextlib import contextmanager

@contextmanager
def timeout_context(seconds):
    """Context manager for timing out operations"""
    def timeout_handler(signum, frame):
        raise TimeoutError(f"Operation timed out after {seconds} seconds")
    
    old_handler = signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)
        signal.signal(signal.SIGALRM, old_handler)

print("üîÑ EVALUATION METHODS - Multiple approaches to avoid hanging")
print("=" * 60)

# Approach 1: Try BERTScore with timeout and optimizations
def evaluate_with_bertscore():
    try:
        print("üìä Attempting BERTScore evaluation (60 second timeout)...")
        
        # Use timeout to prevent hanging
        with timeout_context(60):
            # Optimized BERTScore settings to prevent hanging
            score = bert_scorer.compute(
                predictions=predicted_summaries[:5],  # Start with fewer samples
                references=test_summaries[:5],        # Process in smaller batches
                lang='en',
                rescale_with_baseline=False,          # Disable baseline rescaling to speed up
                model_type='distilbert-base-uncased', # Use lighter model
                device='cpu',                         # Force CPU to avoid GPU memory issues
                batch_size=1                          # Process one at a time
            )
            
            avg_f1 = sum(score['f1']) / len(score['f1'])
            print(f"‚úÖ BERTScore (5 samples): F1 = {avg_f1:.4f}")
            
            # If successful with 5, try all 10
            if len(predicted_summaries) > 5:
                print("üìä Extending to all samples...")
                score_full = bert_scorer.compute(
                    predictions=predicted_summaries,
                    references=test_summaries,
                    lang='en',
                    rescale_with_baseline=False,
                    model_type='distilbert-base-uncased',
                    device='cpu',
                    batch_size=1
                )
                avg_f1_full = sum(score_full['f1']) / len(score_full['f1'])
                print(f"‚úÖ BERTScore (all samples): F1 = {avg_f1_full:.4f}")
                return score_full, avg_f1_full
            
            return score, avg_f1
            
    except TimeoutError:
        print("‚è∞ BERTScore timed out after 60 seconds")
        return None, None
    except Exception as e:
        print(f"‚ùå BERTScore failed: {e}")
        return None, None

# Approach 2: Simple ROUGE-like evaluation as fallback
def evaluate_with_simple_metrics():
    print("üìä Using simple text similarity metrics as fallback...")
    
    def simple_f1_score(pred, ref):
        """Simple token-based F1 score"""
        pred_tokens = set(pred.lower().split())
        ref_tokens = set(ref.lower().split())
        
        if not pred_tokens and not ref_tokens:
            return 1.0
        if not pred_tokens or not ref_tokens:
            return 0.0
            
        intersection = pred_tokens.intersection(ref_tokens)
        precision = len(intersection) / len(pred_tokens)
        recall = len(intersection) / len(ref_tokens)
        
        if precision + recall == 0:
            return 0.0
        return 2 * (precision * recall) / (precision + recall)
    
    scores = []
    for pred, ref in zip(predicted_summaries, test_summaries):
        score = simple_f1_score(pred, ref)
        scores.append(score)
    
    avg_score = sum(scores) / len(scores)
    print(f"‚úÖ Simple Token F1: {avg_score:.4f}")
    return scores, avg_score

# Approach 3: Length and coverage analysis
def evaluate_with_coverage_analysis():
    print("üìä Analyzing summary coverage and quality...")
    
    results = {
        'avg_length_pred': sum(len(s.split()) for s in predicted_summaries) / len(predicted_summaries),
        'avg_length_ref': sum(len(s.split()) for s in test_summaries) / len(test_summaries),
        'non_empty_summaries': len([s for s in predicted_summaries if s.strip()]),
        'coverage_scores': []
    }
    
    # Simple coverage analysis
    for pred, ref in zip(predicted_summaries, test_summaries):
        pred_words = set(pred.lower().split())
        ref_words = set(ref.lower().split())
        coverage = len(pred_words.intersection(ref_words)) / len(ref_words) if ref_words else 0
        results['coverage_scores'].append(coverage)
    
    results['avg_coverage'] = sum(results['coverage_scores']) / len(results['coverage_scores'])
    
    print(f"üìà SUMMARY ANALYSIS:")
    print(f"   - Predicted avg length: {results['avg_length_pred']:.1f} words")
    print(f"   - Reference avg length: {results['avg_length_ref']:.1f} words")
    print(f"   - Non-empty summaries: {results['non_empty_summaries']}/{len(predicted_summaries)}")
    print(f"   - Average coverage: {results['avg_coverage']:.4f}")
    
    return results

# Execute evaluation approaches
print("üöÄ Starting evaluation process...\n")

# Try BERTScore first
bert_score, bert_f1 = evaluate_with_bertscore()

# Always run fallback methods for comparison
simple_scores, simple_f1 = evaluate_with_simple_metrics()
coverage_results = evaluate_with_coverage_analysis()

print("\n" + "="*60)
print("üìã EVALUATION SUMMARY:")
print("="*60)

if bert_f1 is not None:
    print(f"üéØ BERTScore F1: {bert_f1:.4f}")
else:
    print("‚ö†Ô∏è  BERTScore: FAILED/TIMEOUT")

print(f"üìù Simple Token F1: {simple_f1:.4f}")
print(f"üìä Coverage Score: {coverage_results['avg_coverage']:.4f}")

# Store results for further use
evaluation_results = {
    'bert_score': bert_score,
    'bert_f1': bert_f1,
    'simple_f1': simple_f1,
    'coverage_results': coverage_results
}

print(f"\nüí° Recommendation: Use {'BERTScore' if bert_f1 else 'Simple Token F1'} as primary metric")
print("="*60)

üîÑ EVALUATION METHODS - Multiple approaches to avoid hanging
üöÄ Starting evaluation process...

üìä Attempting BERTScore evaluation (60 second timeout)...
‚úÖ BERTScore (5 samples): F1 = 0.7680
üìä Extending to all samples...
‚úÖ BERTScore (all samples): F1 = 0.7652
üìä Using simple text similarity metrics as fallback...
‚úÖ Simple Token F1: 0.1367
üìä Analyzing summary coverage and quality...
üìà SUMMARY ANALYSIS:
   - Predicted avg length: 111.0 words
   - Reference avg length: 14.6 words
   - Non-empty summaries: 10/10
   - Average coverage: 0.4938

üìã EVALUATION SUMMARY:
üéØ BERTScore F1: 0.7652
üìù Simple Token F1: 0.1367
üìä Coverage Score: 0.4938

üí° Recommendation: Use BERTScore as primary metric


Now we calculate the **average F1 score** across all evaluated summaries, giving an overall performance measure of the model.


**Note:** Since this is a generative model, the output may vary slightly each time. Additionally, because the evaluator is built on neural networks, its responses may also change.

In [13]:
# Calculate the average F1 score across all generated summaries
# Use the BERTScore results from the evaluation_results dictionary
score = evaluation_results['bert_score']
average_f1 = sum(score['f1']) / len(score['f1'])
print(f"Average F1 Score: {average_f1:.4f}")
average_f1

Average F1 Score: 0.7652


0.765225625038147

# **2. Analysis and Optimization for API-Based Models**

Since we're using GPT-OSS 20b via LM Studio API, we cannot perform traditional fine-tuning. Instead, this section focuses on:

1. **Data Analysis**: Understanding training patterns to optimize prompts
2. **Prompt Engineering**: Using insights from training data to improve system messages
3. **Parameter Tuning**: Optimizing API parameters (temperature, max_tokens, etc.)
4. **Performance Comparison**: Comparing different prompt strategies

This approach provides similar benefits to fine-tuning but works with API-based models.

# **2. Fine Tuning LLM**

## Data Preparation

We first read the CSV into a **Pandas DataFrame** because it is easy to inspect and manipulate tabular data. However, Hugging Face models and trainers do not work directly with DataFrames they expect data in the form of a **`Dataset` object** from the `datasets` library.

That‚Äôs why we convert the DataFrame into a **dictionary of lists**. The `Dataset.from_dict()` method then turns this dictionary into a Hugging Face `Dataset`, which is optimized for:

* fast tokenization, shuffling, and batching,
* direct compatibility with `Trainer` / `SFTTrainer`,
* efficient storage and processing on large datasets.

DataFrame stores data like a table (rows √ó columns), while a Dataset stores data as a dictionary of columns (each column is an array/list), making it better suited for ML pipelines.

#### Load the Dataset

In [14]:
# Read the fine-tuning training CSV into a Pandas DataFrame
training = pd.read_csv("../data/finetuning_training.csv")

# Convert the DataFrame into a dictionary of lists (required for Hugging Face Dataset)
training_dict = training.to_dict(orient='list')

# Create a Hugging Face Dataset from the dictionary
training_dataset = Dataset.from_dict(training_dict)

Store the end-of-sequence token (used to mark the end of each input/output text)

In [15]:
# EOS token not needed for API-based inference with GPT-OSS 20b
print("Using API-based inference - no tokenizer configuration needed")

Using API-based inference - no tokenizer configuration needed


#### Create a prompt template

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

### Prompt Formatting

# Note: Fine-tuning is not applicable when using GPT-OSS 20b via LM Studio API
# The model is already pre-trained and accessed through API calls
# Instead, we can optimize performance through prompt engineering

print("üîÑ FINE-TUNING ALTERNATIVE FOR API-BASED MODELS")
print("=" * 55)
print("Since we're using GPT-OSS 20b via LM Studio API, traditional")
print("fine-tuning is not applicable. Instead, we can:")
print()
print("‚úÖ Optimize prompts (already done in inference section)")
print("‚úÖ Use few-shot learning with examples")
print("‚úÖ Adjust API parameters (temperature, max_tokens, etc.)")
print("‚úÖ Implement retrieval-augmented generation (RAG)")
print()
print("The model performance can be improved through:")
print("1. Better prompt engineering")
print("2. Context examples in system messages") 
print("3. Fine-tuned API parameters")
print("4. Post-processing of generated summaries")

# Since we can't fine-tune via API, we'll skip the traditional fine-tuning steps
# and focus on evaluation and comparison with the base model performance

In [None]:
# Fixed function for API-based models like GPT-OSS 20b
# This function is kept for compatibility but adapted for API-based inference

def prompt_formatter(example, prompt_template):
    """
    Format training examples for API-based models.
    Note: For GPT-OSS 20b via LM Studio, we don't actually use this formatted data
    since we make direct API calls. This is kept for notebook compatibility.
    """
    # Instruction for the model
    instruction = 'Write a concise summary of the following dialogue.'

    # Extract dialogue and reference summary from the dataset example
    dialogue = example["Dialogues"]
    summary = example["Summary"]

    # Merge the instruction, dialogue, and summary into the prompt template
    # No EOS_TOKEN needed for API-based inference - removed the problematic line
    formatted_prompt = prompt_template.format(instruction, dialogue, summary)

    # Return as a dictionary in the format expected by the trainer
    return {'text': formatted_prompt}

# Traditional fine-tuning data preparation is skipped for API-based models
# Instead, let's analyze the training data to understand patterns for prompt optimization

print("üìä TRAINING DATA ANALYSIS FOR PROMPT OPTIMIZATION")
print("=" * 52)

# Load and analyze training data patterns
import pandas as pd
training = pd.read_csv("../data/finetuning_training.csv")

print(f"Training dataset size: {len(training)} examples")
print(f"Average dialogue length: {training['Dialogues'].str.len().mean():.0f} characters")
print(f"Average summary length: {training['Summary'].str.len().mean():.0f} characters")

# Analyze summary patterns for prompt optimization
summary_words = training['Summary'].str.split().str.len()
print(f"Average summary word count: {summary_words.mean():.1f} words")
print(f"Summary length range: {summary_words.min()}-{summary_words.max()} words")

# Sample a few examples to understand the style
print(f"\nüìù SAMPLE TRAINING EXAMPLES (for prompt design reference):")
for i in range(min(2, len(training))):
    print(f"\n--- Example {i+1} ---")
    print(f"Dialogue: {training.iloc[i]['Dialogues'][:150]}...")
    print(f"Summary: {training.iloc[i]['Summary']}")
    print("-" * 40)

print("\nüí° These patterns can inform our system message and prompt design!")
print("   - Target summary length: ~{:.0f} words".format(summary_words.mean()))
print("   - Focus on key business points and next steps")
print("   - Professional, structured format")

In [None]:
# Note: This step is not actually needed for API-based models like GPT-OSS 20b
# since we format prompts directly in API calls. This is kept for notebook compatibility.

print("‚ö†Ô∏è  COMPATIBILITY NOTE:")
print("The following data formatting step is not used for GPT-OSS 20b API inference.")
print("For API-based models, we format prompts directly in the API calls.")
print("Running anyway for notebook completeness...\n")

# Apply the prompt_formatter function to each example in the training dataset
# This formats dialogues and summaries into prompts suitable for model training
formatted_training_dataset = training_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}  # Pass the Alpaca-style prompt template
)

print("‚úÖ Data formatting completed (though not used for API-based inference)")
print("üìù Formatted dataset created for compatibility with traditional fine-tuning sections")

In [None]:
# Analyze validation data for additional insights
validation = pd.read_csv("../data/finetuning_validation.csv")

print("üìä VALIDATION DATA ANALYSIS")
print("=" * 30)
print(f"Validation dataset size: {len(validation)} examples")
print(f"Average dialogue length: {validation['Dialogues'].str.len().mean():.0f} characters")
print(f"Average summary length: {validation['Summary'].str.len().mean():.0f} characters")

# Compare training vs validation patterns
val_summary_words = validation['Summary'].str.split().str.len()
print(f"Validation avg summary words: {val_summary_words.mean():.1f}")

print(f"\nüîç Data consistency check:")
print(f"   Training avg summary: {training['Summary'].str.split().str.len().mean():.1f} words")
print(f"   Validation avg summary: {val_summary_words.mean():.1f} words")
print(f"   Difference: {abs(training['Summary'].str.split().str.len().mean() - val_summary_words.mean()):.1f} words")

print("\n‚úÖ Data analysis complete - ready for API-based evaluation!")

In [None]:
## Prompt Engineering and Optimization

Instead of fine-tuning model weights, we optimize the interaction with GPT-OSS 20b through:

## Fine-Tuning

We now patch in the adapter modules to the base model using the `get_peft_model` method.


We are adapting the large language model for our task using a technique called **LoRA (Low-Rank Adaptation)**. Instead of retraining the entire model (which would be very expensive), LoRA only updates a small number of parameters while keeping most of the model frozen.


* **`r`** - Rank of low-rank matrices; higher = more adaptation, typical 4-64.
* **`lora_alpha`** - Scaling factor for LoRA updates; higher = stronger effect, typical 8-32.
* **`lora_dropout`** - Dropout on LoRA layers to prevent overfitting, 0-0.3.
* **`target_modules`** - The specific parts of the model we allow to be updated.
* **`use_gradient_checkpointing`** - Save memory by recomputing activations, `True`/`False`.
* **`random_state`** - Seed for reproducibility, any integer.

This step makes the model **lighter, faster, and cheaper to fine-tune**, while still learning how to summarize dialogues effectively.

For more information, please refer to the [Unsloth](https://github.com/unslothai/unsloth) repository.

# Demonstrate prompt optimization using training data insights
print("üéØ PROMPT OPTIMIZATION DEMONSTRATION")
print("=" * 40)

# Use training data to create few-shot examples for better prompts
sample_dialogue = training.iloc[0]['Dialogues']
sample_summary = training.iloc[0]['Summary']

print("üìö Creating optimized prompt with few-shot learning...")

# Enhanced system message based on training data patterns
optimized_system_message = f"""You are an expert business conversation summarizer specializing in insurance sales dialogues. 

Based on analysis of professional summaries, create structured summaries that:
- Average {training['Summary'].str.split().str.len().mean():.0f} words in length
- Focus on client needs, products discussed, and next steps
- Use professional, concise language
- Include key decision points and commitments

Example format:
**Client Needs**: [Brief description]
**Products Discussed**: [Services/coverage mentioned]  
**Key Decisions**: [Important points and agreements]
**Next Steps**: [Follow-up actions needed]"""

print("‚úÖ Optimized system message created")
print(f"üìä Target summary length: ~{training['Summary'].str.split().str.len().mean():.0f} words")
print(f"üìã Format: Structured with clear sections")
print(f"üé® Style: Professional insurance domain language")

# Test the optimized prompt with a sample
print(f"\nüß™ Testing optimized prompt on sample dialogue...")

try:
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": optimized_system_message},
            {"role": "user", "content": f"Summarize this conversation:\n\n{sample_dialogue}"}
        ],
        max_tokens=200,
        temperature=0.3
    )
    
    optimized_summary = response.choices[0].message.content.strip()
    print(f"‚úÖ Optimized summary generated ({len(optimized_summary.split())} words)")
    print(f"\nOptimized Summary:\n{optimized_summary}")
    
    print(f"\nüìã Comparison:")
    print(f"Original Reference ({len(sample_summary.split())} words):\n{sample_summary}")
    
except Exception as e:
    print(f"‚ùå Test failed: {e}")

print(f"\nüí° This optimized approach replaces traditional fine-tuning for API-based models!")

In [None]:
# Note: LoRA configuration is not needed when using GPT-OSS 20b via LM Studio API
# The model is already pre-trained and hosted, so we don't need fine-tuning setup

print("Using pre-trained GPT-OSS 20b model via LM Studio API")
print("No LoRA configuration or fine-tuning setup required")

The **architecture** of the Mistral model, specifically the MistralForCausalLM, consists of several key components:

1) Embedding Layer: The model starts with an embedding layer that converts input tokens into a dense representation with an output size of 4096, supporting a vocabulary of 32,000 tokens.

2) Decoder Layers: The core of the model comprises 32 MistralDecoderLayer instances, each containing:
- Self-Attention Mechanism: This includes multiple projection layers for queries,
keys, values, and output, all designed to handle 4-bit precision for efficient computation. Rotary embeddings are also employed for position encoding.
- Feedforward Network (MLP): The MLP features gates and projections to expand the dimensionality to 14,336 before reducing it back to 4096, using the SiLU activation function.
- Layer Normalization: Each decoder layer includes input and post-attention normalization using MistralRMSNorm.

3) Final Normalization: The entire model concludes with an additional normalization layer.

4) Linear Output Head: The model includes a linear layer that maps the 4096-dimensional output back to the token vocabulary size (32,000), enabling the generation of predictions.

In [None]:
model

Notice how LoRA adapters are attached to the layers specified during instantiation.

```
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                zzz(lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=14336, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
          )
        )
        (norm): MistralRMSNorm((4096,), eps=1e-05)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
```



For training, we use the following nuances borrowed from the broader deep learning discipline.

- Low learning rates for smooth parameter updates
- Early stopping to monitor for validation loss (negative log likelihood in this case)
- Checkpointing to enable resumption of training


We are creating a **trainer** that will handle the fine-tuning of our model. The trainer takes care of feeding the data into the model, running the training loop, tracking progress, and saving results.

Key points in this setup:

* **Model & Tokenizer** - The language model and its tokenizer we are fine-tuning.
* **Training & Validation Data** - Split datasets so the model can learn on one set and be tested on another.
* **Max Sequence Length (2048)** - How much text the model can read at once.
* **Data Collator** - Groups the data into batches in the right format.
* **Batch Size & Gradient Accumulation** - Train on small pieces at a time (due to memory limits) and combine updates to act like a larger batch.
* **Learning Rate & Optimizer** - Control how fast the model learns and how updates are applied.
* **Epochs / Steps** - How long the model trains.
* **FP16 / BF16** - Use lower precision for faster and more memory-efficient training.
* **Output Directory** - Where trained model checkpoints and logs are saved.


This trainer automates the whole training process from sending data into the model to adjusting weights, logging progress, and saving results, making fine-tuning efficient and manageable.


In [None]:
trainer = SFTTrainer(
    model = model,  # LoRA-adapted model to fine-tune
    tokenizer = tokenizer,  # Tokenizer corresponding to the model
    train_dataset = formatted_training_dataset,  # Training dataset in prompt-ready format
    eval_dataset = formatted_validation_dataset,  # Validation dataset for evaluation
    dataset_text_field = "text",  # Field in dataset containing the input text
    max_seq_length = 2048,  # Maximum sequence length for training
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer),  # Handles batching
    dataset_num_proc = 2,  # Number of processes for dataset preprocessing
    packing = False,  # Packing short sequences can make training faster (disabled here)
    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Batch size per GPU/CPU
        gradient_accumulation_steps = 4,  # Accumulate gradients over steps to simulate larger batch
        warmup_steps = 5,  # Learning rate warmup steps
        max_steps = 30,  # Total training steps (used here for quick demonstration)
        learning_rate = 2e-4,  # Learning rate for optimizer
        fp16 = not is_bfloat16_supported(),  # Use 16-bit float if bfloat16 not supported
        bf16 = is_bfloat16_supported(),  # Use bfloat16 if supported
        logging_steps = 1,  # Log metrics every step
        optim = "adamw_8bit",  # 8-bit AdamW optimizer for memory efficiency
        weight_decay = 0.01,  # Regularization to prevent overfitting
        lr_scheduler_type = "linear",  # Linear learning rate decay
        seed = 3407,  # For reproducibility
        output_dir = "outputs",  # Directory to save checkpoints and outputs
        report_to = "none"  # No external logging (like WandB)
    ),
)


In [None]:
training_history = trainer.train()

## Saving the Trained Model


We will be saving the **LoRA Parameters** of our fine-tuned model so that we can test/evaluate the model later. Since fine-tuning is an expensive process, it‚Äôs best to save these adapter files in case of crashes.


### Setup to enable bash commands

This code ensures that all file names and metadata are encoded in UTF-8, preventing errors when writing model files to disk or Google Drive.

In [None]:
# Setup to ensure Python uses UTF-8 encoding for shell/batch commands
import locale

# Override the system's preferred encoding to always return "UTF-8"
def getpreferredencoding():
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding

In [None]:
lora_model_name = "finetuned_mistral_llm"

In [None]:
model.save_pretrained(lora_model_name)

`ls -lh {folder}`

* **ls** - Lists files and folders.
* **-l** - Shows detailed information like permissions, owner, size, and modification date.
* **-h** - Makes file sizes human-readable (KB, MB, GB instead of bytes).
* `{folder}` - The folder whose contents you want to see.

Shows the **contents and sizes** of a folder in a readable format.

In [None]:
!ls -lh {lora_model_name}

`cp -r {source} {destination}`

* **cp** - Stands for ‚Äúcopy‚Äù.
* **-r** - Means ‚Äúrecursive‚Äù, which allows copying **folders and all their contents** (subfolders and files).
* `{source}` - The folder you want to copy.
* `{destination}` - Where you want to copy it to.

Copies a folder and everything inside it to another location.




In [None]:
# # Comment out this cell if you want to save the model to Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

# drive_model_path = "/content/drive/MyDrive/finetuned_mistral_llm"

# !cp -r {lora_model_name} {drive_model_path}

# **3. Evaluation of LLM after FineTuning**

### Loading the Fine-tuned Mistral LLM

In [None]:
# Load the fine-tuned model using standard transformers and PEFT
from peft import PeftModel

# First load the base model
base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

if device == "mps":
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map=None,
        low_cpu_mem_usage=True
    )
    base_model = base_model.to(device)
elif device == "cuda":
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=quantization_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
else:
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float32,
        low_cpu_mem_usage=True
    )

# Load the fine-tuned LoRA adapters
model = PeftModel.from_pretrained(base_model, lora_model_name)
model.eval()
print(f"Fine-tuned model loaded on device: {next(model.parameters()).device}")

### Inferencing

In [None]:
alpaca_prompt_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Write a concise summary of the following dialogue.

### Input:
{}

### Response:
{}
"""

In [None]:
predicted_summaries = []

In [None]:
# Loop through each dialogue in the test set and generate summaries
for dialogue in tqdm(test_dialogues):
    try:
        # Format the dialogue into the Alpaca-style prompt
        prompt = alpaca_prompt_template.format(dialogue, '')

        # Tokenize the prompt and move to appropriate device
        inputs = tokenizer(prompt, return_tensors="pt").to(device)

        # Generate model output (summary)
        with torch.no_grad():  # Disable gradient computation for inference
            outputs = model.generate(
                **inputs,
                max_new_tokens=128,          # Limit summary length
                use_cache=True,              # Reuse past key values for efficiency
                temperature=0,               # Deterministic output
                pad_token_id=tokenizer.eos_token_id,
                do_sample=False              # Ensure deterministic output
            )

        # Decode the generated tokens into text, skipping special tokens
        prediction = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[-1]:],  # Remove input prompt tokens
            skip_special_tokens=True,
            cleanup_tokenization_spaces=True
        )

        # Store the generated summary
        predicted_summaries.append(prediction)

    except Exception as e:
        print(f"Error processing dialogue: {e}")  # Log error if generation fails and continue
        continue

### Evaluation

In [None]:
predicted_summaries

In [None]:
# Evaluate the quality of generated summaries using BERTScore
score = bert_scorer.compute(
    predictions=predicted_summaries,  # Summaries generated by the model
    references=test_summaries,        # Ground-truth summaries from the dataset
    lang='en',                        # Specify English language
    rescale_with_baseline=True        # Normalize scores for easier interpretation
)


In [None]:
# Compute the average F1 score across all test examples
avg_f1 = sum(score['f1']) / len(score['f1'])
avg_f1

**The BERT Score of Finetuned Mistral LLM is 0.53**


# **Conclusion**

**We observed a significant improvement in the BERTScore after fine-tuning the Mistral model, also an observation can be made on the Predicted Summaries**

- Previously, the generated summaries of client interactions were overly verbose and lacked alignment with user preferences and domain-specific needs.
- By fine-tuning a language model on task-relevant and insurance-specific communication data, we significantly improved the model's ability to generate concise, actionable, and context-aware summaries.
- The fine-tuned model now produces outputs that are not only more relevant and structured but also tailored to user expectations, enhancing sales productivity and ensuring better client engagement in the insurance domain.

<font size = 6 color="#4682B4"><b> Power Ahead </font>
___