# Dataset Enrichment for LLM Fine-Tuning

## Purpose
This notebook focuses on **enhancing raw textual data** before fine-tuning a Large Language Model (LLM).



This notebook explores **two dataset enrichment strategies**:
1. POS-based linguistic enrichment
2. LLM-based semantic enrichment

Although POS tagging provides syntactic structure, **it does not capture semantic context**.
In practice, fine-tuning on POS-tagged data led to:
- Severe hallucinations
- Loss of contextual grounding
- Overfitting to syntactic patterns instead of meaning

As a result:
‚úÖ POS-based enrichment is used **only for analysis and experimentation**  
‚ùå POS-enriched data is **NOT used** in the final fine-tuning dataset  

Only **LLM-generated, context-rich samples** are saved and used for training.


The enriched dataset is later used to fine-tune an LLM for high-quality article generation.


## Environment Setup and Imports

In this section, we import all required libraries for:
- Text preprocessing
- Linguistic analysis (POS tagging)
- Local LLM inference
- Dataset manipulation and storage

These tools form the foundation for both enrichment pipelines used later in the notebook.


In [4]:
pip install spacy


Collecting spacy
  Downloading spacy-3.8.11-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.15-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.3 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.13-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.7 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.12-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.10-cp311-cp311-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
 

In [3]:
pip install datasets

Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (3.0 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py311-none-any.whl.metadata (7.5 kB)
Collecting fsspec<=2025.10.0,>=2023.1.0 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)
  Using cached fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Downloading datasets-4.5.0-py3-none-any.whl (515 kB)
Using cached dill-0.4.0-py3-none-any.whl (119 kB)
Using cached fsspec-2025.10.0-py3-none-any.whl (200 kB)
Downloading multiprocess-0.70.18-py311-none-any.whl (144 kB)
Downloading pyarrow-23.0.0-cp311-cp311-macosx_12_0_arm64.whl (34.3 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

In [5]:
import spacy
from datasets import load_dataset, Dataset
from collections import defaultdict
import re
from tqdm import tqdm
import json
import time
# Load spacy
try:
    nlp = spacy.load("en_core_web_sm")
    print("‚úÖ Spacy loaded successfully")
except:
    print("‚ùå Spacy not found. Run: python -m spacy download en_core_web_sm")
    exit(1)

  from .autonotebook import tqdm as notebook_tqdm


‚ùå Spacy not found. Run: python -m spacy download en_core_web_sm


## Experimental: POS-Based Linguistic Analysis 

This section investigates whether POS-based enrichment can improve fine-tuning quality.

POS tagging adds grammatical structure but **does not inject missing semantic context**.
We evaluate this approach to understand its limitations before moving to LLM-based enrichment.


## Step 1: Factual Skeleton Extraction Using POS Tagging

Rather than blindly expanding text, we first extract a **factual skeleton** from each article.

The goal is to isolate:
- Named entities (people, locations, organizations)
- Numerical facts and dates
- Core subject‚Äìverb‚Äìobject relationships

This ensures that downstream LLM enrichment:
- Preserves factual correctness
- Avoids hallucinations
- Maintains grounding in the source material


### Factual Skeleton Extraction Function

This function processes an article using spaCy and extracts:
- Named entities (NER)
- Numeric values
- Key noun‚Äìverb‚Äìobject relationships

Only **verifiable information** is retained, discarding stylistic or narrative fluff.


In [None]:

def extract_factual_skeleton(article):
    """Extract ONLY verifiable facts using POS tagging"""
    
    doc = nlp(article)
    
    facts = {
        'entities': [],
        'numbers': [],
        'actions': [],
        'quotes': [],
    }
    
    # Extract Named Entities (People, Places, Organizations, Dates, Money)
    seen_entities = set()
    for ent in doc.ents:
        if ent.text not in seen_entities:
            facts['entities'].append({
                'text': ent.text,
                'label': ent.label_,
            })
            seen_entities.add(ent.text)
    
    # Extract Numbers with Context
    for token in doc:
        if token.like_num or token.pos_ == "NUM":
            start = max(0, token.i - 3)
            end = min(len(doc), token.i + 4)
            context = doc[start:end].text
            facts['numbers'].append({
                'value': token.text,
                'context': context
            })
    
    # Extract Main Actions (Subject-Verb-Object)
    for token in doc:
        if token.pos_ == "VERB" and token.dep_ == "ROOT":
            subject = None
            obj = None
            
            for child in token.children:
                if child.dep_ in ["nsubj", "nsubjpass"]:
                    subject = child.text
                elif child.dep_ in ["dobj", "attr", "prep"]:
                    obj = child.text
            
            if subject:  # Only keep if we have a subject
                facts['actions'].append({
                    'subject': subject,
                    'verb': token.lemma_,
                    'object': obj
                })
    
    # Extract Direct Quotes (EXACT preservation)
    quote_pattern = r'"([^"]+)"'
    quotes = re.findall(quote_pattern, article)
    facts['quotes'] = [q for q in quotes if len(q)>5]  # Skip very short
    
    return facts


In [3]:
def format_facts_as_input(facts):
    """Convert facts into training input"""
    
    lines = []
    
    # Group entities by type
    entities_by_type = defaultdict(list)
    for ent in facts['entities']:
        entities_by_type[ent['label']].append(ent['text'])
    
    # Format entity types
    label_names = {
        'PERSON': 'People',
        'ORG': 'Organizations',
        'GPE': 'Locations',
        'DATE': 'Dates',
        'MONEY': 'Money/Amounts',
        'CARDINAL': 'Numbers',
        'EVENT': 'Events',
    }
    
    for label, name in label_names.items():
        if label in entities_by_type:
            items = list(set(entities_by_type[label]))[:10]  # Max 10 per type
            if items:
                lines.append(f"{name}: {', '.join(items)}")
    
    # Add key actions
    if facts['actions']:
        actions_str = []
        for action in facts['actions'][:5]:  # Top 5 actions
            if action['object']:
                actions_str.append(f"{action['subject']} {action['verb']} {action['object']}")
            else:
                actions_str.append(f"{action['subject']} {action['verb']}")
        if actions_str:
            lines.append(f"Key events: {'; '.join(actions_str)}")
    
    # Add quotes (PRESERVE EXACTLY!)
    if facts['quotes']:
        for i, quote in enumerate(facts['quotes'][:3], 1):
            lines.append(f'Quote {i}: "{quote}"')
    
    # Add numbers with context
    if facts['numbers']:
        seen = set()
        num_strs = []
        for num in facts['numbers'][:8]:  # Max 8 numbers
            if num['value'] not in seen:
                num_strs.append(f"{num['value']}")
                seen.add(num['value'])
        if num_strs:
            lines.append(f"Numbers mentioned: {', '.join(num_strs)}")
    
    return "\n".join(lines)


### Applying Factual Extraction Across the Dataset

Each article in the dataset is processed to generate a compact factual representation.

This representation acts as:
- A high-signal conditioning input
- A factual anchor for later semantic expansion


In [None]:


def create_pos_dataset(num_examples=10000):
    """Generate complete training dataset"""
    
    print("üìö Loading CNN/DailyMail dataset...")
    cnn_dataset = load_dataset("cnn_dailymail", "3.0.0", split="train")
    
    print(f"üîç Processing {num_examples} articles with POS tagging...")
    
    training_pairs = []
    skipped = 0
    
    for i in tqdm(range(min(num_examples, len(cnn_dataset)))):
        article = cnn_dataset[i]['article']
        
        # Skip very short articles
        # if len(article) < 200:
        #     skipped += 1
        #     continue
        
        # Extract facts
        
        facts = extract_factual_skeleton(article)
        
        
        # Format as input
        facts_input = format_facts_as_input(facts)
        
        # # Skip if too few facts
        # if len(facts_input) < 50 or not facts['entities']:
        #     skipped += 1
        #     continue
        
        # Create training pair
        messages = [
            {
                "role": "system",
                "content": "You are a professional journalist. Write clear, factual news articles using ONLY the information provided. Do not invent names, quotes, numbers, or any other details. Use exact quotes as given."
            },
            {
                "role": "user",
                "content": f"Write a professional news article using these facts:\n\n{facts_input}"
            },
            {
                "role": "assistant",
                "content": article
            }
        ]
        
        training_pairs.append({
            "messages": messages,
            "facts_input": facts_input,
            "article": article,
            "extracted_facts": facts
        })

        #print( "facts_input:", facts_input,"extracted_facts: ", facts)
    
    print(f"\n‚úÖ Created {len(training_pairs)} training pairs")
    print(f"‚ö†Ô∏è  Skipped {skipped} articles (too short or insufficient facts)")
    
    return Dataset.from_list(training_pairs)


### Why POS Tagging Falls Short for Fine-Tuning

While POS tagging helps models learn syntax, it fails to:
- Capture implicit reasoning
- Preserve narrative flow
- Encode domain-specific context

When used for fine-tuning, POS-enriched inputs caused the model to:
- Generate syntactically valid but semantically incorrect text
- Hallucinate facts due to missing contextual grounding

Therefore, POS tagging is treated as an **exploratory preprocessing step**, not a training signal.


In [None]:
def main():
    print("\n" + "="*70)
    print("POS-BASED FACT-CONSTRAINED DATASET GENERATOR")
    print("="*70)
    
    # Generate dataset
    print("\nüöÄ Starting dataset generation...")
    print("This will take ~1 hour for 10,000 examples")
    print("(Much faster than LLM-based approach!)\n")
    dataset = create_pos_dataset(num_examples=10000)
    
    
    # Save
    output_path = "pos_constrained_cnn_dataset"
    dataset.save_to_disk(output_path)
    print(f"\nüíæ Dataset saved to: {output_path}")
    
    # Show examples
    print("\n" + "="*70)
    print("üìã EXAMPLE 1:")
    print("="*70)
    example = dataset[0]
    print("\nüîµ INPUT (Extracted Facts):")
    print(example['facts_input'])
    print("\nüü¢ OUTPUT (Article):")
    print(example['article'][:400] + "...")
    print("="*70)
    
    print("\n" + "="*70)
    print("üìã EXAMPLE 2:")
    print("="*70)
    example = dataset[5]
    print("\nüîµ INPUT (Extracted Facts):")
    print(example['facts_input'])
    print("\nüü¢ OUTPUT (Article):")
    print(example['article'][:400] + "...")
    print("="*70)
    
    # Statistics
    print("\nüìä DATASET STATISTICS:")
    print(f"  Total examples: {len(dataset)}")
    
    avg_facts_length = sum(len(ex['facts_input']) for ex in dataset) / len(dataset)
    avg_article_length = sum(len(ex['article']) for ex in dataset) / len(dataset)
    
    print(f"  Avg facts length: {avg_facts_length:.0f} chars")
    print(f"  Avg article length: {avg_article_length:.0f} chars")
    print(f"  Expansion ratio: {avg_article_length/avg_facts_length:.1f}x")
    
    print("\n‚úÖ Dataset ready for training!")
    print("\nNext steps:")
  


main()


POS-BASED FACT-CONSTRAINED DATASET GENERATOR

üöÄ Starting dataset generation...
This will take ~1 hour for 10,000 examples
(Much faster than LLM-based approach!)

üìö Loading CNN/DailyMail dataset...




üîç Processing 10000 articles with POS tagging...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [17:02<00:00,  9.78it/s]



‚úÖ Created 10000 training pairs
‚ö†Ô∏è  Skipped 0 articles (too short or insufficient facts)


Saving the dataset (1/1 shards): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 74420.27 examples/s]



üíæ Dataset saved to: pos_constrained_cnn_dataset

üìã EXAMPLE 1:

üîµ INPUT (Extracted Facts):
People: the Order of the Phoenix, Peter Shaffer's, Potter, Harry Potter, Radcliffe, Londoner, Daniel Radcliffe, Rudyard Kipling
Organizations: Reuters
Locations: England, UK, LONDON
Dates: Earlier this year, Monday, 2007, later this year, earlier this month, last month
Money/Amounts: $41.1 million, ¬£20 million
Numbers: two, about four, six, 18, five, one
Key events: LONDON gain access; actor say To; he tell interviewer; I think; agent have comment
Quote 1: "Harry Potter and the Order of the Phoenix"
Quote 2: "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,"
Quote 3: "I don't think I'll be particularly extravagant. "
Numbers mentioned: 20, million, 41.1, 18, one, 10

üü¢ OUTPUT (Article):
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported ¬£20 millio

In [17]:
print("\nüìö Loading POS-constrained dataset...")

try:
    dataset = Dataset.load_from_disk("pos_constrained_cnn_dataset")
    print(f"‚úÖ Loaded {len(dataset)} examples")
except:
    print("‚ùå Dataset not found!")
    print("Run: python generate_pos_dataset.py first")
    exit()


üìö Loading POS-constrained dataset...
‚úÖ Loaded 10000 examples


### Observations from POS-Based Enrichment

Key observations:
- POS tagging preserves grammar but strips meaning
- Context loss outweighs syntactic benefits
- Fine-tuned models trained on this data hallucinate heavily

This confirms that **semantic richness is more important than grammatical annotation** for this task.


POS tagged dataset looked like this, but wasn't able to provide desired output.


The reason being that the LLM model gets trained on only the data rather than how it is being used in the sentence. (Final Loss :1.9 over 3000 epochs)

## Using OLLAMA to generate rough notes

## Moving Beyond POS: Semantic Enrichment with LLMs

Given the limitations of POS-based enrichment, we shift focus to **LLM-driven dataset enhancement**.

LLMs can:
- Infer missing context
- Preserve factual grounding
- Generate coherent explanatory text

This makes LLM-enriched data far more suitable for fine-tuning.


Here we will using OLLAMA generate rough notes (journalist style) for a given CNN article. Reverse engineer the article so now we have a proper data to train on. That is, given rough notes this is how final article looks. 

This helps us to very efficiently capture the required writing style for a article writer for CNN

In [6]:


try:
    import ollama
    print("‚úÖ Ollama module found")
except ImportError:
    print("‚ùå Ollama not installed!")
    print("\nInstall:")
    print("  pip install ollama")
    exit(1)


‚úÖ Ollama module found


## Final Dataset Pipeline: LLM-Based Semantic Enrichment

This section generates the **final dataset used for fine-tuning**.

All samples produced here:
- Retain original factual content
- Add explanatory context
- Are saved and exported for training


In [6]:
def generate_rough_notes(article, style="bullet"):
    """
    Use local LLM to convert polished article into rough notes
    
    Styles:
    - "bullet": Bullet point format
    - "brief": Very short sentences
    - "fragments": Sentence fragments
    - "minimal": Absolute minimum info
    """
    
    prompts = {
        "bullet": """Convert this polished article into rough bullet-point notes that a journalist might write:

Article:
{article}

Requirements:
- Bullet points only
- Keep all key facts (names, numbers, dates, quotes)
- Remove fancy language
- Remove transitions and context
- Keep it brief

Rough notes:""",

        "brief": """Convert this into very brief rough notes with short sentences:

Article:
{article}

Requirements:
- Very short, simple sentences
- Keep facts but remove fluff
- No fancy words
- Sound like quick notes

Rough notes:""",

        "fragments": """Convert this into rough note fragments (incomplete sentences):

Article:
{article}

Requirements:
- Sentence fragments OK
- Keep key facts only
- Remove adjectives
- Very rough style

Notes:""",

        "minimal": """Extract ONLY the core facts from this article in minimal note form:

Article:
{article}

What happened? Who? When? How much? Quote?

Brief facts:"""
    }
    
    prompt = prompts.get(style, prompts["bullet"]).format(article=article[:1500])
    
    try:
        response = ollama.generate(
            model='llama3.2:3b',  # Fast, good quality
            prompt=prompt,
            options={
                'temperature': 0.3,  # Low = more consistent
                'top_p': 0.9,
                'num_predict': 200,  # Short responses
            }
        )
        
        rough_notes = response['response'].strip()
        
        # Clean up
        rough_notes = rough_notes.replace('**', '')  # Remove markdown bold
        rough_notes = rough_notes.strip()
        
        return rough_notes
        
    except Exception as e:
        print(f"Error generating: {e}")
        return None



In [9]:
ollama.list()

ListResponse(models=[Model(model='llama3.2:3b', modified_at=datetime.datetime(2026, 1, 14, 0, 27, 52, 40183, tzinfo=TzInfo(-28800)), digest='a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72', size=2019393189, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='3.2B', quantization_level='Q4_K_M')), Model(model='gemma2:2b', modified_at=datetime.datetime(2025, 7, 11, 11, 17, 27, 384119, tzinfo=TzInfo(-25200)), digest='8ccf136fdd5298f3ffe2d69862750ea7fb56555fa4d5b18c04e3fa4d82ee09d7', size=1629518495, details=ModelDetails(parent_model='', format='gguf', family='gemma2', families=['gemma2'], parameter_size='2.6B', quantization_level='Q4_0')), Model(model='llama3:latest', modified_at=datetime.datetime(2025, 7, 11, 11, 15, 21, 694869, tzinfo=TzInfo(-25200)), digest='365c0bd3c000a25d28ddbf732fe1c6add414de7275464c4e4d1c3b5fcb5d8ad1', size=4661224676, details=ModelDetails(parent_model='', format='gguf', family='llama', families

In [None]:

cnn_dataset = load_dataset("cnn_dailymail", "3.0.0", split="train")
training_pairs = []
skipped = 0
start_time = time.time()
completed =4459
num_examples = 5000
style="bullet"
with open("data.jsonl", "a") as f:

    for i in tqdm( range(completed, min(num_examples, len(cnn_dataset)))):
            article = cnn_dataset[i]['article']
            
            # Skip very short articles
            if len(article) < 300:
                skipped += 1
                continue
            
            # Generate rough notes using LLM
            rough_notes = generate_rough_notes(article, style=style)
            messages = [
                {
                    "role": "system",
                    "content": "You are a professional journalist. Expand rough notes into complete, well-written news articles. Maintain all facts while adding proper structure and professional language."
                },
                {
                    "role": "user",
                    "content": f"Expand these rough notes into a professional news article:\n\n{rough_notes}"
                },
                {
                    "role": "assistant",
                    "content": article
                }
            ]
            entry = {
                "messages": messages,
                "rough_notes": rough_notes,
                "polished_article": article,
                "style": style
            }
            training_pairs.append(entry)
            
            json.dump(entry, f)
            f.write("\n")

new_df  = Dataset.from_list(training_pairs)

Using the latest cached version of the dataset since cnn_dailymail couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration '3.0.0' at /Users/aryanparab/.cache/huggingface/datasets/cnn_dailymail/3.0.0/0.0.0/96df5e686bee6baa90b8bee7c28b81fa3fa6223d (last modified on Tue Jan 13 00:41:11 2026).
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 541/541 [1:22:55<00:00,  9.20s/it]


In [None]:


data = training_pairs

# Specify the filename
filename = "my_data.json"

# Open the file in write mode ('w') and use json.dump()
try:
    with open(filename, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Successfully wrote data to {filename}")
except IOError as e:
    print(f"Error writing to file {filename}: {e}")



Successfully wrote data to my_data.json


## Parsing the data to correct format

In [None]:


def fix_json_file(input_file, output_file):
    """
    Fixes JSON file with multiple objects (JSONL format) to proper JSON array

    Converts:
    {"key": "value1"}
    {"key": "value2"}

    To:
    [
      {"key": "value1"},
      {"key": "value2"}
    ]
    """

    print(f"üîß Fixing {input_file}...")

    data = []

    with open(input_file, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:  # Skip empty lines
                continue

            try:
                obj = json.loads(line)
                data.append(obj)
            except json.JSONDecodeError as e:
                print(f"‚ö†Ô∏è Error on line {line_num}: {e}")
                print(f"   Line content: {line[:100]}...")

    print(f"‚úÖ Successfully parsed {len(data)} objects")

    # Save as proper JSON array
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"üíæ Saved fixed JSON to: {output_file}")
    print(f"\nüìä Stats:")
    print(f"  Total records: {len(data)}")

    if data:
        print(f"\nüìã First record preview:")
        first = data[0]
        print(f"  Keys: {list(first.keys())}")
        if 'rough_notes' in first:
            print(f"  Rough notes length: {len(first['rough_notes'])} chars")
        if 'polished_article' in first:
            print(f"  Article length: {len(first['polished_article'])} chars")

    return data

In [None]:


EXAMPLE_FORMAT = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a professional journalist. Expand rough notes into complete, well-written news articles. Maintain all facts while adding proper structure and professional language."
            },
            {
                "role": "user",
                "content": "Expand these rough notes into a professional news article:\n\n‚Ä¢ Bullet point 1\n‚Ä¢ Bullet point 2\n‚Ä¢ etc..."
            },
            {
                "role": "assistant",
                "content": "Full polished article text here..."
            }
        ],
        "rough_notes": "‚Ä¢ Bullet point 1\n‚Ä¢ Bullet point 2\n‚Ä¢ etc...",
        "polished_article": "Full polished article text here..."
    }
   
]



def validate_dataset(json_path):
    """Validate that your dataset is in the correct format"""

    print("üîç Validating dataset...")

    try:
        with open(json_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
    except FileNotFoundError:
        print(f"‚ùå Error: File '{json_path}' not found!")
        return False
    except json.JSONDecodeError as e:
        print(f"‚ùå Error: Invalid JSON format - {e}")
        return False

    if not isinstance(data, list):
        print("‚ùå Error: JSON must be a list of objects")
        return False

    if len(data) == 0:
        print("‚ùå Error: Dataset is empty!")
        return False

    print(f"‚úÖ Found {len(data)} examples")

    # Check first example
    example = data[0]
    required_keys = ['messages', 'rough_notes', 'polished_article']

    for key in required_keys:
        if key not in example:
            print(f"‚ùå Error: Missing key '{key}' in first example")
            return False

    # Validate messages structure
    if not isinstance(example['messages'], list) or len(example['messages']) != 3:
        print("‚ùå Error: 'messages' must be a list with 3 items (system, user, assistant)")
        return False

    roles = [msg.get('role') for msg in example['messages']]
    if roles != ['system', 'user', 'assistant']:
        print(f"‚ùå Error: Message roles must be ['system', 'user', 'assistant'], got {roles}")
        return False

    # Check for empty content
    for i, msg in enumerate(example['messages']):
        if not msg.get('content'):
            print(f"‚ùå Error: Empty content in message {i}")
            return False

    print("‚úÖ Dataset format is valid!")

    # Statistics
    print("\nüìä DATASET STATISTICS:")
    print(f"  Total examples: {len(data)}")

    avg_notes = sum(len(ex['rough_notes']) for ex in data) / len(data)
    #avg_article = sum(len(ex['polished_article']) for ex in data) / len(data)

    print(f"  Average rough notes length: {avg_notes:.0f} chars")
    # print(f"  Average article length: {avg_article:.0f} chars")
    # print(f"  Expansion ratio: {avg_article/avg_notes:.1f}x")

    # Show first example
    print("\nüìã FIRST EXAMPLE:")
    print("\nRough notes:")
    print(example['rough_notes'][:200] + "..." if len(example['rough_notes']) > 200 else example['rough_notes'])
    print("\nPolished article:")
    print(example['polished_article'][:300] + "..." if len(example['polished_article']) > 300 else example['polished_article'])

    return True

In [None]:

def convert_to_training_format(input_json_path, output_json_path):
    """
    Convert your raw data to the training format

    Assumes input format like:
    [
        {
            "rough_notes": "...",
            "polished_article": "...",
            "style": "bullet"  # optional
        }
    ]
    """

    print(f"üîÑ Converting {input_json_path} to training format...")

    with open(input_json_path, 'r', encoding='utf-8') as f:
        raw_data = json.load(f)

    training_data = []

    for item in raw_data:
        # Create the messages format
        if 'polished_article' not in item:continue
        messages = [
            {
                "role": "system",
                "content": "You are a professional journalist. Expand rough notes into complete, well-written news articles. Maintain all facts while adding proper structure and professional language."
            },
            {
                "role": "user",
                "content": f"Expand these rough notes into a professional news article:\n\n{item['rough_notes']}"
            },
            {
                "role": "assistant",
                "content": item['polished_article']
            }
        ]

        training_data.append({
            "messages": messages,
            "rough_notes": item['rough_notes'],
            "polished_article": item['polished_article']
        })

    # Save
    with open(output_json_path, 'w', encoding='utf-8') as f:
        json.dump(training_data, f, indent=2, ensure_ascii=False)

    print(f"‚úÖ Converted {len(training_data)} examples")
    print(f"üíæ Saved to: {output_json_path}")


In [None]:
input_filename = "/content/data.json"  # ‚Üê Your broken file
output_filename = "/content/cnn_training_data_fixed.json"  # ‚Üê Fixed output

fixed_data = fix_json_file(input_filename, output_filename)

üîß Fixing /content/data.json...
‚úÖ Successfully parsed 4999 objects
üíæ Saved fixed JSON to: /content/cnn_training_data_fixed.json

üìä Stats:
  Total records: 4999

üìã First record preview:
  Keys: ['messages', 'rough_notes', 'polished_article', 'style']
  Rough notes length: 571 chars
  Article length: 2527 chars

In [None]:

validate_dataset(output_filename)
convert_to_training_format(output_filename, "/content/cnn_training_data.json")

üîç Validating dataset...
‚úÖ Found 4999 examples
‚úÖ Dataset format is valid!

üìä DATASET STATISTICS:
  Total examples: 4999
  Average rough notes length: 585 chars

üìã FIRST EXAMPLE:

Rough notes:
‚Ä¢ Daniel Radcliffe turns 18 on Monday with a ¬£20 million fortune
‚Ä¢ He won't spend the money on luxury items like cars or parties
‚Ä¢ "I don't plan to be one of those people who, as soon as they turn 18,...

Polished article:
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported ¬£20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappoi...
üîÑ Converting /content/cnn_training_data_fixed.json to training format...
‚úÖ Converted 4999 examples
üíæ Saved to: /content/cnn_training_data.json

## Saving Only LLM-Enriched Data

Although multiple enrichment strategies were explored, **only LLM-generated outputs are persisted**.

This decision is based on empirical results showing:
- Lower hallucination rates
- Better contextual coherence
- Superior article-writing capability

POS-enriched representations are intentionally excluded from the training dataset.


## Final Summary

In this notebook, we evaluated multiple dataset enrichment strategies:

- POS-based enrichment was tested and rejected due to context loss and hallucination
- LLM-based enrichment was selected for its semantic depth and factual grounding

The final dataset:
- Contains **only LLM-generated enriched text**
- Is optimized for supervised fine-tuning
- Produces significantly more coherent and reliable models

This reinforces a key insight:
**For LLM fine-tuning, semantic context matters far more than syntactic annotation.**
