## Import Libraries

In [1]:
from datasets import load_dataset
import random
import re
from collections import Counter
import json

In [2]:
# Load Cosmopedia-100K dataset
print("Loading Cosmopedia-100K dataset...")
cosmopedia = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train")

print(f"Dataset size: {len(cosmopedia)} examples")
print(f"Dataset features: {cosmopedia.features}")

Loading Cosmopedia-100K dataset...


README.md:   0%|          | 0.00/944 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00002.parquet:   0%|          | 0.00/153M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00001-of-00002.parquet:   0%|          | 0.00/153M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Dataset size: 100000 examples
Dataset features: {'prompt': Value(dtype='string', id=None), 'text_token_length': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'seed_data': Value(dtype='string', id=None), 'format': Value(dtype='string', id=None), 'audience': Value(dtype='string', id=None)}


In [3]:
# Explore the structure of the dataset
print("\n" + "="*60)
print("DATASET STRUCTURE EXPLORATION")
print("="*60)

# Look at first few examples
for i in range(3):
    example = cosmopedia[i]
    print(f"\nExample {i+1}:")
    print(f"Prompt: {example['prompt'][:200]}...")
    print(f"Text length: {len(example['text'])} characters")
    print(f"Text preview: {example['text'][:300]}...")
    if 'audience' in example:
        print(f"Audience: {example['audience']}")
    if 'format' in example:
        print(f"Format: {example['format']}")
    print("-" * 40)



DATASET STRUCTURE EXPLORATION

Example 1:
Prompt: Here is an extract from a webpage: "What can cause my settlement offer to be delayed?
When you’ve been injured in an Austin truck accident, one of the most common questions is how long it will take fo...
Text length: 3057 characters
Text preview:  When you've been involved in an auto accident, particularly one involving a commercial truck, receiving a settlement offer from the insurance company is often top of mind. After all, medical bills, lost wages, and property damage can quickly add up, leaving you financially strained. However, the ti...
Audience: general
Format: blogpost
----------------------------------------

Example 2:
Prompt: Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento, together with key representatives of the space agencies and industries that made the mission possi...
Text length: 4183 characters
Text preview:  Course Unit: LISA Pathfinder Mission and Gravitational

In [5]:
# Analyze prompt patterns
print("\n" + "="*60)
print("PROMPT ANALYSIS")
print("="*60)

prompts = [example['prompt'] for example in cosmopedia]
audiences = [example.get('audience', 'unknown') for example in cosmopedia]
formats = [example.get('format', 'unknown') for example in cosmopedia]

print("Audience distribution:")
audience_counts = Counter(audiences)
for audience, count in audience_counts.most_common(10):
    print(f"  {audience}: {count}")

print("\nFormat distribution:")
format_counts = Counter(formats)
for format_type, count in format_counts.most_common(10):
    print(f"  {format_type}: {count}")

# Analyze prompt keywords and patterns
print("\nCommon prompt patterns:")
prompt_words = []
for prompt in prompts[:1000]:  # Sample first 1000 prompts
    words = prompt.lower().split()
    prompt_words.extend(words)

common_prompt_words = Counter(prompt_words).most_common(20)
print("Most common words in prompts:")
for word, count in common_prompt_words:
    print(f"  {word}: {count}")


PROMPT ANALYSIS
Audience distribution:
  general: 57597
  college_students: 32161
  young_children: 5081
  grade_school_students: 3153
  researchers: 972
  high_school_studnets: 823
  college_studnets: 108
  middle_school_students: 105

Format distribution:
  blogpost: 37927
  textbook_academic_tone: 28261
  educational_piece: 6203
  story_reddit: 4235
  story_children: 4160
  story_life_lessons: 4056
  wikihow: 3979
  textbook_narrative_tone: 3606
  textbook_narrative: 3364
  story_morality: 1915

Common prompt patterns:
Most common words in prompts:
  the: 12808
  and: 7852
  a: 5893
  of: 5577
  to: 5420
  in: 4240
  -: 3408
  that: 2714
  an: 2631
  is: 2586
  with: 2425
  or: 2167
  for: 2136
  write: 2040
  not: 1659
  as: 1483
  do: 1458
  extract: 1343
  from: 1326
  you: 1266


In [6]:
# Quality Analysis Functions
def analyze_text_quality(text):
    quality_issues = []
    
    # Check for repetition
    sentences = text.split('.')
    if len(sentences) > 5:
        sentence_counts = Counter(sentences)
        repeated_sentences = [s for s, count in sentence_counts.items() if count > 1 and len(s.strip()) > 10]
        if repeated_sentences:
            quality_issues.append(f"Repeated sentences: {len(repeated_sentences)}")
    
    # Check for coherence issues (basic patterns)
    if "I'm sorry" in text or "I cannot" in text:
        quality_issues.append("Contains refusal patterns")
    
    # Check for formatting issues
    if text.count('\n\n') > len(text) / 100:  # Too many line breaks
        quality_issues.append("Excessive line breaks")
    
    # Check for incomplete sentences
    if text.endswith((',', ';', 'and', 'or', 'but')):
        quality_issues.append("Incomplete ending")
    
    # Basic factual consistency check (very simple)
    numbers = re.findall(r'\b\d+\b', text)
    if len(numbers) > 10:
        # Check for obviously wrong calculations or dates
        for num in numbers:
            if len(num) == 4 and num.startswith('2') and int(num) > 2025:
                quality_issues.append("Future date mentioned")
                break
    
    return quality_issues

In [7]:
# Analyze quality across dataset sample
print("\n" + "="*60)
print("QUALITY ANALYSIS")
print("="*60)

sample_size = 500
sample_indices = random.sample(range(len(cosmopedia)), sample_size)
quality_report = {
    'total_issues': 0,
    'issue_types': Counter(),
    'examples_with_issues': []
}

print(f"Analyzing quality of {sample_size} random examples...")

for idx in sample_indices[:100]:  # Analyze first 100 for detailed output
    example = cosmopedia[idx]
    issues = analyze_text_quality(example['text'])
    
    if issues:
        quality_report['total_issues'] += len(issues)
        quality_report['issue_types'].update(issues)
        if len(quality_report['examples_with_issues']) < 5:  # Store first 5 examples
            quality_report['examples_with_issues'].append({
                'idx': idx,
                'prompt': example['prompt'][:100],
                'issues': issues,
                'text_preview': example['text'][:200]
            })

print(f"\nQuality Analysis Results:")
print(f"Examples with issues: {len([1 for idx in sample_indices[:100] if analyze_text_quality(cosmopedia[idx]['text'])])}")
print(f"Total issues found: {quality_report['total_issues']}")

print("\nMost common issue types:")
for issue_type, count in quality_report['issue_types'].most_common():
    print(f"  {issue_type}: {count}")

print("\nExamples with quality issues:")
for i, example in enumerate(quality_report['examples_with_issues']):
    print(f"\nExample {i+1}:")
    print(f"Prompt: {example['prompt']}...")
    print(f"Issues: {', '.join(example['issues'])}")
    print(f"Text preview: {example['text_preview']}...")



QUALITY ANALYSIS
Analyzing quality of 500 random examples...

Quality Analysis Results:
Examples with issues: 0
Total issues found: 0

Most common issue types:

Examples with quality issues:


In [8]:
# Look for specific factual errors
print("\n" + "="*60)
print("FACTUAL ERROR DETECTION")
print("="*60)

def check_factual_errors(text):
    errors = []
    
    # Check for obviously wrong facts
    if "the sun revolves around the earth" in text.lower():
        errors.append("Geocentric model error")
    
    # Check for impossible dates
    current_year = 2025
    years = re.findall(r'\b(19|20)\d{2}\b', text)
    for year in years:
        if int(year) > current_year:
            errors.append(f"Future year mentioned: {year}")
    
    # Check for mathematical errors in simple cases
    math_expressions = re.findall(r'(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)', text)
    for expr in math_expressions:
        if int(expr[0]) + int(expr[1]) != int(expr[2]):
            errors.append(f"Math error: {expr[0]} + {expr[1]} ≠ {expr[2]}")
    
    return errors

factual_errors = []
for idx in random.sample(range(len(cosmopedia)), 200):
    example = cosmopedia[idx]
    errors = check_factual_errors(example['text'])
    if errors:
        factual_errors.append({
            'idx': idx,
            'prompt': example['prompt'][:100],
            'errors': errors,
            'text_preview': example['text'][:300]
        })

print(f"Found {len(factual_errors)} examples with potential factual errors:")
for i, example in enumerate(factual_errors[:5]):  # Show first 5
    print(f"\nExample {i+1}:")
    print(f"Prompt: {example['prompt']}...")
    print(f"Errors: {', '.join(example['errors'])}")
    print(f"Text: {example['text_preview']}...")



FACTUAL ERROR DETECTION
Found 0 examples with potential factual errors:


In [9]:
# Generate diverse prompts
print("\n" + "="*60)
print("PROMPT VARIATION EXPERIMENTS")
print("="*60)

# Analyze existing successful prompts
successful_prompts = []
for i in range(50):
    example = cosmopedia[i]
    if len(example['text']) > 500 and len(analyze_text_quality(example['text'])) == 0:
        successful_prompts.append(example['prompt'])

print("Examples of high-quality prompts:")
for i, prompt in enumerate(successful_prompts[:5]):
    print(f"{i+1}. {prompt[:150]}...")

# Create prompt variations
def create_prompt_variations(base_topic, audience_level="general"):
    """Generate diverse prompts for the same topic"""
    
    variations = [
        f"Write a comprehensive guide about {base_topic} for {audience_level} readers.",
        f"Explain {base_topic} using real-world examples and analogies.",
        f"Create a step-by-step tutorial on {base_topic} with practical applications.",
        f"Discuss the history and evolution of {base_topic} in an engaging narrative style.",
        f"Write a Q&A format explanation covering the most important aspects of {base_topic}.",
        f"Create an educational article about {base_topic} that includes common misconceptions and facts.",
        f"Write a problem-solving guide for {base_topic} with worked examples.",
        f"Explain {base_topic} from multiple perspectives, including benefits and challenges."
    ]
    
    return variations

# Example prompt variations
topics = ["renewable energy", "machine learning basics", "ancient civilizations", "nutrition science"]
audiences = ["middle school students", "college students", "general public", "professionals"]

print("\nGenerated prompt variations:")
for topic in topics[:2]:
    for audience in audiences[:2]:
        variations = create_prompt_variations(topic, audience)
        print(f"\nTopic: {topic} | Audience: {audience}")
        for i, variation in enumerate(variations[:3]):
            print(f"  {i+1}. {variation}")


PROMPT VARIATION EXPERIMENTS
Examples of high-quality prompts:
1. Here is an extract from a webpage: "What can cause my settlement offer to be delayed?
When you’ve been injured in an Austin truck accident, one of the...
2. Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento, together with key representatives of the space a...
3. Here is an extract from a webpage: "Recording of Present Day: Math’s Greatest Hits, “Analysis” with Alex Kontorovich
This is a recording of a live-str...
4. Here is an extract from a webpage: "DoP Jules O'Loughlin ASC ACS Talks Us Through Angel Has Fallen
Angel Has Fallen (2019) is the third instalment in ...
5. Write an educational story (3-5 paragraphs) targeted at young children using simple words. The story should be inspired from this text snippet: 
“Not ...

Generated prompt variations:

Topic: renewable energy | Audience: middle school students
  1. Write a comprehensive guide about renewable energy for m

In [10]:
# Diversity metrics
print("\n" + "="*60)
print("DIVERSITY ANALYSIS")
print("="*60)

def calculate_diversity_metrics(texts):
    
    # Vocabulary diversity
    all_words = []
    for text in texts:
        words = text.lower().split()
        all_words.extend(words)
    
    unique_words = len(set(all_words))
    total_words = len(all_words)
    vocab_diversity = unique_words / total_words if total_words > 0 else 0
    
    # Topic diversity (simple keyword-based)
    topics = {
        'science': ['research', 'study', 'experiment', 'theory', 'scientific'],
        'history': ['historical', 'ancient', 'century', 'civilization', 'era'],
        'technology': ['computer', 'digital', 'software', 'internet', 'technology'],
        'education': ['learning', 'student', 'teaching', 'knowledge', 'academic'],
        'health': ['medical', 'health', 'treatment', 'disease', 'medicine']
    }
    
    topic_counts = {topic: 0 for topic in topics}
    for text in texts:
        text_lower = text.lower()
        for topic, keywords in topics.items():
            if any(keyword in text_lower for keyword in keywords):
                topic_counts[topic] += 1
    
    return {
        'vocab_diversity': vocab_diversity,
        'unique_words': unique_words,
        'total_words': total_words,
        'topic_distribution': topic_counts
    }

# Analyze diversity of sample
sample_texts = [cosmopedia[i]['text'] for i in range(100)]
diversity_metrics = calculate_diversity_metrics(sample_texts)

print("Diversity Analysis Results:")
print(f"Vocabulary diversity: {diversity_metrics['vocab_diversity']:.4f}")
print(f"Unique words: {diversity_metrics['unique_words']:,}")
print(f"Total words: {diversity_metrics['total_words']:,}")

print("\nTopic distribution in sample:")
for topic, count in diversity_metrics['topic_distribution'].items():
    percentage = (count / 100) * 100
    print(f"  {topic}: {count} examples ({percentage}%)")


DIVERSITY ANALYSIS
Diversity Analysis Results:
Vocabulary diversity: 0.2377
Unique words: 13,451
Total words: 56,595

Topic distribution in sample:
  science: 51 examples (51.0%)
  history: 90 examples (90.0%)
  technology: 20 examples (20.0%)
  education: 44 examples (44.0%)
  health: 32 examples (32.0%)


In [11]:
print("\n" + "="*60)
print("RECOMMENDATIONS FOR IMPROVEMENT")
print("="*60)

print("""
Based on the analysis, here are recommendations for generating more diverse and higher-quality synthetic data:

1. PROMPT DIVERSITY:
   - Use varied instruction formats (explain, describe, analyze, compare)
   - Include different perspectives and viewpoints
   - Vary the complexity and depth of explanations
   - Mix formal and informal tones

2. QUALITY IMPROVEMENTS:
   - Add fact-checking prompts
   - Include requests for specific examples and evidence
   - Ask for structured content with clear sections
   - Request citations or references where appropriate

3. CONTENT DIVERSITY:
   - Cover underrepresented topics more thoroughly
   - Include interdisciplinary content
   - Add more practical, hands-on examples
   - Include diverse cultural perspectives

4. FORMAT VARIATIONS:
   - Mix educational formats (tutorials, explanations, discussions)
   - Include dialogue and conversational formats
   - Add problem-solving scenarios
   - Create comparative analyses
""")

print("\nSample improved prompts:")
improved_prompts = [
    "Write a balanced analysis of renewable energy sources, including both advantages and limitations, with specific examples from different countries.",
    "Create a step-by-step guide for understanding machine learning concepts, using everyday analogies and avoiding technical jargon.",
    "Explain the cultural and scientific significance of ancient astronomical observations, connecting historical practices to modern understanding.",
    "Develop a comprehensive nutrition guide that addresses common myths and provides evidence-based recommendations for different age groups."
]

for i, prompt in enumerate(improved_prompts):
    print(f"{i+1}. {prompt}")


RECOMMENDATIONS FOR IMPROVEMENT

Based on the analysis, here are recommendations for generating more diverse and higher-quality synthetic data:

1. PROMPT DIVERSITY:
   - Use varied instruction formats (explain, describe, analyze, compare)
   - Include different perspectives and viewpoints
   - Vary the complexity and depth of explanations
   - Mix formal and informal tones

2. QUALITY IMPROVEMENTS:
   - Add fact-checking prompts
   - Include requests for specific examples and evidence
   - Ask for structured content with clear sections
   - Request citations or references where appropriate

3. CONTENT DIVERSITY:
   - Cover underrepresented topics more thoroughly
   - Include interdisciplinary content
   - Add more practical, hands-on examples
   - Include diverse cultural perspectives

4. FORMAT VARIATIONS:
   - Mix educational formats (tutorials, explanations, discussions)
   - Include dialogue and conversational formats
   - Add problem-solving scenarios
   - Create comparative ana