# Weeks 5-6: Text & Language Track
## Deep Dive into Language AI

**Track Focus:** Build tools that understand, analyze, and generate text.

This notebook covers TWO weeks. Work through it at your own pace, but aim to:
- Complete Part 1-3 by end of Week 5
- Complete Part 4-5 and start your project by end of Week 6

---

## Setup

In [None]:
# Install everything we need
!pip install transformers torch -q
print("Ready!")

In [None]:
from transformers import pipeline
print("Imports complete!")

### A Note on Professional Tools

The Hugging Face models you're using are the same ones powering real AI applications at companies like Google, Microsoft, and thousands of startups. This isn't a simplified "educational" version — it's the real thing, made accessible through free tools.

If your code ever seems slow or you get memory errors, that's because you're running professional-grade AI models that normally require expensive hardware. Go to **Runtime → Restart runtime** to clear memory and try again.

---

---

# Part 1: Beyond Basic Sentiment (45 min)

## Emotion Detection vs. Sentiment

Sentiment gives you positive/negative. Emotion detection gives you specific feelings.

In [None]:
# Load emotion classifier
emotions = pipeline(
    "text-classification",
    model="j-hartmann/emotion-english-distilroberta-base",
    top_k=None
)

# Test it
text = "I can't believe they cancelled my favorite show. I'm so disappointed."
result = emotions(text)[0]

print(f"Text: {text}\n")
for e in sorted(result, key=lambda x: x['score'], reverse=True):
    print(f"  {e['label']:10} {'*' * int(e['score'] * 20)} {e['score']:.1%}")

## Zero-Shot Classification: Any Categories You Want

This is one of the most powerful tools. You define the categories!

In [None]:
# Load zero-shot classifier
classifier = pipeline("zero-shot-classification")

# Example: Classify customer feedback
text = "The app crashes every time I try to upload a photo"
categories = ["bug report", "feature request", "compliment", "general question"]

result = classifier(text, categories)

print(f"Text: {text}\n")
print("Categories:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.1%}")

In [None]:
# YOUR TURN: Create your own classification system
# Ideas: homework types, email categories, social media posts, game genres

my_text = ""  # Your text here
my_categories = []  # Your categories here

if my_text and my_categories:
    result = classifier(my_text, my_categories)
    for label, score in zip(result['labels'], result['scores']):
        print(f"  {label}: {score:.1%}")

## Exercise: Build a Homework Sorter

Create a tool that takes homework descriptions and categorizes them by:
1. Subject (math, english, science, etc.)
2. Urgency (due today, due this week, due later)
3. Type (reading, writing, problem-solving, creative)

In [None]:
def homework_analyzer(homework_description):
    """
    Analyze homework and categorize it multiple ways.
    """
    print(f"Homework: {homework_description}\n")
    
    # Subject classification
    subjects = ["math", "english", "science", "history", "art", "computer science"]
    subject_result = classifier(homework_description, subjects)
    print(f"Subject: {subject_result['labels'][0]} ({subject_result['scores'][0]:.1%})")
    
    # Type classification
    types = ["reading assignment", "writing assignment", "problem solving", "creative project", "research"]
    type_result = classifier(homework_description, types)
    print(f"Type: {type_result['labels'][0]} ({type_result['scores'][0]:.1%})")
    
    # Difficulty guess
    difficulty = ["easy", "medium", "hard"]
    diff_result = classifier(homework_description, difficulty)
    print(f"Difficulty: {diff_result['labels'][0]} ({diff_result['scores'][0]:.1%})")
    
    return {
        'subject': subject_result['labels'][0],
        'type': type_result['labels'][0],
        'difficulty': diff_result['labels'][0]
    }

# Test it!
homework_analyzer("Read chapters 5-7 of To Kill a Mockingbird and write a one-page summary")

In [None]:
# Test with more homework
homeworks = [
    "Solve problems 1-20 on page 145 about quadratic equations",
    "Create a poster about the water cycle",
    "Write a 500-word essay about the American Revolution",
    "Build a simple website using HTML and CSS"
]

for hw in homeworks:
    print("=" * 50)
    homework_analyzer(hw)
    print()

---

# Part 2: Question Answering Systems (45 min)

Build tools that can answer questions based on text you provide.

In [None]:
# Load QA pipeline
qa = pipeline("question-answering")

In [None]:
# Example: Answer questions about a text passage
context = """
The Amazon rainforest is the world's largest tropical rainforest, covering over 
5.5 million square kilometers. It is home to approximately 10% of all species on Earth,
including jaguars, pink river dolphins, and poison dart frogs. The Amazon River, 
which flows through the forest, is the second longest river in the world and 
carries more water than any other river. The rainforest produces about 20% of the 
world's oxygen, earning it the nickname "the lungs of the Earth." However, 
deforestation threatens this vital ecosystem, with about 17% of the forest 
lost in the last 50 years.
"""

questions = [
    "How big is the Amazon rainforest?",
    "What percentage of species live there?",
    "Why is it called the lungs of the Earth?",
    "How much forest has been lost?"
]

print("AMAZON RAINFOREST Q&A")
print("=" * 50)
for q in questions:
    result = qa(question=q, context=context)
    print(f"\nQ: {q}")
    print(f"A: {result['answer']} (confidence: {result['score']:.1%})")

## Exercise: Build a Study Helper

Create a tool where you paste your notes, and it answers questions about them.

In [None]:
def study_helper(notes, questions):
    """
    A study helper that answers questions based on your notes.
    """
    print("STUDY HELPER")
    print("=" * 50)
    print(f"\nNotes length: {len(notes)} characters")
    print(f"Number of questions: {len(questions)}")
    print("\n" + "-" * 50)
    
    results = []
    for q in questions:
        answer = qa(question=q, context=notes)
        results.append({
            'question': q,
            'answer': answer['answer'],
            'confidence': answer['score']
        })
        
        # Color-code by confidence
        confidence_emoji = "" if answer['score'] > 0.7 else "" if answer['score'] > 0.4 else ""
        print(f"\n{confidence_emoji} Q: {q}")
        print(f"   A: {answer['answer']}")
        print(f"   Confidence: {answer['score']:.1%}")
    
    return results

# Test with sample notes
my_notes = """
The Civil War began in 1861 when Confederate forces attacked Fort Sumter. 
Abraham Lincoln was president during the war. The main cause was disagreement 
over slavery and states' rights. The war ended in 1865 with the surrender of 
General Robert E. Lee at Appomattox Court House. About 620,000 soldiers died, 
making it the deadliest war in American history. The 13th Amendment, which 
abolished slavery, was passed shortly after the war ended.
"""

my_questions = [
    "When did the Civil War start?",
    "Who was president during the war?",
    "Where did the war end?",
    "How many soldiers died?",
    "What amendment abolished slavery?"
]

study_helper(my_notes, my_questions)

---

### Restart Your Runtime

Good checkpoint! Before continuing to Part 3, restart your runtime to free up memory.

**Runtime → Restart runtime**, then re-run the Setup cells (cells 2-3) before continuing.

---

In [None]:
# YOUR TURN: Use it with your own notes!
# Paste notes from a class you're taking

your_notes = """
PASTE YOUR NOTES HERE
"""

your_questions = [
    "Question 1?",
    "Question 2?",
    "Question 3?"
]

# study_helper(your_notes, your_questions)

---

# Part 3: Summarization & Key Information (45 min)

Extract the most important information from long texts.

In [None]:
# Load summarization pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

# Load NER for extracting key entities
ner = pipeline("ner", grouped_entities=True)

In [None]:
# Example: Summarize and extract key info from an article
article = """
SpaceX successfully launched its Starship rocket on its latest test flight, marking 
a significant milestone in the company's quest to create a fully reusable spacecraft. 
The massive rocket, standing at 400 feet tall, lifted off from the company's facility 
in Boca Chica, Texas early Thursday morning. CEO Elon Musk watched from the control 
center as the rocket completed a successful stage separation and the booster returned 
to the launch pad for the first time. While the upper stage was intentionally destroyed 
over the ocean as planned, the successful booster catch represents a major advancement 
toward SpaceX's goal of making space travel more affordable. NASA has contracted SpaceX 
to use Starship for its Artemis program, which aims to return humans to the Moon by 2026.
"""

# Summarize
summary = summarizer(article, max_length=60, min_length=20)[0]['summary_text']

# Extract key entities
entities = ner(article)

print("ARTICLE ANALYSIS")
print("=" * 50)
print(f"\nOriginal length: {len(article)} characters")
print(f"\nSUMMARY:\n{summary}")
print(f"\nKEY ENTITIES:")
for e in entities:
    print(f"  {e['word']}: {e['entity_group']}")

---

### Restart Your Runtime

Before Part 4, restart again: **Runtime → Restart runtime**

Re-run the Setup cells (cells 2-3), then reload any models you need for Part 4.

---

## Exercise: Build a News Digest Tool

Create a tool that takes multiple articles and creates a digest.

In [None]:
def create_news_digest(articles):
    """
    Takes a list of articles and creates a summary digest.
    """
    print("NEWS DIGEST")
    print("=" * 50)
    
    all_entities = []
    
    for i, article in enumerate(articles, 1):
        print(f"\n--- Article {i} ---")
        
        # Summarize
        if len(article) > 100:
            summary = summarizer(article, max_length=50, min_length=15)[0]['summary_text']
            print(f"Summary: {summary}")
        else:
            print(f"Summary: {article}")
        
        # Get entities
        entities = ner(article)
        for e in entities:
            all_entities.append(e['word'])
    
    # Find most mentioned entities
    from collections import Counter
    entity_counts = Counter(all_entities)
    
    print(f"\n{'='*50}")
    print("TRENDING TOPICS:")
    for entity, count in entity_counts.most_common(5):
        print(f"  {entity}: mentioned {count} time(s)")

# Test with sample articles
sample_articles = [
    "Apple announced new iPhone models at its annual event in Cupertino. Tim Cook presented the iPhone 16 with improved AI features.",
    "Google released an update to its Gemini AI model, claiming it now outperforms competitors on most benchmarks.",
    "Microsoft's partnership with OpenAI continues to expand, with new AI features coming to Windows and Office products."
]

create_news_digest(sample_articles)

---

# Part 4: Combining Models for Powerful Tools (45 min)

The real power comes from combining multiple AI capabilities.

In [None]:
def comprehensive_text_analysis(text):
    """
    Perform multiple analyses on a piece of text.
    """
    print("COMPREHENSIVE TEXT ANALYSIS")
    print("=" * 60)
    print(f"\nAnalyzing text ({len(text)} characters)...\n")
    
    results = {}
    
    # 1. Emotion Analysis
    print("1. EMOTIONAL TONE")
    emotion_result = emotions(text[:500])[0]  # Limit length
    top_emotions = sorted(emotion_result, key=lambda x: x['score'], reverse=True)[:3]
    results['emotions'] = top_emotions
    for e in top_emotions:
        print(f"   {e['label']}: {e['score']:.1%}")
    
    # 2. Topic Classification
    print("\n2. LIKELY TOPIC")
    topics = ["technology", "politics", "sports", "entertainment", "science", "business", "personal"]
    topic_result = classifier(text[:500], topics)
    results['topic'] = topic_result['labels'][0]
    print(f"   Primary topic: {topic_result['labels'][0]} ({topic_result['scores'][0]:.1%})")
    
    # 3. Key Entities
    print("\n3. KEY ENTITIES")
    entity_result = ner(text)
    results['entities'] = entity_result
    if entity_result:
        for e in entity_result[:5]:
            print(f"   {e['word']}: {e['entity_group']}")
    else:
        print("   No named entities found")
    
    # 4. Summary
    print("\n4. SUMMARY")
    if len(text) > 100:
        summary = summarizer(text, max_length=60, min_length=20)[0]['summary_text']
        results['summary'] = summary
        print(f"   {summary}")
    else:
        results['summary'] = text
        print(f"   (Text too short to summarize)")
    
    # 5. Writing Style
    print("\n5. WRITING STYLE")
    styles = ["formal", "casual", "academic", "journalistic", "creative"]
    style_result = classifier(text[:500], styles)
    results['style'] = style_result['labels'][0]
    print(f"   Style: {style_result['labels'][0]} ({style_result['scores'][0]:.1%})")
    
    return results

# Test it!
test_text = """
The breakthrough discovery in quantum computing announced yesterday by researchers 
at MIT has sent shockwaves through the tech industry. Dr. Sarah Chen and her team 
demonstrated a new method for maintaining quantum coherence at room temperature, 
potentially solving one of the biggest obstacles to practical quantum computers. 
"This changes everything," said Chen at the press conference. Google and IBM stocks 
rose sharply on the news, while investors scrambled to understand the implications. 
Critics caution that scaling the technology remains a challenge, but optimists 
believe commercial quantum computers could arrive within five years.
"""

comprehensive_text_analysis(test_text)

---

# Part 5: Your Track Project (Week 6 Focus)

## Project Ideas for Text & Language Track

### Idea 1: Smart Homework Assistant
- Input: Homework questions and your notes
- Output: Suggested answers with confidence levels
- Features: Highlights when it's unsure, suggests what to review

### Idea 2: Writing Feedback Tool
- Input: An essay or writing sample
- Output: Emotion analysis, readability score, key points summary
- Features: Suggests if tone matches intended audience

### Idea 3: Social Media Analyzer
- Input: Collection of posts or comments
- Output: Overall sentiment, trending topics, emotion breakdown
- Features: Identifies potentially toxic content

### Idea 4: Study Quiz Generator
- Input: Your notes or a textbook chapter
- Output: Generated quiz questions with answers
- Features: Uses QA to verify answers are in the source

### Idea 5: Your Own Idea!
What problem do YOU want to solve with text AI?

## Project Planning Template

**My Project:** 

**What it does:** 

**Who would use it:** 

**AI capabilities I'll use:**
- [ ] Sentiment/Emotion analysis
- [ ] Zero-shot classification
- [ ] Question answering
- [ ] Summarization
- [ ] Named entity recognition
- [ ] Text generation
- [ ] Other: ___________

**What the input will look like:**

**What the output will look like:**

**Stretch goals (if I have time):**

In [None]:
# START YOUR PROJECT HERE!
# Use AI (ChatGPT/Claude) to help you build it.
# Remember the CLEAR framework for prompting.



In [None]:
# PROJECT CODE CELL 2



In [None]:
# PROJECT CODE CELL 3



In [None]:
# PROJECT TESTING



---

## Checklist: Weeks 5-6

**Week 5:**
- [ ] Completed Part 1: Emotion & Classification
- [ ] Completed Part 2: Question Answering
- [ ] Completed Part 3: Summarization
- [ ] Chose a project idea

**Week 6:**
- [ ] Completed Part 4: Combining Models
- [ ] Started building project
- [ ] Got something working (even if basic)
- [ ] Saved to GitHub

---

## Looking Ahead: Week 7

Next week begins the **Project Phase**. You'll:
- Define your project clearly
- Build a working prototype
- Get feedback and iterate

Come prepared with your project started!

---

*Youth Horizons AI Researcher Program - Level 2 | Text & Language Track*