# Unit 1 Assignment: spaCy Basics
## Exploring spaCy's Core Objects

**Learning Goals:**
- Load spaCy models and process text
- Understand the relationship between `nlp`, `Doc`, and `Token` objects
- Explore basic token attributes
- Practice iterating through documents and tokens

---

## Setup

First, let's install and import spaCy. If you're running this in Colab, uncomment the first two lines.

In [None]:
# Uncomment these lines if running in Google Colab:
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy
import pandas as pd

# Load the English model
nlp = spacy.load("en_core_web_sm")
print(f"spaCy version: {spacy.__version__}")
print(f"Model loaded: {nlp.meta['name']}")

## Exercise 1: Your First spaCy Document

Let's start by processing a simple sentence and exploring the resulting `Doc` object.

In [None]:
# Process a sample text
text = "Natural language processing with spaCy is powerful and efficient."
doc = nlp(text)

print(f"Original text: {text}")
print(f"Doc object: {doc}")
print(f"Type of doc: {type(doc)}")
print(f"Number of tokens: {len(doc)}")

### Task 1.1: Process Your Own Text

Create a variable called `my_text` with a sentence of your choice, process it with spaCy, and print basic information about the resulting document.

In [None]:
# TODO: Create your own text and process it
my_text = ""  # Replace with your sentence
my_doc = None  # Process my_text with nlp()

# TODO: Print the text, document, and token count
# Your code here

In [None]:
# Test your solution
assert len(my_text) > 0, "Please provide some text"
assert my_doc is not None, "Please process the text with nlp()"
assert len(my_doc) > 0, "The document should contain tokens"
print("✅ Task 1.1 completed successfully!")

## Exercise 2: Exploring Tokens

Now let's look at individual tokens and their attributes.

In [None]:
# Let's examine each token in our document
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

print("Token Analysis:")
print("-" * 50)
for token in doc:
    print(f"Text: '{token.text}' | Lemma: '{token.lemma_}' | POS: '{token.pos_}' | Is Stop: {token.is_stop}")

### Task 2.1: Token Attribute Explorer

Complete the function below to extract specific token attributes.

In [None]:
def analyze_tokens(text):
    """
    Analyze tokens in a text and return lists of different attributes.
    
    Args:
        text (str): Input text to analyze
        
    Returns:
        dict: Dictionary with lists of token attributes
    """
    doc = nlp(text)
    
    result = {
        'tokens': [],
        'lemmas': [],
        'pos_tags': [],
        'is_alpha': [],
        'is_stop': []
    }
    
    # TODO: Fill in the lists by iterating through tokens
    for token in doc:
        result['tokens'].append(token.text)  # Example - complete the rest
        # TODO: Add token.lemma_ to lemmas list
        # TODO: Add token.pos_ to pos_tags list  
        # TODO: Add token.is_alpha to is_alpha list
        # TODO: Add token.is_stop to is_stop list
    
    return result

# Test your function
sample_text = "I'm learning spaCy for NLP analysis!"
analysis = analyze_tokens(sample_text)

# Display results as a DataFrame for easy viewing
df = pd.DataFrame(analysis)
print(df)

In [None]:
# Test your solution
test_analysis = analyze_tokens("Hello world!")
assert len(test_analysis['lemmas']) > 0, "Please fill in the lemmas list"
assert len(test_analysis['pos_tags']) > 0, "Please fill in the pos_tags list"
assert len(test_analysis['is_alpha']) > 0, "Please fill in the is_alpha list"
assert len(test_analysis['is_stop']) > 0, "Please fill in the is_stop list"
print("✅ Task 2.1 completed successfully!")

## Exercise 3: Filtering Tokens

Often we want to filter tokens based on certain criteria. Let's practice this.

### Task 3.1: Content Words Only

Create a function that extracts only "content words" (words that are alphabetic and not stop words).

In [None]:
def extract_content_words(text):
    """
    Extract content words (alphabetic, non-stop words) from text.
    
    Args:
        text (str): Input text
        
    Returns:
        list: List of content words (lowercased)
    """
    doc = nlp(text)
    content_words = []
    
    # TODO: Iterate through tokens and add content words
    for token in doc:
        # TODO: Check if token is alphabetic AND not a stop word
        if ...  # Complete this condition
            content_words.append(token.text.lower())
    
    return content_words

# Test your function
test_text = "The quick brown fox jumps over the lazy dog in the park."
content = extract_content_words(test_text)
print(f"Original: {test_text}")
print(f"Content words: {content}")

In [None]:
# Test your solution
test_content = extract_content_words("The cat sat on the mat.")
assert 'cat' in test_content, "Should include content words like 'cat'"
assert 'sat' in test_content, "Should include content words like 'sat'"
assert 'the' not in test_content, "Should not include stop words like 'the'"
assert 'on' not in test_content, "Should not include stop words like 'on'"
print("✅ Task 3.1 completed successfully!")

## Exercise 4: Document Statistics

Let's create a comprehensive analysis function that provides various statistics about a document.

### Task 4.1: Document Analyzer

Complete the function below to calculate various document statistics.

In [None]:
def analyze_document(text):
    """
    Provide comprehensive statistics about a text document.
    
    Args:
        text (str): Input text
        
    Returns:
        dict: Dictionary with various statistics
    """
    doc = nlp(text)
    
    # TODO: Calculate these statistics
    stats = {
        'total_tokens': 0,           # Total number of tokens
        'alphabetic_tokens': 0,      # Number of alphabetic tokens
        'punctuation_tokens': 0,     # Number of punctuation tokens
        'stop_words': 0,             # Number of stop words
        'content_words': 0,          # Number of content words (alphabetic + non-stop)
        'unique_lemmas': 0,          # Number of unique lemmas
        'avg_token_length': 0.0      # Average token length
    }
    
    # TODO: Implement the calculations
    lemmas_set = set()  # To track unique lemmas
    total_length = 0    # To calculate average length
    
    for token in doc:
        stats['total_tokens'] += 1
        
        # TODO: Update other statistics based on token properties
        # Hint: Use token.is_alpha, token.is_punct, token.is_stop
        # Remember to add to lemmas_set and total_length
    
    # TODO: Calculate unique lemmas and average length
    stats['unique_lemmas'] = len(lemmas_set)
    if stats['total_tokens'] > 0:
        stats['avg_token_length'] = total_length / stats['total_tokens']
    
    return stats

# Test with sample text
sample = "Natural language processing is fascinating! It involves computational linguistics, machine learning, and artificial intelligence."
stats = analyze_document(sample)

print("Document Statistics:")
print("-" * 30)
for key, value in stats.items():
    print(f"{key}: {value}")

In [None]:
# Test your solution
test_stats = analyze_document("Hello, world! This is a test.")
assert test_stats['total_tokens'] > 0, "Should count total tokens"
assert test_stats['punctuation_tokens'] > 0, "Should count punctuation"
assert test_stats['alphabetic_tokens'] > 0, "Should count alphabetic tokens"
assert test_stats['avg_token_length'] > 0, "Should calculate average length"
print("✅ Task 4.1 completed successfully!")

## Exercise 5: Real-world Application

Let's apply what we've learned to analyze a longer piece of text.

In [None]:
# Sample article text
article = """
Artificial intelligence has revolutionized many industries in recent years. 
From healthcare to finance, AI systems are transforming how we work and live. 
Natural language processing, a subset of AI, enables computers to understand 
and generate human language. This technology powers chatbots, translation 
services, and text analysis tools. Machine learning algorithms continue to 
improve, making AI more accurate and efficient than ever before.
""".strip()

print("Article Analysis:")
print("=" * 50)
print(f"Text: {article[:100]}...")
print()

# Analyze the article
article_stats = analyze_document(article)
content_words = extract_content_words(article)

print("Statistics:")
for key, value in article_stats.items():
    print(f"  {key}: {value}")

print(f"\nTop content words: {content_words[:10]}")

### Task 5.1: Word Frequency Analysis

Create a simple word frequency counter for content words.

In [None]:
from collections import Counter

def word_frequency_analysis(text, top_n=10):
    """
    Analyze word frequency in text (content words only).
    
    Args:
        text (str): Input text
        top_n (int): Number of top words to return
        
    Returns:
        list: List of (word, frequency) tuples
    """
    # TODO: Get content words and count their frequency
    content_words = extract_content_words(text)
    # TODO: Use Counter to count word frequencies
    # TODO: Return the top_n most common words
    
    pass  # Replace with your implementation

# Test with the article
top_words = word_frequency_analysis(article, top_n=5)
print("Top 5 content words:")
for word, freq in top_words:
    print(f"  {word}: {freq}")

## Reflection Questions

Answer these questions based on what you've learned:

1. **What is the relationship between `nlp`, `Doc`, and `Token` objects in spaCy?**

2. **Why might you want to filter out stop words in text analysis?**

3. **What are some advantages of using spaCy over simple string methods for text processing?**

4. **How could the token attributes we explored be useful in real NLP applications?**

---

## Summary

Congratulations! You've completed Unit 1. You now understand:

✅ How to load spaCy models and process text  
✅ The hierarchy of spaCy objects (`nlp` → `Doc` → `Token`)  
✅ Key token attributes like `text`, `lemma_`, `pos_`, `is_stop`  
✅ How to filter and analyze tokens programmatically  
✅ Basic text statistics and frequency analysis  

**Next up:** [Unit 2 - Tokens & POS](../units/02-tokens-pos/index.md) where we'll dive deeper into part-of-speech tagging and linguistic analysis!
