# Day 1: Text Pipeline - Your First Language Model

Welcome to hands-on text processing! Now that you understand neural networks, let's explore how they work with text data.

## 🎯 Learning Objectives
By the end of this notebook, you will:
- Understand how text becomes numbers (tokenization)
- Load and use a pre-trained language model
- Experiment with text generation parameters
- Compare different prompt engineering techniques
- Build your first text generation pipeline

## 📚 Research Focus
This notebook emphasizes **discovery learning**. You'll:
1. Research concepts before implementing
2. Experiment with parameters to see their effects
3. Compare different approaches
4. Build understanding through hands-on exploration

---

## 1. From Text to Numbers

Neural networks work with numbers, but we have text. How do we bridge this gap?

🔍 **RESEARCH TASK 1**:
- What is tokenization in NLP?
- What is the difference between word-level and sub-word tokenization?
- Research "BPE" (Byte Pair Encoding) - how does it work?
- Why can't we just assign each word a number?

In [2]:
# Import required libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
import seaborn as sns

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


### Exploring Tokenization

🔍 **RESEARCH TASK 2**:
- Look up the GPT-2 tokenizer documentation
- What is a "vocabulary size"?
- What happens when the model encounters a word it's never seen?

In [4]:
# TODO: Load the GPT-2 tokenizer
# Hint: Use GPT2Tokenizer.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Test sentences to explore tokenization
test_sentences = [
    "Hello world!",
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is revolutionizing technology.",
    "GPT-2 uses transformer architecture.",
    "Supercalifragilisticexpialidocious"  # Long word to see sub-word tokenization
]

print("🔍 Exploring Tokenization:")
print("=" * 50)

for sentence in test_sentences:
    # TODO: Tokenize the sentence
    # Hint: Use tokenizer.encode() to get token IDs
    # Use tokenizer.tokenize() to see the actual tokens
    tokens = tokenizer.tokenize(sentence)  # Get the actual token strings
    token_ids = tokenizer.encode(sentence)  # Get the numerical IDs

    print(f"\nOriginal: {sentence}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Number of tokens: {len(tokens)}")

# TODO: Print tokenizer vocabulary size
print(f"\n📊 Tokenizer vocabulary size: {len(tokenizer)}")  # Hint: len(tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

🔍 Exploring Tokenization:

Original: Hello world!
Tokens: ['Hello', 'Ġworld', '!']
Token IDs: [15496, 995, 0]
Number of tokens: 3

Original: The quick brown fox jumps over the lazy dog.
Tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']
Token IDs: [464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]
Number of tokens: 10

Original: Artificial intelligence is revolutionizing technology.
Tokens: ['Art', 'ificial', 'Ġintelligence', 'Ġis', 'Ġrevolution', 'izing', 'Ġtechnology', '.']
Token IDs: [8001, 9542, 4430, 318, 5854, 2890, 3037, 13]
Number of tokens: 8

Original: GPT-2 uses transformer architecture.
Tokens: ['G', 'PT', '-', '2', 'Ġuses', 'Ġtransformer', 'Ġarchitecture', '.']
Token IDs: [38, 11571, 12, 17, 3544, 47385, 10959, 13]
Number of tokens: 8

Original: Supercalifragilisticexpialidocious
Tokens: ['Super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']
Token IDs: [12442, 9948, 361, 22562, 346, 396, 501, 42372, 

### Understanding Token Patterns

🔍 **RESEARCH TASK 3**:
- Why do some words get split into multiple tokens?
- What does the 'Ġ' symbol represent in GPT-2 tokens?
- How might tokenization affect model performance?

In [6]:
# Analyze tokenization patterns
analysis_texts = [
    "running",
    "runner",
    "run",
    "unhappiness",
    "ChatGPT",
    "COVID-19",
    "2023",
    "programming",
    "antidisestablishmentarianism"
]

print("🔍 Token Pattern Analysis:")
print("=" * 60)

token_analysis = []

for text in analysis_texts:
    # TODO: Analyze each text
    tokens = tokenizer.tokenize(text)  # Tokenize the text
    token_count = len(tokens)    # Count the tokens

    token_analysis.append({
        'text': text,
        'tokens': tokens,
        'token_count': token_count,
        'chars_per_token': len(text) / token_count
    })

    print(f"{text:30} → {tokens} ({token_count} tokens)")

# TODO: Create a DataFrame and analyze patterns
df = pd.DataFrame(token_analysis)
# Calculate average characters per token
avg_chars_per_token = df['chars_per_token'].mean()

# Find the word with the maximum token count
max_token_word = df.loc[df['token_count'].idxmax(), 'text']
print(f"\n📊 Average characters per token: {avg_chars_per_token:.2f}")  # Calculate mean
print(f"📊 Longest word in tokens: {max_token_word}")  # Find max token_count

🔍 Token Pattern Analysis:
running                        → ['running'] (1 tokens)
runner                         → ['runner'] (1 tokens)
run                            → ['run'] (1 tokens)
unhappiness                    → ['un', 'h', 'appiness'] (3 tokens)
ChatGPT                        → ['Chat', 'G', 'PT'] (3 tokens)
COVID-19                       → ['CO', 'VID', '-', '19'] (4 tokens)
2023                           → ['20', '23'] (2 tokens)
programming                    → ['program', 'ming'] (2 tokens)
antidisestablishmentarianism   → ['ant', 'idis', 'establishment', 'arian', 'ism'] (5 tokens)

📊 Average characters per token: 4.12
📊 Longest word in tokens: antidisestablishmentarianism


## 2. Loading Your First Language Model

Now let's load GPT-2 and understand its architecture.

🔍 **RESEARCH TASK 4**:
- What is GPT-2 and when was it released?
- How many parameters does GPT-2 have? (Compare different sizes)
- What is "autoregressive" text generation?
- How does GPT-2 relate to the neural network you built in the previous notebook?

In [10]:
# TODO: Load GPT-2 model
# Hint: Use GPT2LMHeadModel.from_pretrained('gpt2')
print("🔄 Loading GPT-2 model (this may take a moment)...")
model = GPT2LMHeadModel.from_pretrained('gpt2')

# TODO: Set model to evaluation mode
# Hint: Use
model.eval()

print("✅ GPT-2 model loaded successfully!")

# Explore model architecture
print("\n🏗️ Model Architecture:")
print(f"Model type: {type(model).__name__}")

# TODO: Count model parameters
# Hint: sum(p.numel() for p in model.parameters())
total_params = sum(p.numel() for p in model.parameters())

print(f"Total parameters: {total_params:,}")
print(f"Model size: ~{total_params / 1e6:.1f}M parameters")

🔄 Loading GPT-2 model (this may take a moment)...
✅ GPT-2 model loaded successfully!

🏗️ Model Architecture:
Model type: GPT2LMHeadModel
Total parameters: 124,439,808
Model size: ~124.4M parameters


### Understanding Model Architecture

🔍 **RESEARCH TASK 5**:
- What are "transformer blocks" in GPT-2?
- What is "attention" in the context of neural networks?
- How does this compare to the simple network you built earlier?

In [11]:
# Explore model structure
print("🔍 Model Structure Analysis:")
print("=" * 50)

# TODO: Print model configuration
# Hint: Use model.config
config = model.config

print(f"Vocabulary size: {config.vocab_size}")
print(f"Maximum sequence length: {config.n_positions}")
print(f"Number of transformer layers: {config.n_layer}")
print(f"Number of attention heads: {config.n_head}")
print(f"Hidden size: {config.n_embd}")

# Compare to your simple network
print("\n🤔 Comparison to Your Neural Network:")
print(f"Your network had: 2 inputs → 4 hidden → 1 output")
print(f"GPT-2 has: {config.vocab_size} inputs → {config.n_embd} hidden → {config.vocab_size} outputs")
print(f"Your network: ~50 parameters")
print(f"GPT-2: {total_params:,} parameters")
print(f"GPT-2 is ~{total_params/50:,.0f}x larger!")

🔍 Model Structure Analysis:
Vocabulary size: 50257
Maximum sequence length: 1024
Number of transformer layers: 12
Number of attention heads: 12
Hidden size: 768

🤔 Comparison to Your Neural Network:
Your network had: 2 inputs → 4 hidden → 1 output
GPT-2 has: 50257 inputs → 768 hidden → 50257 outputs
Your network: ~50 parameters
GPT-2: 124,439,808 parameters
GPT-2 is ~2,488,796x larger!


## 3. Text Generation Experiments

Let's generate text and understand how different parameters affect the output.

🔍 **RESEARCH TASK 6**:
- What is "temperature" in text generation?
- What is "top-p" (nucleus) sampling?
- What's the difference between greedy decoding and sampling?
- How do these parameters affect creativity vs. coherence?

In [12]:
# TODO: Create a text generation pipeline
# Hint: Use pipeline('text-generation', model=model, tokenizer=tokenizer)
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Base prompt for experiments
base_prompt = "In the future, artificial intelligence will"

print(f"🤖 Base prompt: '{base_prompt}'")
print("=" * 60)

Device set to use cuda:0


🤖 Base prompt: 'In the future, artificial intelligence will'


### Temperature Experiments

🔍 **RESEARCH TASK 7**:
- What happens when temperature = 0?
- What happens when temperature > 1?
- Why might you want different temperatures for different tasks?

In [13]:
# Experiment with different temperatures
temperatures = [0.1, 0.7, 1.0, 1.5]

print("🌡️ Temperature Experiments:")
print("=" * 50)

for temp in temperatures:
    print(f"\n🔥 Temperature: {temp}")
    print("-" * 30)

    # TODO: Generate text with different temperatures
    # Hint: Use generator() with temperature parameter
    result = generator(
        base_prompt,  # prompt
        max_length=60,  # try 60
        temperature=temp,  # use the temp variable
        do_sample=True,  # should be True for sampling
        pad_token_id=tokenizer.eos_token_id  # use tokenizer.eos_token_id
    )

    # TODO: Print the generated text
    generated_text = result[0]['generated_text']  # Extract from result
    print(generated_text)

print("\n🤔 Discussion Questions:")
print("• Which temperature produced the most coherent text?")
print("• Which was most creative/surprising?")
print("• When might you use each temperature setting?")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🌡️ Temperature Experiments:

🔥 Temperature: 0.1
------------------------------


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will be able to do things like search for information about people, and to do things like search for information about people.

The problem is that we're not really talking about a computer that can do things like that. We're talking about a computer that can do things like that.

The problem is that we're not really talking about a computer that can do things like that. We're talking about a computer that can do things like that.

The problem is that we're not really talking about a computer that can do things like that. We're talking about a computer that can do things like that.

The problem is that we're not really talking about a computer that can do things like that. We're talking about a computer that can do things like that.

The problem is that we're not really talking about a computer that can do things like that. We're talking about a computer that can do things like that.

The problem is that we're not really talking about a computer t

Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will allow a lot of new ways to interact with machines.

In the current process, we've moved away from the traditional way that you interact with machines, we've moved away from the idea of being able to interact with humans in any way that you want, and we've moved away from the idea of being able to interact with a computer. That's the fundamental idea that AI and the new machine learning technologies, the whole idea that robots are human beings.

We've really changed the way we think about AI in terms of the way we think about human behavior. And the idea that, "Well, we're robots, and we're human, and we're human, and we're human and we're human, and we're human, and we're human, and we're human, and we're human, and we're human, and we're human, and we're human, and we're human, and we're human, and we're human— we're human, and we're human, and we're human."

And we've really changed the way we think about human behavior. And the idea that, 

Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will become so common that it's hard for anyone in this industry to break even to even a minor level. We've got to look elsewhere (think AI and natural language processing, but not as an industry). But even if there are artificial intelligence developers in the world who will, it will still take a long time until we reach a large, multi-million dollar business.

The big question is: Will one day, and maybe by some means, become the norm. When we first started out with AI research, I was skeptical about the promise. The last few years have seen some interesting developments, like the rise of a crowdfunded AI-powered startup with a market capitalization of $150 million in 2014. But as if that weren't enough, it seems like AI is about to enter high gear. I wrote a story two years ago in which I demonstrated a machine learning system for creating real time news from Wikipedia. It's a relatively new technique, but that's already happened. The machine l

### Top-p (Nucleus) Sampling Experiments

🔍 **RESEARCH TASK 8**:
- How does top-p sampling work?
- What's the difference between top-k and top-p sampling?
- Why might top-p be better than just using temperature?

In [14]:
# Experiment with top-p sampling
top_p_values = [0.3, 0.7, 0.9, 1.0]

print("🎯 Top-p Sampling Experiments:")
print("=" * 50)

for top_p in top_p_values:
    print(f"\n🎲 Top-p: {top_p}")
    print("-" * 30)

    # TODO: Generate text with different top-p values
    result = generator(
        base_prompt,
        max_length=60,
        temperature=0.8,  # Keep temperature constant
        top_p=top_p,  # Use the top_p variable
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = result[0]['generated_text']
    print(generated_text)

print("\n🤔 Discussion Questions:")
print("• How did the outputs change with different top-p values?")
print("• What's the trade-off between diversity and quality?")

Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🎯 Top-p Sampling Experiments:

🎲 Top-p: 0.3
------------------------------


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will be able to do things like create a "super-computer" that can perform tasks like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

The AI will also be able to do things like "cooking" and "cooking-up" food.

🎲 Top-p: 0.7
------------------------------


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will be able to detect things like a person's eyes, and it will be able to see things like a person's face.

"In the future, we will have to do a lot of research on the neural networks and neural networks of the brain to understand what is happening and how it is working. We will have to do that in the next 10 years.

"The big challenge is to make sure that we can do that in a way that we can control our brain. That is the big challenge for artificial intelligence."

This is a developing story, so we'll keep you updated as more information comes in.

Explore further: Researchers predict that artificial intelligence will eventually solve most problems

More information: Nature Communications, DOI: 10.1038/ncomms537

🎲 Top-p: 0.9
------------------------------


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In the future, artificial intelligence will have an increasingly important role in social and economic decisions.

For instance, some researchers believe artificial intelligence can be used to predict the future use of public transportation, including in cities, with the help of robots.

"What we're seeing is that we're seeing artificial intelligence being used in real-time to do real-time forecasting, to help us predict where the next major problem is going to take place," said David T. Karp, a professor of computer science at Stanford University, in a statement.

It's also possible that artificial intelligence could become an increasingly important tool for developing algorithms for the study of complex problems in human behavior.

But while AI is just beginning to be used in a specific area, the ability to develop new algorithms has been a major focus for many years, said Brian O'Leary, a Stanford computer scientist who co-authored a paper last year in the journal Nature.

"This is 

## 4. Prompt Engineering Experiments

The way you phrase your prompt dramatically affects the output.

🔍 **RESEARCH TASK 9**:
- What is "prompt engineering"?
- What are "few-shot" prompts?
- How can prompt structure influence model behavior?
- Research common prompt engineering techniques

In [15]:
# Different prompt styles to experiment with
prompts_to_test = {
    "Direct": "Write about artificial intelligence:",
    "Question": "What is artificial intelligence and how will it change the world?",
    "Story_Start": "Once upon a time, in a world where artificial intelligence was everywhere,",
    "List_Format": "Here are 5 ways artificial intelligence will change our lives:\n1.",
    "Expert_Persona": "As a leading AI researcher, I believe that artificial intelligence will",
    "Few_Shot": "Technology predictions:\n• The internet will connect everyone (1990s)\n• Smartphones will be everywhere (2000s)\n• Artificial intelligence will"
}

print("✍️ Prompt Engineering Experiments:")
print("=" * 60)

# TODO: Test each prompt style
for style, prompt in prompts_to_test.items():
    print(f"\n📝 Style: {style}")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Generate text for each prompt
    result = generator(
        prompt,  # use the prompt variable
        max_length=80,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = result[0]['generated_text']
    print(generated_text)
    print("\n" + "="*60)

Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


✍️ Prompt Engineering Experiments:

📝 Style: Direct
Prompt: 'Write about artificial intelligence:'
----------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Write about artificial intelligence:

https://www.youtube.com/watch?v=ZsJW9nHd9_E

https://www.youtube.com/watch?v=DXnCX8k5bWc

https://www.youtube.com/watch?v=9T-7tQ5pXK4

https://www.youtube.com/watch?v=Gzf4QZrW1vE

https://www.youtube.com/watch?v=qT-2bqDn8V6

https://www.youtube.com/watch?v=R-6C5qqwC9U

https://www.youtube.com/watch?v=1Xf6q-cXJ2Q

https://www.youtube.com/watch?v=PtL-S2pWcXg

https://www.youtube.com/watch?v=3w7QjBk9YQM

https://www.youtube.com/watch?v=9WV1j5pXK4

https://www.youtube


📝 Style: Question
Prompt: 'What is artificial intelligence and how will it change the world?'
----------------------------------------


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


What is artificial intelligence and how will it change the world?

The answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a little different than the answer is a littl

Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, in a world where artificial intelligence was everywhere, it seemed like the only way to deal with it. But now, it seems, the only way to deal with it is to put it on a pedestal.

And that pedestal is the Internet.

In other words, the Internet is the only way to deal with artificial intelligence. And if it is not the only way to deal with it, then it will be the only way to deal with it.

If it is not the only way to deal with it, then it will be the only way to deal with it.

And that's what the Internet is all about.

The Internet is the Internet for people who can't afford to pay for their own computers.

The Internet is the Internet for people who can afford to pay for their own phones.

The Internet is the Internet for people who can't afford to pay for their own Internet.

The Internet is the Internet for people who can't afford to pay for their own Internet.

The Internet is the Internet for people who can't afford to pay for their own Internet.

The Internet i

Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Here are 5 ways artificial intelligence will change our lives:
1. AI will become the ultimate social and political force.
2. Artificial intelligence will become a powerful force for social change.
3. Artificial intelligence will become a force for change.
4. AI will become a force for change.
5. AI will become a force for change.
6. AI will become a force for change.
7. AI will become a force for change.
8. AI will become a force for change.
9. AI will become a force for change.
10. AI will become a force for change.
11. AI will become a force for change.
12. AI will become a force for change.
13. AI will become a force for change.
14. AI will become a force for change.
15. AI will become a force for change.
16. AI will become a force for change.
17. AI will become a force for change.
18. AI will become a force for change.
19. AI will become a force for change.
20. AI will become a force for change.
21. AI will become a force for change.
22. AI will become a force for change.
23. AI wi

Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


As a leading AI researcher, I believe that artificial intelligence will have a lot of opportunities to solve the problems we face, and that AI will have a lot of great potential to solve the problems we face.

The AI challenge is not about making artificial intelligence smarter. It is about making it smarter.

The question is not, "What do I need to do to make AI smarter?"

The question is, "What do I need to do to make AI smarter?"

The question is, "What do I need to do to make AI smarter?"

The answer to that question is, "I need to make AI smarter."

The answer to that question is, "I need to make AI smarter."

If AI can solve all the problems we face, then there will be a lot of good AI.

But if AI can solve the problems we face, then there will be a lot of bad AI.

And if AI can solve the problems we face, then there will be a lot of bad AI.

The AI challenge is not about making AI smarter. It is about making it smarter.

The answer to that question is, "What do I need to do to m

### Analyzing Prompt Effectiveness

🔍 **RESEARCH TASK 10**:
- Which prompt style produced the most useful output?
- How did the model's "behavior" change with different prompts?
- What makes a good prompt?
- How might this apply to chatbots or AI assistants?

In [16]:
# Let's analyze the generated text more systematically
print("📊 Prompt Analysis Exercise:")
print("=" * 50)

# TODO: For each prompt style, generate multiple outputs and analyze
analysis_results = []

for style, prompt in list(prompts_to_test.items())[:3]:  # Test first 3 for time
    # Generate 3 outputs for each prompt
    outputs = []

    for i in range(3):
        # TODO: Generate text
        result = generator(
            prompt,
            max_length=60,
            temperature=0.8,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

        output = result[0]['generated_text']
        outputs.append(output)

    # TODO: Analyze the outputs
    lengths = [len(text) for text in outputs]
    avg_length = sum(lengths) / len(lengths) # Calculate average length of outputs

    analysis_results.append({
        'style': style,
        'prompt': prompt,
        'avg_length': avg_length,
        'outputs': outputs
    })

    print(f"\n{style}:")
    print(f"  Average length: {avg_length:.1f} characters")
    print(f"  Sample output: {outputs[0][:100]}...")

print("\n🤔 Reflection Questions:")
print("• Which prompt style was most consistent?")
print("• Which produced the most relevant outputs?")
print("• How might you improve these prompts?")

Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


📊 Prompt Analysis Exercise:


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Direct:
  Average length: 755.7 characters
  Sample output: Write about artificial intelligence: how did it come to be?

I had always thought of AI as a field o...


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Question:
  Average length: 1177.3 characters
  Sample output: What is artificial intelligence and how will it change the world?

Cameron: We think that we have a ...


Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Story_Start:
  Average length: 1013.7 characters
  Sample output: Once upon a time, in a world where artificial intelligence was everywhere, the best of humanity woul...

🤔 Reflection Questions:
• Which prompt style was most consistent?
• Which produced the most relevant outputs?
• How might you improve these prompts?


## 5. Building Your Text Generation Pipeline

Now let's create a customizable text generation function.

🔍 **RESEARCH TASK 11**:
- What parameters should a good text generation function have?
- How can you make text generation more controllable?
- What are the trade-offs between different generation strategies?

In [19]:
def custom_text_generator(prompt, style="balanced", length="medium"):
    """
    TODO: Create a customizable text generation function

    Args:
        prompt (str): The input prompt
        style (str): "creative", "balanced", or "conservative"
        length (str): "short", "medium", or "long"

    Returns:
        str: Generated text
    """

    # TODO: Set parameters based on style
    if style == "creative":
        temperature = 1.2  # Higher for creativity
        top_p = 0.95        # Higher for diversity
    elif style == "conservative":
        temperature = 0.3  # Lower for consistency
        top_p = 0.6        # Lower for focus
    else:  # balanced
        temperature = 0.7  # Medium values
        top_p = 0.85

    # TODO: Set length based on parameter
    if length == "short":
        max_length = 40  # Try 40
    elif length == "long":
        max_length = 100  # Try 100
    else:  # medium
        max_length = 70  # Try 70

    # TODO: Generate text with the parameters
    result = generator(
        prompt,  # prompt
        max_length=max_length,
        temperature=temperature,
        top_p=top_p,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    return result[0]['generated_text']

# Test your function
test_prompt = "The future of education will be"

print("🧪 Testing Your Text Generator:")
print("=" * 50)

# TODO: Test different combinations
test_combinations = [
    ("creative", "short"),
    ("balanced", "medium"),
    ("conservative", "long")
]

for style, length in test_combinations:
    print(f"\n📝 Style: {style}, Length: {length}")
    print("-" * 30)

    # TODO: Use your function
    output = custom_text_generator(test_prompt, style, length)
    print(output)
    print(f"Characters: {len(output)}")

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🧪 Testing Your Text Generator:

📝 Style: creative, Length: short
------------------------------


Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of education will be decided when schools are ready.

"The challenge will also be to create an educated, highly educated workforce for our state institutions. That is a critical first step to ensuring that it has a strong education process."

If all goes well, Mr Brown's amendment to his budget, and his commitment to provide $7 billion or more in free local government funding for teacher and student attendance at state levels over the next 10 years, will help to keep students attending school.

'Might and potential'

Mr Brown said "one of the many things the Bill C-44 will do to encourage better access to and affordability of public schools is that it will ensure that students and the whole community have access to quality public and privately-funded schools of quality so they have an opportunity to get high schools which are ready, that are fit to educate and meet the highest expectations of Australians."

Under Mr Brown's plan he would give five years for high schools in S

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of education will be shaped by the decisions of the state," he said.

"It will depend on the state's efforts, which will be made with the assistance of the state's education and social services departments.

"This is not a new concept, and we will be following the recommendations of the previous government."

Topics: education, government-and-politics, federal-government, state-parliament, government-and-politics, australia

First posted
Characters: 452

📝 Style: conservative, Length: long
------------------------------
The future of education will be determined by the success of the students and the quality of their education."

The report also said that the government would not be able to provide a "fair and equitable" education system for all students.

The report also said that the government would not be able to provide a "fair and equitable" education system for all students.

The report also said that the government would not be able to provide a "fair and equitable" 

## 6. Creative Applications

Let's explore some creative uses of text generation.

🔍 **RESEARCH TASK 12**:
- How is GPT-2 being used in creative writing?
- What are some potential applications for businesses?
- What ethical considerations should we keep in mind?
- How might this technology evolve?

In [20]:
# Creative applications to try
creative_prompts = {
    "Poetry": "Roses are red, violets are blue, artificial intelligence",
    "Story": "It was a dark and stormy night when the AI finally",
    "Product Description": "Introducing the revolutionary new smartphone that",
    "Email": "Dear valued customer, we are excited to announce",
    "Recipe": "How to make the perfect AI-inspired cookies:\nIngredients:\n-",
    "News Headline": "Breaking: Scientists discover that artificial intelligence"
}

print("🎨 Creative Applications:")
print("=" * 50)

# TODO: Generate creative content
for app_type, prompt in creative_prompts.items():
    print(f"\n🖼️ {app_type}:")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Choose appropriate style for each application
    if app_type in ["Poetry", "Story"]:
        style = "creative"  # Should be creative
    elif app_type in ["Product Description", "Email"]:
        style = "conservative"  # Should be conservative
    else:
        style = "balanced"  # Should be balanced

    output = custom_text_generator(prompt, style=style, length="medium")
    print(output)
    print("\n" + "="*50)

Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🎨 Creative Applications:

🖼️ Poetry:
Prompt: 'Roses are red, violets are blue, artificial intelligence'
----------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Roses are red, violets are blue, artificial intelligence and life itself is a race, and he's a genius," Hargrove recalls with a laugh. "I went into the game as young as my brother; what happened? The players took me into the world, and each set had two lives. Each would have three and each another four lives. Then if a player lost a turn, it was his choice and you got to choose how you want to die. I've never been a die-hard."

The game's "real" challenge: to kill off players with little more than luck. (A few decades ago, the game required three and even four people who could go out with their lives.) (The developers didn't specify which one, since it might be too challenging.)

Hargrove's first experience as a virtual alien was working on two new virtual environments: Planet of the Octopus. That place, the game described as a "sandboxed landscape" designed by "a virtual engineer", is a place where aliens might come flying into the galaxy at any time -- if they wanted a chance. When y

Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



"Where am I supposed to stop you!? It isn't here!!! Are we the bad guys!? It's here! Hurry! You must go off into deep space! We are the only ones left alive here! Hurry!" shouted the AI and the AI panicked again and turned around. It then came to a stop. "Why did this not work this way! No, what can we do?! Are your guys ready for a rescue?"

"Why aren't we already back to our original positions?! Do you think we're going to find us in a nearby station?! Hurry!" shouted the AI. Without the help of others, all the others finally began to fight.


🖼️ Product Description:
Prompt: 'Introducing the revolutionary new smartphone that'
----------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Introducing the revolutionary new smartphone that will revolutionize the way we interact with our smartphones.

The new iPhone 5S is the first smartphone to feature a fingerprint sensor, which will allow users to unlock their phones without having to use a fingerprint scanner.

The new iPhone 5S is the first smartphone to feature a fingerprint sensor, which will allow users to unlock their phones without having to use a fingerprint scanner. The new iPhone 5S is the first smartphone to feature a fingerprint sensor, which will allow users to unlock their phones without having to use a fingerprint scanner. The new iPhone 5S is the first smartphone to feature a fingerprint sensor, which will allow users to unlock their phones without having to use a fingerprint scanner. The new iPhone 5S is the first smartphone to feature a fingerprint sensor, which will allow users to unlock their phones without having to use a fingerprint scanner. The new iPhone 5S is the first smartphone to feature a fi

Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Dear valued customer, we are excited to announce that we have been selected to be the first to offer a new, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affordable, high-quality, and affo

Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


How to make the perfect AI-inspired cookies:
Ingredients:
-3/4 cup chocolate, melted
-1/4 cup sugar, plus more for topping
-1/2 cup vanilla extract
-1/2 cup all-purpose flour
-1/4 teaspoon baking powder
-1/4 teaspoon baking soda
-1/4 teaspoon salt
-1/2 teaspoon baking soda
-1/2 teaspoon cinnamon
-1/4 teaspoon salt
-1/4 teaspoon pepper
-1/4 teaspoon peppermint extract
-1/4 cup warm water
-1/2 cup cold water
-1/2 cup water for filling
-1/4 cup sugar, plus more for topping
-1/2 cup vanilla extract, plus more for topping
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for topping
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling
-1/4 cup water, plus more for filling


🖼️ News Headline:
Prompt: 'Breaking: Scientists discover that artificial intelligence'
----------------------------------------
Break

## 7. Understanding Limitations

It's important to understand what language models can and cannot do.

🔍 **RESEARCH TASK 13**:
- What is "hallucination" in language models?
- Why might GPT-2 generate biased or incorrect information?
- What are the limitations of autoregressive generation?
- How do these limitations affect real-world applications?

In [21]:
# Test model limitations
limitation_tests = {
    "Factual Knowledge": "The capital of Fakelandia is",
    "Recent Events": "In 2023, the most important AI breakthrough was",
    "Math": "What is 47 * 83? The answer is",
    "Logic": "If all A are B, and all B are C, then all A are",
    "Consistency": "My favorite color is blue. Later in the conversation, my favorite color is"
}

print("⚠️ Understanding Model Limitations:")
print("=" * 50)

for test_type, prompt in limitation_tests.items():
    print(f"\n🧪 Testing: {test_type}")
    print(f"Prompt: '{prompt}'")
    print("-" * 40)

    # TODO: Generate responses to test limitations
    output = custom_text_generator(
        prompt,  # prompt
        style="conservative",  # Use conservative for factual tasks
        length="short"
    )

    print(output)

    # TODO: Analyze the output
    print(f"🤔 Analysis: Does this look correct/reasonable?")
    print("\n" + "="*50)

print("\n⚠️ Important Reminders:")
print("• Language models can generate plausible-sounding but incorrect information")
print("• Always verify factual claims from AI-generated content")
print("• Be aware of potential biases in training data")
print("• Use AI as a tool to assist, not replace, human judgment")

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


⚠️ Understanding Model Limitations:

🧪 Testing: Factual Knowledge
Prompt: 'The capital of Fakelandia is'
----------------------------------------


Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The capital of Fakelandia is the capital of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic of the Republic

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In 2023, the most important AI breakthrough was the first to be made in the field of artificial intelligence. The first of these was the AI-powered computer that could read and interpret human speech. The next was the first to be made in the field of artificial intelligence. The next was the first to be made in the field of artificial intelligence.

The first AI breakthrough was the first to be made in the field of artificial intelligence. The next was the first to be made in the field of artificial intelligence.

The first AI breakthrough was the first to be made in the field of artificial intelligence. The next was the first to be made in the field of artificial intelligence.

The first AI breakthrough was the first to be made in the field of artificial intelligence. The next was the first to be made in the field of artificial intelligence.

The first AI breakthrough was the first to be made in the field of artificial intelligence. The next was the first to be made in the field of ar

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


What is 47 * 83? The answer is: it's a lot of things.

The most important thing is that you're not going to be able to do anything about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to have to do something about it.

You're going to
🤔 Analysis: Does this look

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


If all A are B, and all B are C, then all A are C.

If all A are B, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C, then all A are C.

If all A are C, and all B are C,
🤔 Analysis: Does this look correct/reasonable?


🧪 Testing: Consistency
Prompt: 'My favorite color is blue. Later in the conversation, my favorite color is'
----------------------------------------
My favorite color is blue. Later in the conversation, my favorite color is red.

I'm not sure

## 8. Reflection and Next Steps

### What You've Accomplished
✅ **Understood tokenization and text preprocessing**
✅ **Loaded and used a pre-trained language model**
✅ **Experimented with generation parameters**
✅ **Explored prompt engineering techniques**
✅ **Built a customizable text generation pipeline**
✅ **Understood model limitations and ethical considerations**

### Key Insights
🔍 **Discussion Questions**:
- What surprised you most about text generation?
- Which prompt engineering technique was most effective?
- How might you use this in a real project?
- What limitations concerned you most?

In [22]:
# Final experiment: Design your own use case
print("🎯 FINAL CHALLENGE:")
print("Design your own text generation use case!")
print("=" * 50)

# TODO: Create your own application
# Ideas: Story generator, email assistant, creative writing helper, etc.

your_use_case =  "Motivational Quote Generator for Students" # Describe your use case
your_prompt = "Generate a short motivational quote to inspire a student preparing for final exams."   # Design your prompt
your_style = "Inspirational and Encouraging"    # Choose your style
your_length = "1-2 sentences"   # Choose your length

print(f"📝 Your use case: {your_use_case}")
print(f"📝 Your prompt: '{your_prompt}'")
print(f"📝 Your settings: {your_style}, {your_length}")
print("-" * 50)

# TODO: Generate with your custom settings
your_output = custom_text_generator(your_prompt, your_style, your_length)
print("🎉 Your generated content:")
print(your_output)

print("\n📈 Next Steps:")
print("• Experiment with different prompt formats")
print("• Try combining multiple generation calls")
print("• Think about how to validate or improve outputs")
print("• Consider user interface design for your application")

Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🎯 FINAL CHALLENGE:
Design your own text generation use case!
📝 Your use case: Motivational Quote Generator for Students
📝 Your prompt: 'Generate a short motivational quote to inspire a student preparing for final exams.'
📝 Your settings: Inspirational and Encouraging, 1-2 sentences
--------------------------------------------------
🎉 Your generated content:
Generate a short motivational quote to inspire a student preparing for final exams.

The goal is to motivate students to pursue their goals and to use this motivational quote as a motivational tool to help them get through the final exams.

The motivational quote is a motivational text that students will use to motivate themselves. The text is an example of a motivational text that students can use to motivate themselves.

The goal is to motivate students to improve their self-esteem.

The motivational quote is a motivational text that students will use to motivate themselves. The text is an example of a motivational text that stude

## 🎉 Congratulations!

You've successfully:
- ✅ Mastered text tokenization and preprocessing
- ✅ Used a state-of-the-art language model
- ✅ Discovered the art and science of prompt engineering
- ✅ Built your own text generation pipeline
- ✅ Understood the capabilities and limitations of AI text generation
- ✅ Explored creative applications

### Prepare for the Next Notebook
Next, we'll explore computer vision and image processing, applying similar principles to visual data!

**Share with your partner**: What was your most successful text generation experiment?

---
*Text Pipeline Complete - Ready for Computer Vision! 🖼️*