# ‚öôÔ∏è Notebook 01: An Introduction to Large Language Models (LLMs)

**Week 5-6: LLMs, Prompt Engineering & RAG**  
**Gen AI Masters Program**

---

## üéØ Objectives

Welcome to the world of Large Language Models! LLMs are the engine behind the generative AI revolution. This notebook provides a comprehensive introduction to what LLMs are, how they are built, and how to start using them effectively.

By the end of this session, you will be able to:

1.  **Explain the LLM Lifecycle:** Understand the key stages of pre-training, instruction tuning, and alignment that create modern LLMs.
2.  **Compare Model Types:** Differentiate between major open-source (like Llama, Mixtral) and proprietary (like GPT-4, Claude) model families.
3.  **Perform Basic Inference:** Use the Hugging Face `pipeline` to perform tasks like summarization and text generation with powerful pre-trained models.
4.  **Develop a Model Selection Strategy:** Learn to choose the right model for a given task based on criteria like performance, cost, latency, and privacy.
5.  **Understand Core Concepts:** Grasp key terminology such as "foundation model," "RLHF," and "context window."

**Estimated Time:** 2-3 hours

---

## ü§î What is a Large Language Model?

At its core, an LLM is a massive neural network trained on vast quantities of text data. Its fundamental capability is to **predict the next word in a sequence**. While this sounds simple, when scaled up to billions of parameters and trained on terabytes of data, this core function gives rise to emergent abilities like translation, summarization, question-answering, and even reasoning.

This notebook will demystify these powerful tools and give you the foundational knowledge to harness their capabilities.

## üï∞Ô∏è A Brief History: The Evolution of Language Models

The journey to today's powerful LLMs has been one of increasing scale and architectural innovation.

| Era         | Representative Models        | Key Breakthrough                                       | Impact on AI                                                    |
| :---------- | :--------------------------- | :----------------------------------------------------- | :-------------------------------------------------------------- |
| **Pre-2018**  | word2vec, GloVe, FastText    | **Static Word Embeddings:** Words have fixed vector representations. | Enabled basic semantic search and text classification.          |
| **2018-2020** | ELMo, BERT, T5, GPT-2        | **The Transformer & Contextual Embeddings:** A word's vector changes based on its context. | Revolutionized NLP, achieving state-of-the-art on most benchmarks. |
| **2020-2023** | GPT-3, PaLM, Llama, Claude   | **Massive Scale & Instruction Tuning:** Models with 100B+ parameters trained to follow human instructions. | Gave rise to conversational AI and powerful "zero-shot" abilities. |
| **2023+**     | GPT-4, Mixtral, Llama 3, Phi-3 | **Multi-modality & Extreme Efficiency:** Models that understand images, are safer, and can run on smaller devices. | Pushing AI towards general-purpose assistants and on-device intelligence. |

This rapid progress is built on three pillars: **more data**, **bigger models**, and **better training techniques**.

### Key Terminology You Need to Know

*   **Foundation Model:** A large-scale model pre-trained on a massive, broad dataset (e.g., the whole internet). It's not specialized for any single task but can be adapted to many different tasks. `GPT-4` and `Llama 3` are foundation models.
*   **Instruction Tuning (or Supervised Fine-Tuning, SFT):** The process of further training a pre-trained model on a dataset of high-quality `(prompt, response)` examples. This teaches the model to be more helpful, follow instructions, and engage in conversation.
*   **Alignment:** The process of ensuring an LLM's behavior aligns with human values and intentions. This involves making the model more helpful, honest, and harmless.
*   **RLHF (Reinforcement Learning from Human Feedback):** A key alignment technique. Humans rank different model responses to the same prompt. A "reward model" is trained on these rankings, and then reinforcement learning is used to fine-tune the LLM to generate responses that maximize the reward score.
*   **Context Window:** The maximum number of tokens (words and subwords) the model can consider at one time. A larger context window allows the model to process longer documents and maintain longer conversations.
*   **Parameters:** The weights and biases of the neural network. The number of parameters is a rough measure of the model's size and capacity (e.g., 7 billion, 70 billion).

## ‚öôÔ∏è The Anatomy of an LLM Training Pipeline

Creating a state-of-the-art LLM is a monumental engineering effort. The process can be broken down into three main stages:

1.  **Pre-training:**
    *   **Goal:** To learn general knowledge about language, facts, and reasoning.
    *   **Data:** A massive, diverse corpus of text and code from the public internet (e.g., Common Crawl, Wikipedia, GitHub).
    *   **Process:** The model is trained on a simple, self-supervised objective, typically **next-token prediction**. Given a sequence of words, it learns to predict the most likely next word. This is done for trillions of tokens over thousands of powerful GPUs for weeks or months. The result is a **base foundation model**.

2.  **Instruction Tuning (Supervised Fine-Tuning - SFT):**
    *   **Goal:** To teach the base model how to follow instructions and be a helpful assistant.
    *   **Data:** A smaller, high-quality dataset of curated `(prompt, response)` pairs. These examples are often created by human labelers.
    *   **Process:** The base model is fine-tuned on this dataset. It learns the *style* and *format* of being a helpful chatbot.

3.  **Alignment (RLHF / DPO):**
    *   **Goal:** To make the model safer, more honest, and better aligned with human preferences.
    *   **Data:** Human preference data. For a given prompt, humans rank several model-generated responses from best to worst.
    *   **Process:** Techniques like **RLHF** (Reinforcement Learning from Human Feedback) or **DPO** (Direct Preference Optimization) are used. A "reward model" is trained to predict the human preference score for any given response. Then, the LLM is fine-tuned to generate responses that maximize this reward score.

This multi-stage process transforms a raw, next-word predictor into a capable and safe AI assistant.

In [None]:
## üöÄ Hands-On: Basic Inference with a Foundation Model

Enough theory! Let's use a real LLM. We'll start with a simple task: summarizing a technical report using the Hugging Face `pipeline`. This is a great example of using a foundation model's "zero-shot" capabilities‚Äîits ability to perform a task it wasn't explicitly trained for, just by being prompted correctly.

We will use `facebook/bart-large-cnn`, a model specifically fine-tuned for summarizing news articles. While not trained on technical logs, its general understanding of language makes it surprisingly effective.

## üßÆ Model Sizes, Hardware, and The Trade-Offs

LLMs come in a wide range of sizes, measured by their number of **parameters**. This size has a direct impact on their performance, cost, and hardware requirements.

| Model Family                               | Parameters | VRAM (16-bit) | Typical Use Case                               |
| :----------------------------------------- | :--------- | :------------ | :--------------------------------------------- |
| `distilbert-base-uncased`                  | 66 Million | ~0.3 GB       | Fast, simple text classification on the edge.  |
| `meta-llama/Meta-Llama-3-8B-Instruct`      | 8 Billion  | ~16 GB        | High-quality chat & instruction-following on a single GPU. |
| `mistralai/Mixtral-8x7B-Instruct-v0.1`     | 47 Billion (MoE) | ~94 GB        | Top-tier performance, requires multiple GPUs. Efficient for its size. |
| `meta-llama/Meta-Llama-3-70B-Instruct`     | 70 Billion | ~140 GB       | Frontier-level reasoning, requires powerful server hardware. |

*VRAM estimates are for inference using 16-bit precision (FP16). Training requires significantly more.*

> **Key Insight:** There is no "best" model, only the **best model for your specific use case**. A 70B parameter model is overkill for simple sentiment analysis, while a 66M parameter model cannot perform complex reasoning.

> **Quantization:** Techniques like 4-bit or 8-bit quantization can dramatically reduce the VRAM footprint (by 2-4x), allowing larger models to run on smaller hardware. However, this often comes with a small trade-off in accuracy.

### Tokenization: The Language of LLMs

LLMs don't see words; they see **tokens**. Tokenization is the process of breaking down raw text into smaller units that the model can understand. These tokens are then mapped to unique integer IDs.

**Common Tokenization Strategies:**
1.  **Word-Based:** Each word is a token. Fails on typos, variations ("run", "running"), and rare words.
2.  **Character-Based:** Each character is a token. Handles any word but loses semantic meaning and creates very long sequences.
3.  **Subword (e.g., BPE, SentencePiece):** The modern standard. It breaks words into semantically meaningful chunks. Common words get their own token (e.g., "hello"), while rare words are broken down (e.g., "tokenization" -> "token" + "ization"). This provides a perfect balance between vocabulary size and sequence length.

In [None]:
# Hands-On: Exploring Tokenization
from transformers import AutoTokenizer

# Load a tokenizer for a specific model
# Each model has its own unique tokenizer trained on its specific dataset.
# Using the wrong tokenizer for a model will lead to poor performance.
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
# Note: You may need to request access to this model on Hugging Face Hub.
# If you don't have access, you can use "distilbert-base-uncased" as an alternative.
tokenizer = AutoTokenizer.from_pretrained(model_name)

# --- Example 1: A simple sentence ---
text = "Hello, world! This is a test."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print(f"Original Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
print("-" * 20)

# --- Example 2: Subword tokenization in action ---
text_complex = "Tokenization is foundational to LLMs."
tokens_complex = tokenizer.tokenize(text_complex)
token_ids_complex = tokenizer.encode(text_complex)

print(f"Original Text: '{text_complex}'")
print(f"Tokens: {tokens_complex}")
print(f"Token IDs: {token_ids_complex}")

# You can see how "Tokenization" is broken into 'Token' and 'ization'.

## ? Case Study: Building a Downtime Incident Assistant

Let's apply these concepts to a practical industrial scenario.

**Goal:** Create an intelligent assistant that can help technicians respond to production line incidents more effectively.

**Our Plan:**
1.  **Step 1: Zero-Shot Classification:** Use a general-purpose LLM to instantly categorize incoming incident reports without any initial training.
2.  **Step 2: Suggest First Actions:** Use an instruction-tuned LLM to recommend the immediate next step for a technician based on the report.
3.  **Step 3: Evaluate the Suggestion:** Use a question-answering model to validate the LLM's suggestion against a trusted Standard Operating Procedure (SOP).

This multi-step approach demonstrates how different models can be chained together to build a robust and reliable system.

### Step 1: Zero-Shot Incident Classification

When a new incident report comes in, the first task is to route it to the correct department. A **zero-shot classification** model is perfect for this. It can classify text into predefined categories, even if it has never been explicitly trained on those specific labels. This allows for rapid prototyping and deployment.

In [None]:
# Hands-On: Zero-Shot Classification
from transformers import pipeline

# The incident report from the factory floor
incident_report = 'Vision system flagged misaligned solder joints on PCB lot 2025-A34 during the night shift.'

# The categories we want to classify the report into
possible_labels = ['safety', 'quality', 'maintenance', 'supply-chain']

# Load a pre-trained zero-shot classification pipeline
# Model: facebook/bart-large-mnli is a popular choice for this task.
classifier = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli'
)

# Run the classification
result = classifier(incident_report, candidate_labels=possible_labels, multi_label=False)

# Display the result
print(f"Incident Report: '{result['sequence']}'")
print(f"Predicted Category: {result['labels'][0]}")
print(f"Confidence Score: {result['scores'][0]:.4f}")

In [None]:
### Step 2: Suggesting a First Action with an Instruction-Tuned Model

Now that the incident is categorized, we can use a more powerful **instruction-tuned LLM** to suggest a concrete first action for the technician. These models are specifically trained to follow commands and provide helpful, direct responses.

We will formulate a clear prompt that provides the model with the necessary context (the incident report) and asks for a specific output (the recommended first action). This is a form of **prompt engineering**, a critical skill for getting the best results from LLMs.

In [None]:
# Hands-On: Generating a First Action
# Note: Running this cell requires significant GPU memory (~15GB for a 7B model) and may be slow.
# It is provided as a demonstration of how to use a large instruction-tuned model.
# If you have limited resources, this cell may fail, which is expected.

try:
    # Load an instruction-tuned model pipeline
    # tiiuae/falcon-7b-instruct is a powerful and popular open-source choice.
    assistant = pipeline(
        'text-generation',
        model='tiiuae/falcon-7b-instruct',
        trust_remote_code=True,
        device_map="auto"  # Automatically uses GPU if available
    )

    # Engineer a prompt to guide the model
    prompt = f"""
    Given the incident report: '{incident_report}'
    What is the single most important first action for a technician to take? Be concise and direct.
    """

    # Generate the response
    result = assistant(
        prompt,
        max_new_tokens=50,       # Limit the length of the response
        do_sample=True,          # Use sampling for more creative responses
        temperature=0.7,         # A lower temperature makes the output more deterministic
        top_k=50,                # Consider the top 50 most likely tokens at each step
        top_p=0.95               # Use nucleus sampling
    )
    print(result[0]['generated_text'])

except Exception as e:
    print(f"Could not run text generation pipeline, likely due to resource constraints. Error: {e}")
    print("Skipping this step. This is expected on most consumer hardware.")

### Step 3: Evaluating the Response with a QA Model

An LLM's suggestion is powerful, but is it correct? For safety and reliability, we should not blindly trust the model's output, especially in a production environment.

One way to validate the response is to check it against a **trusted knowledge source**, like a Standard Operating Procedure (SOP). We can use a **Question-Answering (QA) model** for this task.

The process is:
1.  Provide the QA model with a `context` (the SOP text).
2.  Ask a specific `question` ("What is the first action?").
3.  The model will find and extract the most likely answer from the provided context.

This creates a verifiable, document-grounded check on the generative model's suggestion, forming a basic but effective **RAG (Retrieval-Augmented Generation)** pattern, which we will explore deeply later in this course.

In [None]:
# Hands-On: Validating with a QA Model
from transformers import pipeline

# The official Standard Operating Procedure for this type of incident
sop_context = """
Standard Operating Procedure for Quality Alerts (QA-SOP-004):
1. Upon receiving a quality alert, the first and most critical action is to quarantine the affected batch to prevent it from entering the main production flow or being shipped to customers.
2. Immediately notify the shift supervisor and the Quality Assurance (QA) department.
3. Document the incident in the Quality Management System (QMS) with all relevant details, including lot number, timestamp, and a description of the defect.
4. An assigned engineer will then conduct a root cause analysis (RCA).
"""

# The question we want to answer based on the SOP
question = "What is the first action for a quality alert?"

# Load a QA pipeline. 'distilbert-base-cased-distilled-squad' is a lightweight but effective model.
qa_pipeline = pipeline('question-answering', model='distilbert-base-cased-distilled-squad')

# Find the answer within the context
answer = qa_pipeline(question=question, context=sop_context)

print(f"Question: {question}")
print(f"Answer from SOP: '{answer['answer']}'")
print(f"Confidence Score: {answer['score']:.4f}")

## üõ°Ô∏è Safety, Governance, and Responsible AI

Deploying LLMs, especially in industrial settings, carries significant responsibility. A mistake can impact physical processes, safety, and product quality. Building a robust AI system requires a strong governance framework.

**Key Pillars of Responsible AI:**

*   **Guardrails:** Implement automated checks on both the inputs to the model and the outputs it generates.
    *   *Input Filtering:* Block prompts that are off-topic, malicious, or seek to exploit the model.
    *   *Output Filtering:* Prevent the model from generating harmful, toxic, or nonsensical content. For example, if an assistant suggests "increase the boiler pressure to maximum," a guardrail should flag this as a potentially dangerous and invalid action.

*   **Hallucination Mitigation:** LLMs can sometimes "hallucinate" or invent facts. The most effective way to combat this is with **Retrieval-Augmented Generation (RAG)**, which we will cover in detail later. RAG grounds the model's responses in a specific set of trusted documents, forcing it to base its answers on verifiable facts rather than its internal knowledge.

*   **Bias Audits:** Models can inherit biases from their training data. It's crucial to audit your system for potential biases related to different factory shifts, employee roles, or equipment types to ensure fairness and equity.

*   **Human-in-the-Loop (HITL):** For any high-stakes decision, an LLM should assist, not decide. A human expert must always have the final say. For example, an LLM can *suggest* shutting down a machine, but the action should require explicit approval from a qualified operator.

---

## üìö Further Reading & Resources

This notebook provides a high-level introduction. To deepen your understanding, we highly recommend exploring the original papers and technical reports for the models discussed.

*   **Foundational Paper:** ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017) - The paper that introduced the Transformer architecture, the bedrock of modern LLMs.
*   **Safety & Ethics:** ["Building Safe LLM Systems"](https://www.anthropic.com/news/building-safe-llm-systems) (Anthropic, 2024) - A comprehensive guide to practical safety measures in LLM development.
*   **Model-Specific Details:**
    *   [Llama 3 Technical Report](https://ai.meta.com/blog/meta-llama-3/)
    *   [Mixtral of Experts Technical Report](https://arxiv.org/abs/2401.04088)
    *   [Phi-3 Technical Report](https://arxiv.org/abs/2404.14219)