# Advanced List Generation with Structured Output

## üéØ Learning Objectives
By the end of this notebook, you will be able to:
1. Generate structured content (outlines, hierarchies) using LLMs
2. Work with different output formats (text, JSON, YAML)
3. Parse and extract structured data from LLM responses
4. Handle format inconsistencies and clean LLM outputs

---

## üìö What You'll Learn

**The Challenge:**
LLMs naturally produce unstructured text, but modern applications need structured data (JSON, YAML, XML) that can be programmatically processed.

**Three Approaches:**
1. **Text with Regex Parsing**: Generate formatted text, then extract structure with patterns
2. **Direct JSON Output**: Prompt the LLM to return valid JSON
3. **YAML Format**: Use YAML for more human-readable structured data

**Key Insight:**
The same information can be represented in multiple formats. Choosing the right format depends on:
- How you'll process the data
- Integration requirements with other systems
- Readability vs. parseability tradeoffs

---

## Section 1: Setup and Imports

In [2]:

import re
import json
from src.fnUtils import render_markdown
from src.openai_client import generate_text

### Test the LLM Connection

Create a simple wrapper function and test it with a basic prompt.

In [3]:
def complete(prompt):
    output = generate_text(prompt)
    return output

complete("Tell me an inspiring quote.")

'"Believe you can and you\'re halfway there." ‚Äì Theodore Roosevelt'

---

## Section 2: Approach 1 - Text Output with Regex Parsing

### üìù Strategy: Generate Human-Readable Text

**Advantages:**
- Natural for LLMs to produce
- Easy to read and verify
- Flexible formatting

**Disadvantages:**
- Requires custom parsing logic
- Fragile to format variations
- Regex patterns can be complex

### The Prompt Design:

We provide a clear example of the desired structure to guide the LLM's output format.

In [5]:
# Setup the topic  title
topic = "What is data engineering?"

# Inject the topic into the base prompt
base_prompt = f"""
Write a numbered, hierarchical outline for an article on "{topic}"

Here is an example, of the structure:

1. Introduction
    a. Definition of digital marketing
2. Types of Digital Marketing
    a. Search Engine Optimization
    b. Social Media Marketing
    c. Content Marketing
    d. Pay-Per-Click Advertising
    e. Email Marketing
3. Benefits of Digital Marketing
    a. Cost-Effective
    b. Targeted Audience
    c. Measurable Results
    d. Increased Reach
"""
print(base_prompt)

# Combine the two patterns so that we have a dictionary that looks like this:
# {
#     "1. Introduction": {"a.": "Definition of digital marketing"},
#     "2. Types of Digital Marketing": {
#         "a.": "Cost-Effective",
#         "b.": "Targeted Audience",
#         "c.": "Measurable Results",
#     },
# }


Write a numbered, hierarchical outline for an article on "What is data engineering?"

Here is an example, of the structure:

1. Introduction
    a. Definition of digital marketing
2. Types of Digital Marketing
    a. Search Engine Optimization
    b. Social Media Marketing
    c. Content Marketing
    d. Pay-Per-Click Advertising
    e. Email Marketing
3. Benefits of Digital Marketing
    a. Cost-Effective
    b. Targeted Audience
    c. Measurable Results
    d. Increased Reach



### Generate the Outline

Run the prompt and examine the raw text output.

In [6]:
result = complete(base_prompt)
print(result)

# Outline for "What is Data Engineering?"

1. Introduction  
    a. Definition of data engineering  
    b. Importance of data engineering in the modern data landscape  

2. Key Concepts in Data Engineering  
    a. Data Collection  
        i. Sources of data (e.g., databases, APIs, web scraping)  
        ii. Data ingestion methods (batch vs. real-time)  
    b. Data Storage  
        i. Types of storage solutions (e.g., relational databases, NoSQL, data lakes)  
        ii. Data warehousing concepts  
    c. Data Processing  
        i. Data transformation and ETL (Extract, Transform, Load)  
        ii. Real-time data processing vs. batch processing  

3. Tools and Technologies in Data Engineering  
    a. Programming Languages  
        i. Python  
        ii. Scala  
        iii. SQL  
    b. Data Pipeline Tools  
        i. Apache Airflow  
        ii. Apache Kafka  
        iii. Luigi  
    c. Cloud Platforms  
        i. Amazon Web Services (AWS)  
        ii. Google Cloud Pla

### Parse the Text into Structured Data

**The Parsing Strategy:**
1. Use regex to find main sections (numbered items like "1. Introduction")
2. Extract sub-sections (lettered items like "a. Definition")
3. Build a dictionary mapping sections to their sub-sections

**Regex Patterns Explained:**
- `r'\d+\..*?(?=^\d+|\Z)'`: Captures main sections (stops at next number or end)
- `r'\d+\..+'`: Extracts the section title
- `r'\s+[a-z]\..+'`: Finds sub-sections (letter followed by dot)

**‚ö†Ô∏è Limitation:** This approach breaks if the LLM changes the format slightly!

In [7]:
def extract_sections(outline_text):
    # Extract main sections
    main_sections = re.findall(r'\d+\..*?(?=^\d+|\Z)', outline_text, re.MULTILINE | re.DOTALL)

    # Extract sub-sections
    sections = {}
    for section in main_sections:
        section_title = re.search(r'\d+\..+', section).group(0)
        sub_sections = re.findall(r'\s+[a-z]\..+', section, re.MULTILINE)
        sections[section_title] = [heading.strip() for heading in  sub_sections]
    return sections

print(extract_sections(result))

{'1. Introduction  ': ['a. Definition of data engineering', 'b. Importance of data engineering in the modern data landscape'], '2. Key Concepts in Data Engineering  ': ['a. Data Collection', 'i. Sources of data (e.g., databases, APIs, web scraping)', 'b. Data Storage', 'i. Types of storage solutions (e.g., relational databases, NoSQL, data lakes)', 'c. Data Processing', 'i. Data transformation and ETL (Extract, Transform, Load)'], '3. Tools and Technologies in Data Engineering  ': ['a. Programming Languages', 'i. Python', 'b. Data Pipeline Tools', 'i. Apache Airflow', 'c. Cloud Platforms', 'i. Amazon Web Services (AWS)'], '4. Roles and Responsibilities of a Data Engineer  ': ['a. Designing data infrastructure', 'b. Ensuring data quality and integrity', 'c. Collaborating with data scientists and analysts', 'd. Implementing data security measures'], '5. Challenges in Data Engineering  ': ['a. Dealing with large volumes of data', 'b. Ensuring data accuracy and consistency', 'c. Integratio

---

## Section 3: Approach 2 - Direct JSON Output

### üîß Strategy: Request Structured Data Directly

**Advantages:**
- No parsing needed (use `json.loads()`)
- Standardized format
- Language-agnostic
- Easy integration with APIs

**Disadvantages:**
- LLMs sometimes return invalid JSON
- May include markdown code blocks (` ```json `)
- Need error handling for malformed output

### The Prompt Design:

We explicitly:
1. Ask for JSON output
2. Show the exact structure we want
3. Remind the model to make it parsable

In [8]:
prompt = f"""Produce an article outline for "{topic}" as JSON.

**Output format**:
{{
'top heading one': ['subheading_one', 'subheading_two', ...],
'top heading two': ['subheading_one', 'subheading_two', ...],
...
'top heading n': ['subheading_one', 'subheading_two', ...],
}}

Remember that the ouput must be parsable JSON.
"""

json_string = complete(prompt)

### Inspect the Raw JSON String

Check if the LLM wrapped the JSON in markdown code blocks.

In [9]:
print(json_string)

{
  "Introduction": ["Definition of Data Engineering", "Importance of Data Engineering"],
  "Key Concepts": ["Data Pipeline", "Data Warehousing", "ETL Processes", "Data Modeling"],
  "Roles and Responsibilities": ["Data Engineer vs. Data Scientist", "Typical Tasks of a Data Engineer", "Skills Required for Data Engineering"],
  "Tools and Technologies": ["Popular Data Engineering Tools", "Cloud Platforms for Data Engineering", "Programming Languages Used"],
  "Challenges in Data Engineering": ["Data Quality Issues", "Scalability Concerns", "Integration with Existing Systems"],
  "Future Trends": ["Emerging Technologies", "The Role of AI and Machine Learning", "The Evolving Landscape of Data Engineering"],
  "Conclusion": ["Summary of Key Points", "The Importance of Data Engineering in Modern Businesses"]
}


### Clean and Parse the JSON

**Common Issue:** LLMs often wrap JSON in markdown code blocks like ` ```json ... ``` `

**Solution:** Strip these markers before parsing with `json.loads()`

**üí° Pro Tip:** Always add error handling in production:
```python
try:
    json_object = json.loads(json_string)
except json.JSONDecodeError:
    # Handle invalid JSON
    pass
```

In [10]:
# Load JSON string into a JSON object
json_object = json.loads(json_string)

# Print the JSON object
print(json_object)

{'Introduction': ['Definition of Data Engineering', 'Importance of Data Engineering'], 'Key Concepts': ['Data Pipeline', 'Data Warehousing', 'ETL Processes', 'Data Modeling'], 'Roles and Responsibilities': ['Data Engineer vs. Data Scientist', 'Typical Tasks of a Data Engineer', 'Skills Required for Data Engineering'], 'Tools and Technologies': ['Popular Data Engineering Tools', 'Cloud Platforms for Data Engineering', 'Programming Languages Used'], 'Challenges in Data Engineering': ['Data Quality Issues', 'Scalability Concerns', 'Integration with Existing Systems'], 'Future Trends': ['Emerging Technologies', 'The Role of AI and Machine Learning', 'The Evolving Landscape of Data Engineering'], 'Conclusion': ['Summary of Key Points', 'The Importance of Data Engineering in Modern Businesses']}


---

## Section 4: Approach 3 - YAML Format

### üìÑ Strategy: Human-Readable Structured Data

**Advantages:**
- More readable than JSON (no brackets clutter)
- Supports comments
- Better for configuration files
- Handles multi-line strings naturally

**Disadvantages:**
- Requires additional library (`pyyaml`)
- Indentation-sensitive (errors if spacing is wrong)
- Less common in APIs than JSON

### Install YAML Library

In [28]:
!pip install pyyaml

import yaml



### Request YAML Output

**Prompt Strategy:**
- Explicitly ask for `.yml` format
- Show example YAML structure with proper indentation
- Emphasize "Always return valid YML"

**Note the syntax differences from JSON:**
- Uses `-` for list items
- Uses `:` for key-value pairs
- Uses `|` for multi-line strings
- No quotes needed around strings (usually)

In [32]:
prompt = f"""Produce an article outline as a .yml file for {topic}.

Always return valid YML.

**Output format**:
- name: Example YAML File
  description: This is an example YAML file.
  sections:
    - title: Introduction
      content: |
        This is the introduction.
    - title: Conclusion
      content: |
        This is the conclusion.
"""

text = complete(prompt)
text

'name: What is data engineering?\ndescription: Data engineering is the process of building, maintaining, and optimizing data systems to meet the needs of an organization. This involves collecting, cleaning, and transforming data into a format that is usable for analysis and decision-making.\nsections:\n  - title: Introduction\n    content: |\n      Data engineering is a critical part of the modern data-driven organization. By building and maintaining efficient and reliable data systems, data engineers ensure that the organization has the data it needs to make informed decisions.\n  - title: The role of data engineers\n    content: |\n      Data engineers play a variety of roles in an organization, including:\n      - Collecting data from a variety of sources\n      - Cleaning and transforming data to make it usable for analysis\n      - Building and maintaining data pipelines\n      - Monitoring data quality\n      - Providing support to data analysts and other users of data\n  - title

In [33]:
print(text)

name: What is data engineering?
description: Data engineering is the process of building, maintaining, and optimizing data systems to meet the needs of an organization. This involves collecting, cleaning, and transforming data into a format that is usable for analysis and decision-making.
sections:
  - title: Introduction
    content: |
      Data engineering is a critical part of the modern data-driven organization. By building and maintaining efficient and reliable data systems, data engineers ensure that the organization has the data it needs to make informed decisions.
  - title: The role of data engineers
    content: |
      Data engineers play a variety of roles in an organization, including:
      - Collecting data from a variety of sources
      - Cleaning and transforming data to make it usable for analysis
      - Building and maintaining data pipelines
      - Monitoring data quality
      - Providing support to data analysts and other users of data
  - title: The benefits 

### Inspect the YAML String

Review the raw YAML output to check formatting.

In [34]:
# Load the YAML string into a Python object
data = yaml.load(text, Loader=yaml.FullLoader)

### Parse the YAML

Convert YAML string to Python dictionary using `yaml.load()`.

**‚ö†Ô∏è Security Note:** Use `yaml.safe_load()` in production to prevent code execution from untrusted YAML.

In [35]:
print(data)

{'name': 'What is data engineering?', 'description': 'Data engineering is the process of building, maintaining, and optimizing data systems to meet the needs of an organization. This involves collecting, cleaning, and transforming data into a format that is usable for analysis and decision-making.', 'sections': [{'title': 'Introduction', 'content': 'Data engineering is a critical part of the modern data-driven organization. By building and maintaining efficient and reliable data systems, data engineers ensure that the organization has the data it needs to make informed decisions.\n'}, {'title': 'The role of data engineers', 'content': 'Data engineers play a variety of roles in an organization, including:\n- Collecting data from a variety of sources\n- Cleaning and transforming data to make it usable for analysis\n- Building and maintaining data pipelines\n- Monitoring data quality\n- Providing support to data analysts and other users of data\n'}, {'title': 'The benefits of data enginee

### View the Parsed Data Structure

The YAML is now a Python dictionary/list that you can manipulate programmatically.