# Generating a Jupyter book about Python basics
In this notebook we will generate a Jupyter book using a large language model. We use Claude 4 Opus.

In [1]:
import anthropic
import openai
import datetime
import os
from pathlib import Path
from functools import partial
from IPython.display import Markdown, display
openai.__version__, anthropic.__version__

('1.106.1', '0.60.0')

## Defining the content of the book
The topic of the book will be specified and also the table of contents and some extra hints:

In [2]:
topic = "Foundations of Medical Data Integration"

In [3]:
# The table of contents must be a markdown list with * at the beginning of every line.
toc = """
* Files, Folders, and Filesystems
* Path handling and portable I/O (pathlib)
* Sidecar metadata (JSON/YAML), data dictionaries
* Local and cloud storage with fsspec (s3fs/gcsfs/adlfs), streaming I/O
* Next-cloud access to files
* Integrity checks (hashing), small-to-large file strategies
* Reading/writing CSV/NDJSON (pandas/polars), dtype control, missing values
* Schemas and validation (pandera), coercion, constraints
* Efficient storage: Parquet/Arrow, compression, partitioning
* Joins/merges, grouping, windowing; SQL-in-notebooks with DuckDB
* Export patterns: subsets, snapshots, reproducible filters
* numpy/xarray fundamentals: shapes, dtypes, chunking
* Array storage: HDF5/Zarr; compression (blosc/zstd), chunk size trade-offs
* Image I/O: tifffile/imageio; DICOM with pydicom; OME-TIFF/OME-Zarr basics
* CSV/Excel/NDJSON → Parquet/Arrow (batch vs streaming)
* In-notebook SQL analytics and joins with DuckDB/SQLite
* DICOM → OME-TIFF/OME-Zarr; NIfTI round-trips for neuro
* De-identification workflows for EHR, DICOM, and text (hashing, redaction, re-keying)
* NLP pipelines for clinical notes (spaCy, rule-based extraction, terminology normalization)
* Timezones, units, and categorical normalization
* Unit normalization, time alignment, resampling, and windowing for longitudinal data
* Entity resolution and patient matching (splink, recordlinkage) with evaluation
* Out-of-core/parallel conversion with Dask; progress, retries, idempotency
* Map to clinical vocabularies (SNOMED/LOINC/ICD/RxNorm) using Python lookup tables/services
"""

In [4]:
extra_hints = """
By the end of every notebook, add a markdown cell with the headline "## Exercise" 
and an exercise description allowing the reader to practically apply what they just
learned in the notebook.
"""

We will also specify the location where to store the book:

In [5]:
base_dir = ""
repository_url = "https://github.com/generated-books/medical-data-integration"

We will use this language model to generate the book:

In [6]:
model = "claude-sonnet-4-20250514"

## Helper functions
Here we create some helper functions for prompting and for file format handling.

In [7]:
def prompt_chatGPT(message:str, model="gpt-4o-2024-05-13"):
    """
    A prompt helper function that sends a message to openAI
    and returns only the text response.
    """
    import os
    import openai
    
    # convert message in the right format if necessary
    if isinstance(message, str):
        message = [{"role": "user", "content": message}]
        
    # setup connection to the LLM
    client = openai.OpenAI()
    
    # submit prompt
    response = client.chat.completions.create(
        model=model,
        messages=message
    )
    
    # extract answer
    return response.choices[0].message.content

In [8]:
def prompt_claude(message:str, model="claude-3-5-sonnet-20240620"):
    """
    A prompt helper function that sends a message to anthropic
    and returns only the text response.

    Example models: claude-3-5-sonnet-20240620 or claude-3-opus-20240229
    """
    import os
    from anthropic import Anthropic
    
    # convert message in the right format if necessary
    if isinstance(message, str):
        message = [{"role": "user", "content": message}]
        
    # setup connection to the LLM
    client = Anthropic()
    
    results = []
    with client.messages.stream(
        max_tokens=16000,
        messages=message,
        model=model,
    ) as stream:
      for text in stream.text_stream:
          results.append(text)
    return "".join(results)

In [9]:
if "gpt" in model:
    prompt = partial(prompt_gpt, model=model)
else:
    prompt = partial(prompt_claude, model=model)    

In [10]:
prompt("Hello world")

"Hello! It's nice to meet you. How are you doing today? Is there anything I can help you with?"

In [11]:
def prompt_with_memory(message:str):
    """
    This function allows to use an LLMs in a chat-mode. 
    The LLM is equipped with some memory, 
    so that we can refer back for former conversation steps.
    """
    
    # convert message in the right format and store it in memory
    question = {"role": "user", "content": message}
    chat_history.append(question)
    
    # receive answer
    response = prompt(chat_history)
    
    # convert answer in the right format and store it in memory
    answer = {"role": "assistant", "content": response}
    chat_history.append(answer)
    
    return response

In [12]:
def is_valid_json(test_string):
    """This function returns if a string is formatted json."""
    import json
    try:
        json.loads(test_string)
        return True
    except:
        return False

def ensure_json(notebook):
    """This function makes sure that the passed notebook is indeed a json-formatted ipynb file."""
    if is_valid_json(notebook):
        return notebook
        
    return prompt(f"""
Take the following text and extract the Jupyter 
notebook ipynb/json from it:

{notebook}

Make sure the output is in ipynb/json format. 
Respond only the JSON content.
""").strip("```json").strip("```python").strip("```")

## Context
Here we provide some context to the language model. As gpt4 and claude have different APIs for providing system messages, we instead use this message to start the conversation.

In [13]:
system_message = f"""
You are data scientist and statistician, and you work in the medical domain. 
You have didactic skills and you can explain data analysis very well.
You are about to write a Jupyter book consisting of multiple Jupyter notebooks about a given topic.

In front of every code-cell, add a markdown cell with an explanation of the next code cell. Write 1-3 sentences in these markdown cells.
When writing a notebook, always keep the code in the code cells concise. 
Do only one thing and let the user see the intermediate result.
Then, continue with the next thing in a new code cell.

{extra_hints}

Confirm this with "ok".
"""

chat_history = [{"role": "user", "content": system_message}, {"role": "assistant", "content": "ok"}]

We just test if the chat mode works:

In [14]:
prompt_with_memory("Hi, my name is Robert Haase.")

'Hello Robert! Nice to meet you. I\'m ready to help you create a comprehensive Jupyter book with multiple notebooks on your chosen topic. \n\nAs confirmed, I\'ll structure each notebook with:\n- Clear markdown explanations (1-3 sentences) before each code cell\n- Concise, focused code cells that demonstrate one concept at a time\n- Intermediate results displayed for better understanding\n- An "## Exercise" section at the end of each notebook for hands-on practice\n\nWhat medical data analysis topic would you like to focus on for this Jupyter book?'

In [15]:
prompt_with_memory("What is my name?")

'Your name is Robert Haase.'

## Chatting about book content
We start chatting with the LLM about the book's content. It is key that the LLM _knows_ about all the content of the book before it starts generating the first notebook.

In [16]:
Markdown(prompt_with_memory(f"""
I would like to teach others in {topic} and cover these aspects:
{toc}

Therefore, it would be great to have training material in the form of a Jupyter book.

Which Python libraries are relevant in this context? Do not write any Python code yet.
"""))

Based on your comprehensive curriculum for "Foundations of Medical Data Integration," here are the key Python libraries that would be relevant:

## Core Data Handling & I/O
- **pathlib** - Modern path handling
- **pandas/polars** - Tabular data manipulation
- **numpy/xarray** - Array operations and n-dimensional data
- **fsspec, s3fs, gcsfs, adlfs** - Unified filesystem interfaces
- **dask** - Parallel and out-of-core computing

## File Formats & Storage
- **pyarrow** - Arrow format and Parquet I/O
- **h5py/tables** - HDF5 storage
- **zarr** - Chunked array storage
- **tifffile/imageio** - Image I/O
- **pydicom** - DICOM medical imaging
- **nibabel** - NIfTI neuroimaging format
- **ome-zarr** - OME-ZARR for microscopy

## Data Validation & Schemas
- **pandera** - Data validation and schemas
- **pydantic** - Data validation using Python type annotations
- **jsonschema** - JSON schema validation

## Database & Query
- **duckdb** - In-process analytical database
- **sqlite3** - Lightweight database
- **sqlalchemy** - SQL toolkit

## Medical/Clinical Specific
- **spacy** - NLP processing
- **splink/recordlinkage** - Entity resolution and record matching
- **cryptography/hashlib** - Hashing and de-identification

## Metadata & Configuration
- **pyyaml** - YAML parsing
- **json** - JSON handling
- **toml** - Configuration files

## Utilities & Quality
- **tqdm** - Progress bars
- **pytest** - Testing framework
- **logging** - Logging and monitoring
- **blosc/zstd** - Compression algorithms

This creates a solid foundation for teaching medical data integration workflows from basic file handling through advanced clinical data processing pipelines.

## Generating the book
Here we start generating the notebooks for the content listed in the table of contents.

In [17]:
contents = toc.strip("\n").strip("* ").split("\n* ")

for i, subtopic in enumerate(contents):
    if i < 5:
        continue
    notebook = ensure_json(prompt(
        [{"role": "user", "content": system_message},
         {"role": "assistant", "content": "ok"},
         {"role": "user", "content": f"""
    Please write a Jupyter notebook in json format about "{subtopic}" as part of a course about {topic}.
    Respond only the JSON content.
    """}])).strip("```json").strip("```python").strip("```")

    # f"{i:02}_" + 
    filename = Path(base_dir) / "docs" / prompt(f"What would be a good filename for '{subtopic}' jupyter notebook? Make sure it contains no spaces and ends with .ipynb . Respond with the filename only.")

    directory = directory = Path(filename).parent
    os.makedirs(directory, exist_ok=True)
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(notebook)

    print(subtopic, ":", filename)

Integrity checks (hashing), small-to-large file strategies : docs\integrity_checks_hashing_file_strategies.ipynb
Reading/writing CSV/NDJSON (pandas/polars), dtype control, missing values : docs\data_io_csv_ndjson_pandas_polars_dtypes_missing.ipynb
Schemas and validation (pandera), coercion, constraints : docs\pandera_schemas_validation_coercion_constraints.ipynb
Efficient storage: Parquet/Arrow, compression, partitioning : docs\efficient_storage_parquet_arrow_compression_partitioning.ipynb
Joins/merges, grouping, windowing; SQL-in-notebooks with DuckDB : docs\joins-grouping-windowing-duckdb-sql.ipynb
Export patterns: subsets, snapshots, reproducible filters : docs\export_patterns_subsets_snapshots_reproducible_filters.ipynb
numpy/xarray fundamentals: shapes, dtypes, chunking : docs\numpy-xarray-fundamentals-shapes-dtypes-chunking.ipynb
Array storage: HDF5/Zarr; compression (blosc/zstd), chunk size trade-offs : docs\Array-Storage-HDF5-Zarr-Compression-Chunking-Analysis.ipynb
Image I/O: 

## Generating additional text and config files
We would like to build the book automatically, and we also need some introduction texts and documentation. Now that the individual notebooks have been built, we can generate those additional files as well.

In [18]:
docs_folder = Path(base_dir) / "docs"
today = datetime.date.today().strftime("%B %d, %Y")

more_files = {
    Path(base_dir) / "docs" / "intro.md": 
f"""
Create a intro.md file for a jupyter book that contains all Jupyter notebooks we just created. 
The introduction should give an overview in text form and with bullet points linking to the notebooks.
Mention that the entire book is AI-generated.
The repository url of the book is `{repository_url}`.
Mention that the `generator.ipynb` file in the github repository contains all the code used for generating the book. Add a link to this file.
Respond the content of this file only.
""",
    
    Path(base_dir) / "docs" / "_toc.yml": 
"""
Build a table of contents in Jupytyer book yml format.
First, mention the intro.md file.
Please give me the list of all notebook filenames we just created. 
Put them in a _yml file for a Jupyter book.
Respond the content of this file only.
""",

    Path(base_dir) / "docs" / "requirements.txt":
f"""
A requirements.txt file in the `docs` folder containing all python libraries used in this Jupyter book.
Respond the content of this file only.
""",
    
    Path(base_dir) / "docs" / "_config.yml": 
f"""
Create a minimal config.yml file for the jupyter book.
The book will be uploaded to this github repository: {repository_url}
Make sure the notebooks will be executed when the book is built.
The icon for the book is saved in ../icon.png
Note that today is {today}.
Respond the content of this file only.
""",
    
    Path(base_dir) / ".github" / "workflows" / "book.yml": 
f"""
Write a Github workflow file that builds the book and uploads the content to the gh_pages branch.
The book is stored in the `{docs_folder}` folder of the respository.
Respond the content of this file only.
""",

    Path(base_dir) / "readme.md": 
f"""
Create a readme.md file for the jupyter book. 
Give instructions how to build the book.
Mention that the entire book is AI-generated. 
Mention that the `generator.ipynb` file in the github repository contains all the code used for generating the book.
Respond the content of this file only.
""",

}

for filename, task in more_files.items():
    file_content = prompt_with_memory(task)

    directory = Path(filename).parent
    os.makedirs(directory, exist_ok=True)
    
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(file_content)

    print(filename)

docs\intro.md
docs\_toc.yml
docs\requirements.txt
docs\_config.yml
.github\workflows\book.yml
readme.md


## Chat history
For documentation purposes, we output the entire chat with the LLM. Note: The notebooks were generated without storing the notebooks in the chat-history because that would make the history too quickly too long.

In [19]:
chat_history

[{'role': 'user',
  'content': '\nYou are data scientist and statistician, and you work in the medical domain. \nYou have didactic skills and you can explain data analysis very well.\nYou are about to write a Jupyter book consisting of multiple Jupyter notebooks about a given topic.\n\nIn front of every code-cell, add a markdown cell with an explanation of the next code cell. Write 1-3 sentences in these markdown cells.\nWhen writing a notebook, always keep the code in the code cells concise. \nDo only one thing and let the user see the intermediate result.\nThen, continue with the next thing in a new code cell.\n\n\nBy the end of every notebook, add a markdown cell with the headline "## Exercise" \nand an exercise description allowing the reader to practically apply what they just\nlearned in the notebook.\n\n\nConfirm this with "ok".\n'},
 {'role': 'assistant', 'content': 'ok'},
 {'role': 'user', 'content': 'Hi, my name is Robert Haase.'},
 {'role': 'assistant',
  'content': 'Hello R

This is just an approximation of the number of tokens in the chat history:

In [20]:
len(str(chat_history).split(" "))

2553