## Introduction to the Book Generation Code

This Jupyter cell contains Python code designed to automatically generate a book in LaTeX format.  It leverages several key technologies to streamline the process:

*   **OpenAI's language models (like GPT-4o):** To generate the actual content of each book section based on a defined outline and relevant background information.
*   **Pandas:** To efficiently manage and load background data, which is expected to be pre-processed and saved in a pickle file. This background data contains text and pre-calculated embeddings for similarity searches.
*   **Pickle:** To load the background data quickly from a `.pkl` file, preserving the data structure and embeddings.
*   **LaTeX:** To format the generated book content into a professional, high-quality PDF document.

**Here's a high-level overview of what the code does:**

1.  **Loads Book Outline:** Reads a JSON file (`book_outline.json`) that defines the structure of your book (sections, titles, goals, and required background for each section).
2.  **Loads Background Data:** Loads a pre-processed Pandas DataFrame from a pickle file (`regulations_with_embeddings.pkl`). This DataFrame should contain background text and their corresponding embeddings.
3.  **Iterates Through Book Sections:**  Loops through each section defined in the book outline.
4.  **Finds Relevant Background Text:** For each section, it uses cosine similarity to find the most relevant background text from the loaded DataFrame based on the "required background" description in the outline.
5.  **Generates Section Content with OpenAI:**  Uses OpenAI's API to generate the text content for each section, providing the section title, goal, and the most similar background text as context to the language model.
6.  **Formats Content in LaTeX:**  Structures the generated text into LaTeX sections, including proper LaTeX preamble and postamble for a complete document.  It also includes basic escaping of LaTeX special characters in titles and preamble.
7.  **Saves LaTeX File:**  Saves the complete LaTeX code to a `.tex` file (`generated_book.tex`).
8.  **Compiles LaTeX to PDF (Optional):**  Attempts to automatically compile the generated `.tex` file into a PDF document using `pdflatex`.

This code provides a framework for automated book generation, and you can customize the outline, background data, prompts, and LaTeX formatting to create your own unique book.  Run the cell to start the book generation process!

In [None]:
import openai
import os
import pickle
import json
import pandas as pd
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

### A. Example

In [None]:
def count_words(text):
    """
    Counts the number of words in a string of text.

    Args:
        text: The string of text to count words in.

    Returns:
        int: The number of words in the text.
    """
    if not text:
        return 0
    words = text.split()
    return len(words)

def generate_technical_book_section(prompt_text):
    """
    Generates a section of a technical book using the OpenAI API (updated for openai>=1.0.0).

    Args:
        prompt_text: The prompt string containing technical context, expectations, and word count.

    Returns:
        str: The generated text section, or None if there's an error.
    """
    try:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": prompt_text}
            ],
            temperature=0.7,
            max_tokens=1000,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating text: {e}")
        return None

dummy_prompt = """
Generate a section for a technical book based on the following specifications:

**Technical Context:** This section is for a chapter on cloud computing architectures.  The reader is assumed to have a basic understanding of cloud concepts like Infrastructure as a Service (IaaS) and Platform as a Service (PaaS).  We are now moving into more advanced serverless concepts.

**Section Expectations:** This section should introduce the concept of serverless functions (also known as Functions as a Service or FaaS).  It should:
    * Define what serverless functions are and how they differ from traditional server-based applications.
    * Explain the key benefits of using serverless functions, such as scalability, cost efficiency, and reduced operational overhead.
    * Briefly discuss common use cases for serverless functions, like event-driven processing and API backends.
    * Provide a simple, conceptual example of a serverless function being triggered by an event.

**Word Count:** Aim for approximately 400 words.

The section should be written in a technical and informative tone, suitable for a technical book.  Please structure the section logically, using headings, subheadings, bullet points, and code examples (if applicable) to enhance readability and clarity. Ensure the information is accurate and technically sound.

---
    """

generated_section = generate_technical_book_section(dummy_prompt)

if generated_section:
    print("Generated Technical Book Section:\n")
    print(generated_section)
else:
    print("Failed to generate the technical book section.")


count_words(generated_section)

### B. Create a book

In [3]:
book_outline_data  = {
  "book_outline": [
    {
      "section_number": "1.1",
      "section_title": "General principles",
      "goal": "Introduce the fundamental principles of regulatory compliance and effective validation in the context of IRB models. Explain why validation is crucial for regulatory approval and maintaining model integrity.",
      "expected_length": "Short (2-3 paragraphs)",
      "required_background": "Basic understanding of financial regulations and the concept of model validation. Familiarity with the importance of risk management in financial institutions."
    },
    {
      "section_number": "1.2",
      "section_title": "Definitions",
      "goal": "Define key terms and concepts that are essential for understanding IRB model validation. This section should create a shared vocabulary for the reader. Examples include: IRB models, validation, regulatory compliance, model risk, performance, etc.",
      "expected_length": "Short (List of 5-10 key definitions, 1-2 sentences each)",
      "required_background": "General understanding of risk management and financial modeling terminology. No deep technical knowledge required at this stage."
    },
    {
      "section_number": "1.3",
      "section_title": "Types of validation",
      "goal": "Categorize and describe the different types of validation activities conducted on IRB models. Explain the purpose and scope of each type, such as development validation, independent validation, ongoing monitoring, and model review.",
      "expected_length": "Short (2-3 paragraphs)",
      "required_background": "Understanding of the model lifecycle and different stages where validation is necessary. General knowledge of quality assurance processes."
    },
    {
      "section_number": "2",
      "section_title": "The most common validation tests",
      "goal": "Introduce the section as a comprehensive overview of frequently used validation tests for IRB models. Explain that this section will delve into the specifics of each test, its purpose, application, and interpretation.",
      "expected_length": "Introductory paragraph for Section 2",
      "required_background": "Basic understanding of statistical testing and model evaluation concepts."
    },
    {
      "section_number": "2.1",
      "section_title": "Binomial test",
      "goal": "Explain the Binomial test in detail, including its statistical foundation, how it is applied in IRB model validation (specifically for binary outcomes like default), how to interpret the results, and its strengths and limitations.",
      "expected_length": "Medium (3-4 paragraphs)",
      "required_background": "Solid understanding of basic statistics, specifically the binomial distribution and hypothesis testing. Knowledge of how default events are modeled in credit risk."
    },
    {
      "section_number": "2.2",
      "section_title": "Chi-Square test (Hosmer-Lemeshow test)",
      "goal": "Explain the Chi-Square test, focusing on the Hosmer-Lemeshow variant, and its use in assessing the calibration of probability models. Describe how it groups observations and compares expected vs. observed defaults. Explain interpretation and limitations.",
      "expected_length": "Medium (3-4 paragraphs)",
      "required_background": "Understanding of Chi-Square distribution, goodness-of-fit tests, and logistic regression or similar probability models. Knowledge of model calibration in risk modeling."
    }
  ]
}

In [15]:
# --- Configuration ---
JSON_OUTLINE_FILE = "book_outline.json" # Path to your JSON outline file
PANDAS_DF_FILE = "regulations_with_embeddings.pkl" # Path to your CSV file with background data
TEXT_COLUMN_NAME = "body_of_the_text" # Column in your CSV with text content
EMBEDDING_COLUMN_NAME = "combined_text_embedding" # Column in your CSV with pre-calculated embeddings
OPENAI_MODEL = "gpt-4o" # Or your preferred OpenAI model
LATEX_OUTPUT_FILE = "generated_book.tex" # Name of the output LaTeX file

# --- Helper Functions ---
def load_json_outline(json_file):
    """Loads the book outline from a JSON file."""
    with open(json_file, 'r') as f:
        return json.load(f)

def load_pandas_dataframe(pkl_file):
    """Loads the pandas DataFrame from a CSV file. Assumes embeddings are stored as strings."""
    df = load_dataframe_from_pickle(pkl_file)
    
    # Convert embedding strings back to numpy arrays
    df[EMBEDDING_COLUMN_NAME] = df[EMBEDDING_COLUMN_NAME].apply(lambda x: np.array(json.loads(x)))
    return df

def load_dataframe_from_pickle(pickle_filepath):
    with open(pickle_filepath, 'rb') as f: # 'rb' for read binary mode
        loaded_df = pickle.load(f)
    print(f"DataFrame loaded from pickle file: {pickle_filepath}")
    return loaded_df
        

def get_embedding(text, model="text-embedding-ada-002"):  # Using text embeddings for background similarity
    """Generates embedding for a given text using OpenAI."""
    text = text.replace("\n", " ")
    return openai.embeddings.create(input=[text], model=model).data[0].embedding

def find_similar_background_text(df, background_description, text_column, embedding_column):
    """Finds the most similar text in the DataFrame to the background description using embeddings."""
    description_embedding = get_embedding(background_description)
    similarities = cosine_similarity(np.array([description_embedding]), np.vstack(df[embedding_column].to_numpy()))
    most_similar_index = np.argmax(similarities)
    return df[text_column].iloc[most_similar_index]


def generate_section_text(section_title, section_goal, background_text, model=OPENAI_MODEL):
    """Generates text for a book section using OpenAI."""
    prompt = f"""
    You are a helpful AI assistant specialized in writing technical books about regulatory compliance and model validation in finance.

    Section Title: {section_title}

    Goal of this section: {section_goal}

    Background information to consider when writing this section:
    {background_text}

    ---

    Write the content for the section above, keeping in mind the goal and background information.
    Format the output as plain text.
    """

    try:
        response = openai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7, # Adjust for creativity vs. focus
            max_tokens=500 # Adjust based on expected section length
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating text for section '{section_title}': {e}")
        return f"**Error generating content for this section. Please check logs.**"


def create_latex_section(section_number, section_title, section_text):
    """Formats section content into LaTeX."""
    latex_section = f"""
\\section{{{section_number}. {section_title}}}
\\par % Ensure text starts in paragraph mode
{section_text}

"""
    return latex_section

def create_latex_preamble(title, author):
    """Creates the LaTeX preamble."""
    preamble = f"""
\\documentclass{{book}}
\\usepackage[utf8]{{inputenc}}
\\usepackage{{amsmath}}
\\usepackage{{amsfonts}}
\\usepackage{{amssymb}}
\\usepackage{{graphicx}}
\\usepackage{{geometry}}
\\geometry{{a4paper, margin=1in}}

\\title{{{title}}}
\\author{{{author}}}
\\date{{\\today}}

\\begin{{document}}
\\maketitle
\\tableofcontents

"""
    return preamble

def create_latex_postamble():
    """Creates the LaTeX postamble."""
    postamble = """

\\end{document}
"""
    return postamble


# --- Main Script ---
print("Loading book outline...")
# book_outline_data = load_json_outline(JSON_OUTLINE_FILE)
print("Loading pandas DataFrame...")
background_df = load_dataframe_from_pickle(PANDAS_DF_FILE)

latex_content = create_latex_preamble(
    title="Regulatory compliance through Effective Validation",
    author="AI Generated Book" # You can customize this
)

print("Generating book content section by section...")
for section_data in book_outline_data["book_outline"]:
    section_number = section_data["section_number"]
    section_title = section_data["section_title"]
    section_goal = section_data["goal"]
    required_background = section_data["required_background"]

    print(f"Processing section {section_number}: {section_title}")

    print("  Finding similar background...")
    similar_background_text = find_similar_background_text(
        background_df, required_background, TEXT_COLUMN_NAME, EMBEDDING_COLUMN_NAME
    )

    print("  Generating text with OpenAI...")
    generated_text = generate_section_text(
        section_title, section_goal, similar_background_text
    )

    latex_content += create_latex_section(
        section_number, section_title, generated_text
    )

latex_content += create_latex_postamble()

# Replace '#' with '\#' to avoid LaTeX errors
latex_content = latex_content.replace("#", "\\#")

print(f"Saving LaTeX output to {LATEX_OUTPUT_FILE}...")
with open(LATEX_OUTPUT_FILE, 'w', encoding='utf-8') as outfile:
    outfile.write(latex_content)

print("Book generation complete!")
print(f"Now you can compile '{LATEX_OUTPUT_FILE}' with LaTeX (pdflatex).")

Loading book outline...
Loading pandas DataFrame...
DataFrame loaded from pickle file: regulations_with_embeddings.pkl
Generating book content section by section...
Processing section 1.1: General principles
  Finding similar background...
  Generating text with OpenAI...
Processing section 1.2: Definitions
  Finding similar background...
  Generating text with OpenAI...
Processing section 1.3: Types of validation
  Finding similar background...
  Generating text with OpenAI...
Processing section 1.4: Validation framework
  Finding similar background...
  Generating text with OpenAI...
Processing section 1.5: PD models
  Finding similar background...
  Generating text with OpenAI...
Processing section 2: The most common validation tests
  Finding similar background...
  Generating text with OpenAI...
Processing section 2.1: Binomial test
  Finding similar background...
  Generating text with OpenAI...
Processing section 2.2: Chi-Square test (Hosmer-Lemeshow test)
  Finding similar back

In [None]:
# Sample Python script to compile a .tex (LaTeX) file into a PDF using pdflatex
import subprocess

tex_file = "generated_book.tex"  # specify your .tex file
try:
    subprocess.run(["pdflatex", tex_file], check=True)
    print("Conversion complete.")
except subprocess.CalledProcessError as e:
    print("Error during conversion:", e)