## Generate documents

This Jupyter cell contains Python code designed to automatically generate a book in LaTeX format.  It leverages several key technologies to streamline the process:

*   **OpenAI's language models (like o1):** To generate the actual content of each book section based on a defined outline and relevant background information.
*   **Pandas:** To efficiently manage and load background data, which is expected to be pre-processed and saved in a pickle file. This background data contains text and pre-calculated embeddings for similarity searches.
*   **Pickle:** To load the background data quickly from a `.pkl` file, preserving the data structure and embeddings.
*   **LaTeX:** To format the generated book content into a professional, high-quality PDF document.

**Here's a high-level overview of what the code does:**

1.  **Loads Book Outline:** Reads a JSON file (`book_outline.json`) that defines the structure of your book (sections, titles, goals, and required background for each section).
2.  **Loads Background Data:** Loads a pre-processed Pandas DataFrame from a pickle file (`regulations_with_embeddings.pkl`). This DataFrame should contain background text and their corresponding embeddings.
3.  **Iterates Through Book Sections:**  Loops through each section defined in the book outline.
4.  **Finds Relevant Background Text:** For each section, it uses cosine similarity to find the most relevant background text from the loaded DataFrame based on the "required background" description in the outline.
5.  **Generates Section Content with OpenAI:**  Uses OpenAI's API to generate the text content for each section, providing the section title, goal, and the most similar background text as context to the language model.
6.  **Formats Content in LaTeX:**  Structures the generated text into LaTeX sections, including proper LaTeX preamble and postamble for a complete document.  It also includes basic escaping of LaTeX special characters in titles and preamble.
7.  **Saves LaTeX File:**  Saves the complete LaTeX code to a `.tex` file (`generated_book.tex`).
8.  **Compiles LaTeX to PDF (Optional):**  Attempts to automatically compile the generated `.tex` file into a PDF document using `pdflatex`.

This code provides a framework for automated book generation, and you can customize the outline, background data, prompts, and LaTeX formatting to create your own unique book.  Run the cell to start the book generation process!

In [1]:
import json
import numpy as np
import pickle
import re
import logging
from sklearn.metrics.pairwise import cosine_similarity
import openai
from typing import Any, Dict
import textwrap
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.callbacks import get_openai_callback
import subprocess

# --- Logging Configuration ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

### A. Parametrisation

In [2]:
# --- Configuration ---
JSON_OUTLINE_FILE = "validation_book_outline.json"  # Path to your JSON outline file
PANDAS_DF_FILE = "regulations_with_embeddings.pkl"  # Path to your pickle file with background data
TEXT_COLUMN_NAME = "body_of_the_text"  # Column with text content
EMBEDDING_COLUMN_NAME = "combined_text_embedding"  # Column with pre-calculated embeddings
OPENAI_MODEL = "o1-preview"  # "gpt-4o"  # Your preferred OpenAI model
LATEX_OUTPUT_FILE = "generated_book_v2.tex"  # Name of the output LaTeX file

# Global variable to track the total cost for the whole run
TOTAL_COST = 0.0

### B. Helper functions

In [3]:
# --- Helper Functions ---
def load_json_outline(json_file: str) -> Dict[str, Any]:
    """Loads the book outline from a JSON file."""
    try:
        with open(json_file, "r", encoding="utf-8") as f:
            outline = json.load(f)
        logging.info(f"Book outline loaded from {json_file}")
        return outline
    except Exception as e:
        logging.error(f"Error loading JSON outline: {e}")
        raise


def load_dataframe_from_pickle(pickle_filepath: str) -> Any:
    """Loads a DataFrame from a pickle file."""
    try:
        with open(pickle_filepath, "rb") as f:
            loaded_df = pickle.load(f)
        logging.info(f"DataFrame loaded from pickle file: {pickle_filepath}")
        return loaded_df
    except Exception as e:
        logging.error(f"Error loading DataFrame pickle: {e}")
        raise


def load_pandas_dataframe(pkl_file: str) -> Any:
    """
    Loads a pandas DataFrame from a pickle file and converts embedding strings to numpy arrays.
    Use this if your embeddings are stored as strings.
    """
    df = load_dataframe_from_pickle(pkl_file)
    try:
        df[EMBEDDING_COLUMN_NAME] = df[EMBEDDING_COLUMN_NAME].apply(
            lambda x: np.array(json.loads(x)) if isinstance(x, str) else x
        )
        logging.info("Embeddings converted to numpy arrays (if necessary).")
    except Exception as e:
        logging.error(f"Error converting embeddings: {e}")
        raise
    return df


def get_embedding(text: str, model: str = "text-embedding-ada-002") -> list:
    """Generates an embedding for the given text using OpenAI."""
    text = text.replace("\n", " ")
    try:
        response = openai.embeddings.create(input=[text], model=model)
        return response.data[0].embedding
    except Exception as e:
        logging.error(f"Error generating embedding: {e}")
        raise


def find_similar_background_text(df: Any, background_description: str,
                                 text_column: str, embedding_column: str) -> str:
    """
    Finds the most similar text in the DataFrame to the background description using cosine similarity.
    """
    try:
        description_embedding = get_embedding(background_description)
        background_embeddings = np.vstack(df[embedding_column].to_numpy())
        similarities = cosine_similarity(np.array([description_embedding]), background_embeddings)
        most_similar_index = np.argmax(similarities)
        return df[text_column].iloc[most_similar_index]
    except Exception as e:
        logging.error(f"Error finding similar background text: {e}")
        raise

### C. Content functions

In [4]:
def generate_content_text(title: str, 
                          goal: str,
                          level: int,
                          background_text: str,
                          model: str = OPENAI_MODEL,
                          use_langchain: bool = False) -> str:
    """
    Generates LaTeX-formatted text for either a section or a subsection.
    
    Parameters:
      - title: The title of the section/subsection.
      - goal: The goal of the section/subsection.
      - background_text: Background content to guide the writing.
      - level: Either "section" or "subsection". This will modify the prompt.
      - model: The OpenAI model to use.
      - use_langchain: If True, use LangChain's ChatOpenAI; otherwise, use openai.chat.completions.create.
    """
    global TOTAL_COST
    
    header = f"""
        You are a helpful AI assistant specialized in writing technical books about regulatory compliance and model validation in finance.

        Level {level} heading | Title: {title}

        Goal of this book part of the book: {goal}

        Background information to consider when writing:
        {background_text}
    """
    
    instructions = f"""
        ---
        Write the content for the {title} above, keeping in mind the goal and background information.
        Format the output as LaTeX, suitable for inclusion in a LaTeX document.
        Level 1 heading stands for section, level 2 heading stands for subsection, level 3 stands for sub-subsection - add it to Latex. 
        Please use standard LaTeX commands for formatting (e.g., \\textbf{{important text}}, \\textit{{emphasized text}}).
        If you need to include lists, use LaTeX list environments like \\begin{{itemize}} ... \\end{{itemize}} or \\begin{{enumerate}} ... \\end{{enumerate}}.
    """
    
    instructions += "For mathematical formulas, use inline math mode $...$ or display math mode \\begin{{equation}} ... \\end{{equation}}.\n"
    instructions += "Do not use mathematical formulas.\n\n"
    instructions += "----\n"
    instructions += "When writing Python code, **format the output as valid Python code enclosed in ```python code blocks.**\n"
    instructions += "Include comments to explain the code where necessary.\n"
    instructions += "Focus on clarity, correctness, and efficiency of the Python code.\n"
    instructions += "Do not include any explanations outside of the code block.\n"

    prompt = header + instructions
    
    # Remove any unwanted indentation from the multi-line string.
    prompt = textwrap.dedent(prompt)

    try:
        if use_langchain:
            llm = ChatOpenAI(model_name=model, temperature=1.0)
            with get_openai_callback() as cb:
                response = llm.invoke(prompt)
                content = response.content.strip()
                logging.info("LLM call for content generation completed.")
                logging.info(f"Generation token usage: {cb.total_tokens} (Prompt: {cb.prompt_tokens}, "
                             f"Completion: {cb.completion_tokens}, Cost: ${cb.total_cost:.4f})")

                 # Accumulate the cost from this call
                TOTAL_COST += cb.total_cost
                
            return content
        else:
            response = openai.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful AI assistant specialized in LaTeX output for technical books."},
                    {"role": "user", "content": prompt}
                ],
                # temperature=0.7,
                max_completion_tokens=1700
            )
            return response.choices[0].message.content.strip()
    except Exception as e:
        logging.error(f"Error generating text for Level {level} '{title}': {e}")
        return f"**Error generating content for this Level {level}. Please check logs.**"


def create_latex_section(section_text: str) -> str:
    """Formats a section with a LaTeX \\section header."""
    return f"{section_text}"


def create_latex_subsection(subsection_text: str) -> str:
    """Formats a subsection with a LaTeX \\subsection header."""
    return f"{subsection_text}"


def create_latex_preamble(title: str, author: str, header_text: str = 'Validation Standards') -> str:
    """
    Creates the LaTeX preamble for the document, including a custom title page.
    """
    preamble = f"""
\\documentclass[12pt,a4paper]{{article}}

\\usepackage[utf8]{{inputenc}}
\\usepackage[T1]{{fontenc}}
\\usepackage{{lmodern}}
\\usepackage[margin=1in]{{geometry}}
\\usepackage{{setspace}}
\\usepackage{{titlesec}}
\\usepackage{{etoolbox}}
\\usepackage{{fancyhdr}}
\\usepackage{{graphicx}}
\\usepackage{{amsmath}}
\\usepackage{{listings}} % For code listings
\\lstset{{
  basicstyle=\\ttfamily\\footnotesize,
  breaklines=true,
  showstringspaces=false
}}

% Ensure each \\section begins on a new page
\\preto\\section{{\\clearpage}}

% Format for section and subsection titles
\\titleformat{{\\section}}{{\\large\\bfseries}}{{\\thesection}}{{1em}}{{}}
\\titleformat{{\\subsection}}{{\\normalsize\\bfseries}}{{\\thesubsection}}{{1em}}{{}}

\\setlength{{\\parindent}}{{10pt}}
\\setlength{{\\parskip}}{{0.5\\baselineskip}}
\\setlength{{\\headheight}}{{14.5pt}}

\\pagestyle{{fancy}}
\\fancyhf{{}}
\\fancyhead[C]{{{header_text}}}
\\fancyfoot[C]{{\\thepage}}
\\renewcommand{{\\headrulewidth}}{{0pt}}

\\begin{{document}}
\\pagenumbering{{gobble}}

% --- Custom Title Page ---
\\begin{{titlepage}}
    \\begin{{center}}
        \\vspace*{{3cm}}
        
        {{\\Huge \\textbf{{{title}}}}}\\\\[0.8em]
        {{\\Huge \\textbf{{credit risk validation tests}}}}\\\\[2.0cm]
        
        {{\\Large \\textit{{Review and application of key validation tests}}}}\\\\[2.5cm]
        
        {{\\large \\textbf{{{author}}}}}\\\\[0.5cm]
        
        \\vfill
        {{\\large \\today}}
    \\end{{center}}
\\end{{titlepage}}

\\thispagestyle{{empty}}
\\tableofcontents
\\pagenumbering{{arabic}}
\\setcounter{{page}}{{1}}
"""
    return preamble

def create_latex_postamble() -> str:
    """Creates the LaTeX postamble for the document."""
    return "\n\\end{document}\n"


def convert_markdown_code_blocks_to_lstlisting(text: str) -> str:
    """
    Converts Markdown code blocks (```python ... ```) into LaTeX lstlisting environments.
    This helps prevent errors from raw backticks in the LaTeX document.
    """
    pattern = re.compile(r"```python\s*(.*?)\s*```", re.DOTALL)
    def repl(match):
        code_content = match.group(1)
        return "\\begin{lstlisting}[language=Python]\n" + code_content + "\n\\end{lstlisting}"
    return pattern.sub(repl, text)


def compile_latex_to_pdf(tex_filename):
    """
    Compiles a .tex file to PDF using pdflatex (requires LaTeX installed).
    """
    try:
        # Run pdflatex twice to ensure references are updated if needed
        subprocess.run(["pdflatex", tex_filename], check=True)
        subprocess.run(["pdflatex", tex_filename], check=True)
        print("PDF successfully generated.")
    except subprocess.CalledProcessError as e:
        print(f"Error during LaTeX compilation: {e}")

### D. Main function

In [5]:
# --- Main Function (Modified for Flat Structure) ---
def main():
    """Main function to generate the book content."""
    global TOTAL_COST  # Make sure to update this where you incur costs

    # Load the book outline
    book_outline = load_json_outline(JSON_OUTLINE_FILE)

    # Load the background DataFrame.
    try:
        background_df = load_pandas_dataframe(PANDAS_DF_FILE)
    except Exception as e:
        logging.info(f"Falling back to direct pickle load for background data: {e}")
        background_df = load_dataframe_from_pickle(PANDAS_DF_FILE)

    # Create the LaTeX preamble
    latex_content = create_latex_preamble(
        title="Validation of credit risk models",
        author="Collaboration between Human and AI"
    )
    logging.info("Generating book content section by section...")

    contents = book_outline["table_of_contents"][:6]
    # contents = book_outline["table_of_contents"]

    for section_data in contents:
        level = section_data["level"]
        heading = section_data["heading"]
        goal = section_data["goal"]
        required_background = section_data["required_background"]

        logging.info(f"Processing section (Level {level}): {heading}")
        logging.info("Finding similar background for section intro...")

        similar_background_text = find_similar_background_text(
            background_df, goal, TEXT_COLUMN_NAME, EMBEDDING_COLUMN_NAME
        )

        logging.info("Generating text with OpenAI for section intro...")
    
        generated_section_text = generate_content_text(
            title=heading,
            goal=goal,
            level=level,
            background_text=similar_background_text,
            model=OPENAI_MODEL,
            use_langchain=True  # Adjust as needed
        )

        # Convert any Markdown code blocks to lstlisting environments
        generated_section_text = convert_markdown_code_blocks_to_lstlisting(generated_section_text)
        latex_content += create_latex_section(generated_section_text)


    latex_content += create_latex_postamble()

    # Save the LaTeX output to a file
    try:
        with open(LATEX_OUTPUT_FILE, "w", encoding="utf-8") as outfile:
            outfile.write(latex_content)
        logging.info(f"Saved LaTeX output to {LATEX_OUTPUT_FILE}")
    except Exception as e:
        logging.error(f"Error saving LaTeX file: {e}")

    logging.info("Book generation complete!")
    logging.info(f"Total cost of the run: ${TOTAL_COST:.2f}")
    logging.info(f"Now you can compile '{LATEX_OUTPUT_FILE}' with LaTeX (e.g., pdflatex).")

main()

2025-02-06 06:56:25,574 - INFO - Book outline loaded from validation_book_outline.json
2025-02-06 06:56:27,246 - INFO - DataFrame loaded from pickle file: regulations_with_embeddings.pkl
2025-02-06 06:56:27,253 - INFO - Embeddings converted to numpy arrays (if necessary).
2025-02-06 06:56:27,253 - INFO - Generating book content section by section...
2025-02-06 06:56:27,254 - INFO - Processing section (Level 1): Part I: Foundations of Credit Risk
2025-02-06 06:56:27,255 - INFO - Finding similar background for section intro...
2025-02-06 06:56:29,694 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-02-06 06:56:31,092 - INFO - Generating text with OpenAI for section intro...
2025-02-06 06:57:03,809 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-06 06:57:03,821 - INFO - LLM call for content generation completed.
2025-02-06 06:57:03,822 - INFO - Generation token usage: 4490 (Prompt: 376, Completion: 411

### E. Convert to pdf

In [6]:
compile_latex_to_pdf("generated_book_v2.tex")

Error during LaTeX compilation: Command '['pdflatex', 'generated_book_v2.tex']' returned non-zero exit status 1.


In [7]:
# pdflatex -output-directory=C:/projects/generate_documents/latex_materials generated_book_v19.tex