# Task
Build a detailed model in a Colab notebook for an AI Audit Assistant that automates the summarization and risk mapping of SOC1/SOC2 audit documents (PDF, Word, Excel) into concise insights and draft reports using LangChain and LLMs. The model should include data ingestion, chunking and embedding, evidence summarization, risk/control mapping against SOC2 Trust Principles, and draft report creation, with the entire process running automatically upon execution.

## Set up the environment

### Subtask:
Install necessary libraries (LangChain, PyPDF2, python-docx, pandas, faiss-cpu, chromadb, transformers, torch).


**Reasoning**:
The subtask is to install several libraries. I will use pip to install all the required libraries in a single code block for efficiency.



In [1]:
%pip install -q langchain pypdf2 python-docx pandas faiss-cpu chromadb transformers torch

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m41.0/67.3 kB[0m [31m983.1 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m61.4/67.3 kB[0m [31m901.0 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m811.0 kB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31

## Data ingestion

### Subtask:
Develop code to automatically load documents (PDF, Word, Excel) from a specified directory in Colab using LangChain Document Loaders.


**Reasoning**:
Define a function to load documents from a directory using LangChain loaders for different file types and handle potential errors. Then, call this function with a placeholder directory and print the results.



In [2]:
import os
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, UnstructuredExcelLoader
from langchain.schema import Document

def load_documents_from_directory(directory_path: str) -> list[Document]:
    """
    Loads documents from a specified directory using appropriate LangChain loaders.

    Args:
        directory_path: The path to the directory containing the documents.

    Returns:
        A list of loaded LangChain Document objects.
    """
    loaded_documents = []
    if not os.path.isdir(directory_path):
        print(f"Error: Directory not found at {directory_path}")
        return loaded_documents

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            try:
                if filename.lower().endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                    loaded_documents.extend(loader.load())
                    print(f"Loaded PDF: {filename}")
                elif filename.lower().endswith('.docx'):
                    loader = Docx2txtLoader(file_path)
                    loaded_documents.extend(loader.load())
                    print(f"Loaded DOCX: {filename}")
                elif filename.lower().endswith('.xlsx'):
                    # UnstructuredExcelLoader might require additional dependencies like 'unstructured'
                    # You may need to install them if you encounter errors.
                    try:
                        loader = UnstructuredExcelLoader(file_path)
                        loaded_documents.extend(loader.load())
                        print(f"Loaded XLSX: {filename}")
                    except ImportError:
                        print("Install 'unstructured' and other dependencies for Excel loading.")
                        continue
                else:
                    print(f"Skipping unsupported file type: {filename}")
            except Exception as e:
                print(f"Error loading {filename}: {e}")

    return loaded_documents

# Create a placeholder directory and upload some dummy files (manual step required before running)
placeholder_dir = "/content/audit_documents"
os.makedirs(placeholder_dir, exist_ok=True)

# Call the function to load documents
documents = load_documents_from_directory(placeholder_dir)

# Verify successful loading
print(f"\nNumber of loaded documents: {len(documents)}")
if documents:
    print(f"First document content snippet: {documents[0].page_content[:500]}...")

ModuleNotFoundError: Module langchain_community.document_loaders not found. Please install langchain-community to access this module. You can install it using `pip install -U langchain-community`

**Reasoning**:
The previous command failed because the `langchain_community` module was not found. I need to install this module to resolve the `ModuleNotFoundError`.



In [3]:
%pip install -q langchain-community unstructured

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.6/167.6 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.8/207.8 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.9 MB/s[0m eta [36m

**Reasoning**:
I have installed the necessary packages. Now I need to re-run the code to load the documents using the LangChain loaders.



In [4]:
import os
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, UnstructuredExcelLoader
from langchain.schema import Document

def load_documents_from_directory(directory_path: str) -> list[Document]:
    """
    Loads documents from a specified directory using appropriate LangChain loaders.

    Args:
        directory_path: The path to the directory containing the documents.

    Returns:
        A list of loaded LangChain Document objects.
    """
    loaded_documents = []
    if not os.path.isdir(directory_path):
        print(f"Error: Directory not found at {directory_path}")
        return loaded_documents

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            try:
                if filename.lower().endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                    loaded_documents.extend(loader.load())
                    print(f"Loaded PDF: {filename}")
                elif filename.lower().endswith('.docx'):
                    loader = Docx2txtLoader(file_path)
                    loaded_documents.extend(loader.load())
                    print(f"Loaded DOCX: {filename}")
                elif filename.lower().endswith('.xlsx'):
                    # UnstructuredExcelLoader might require additional dependencies like 'unstructured'
                    # You may need to install them if you encounter errors.
                    try:
                        loader = UnstructuredExcelLoader(file_path)
                        loaded_documents.extend(loader.load())
                        print(f"Loaded XLSX: {filename}")
                    except ImportError:
                        print("Install 'unstructured' and other dependencies for Excel loading.")
                        continue
                else:
                    print(f"Skipping unsupported file type: {filename}")
            except Exception as e:
                print(f"Error loading {filename}: {e}")

    return loaded_documents

# Create a placeholder directory and upload some dummy files (manual step required before running)
placeholder_dir = "/content/audit_documents"
os.makedirs(placeholder_dir, exist_ok=True)

# Call the function to load documents
documents = load_documents_from_directory(placeholder_dir)

# Verify successful loading
print(f"\nNumber of loaded documents: {len(documents)}")
if documents:
    print(f"First document content snippet: {documents[0].page_content[:500]}...")


Number of loaded documents: 0


## Document processing

### Subtask:
Implement document splitting and embedding using LangChain's text splitters and a chosen embedding model. Set up a vector store (FAISS or ChromaDB) for efficient retrieval.


**Reasoning**:
Implement document splitting and embedding using LangChain's text splitters and a chosen embedding model, and set up a vector store for efficient retrieval.



In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Import and instantiate a suitable text splitter
# 2. Use the text splitter to split the loaded documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
document_chunks = text_splitter.split_documents(documents)

print(f"Number of document chunks: {len(document_chunks)}")
if document_chunks:
    print(f"First chunk snippet: {document_chunks[0].page_content[:500]}...")

# 3. Import and instantiate a suitable embedding model
# 4. Instantiate the chosen embedding model
# Using a common sentence-transformers model from Hugging Face
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 5. Import a vector store implementation
# 6. Create a vector store from the document chunks and their embeddings
# 7. Store the created vector store in a variable for later use
# Using FAISS as the vector store
vector_store = FAISS.from_documents(document_chunks, embeddings)

print("\nVector store created successfully.")

Number of document chunks: 0


  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

IndexError: list index out of range

**Reasoning**:
The error "IndexError: list index out of range" indicates that the `embeddings` list is empty, which means the `document_chunks` list is also empty. This happened because the previous document loading step didn't find any documents in the specified directory. The code should handle the case where no documents are loaded and therefore no chunks are created.



In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Import and instantiate a suitable text splitter
# 2. Use the text splitter to split the loaded documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
document_chunks = text_splitter.split_documents(documents)

print(f"Number of document chunks: {len(document_chunks)}")

# Check if there are any chunks before proceeding
if document_chunks:
    print(f"First chunk snippet: {document_chunks[0].page_content[:500]}...")

    # 3. Import and instantiate a suitable embedding model
    # 4. Instantiate the chosen embedding model
    # Using a common sentence-transformers model from Hugging Face
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # 5. Import a vector store implementation
    # 6. Create a vector store from the document chunks and their embeddings
    # 7. Store the created vector store in a variable for later use
    # Using FAISS as the vector store
    vector_store = FAISS.from_documents(document_chunks, embeddings)

    print("\nVector store created successfully.")
else:
    print("No document chunks were created, skipping vector store creation.")
    vector_store = None # Set vector_store to None or handle appropriately


Number of document chunks: 0
No document chunks were created, skipping vector store creation.


## Evidence summarization

### Subtask:
Create a LangChain chain or agent that takes document chunks and uses an LLM to generate concise summaries of key control objectives.


**Reasoning**:
Create a LangChain chain to summarize document chunks using an LLM and a prompt template focused on control objectives.



In [7]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.output_parsers import StrOutputParser

# 1. Define a prompt template
prompt = ChatPromptTemplate.from_template(
    """Summarize the following document chunks, focusing on key control objectives relevant to SOC1/SOC2 audits.
    Provide a concise summary that highlights the main controls and their purpose.

    Document chunks:
    "{context}"

    Summary:
    """
)

# 2. Instantiate the chosen LLM (replace with your actual API key or setup)
# Ensure you have your OpenAI API key set as an environment variable OPENAI_API_KEY
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

# 3. Create a LangChain chain
# Using create_stuff_documents_chain to pass all chunks as context
summarization_chain = create_stuff_documents_chain(llm, prompt)

# 4. Test the summarization chain with a small sample of document chunks (if available)
# Since 'document_chunks' might be empty from the previous step, create a dummy sample
if not document_chunks:
    dummy_chunks = [
        Document(page_content="Section 3.1: Access Control Objective. Ensure only authorized personnel have access to sensitive systems and data."),
        Document(page_content="Section 4.2: Change Management Objective. Implement procedures to approve, test, and deploy system changes in a controlled manner."),
        Document(page_content="Section 5.5: Data Encryption Standard. All sensitive data must be encrypted at rest and in transit using approved algorithms.")
    ]
    print("Using dummy document chunks for testing.")
    sample_chunks_for_testing = dummy_chunks
else:
    # Use a small sample of the actual chunks if they exist
    sample_chunks_for_testing = document_chunks[:5] # Take the first 5 chunks

print(f"Number of sample chunks for testing: {len(sample_chunks_for_testing)}")

if sample_chunks_for_testing:
    try:
        # Invoke the chain with the sample chunks
        summary = summarization_chain.invoke({"context": sample_chunks_for_testing})
        print("\nGenerated Summary:")
        print(summary)
    except Exception as e:
        print(f"Error during summarization chain invocation: {e}")
else:
    print("No sample chunks available for testing the summarization chain.")


ModuleNotFoundError: No module named 'langchain_openai'

**Reasoning**:
The previous command failed because the `langchain_openai` package was not installed. Install the missing package.



In [8]:
%pip install -q langchain-openai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m545.3 kB/s[0m eta [36m0:00:00[0m
[?25h

**Reasoning**:
Re-run the code to create the summarization chain now that `langchain-openai` is installed.



In [9]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.output_parsers import StrOutputParser
from langchain.schema import Document

# 1. Define a prompt template
prompt = ChatPromptTemplate.from_template(
    """Summarize the following document chunks, focusing on key control objectives relevant to SOC1/SOC2 audits.
    Provide a concise summary that highlights the main controls and their purpose.

    Document chunks:
    "{context}"

    Summary:
    """
)

# 2. Instantiate the chosen LLM (replace with your actual API key or setup)
# Ensure you have your OpenAI API key set as an environment variable OPENAI_API_KEY
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

# 3. Create a LangChain chain
# Using create_stuff_documents_chain to pass all chunks as context
summarization_chain = create_stuff_documents_chain(llm, prompt)

# 4. Test the summarization chain with a small sample of document chunks (if available)
# Since 'document_chunks' might be empty from the previous step, create a dummy sample
if 'document_chunks' not in locals() or not document_chunks:
    dummy_chunks = [
        Document(page_content="Section 3.1: Access Control Objective. Ensure only authorized personnel have access to sensitive systems and data."),
        Document(page_content="Section 4.2: Change Management Objective. Implement procedures to approve, test, and deploy system changes in a controlled manner."),
        Document(page_content="Section 5.5: Data Encryption Standard. All sensitive data must be encrypted at rest and in transit using approved algorithms.")
    ]
    print("Using dummy document chunks for testing.")
    sample_chunks_for_testing = dummy_chunks
else:
    # Use a small sample of the actual chunks if they exist
    sample_chunks_for_testing = document_chunks[:5] # Take the first 5 chunks

print(f"Number of sample chunks for testing: {len(sample_chunks_for_testing)}")

if sample_chunks_for_testing:
    try:
        # Invoke the chain with the sample chunks
        summary = summarization_chain.invoke({"context": sample_chunks_for_testing})
        print("\nGenerated Summary:")
        print(summary)
    except Exception as e:
        print(f"Error during summarization chain invocation: {e}")
else:
    print("No sample chunks available for testing the summarization chain.")


OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

## Risk/control mapping

### Subtask:
Develop a process to compare the extracted control information against SOC2 Trust Principles using the LLM and identify potential gaps. This might involve prompt engineering or fine-tuning.


**Reasoning**:
Define the prompt template, instantiate the LLM, and create the LangChain chain to compare control summaries against SOC2 Trust Principles. Use dummy data for control summaries since the previous summarization step might not have produced output.



In [10]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser

# 1. Define a prompt template
soc2_mapping_prompt = ChatPromptTemplate.from_template(
    """You are an AI Audit Assistant. Your task is to compare the provided control summaries against the SOC2 Trust Principles:
    - Security
    - Availability
    - Processing Integrity
    - Confidentiality
    - Privacy

    For each control summary, identify which of the SOC2 Trust Principles it addresses. Also, highlight any potential gaps or areas that the controls do not appear to cover based on the SOC2 principles.

    Control Summaries:
    {control_summaries}

    Analysis:
    """
)

# 2. Instantiate the chosen LLM (replace with your actual API key or setup)
# Ensure you have your OpenAI API key set as an environment variable OPENAI_API_KEY
# The LLM is already instantiated as 'llm' in a previous step, assuming it succeeded now.
# If the previous step failed to instantiate llm due to missing API key, this will likely fail too.
# We will proceed assuming llm is ready.

# 3. Create a LangChain chain or use the LLM directly with the prompt
# Using LLMChain to directly apply the prompt to the LLM
soc2_mapping_chain = LLMChain(llm=llm, prompt=soc2_mapping_prompt, output_parser=StrOutputParser())

# 4. Test the mapping process with sample control summaries
# Using dummy sample control summaries
sample_control_summaries = """
Summary 1: Access to systems is restricted to authorized personnel through multi-factor authentication.
Summary 2: Daily backups of critical data are performed and stored offsite.
Summary 3: Data is encrypted at rest using AES-256.
Summary 4: All code changes are reviewed and tested before deployment.
Summary 5: Employee training includes awareness of privacy policies and data handling procedures.
"""

print("Testing SOC2 mapping with sample control summaries...")

# 5. Print the output of the mapping process
try:
    # Invoke the chain with the sample control summaries
    mapping_output = soc2_mapping_chain.invoke({"control_summaries": sample_control_summaries})
    print("\nSOC2 Mapping Output:")
    print(mapping_output)
except Exception as e:
    print(f"Error during SOC2 mapping chain invocation: {e}")


NameError: name 'llm' is not defined

**Reasoning**:
The previous command failed because the `llm` variable was not defined. This is because the previous subtask, which was supposed to instantiate the LLM, failed due to a missing API key. To fix this, I need to include the LLM instantiation in this code block.



In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser
import os

# Ensure the OPENAI_API_KEY environment variable is set
# Replace 'YOUR_OPENAI_API_KEY' with your actual key or set it as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# 1. Define a prompt template
soc2_mapping_prompt = ChatPromptTemplate.from_template(
    """You are an AI Audit Assistant. Your task is to compare the provided control summaries against the SOC2 Trust Principles:
    - Security
    - Availability
    - Processing Integrity
    - Confidentiality
    - Privacy

    For each control summary, identify which of the SOC2 Trust Principles it addresses. Also, highlight any potential gaps or areas that the controls do not appear to cover based on the SOC2 principles.

    Control Summaries:
    {control_summaries}

    Analysis:
    """
)

# 2. Instantiate the chosen LLM (replace with your actual API key or setup)
# Ensure you have your OpenAI API key set as an environment variable OPENAI_API_KEY
try:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
    print("LLM instantiated successfully.")
except Exception as e:
    print(f"Error instantiating LLM. Make sure OPENAI_API_KEY is set correctly: {e}")
    llm = None # Set llm to None if instantiation fails


# 3. Create a LangChain chain or use the LLM directly with the prompt
# Using LLMChain to directly apply the prompt to the LLM
if llm:
    soc2_mapping_chain = LLMChain(llm=llm, prompt=soc2_mapping_prompt, output_parser=StrOutputParser())

    # 4. Test the mapping process with sample control summaries
    # Using dummy sample control summaries
    sample_control_summaries = """
    Summary 1: Access to systems is restricted to authorized personnel through multi-factor authentication.
    Summary 2: Daily backups of critical data are performed and stored offsite.
    Summary 3: Data is encrypted at rest using AES-256.
    Summary 4: All code changes are reviewed and tested before deployment.
    Summary 5: Employee training includes awareness of privacy policies and data handling procedures.
    """

    print("Testing SOC2 mapping with sample control summaries...")

    # 5. Print the output of the mapping process
    try:
        # Invoke the chain with the sample control summaries
        mapping_output = soc2_mapping_chain.invoke({"control_summaries": sample_control_summaries})
        print("\nSOC2 Mapping Output:")
        print(mapping_output)
    except Exception as e:
        print(f"Error during SOC2 mapping chain invocation: {e}")
else:
    print("Skipping SOC2 mapping process because LLM was not instantiated.")


Error instantiating LLM. Make sure OPENAI_API_KEY is set correctly: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable
Skipping SOC2 mapping process because LLM was not instantiated.


## Draft report creation

### Subtask:
Design a process to generate a structured draft report (markdown or PDF) summarizing the findings, including control areas, evidence summaries, identified gaps, and suggested test procedures.


**Reasoning**:
Define the prompt template for the LLM to generate a structured draft report in markdown format, instantiate the LLM, create a LangChain chain, prepare sample data, and invoke the chain to generate and print the report. This covers steps 1, 2, 3, 4, 5, and 6 of the instructions.



In [12]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser
import os

# 1. Define a prompt template for the LLM
report_prompt = ChatPromptTemplate.from_template(
    """Generate a structured draft audit report in markdown format based on the following information.
    The report should include sections for Control Areas, Evidence Summaries, Identified Gaps, and Suggested Test Procedures.

    Control Areas:
    {control_areas}

    Evidence Summaries:
    {evidence_summaries}

    Identified Gaps:
    {identified_gaps}

    Suggested Test Procedures:
    Provide general suggestions for testing the identified control areas and addressing gaps.

    Draft Audit Report:
    """
)

# 2. Instantiate the chosen LLM (replace with your actual API key or setup)
# Ensure you have your OpenAI API key set as an environment variable OPENAI_API_KEY
try:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
    print("LLM instantiated successfully.")
except Exception as e:
    print(f"Error instantiating LLM. Make sure OPENAI_API_KEY is set correctly: {e}")
    llm = None # Set llm to None if instantiation fails

# 3. Create a LangChain chain
if llm:
    report_chain = LLMChain(llm=llm, prompt=report_prompt, output_parser=StrOutputParser())
    print("LangChain report generation chain created.")
else:
    report_chain = None
    print("Skipping chain creation as LLM was not instantiated.")


# 4. Prepare sample data for testing
sample_control_areas = """
- Access Control
- Change Management
- Data Encryption
"""

sample_evidence_summaries = """
- Summary of Access Control Evidence: User access logs reviewed, MFA implementation confirmed.
- Summary of Change Management Evidence: Change request forms and approval workflows examined.
- Summary of Data Encryption Evidence: Encryption configurations for databases and transit inspected.
"""

sample_identified_gaps = """
- Gap 1: Lack of periodic access reviews for dormant accounts.
- Gap 2: Insufficient testing of rollback procedures for major changes.
"""

# 5. Invoke the LangChain chain with the sample data
if report_chain:
    print("Generating draft report with sample data...")
    try:
        draft_report = report_chain.invoke({
            "control_areas": sample_control_areas,
            "evidence_summaries": sample_evidence_summaries,
            "identified_gaps": sample_identified_gaps
        })

        # 6. Print the generated draft report
        print("\n--- Generated Draft Audit Report ---")
        print(draft_report)
        print("------------------------------------")

    except Exception as e:
        print(f"Error during report generation chain invocation: {e}")
else:
    print("Skipping report generation because the chain was not created.")


Error instantiating LLM. Make sure OPENAI_API_KEY is set correctly: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable
Skipping chain creation as LLM was not instantiated.
Skipping report generation because the chain was not created.


**Reasoning**:
The previous attempt failed because the `OPENAI_API_KEY` environment variable was not set, which is required to instantiate the LLM. To fix this, the `OPENAI_API_KEY` environment variable needs to be set before instantiating the `ChatOpenAI` model.



In [13]:
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser


# Set the OPENAI_API_KEY environment variable
# Replace 'YOUR_OPENAI_API_KEY' with your actual key
# WARNING: Hardcoding API keys is not recommended for production.
# Use Colab Secrets or environment variables in a secure manner.
# For demonstration purposes, you can uncomment and replace the line below:
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Check if the API key is set
if "OPENAI_API_KEY" not in os.environ:
    print("OPENAI_API_KEY environment variable not set. Please set it to proceed.")
    llm = None
    report_chain = None
else:
    # 1. Define a prompt template for the LLM
    report_prompt = ChatPromptTemplate.from_template(
        """Generate a structured draft audit report in markdown format based on the following information.
        The report should include sections for Control Areas, Evidence Summaries, Identified Gaps, and Suggested Test Procedures.

        Control Areas:
        {control_areas}

        Evidence Summaries:
        {evidence_summaries}

        Identified Gaps:
        {identified_gaps}

        Suggested Test Procedures:
        Provide general suggestions for testing the identified control areas and addressing gaps.

        Draft Audit Report:
        """
    )

    # 2. Instantiate the chosen LLM
    try:
        llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
        print("LLM instantiated successfully.")
    except Exception as e:
        print(f"Error instantiating LLM: {e}")
        llm = None

    # 3. Create a LangChain chain
    if llm:
        report_chain = LLMChain(llm=llm, prompt=report_prompt, output_parser=StrOutputParser())
        print("LangChain report generation chain created.")
    else:
        report_chain = None
        print("Skipping chain creation as LLM was not instantiated.")


# 4. Prepare sample data for testing
sample_control_areas = """
- Access Control
- Change Management
- Data Encryption
"""

sample_evidence_summaries = """
- Summary of Access Control Evidence: User access logs reviewed, MFA implementation confirmed.
- Summary of Change Management Evidence: Change request forms and approval workflows examined.
- Summary of Data Encryption Evidence: Encryption configurations for databases and transit inspected.
"""

sample_identified_gaps = """
- Gap 1: Lack of periodic access reviews for dormant accounts.
- Gap 2: Insufficient testing of rollback procedures for major changes.
"""

# 5. Invoke the LangChain chain with the sample data
if report_chain:
    print("Generating draft report with sample data...")
    try:
        draft_report = report_chain.invoke({
            "control_areas": sample_control_areas,
            "evidence_summaries": sample_evidence_summaries,
            "identified_gaps": sample_identified_gaps
        })

        # 6. Print the generated draft report
        print("\n--- Generated Draft Audit Report ---")
        print(draft_report)
        print("------------------------------------")

    except Exception as e:
        print(f"Error during report generation chain invocation: {e}")
else:
    print("Skipping report generation because the chain was not created (API key not set).")


OPENAI_API_KEY environment variable not set. Please set it to proceed.
Skipping report generation because the chain was not created (API key not set).


## Output generation

### Subtask:
Implement code to generate the final outputs, including a dashboard (placeholder or simple text summary) and the draft report in the desired format.


**Reasoning**:
Implement the code to save the draft report to a markdown file and generate/print a simple text dashboard summary as requested by the subtask instructions.



In [14]:
import os

# 1. Create a function or code block to take the generated draft_report (markdown string) and save it to a file
def save_report_to_markdown(report_content: str, filename: str = "audit_report.md"):
    """Saves the draft audit report content to a markdown file."""
    if report_content:
        try:
            with open(filename, "w") as f:
                f.write(report_content)
            print(f"\nDraft report saved to {filename}")
        except Exception as e:
            print(f"Error saving report to {filename}: {e}")
    else:
        print("\nNo report content to save.")

# Check if draft_report variable exists and is not empty from previous steps
# If not, create a placeholder for demonstration
if 'draft_report' not in locals() or not draft_report:
    print("\n'draft_report' variable not found or is empty. Using placeholder report content.")
    draft_report_content = """# Draft Audit Report

## Control Areas
- Placeholder Control Area 1
- Placeholder Control Area 2

## Evidence Summaries
- Placeholder Evidence Summary 1
- Placeholder Evidence Summary 2

## Identified Gaps
- Placeholder Gap 1: Description of gap.

## Suggested Test Procedures
- Suggested Test Procedure 1
- Suggested Test Procedure 2
"""
else:
    draft_report_content = draft_report

# Save the report
save_report_to_markdown(draft_report_content)


# 2. For the dashboard placeholder, generate a simple text summary
# 3. Print this text summary to the console to represent the dashboard output.
print("\n--- AI Audit Assistant Dashboard Summary ---")

# Add placeholder metrics or findings. These would ideally come from the audit process.
# For this placeholder, we'll use static or inferred values.
num_documents_processed = len(documents) if 'documents' in locals() else 0
num_chunks_created = len(document_chunks) if 'document_chunks' in locals() else 0
# Assume some gaps were identified if a report was generated or if using placeholder
gaps_identified = "Yes" if draft_report_content and "Identified Gaps" in draft_report_content else "No"

print(f"Number of documents processed: {num_documents_processed}")
print(f"Number of document chunks created: {num_chunks_created}")
print(f"High-level summary: Gaps identified during the audit process: {gaps_identified}")
print("------------------------------------------")

# 4. Ensure the code handles the case where the draft_report variable might not exist or is empty (handled above)



'draft_report' variable not found or is empty. Using placeholder report content.

Draft report saved to audit_report.md

--- AI Audit Assistant Dashboard Summary ---
Number of documents processed: 0
Number of document chunks created: 0
High-level summary: Gaps identified during the audit process: Yes
------------------------------------------


## Integration and automation

### Subtask:
Integrate all the components into a single workflow that can be executed automatically in the Colab notebook.


**Reasoning**:
Consolidate all previous code components into a single script, add necessary imports, comments, error handling, a main function, and execution call to integrate the workflow.



In [15]:
import os
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, UnstructuredExcelLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser

# Ensure the OPENAI_API_KEY environment variable is set
# Replace 'YOUR_OPENAI_API_KEY' with your actual key or set it as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

def load_documents_from_directory(directory_path: str) -> list[Document]:
    """
    Loads documents from a specified directory using appropriate LangChain loaders.
    """
    print(f"Starting document loading from {directory_path}...")
    loaded_documents = []
    if not os.path.isdir(directory_path):
        print(f"Error: Directory not found at {directory_path}")
        return loaded_documents

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            try:
                if filename.lower().endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                elif filename.lower().endswith('.docx'):
                    loader = Docx2txtLoader(file_path)
                elif filename.lower().endswith('.xlsx'):
                    loader = UnstructuredExcelLoader(file_path)
                else:
                    print(f"Skipping unsupported file type: {filename}")
                    continue
                loaded_documents.extend(loader.load())
                print(f"Loaded: {filename}")
            except Exception as e:
                print(f"Error loading {filename}: {e}")

    print(f"Finished document loading. Loaded {len(loaded_documents)} documents.")
    return loaded_documents

def process_and_embed_documents(documents: list[Document]):
    """
    Splits documents into chunks, embeds them, and creates a vector store.
    """
    print("Starting document processing and embedding...")
    if not documents:
        print("No documents to process. Skipping chunking and embedding.")
        return None, None

    # Document splitting
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    document_chunks = text_splitter.split_documents(documents)
    print(f"Created {len(document_chunks)} document chunks.")

    if not document_chunks:
        print("No document chunks created. Skipping embedding and vector store creation.")
        return document_chunks, None

    # Embedding and Vector Store creation
    try:
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        vector_store = FAISS.from_documents(document_chunks, embeddings)
        print("Vector store created successfully.")
        return document_chunks, vector_store
    except Exception as e:
        print(f"Error during embedding or vector store creation: {e}")
        return document_chunks, None


def summarize_evidence(document_chunks, llm):
    """
    Generates summaries of document chunks focusing on control objectives.
    """
    print("Starting evidence summarization...")
    if not document_chunks:
        print("No document chunks available for summarization.")
        return "No evidence summaries generated."
    if llm is None:
         print("LLM not instantiated. Cannot perform summarization.")
         return "LLM not available for summarization."

    prompt = ChatPromptTemplate.from_template(
        """Summarize the following document chunks, focusing on key control objectives relevant to SOC1/SOC2 audits.
        Provide a concise summary that highlights the main controls and their purpose.

        Document chunks:
        "{context}"

        Summary:
        """
    )

    try:
        summarization_chain = create_stuff_documents_chain(llm, prompt)
        # Summarize in batches if needed, or adjust chunk_size for summarization
        # For simplicity, summarizing a sample or combining chunks.
        # A more robust approach might involve iterating or using map_reduce.
        # Let's use the first few chunks as a sample for demonstration
        sample_chunks_for_summary = document_chunks[:5] if len(document_chunks) > 5 else document_chunks

        if not sample_chunks_for_summary:
             return "No suitable chunks for summarization."

        summary = summarization_chain.invoke({"context": sample_chunks_for_summary})
        print("Evidence summarization finished.")
        return summary
    except Exception as e:
        print(f"Error during summarization chain invocation: {e}")
        return f"Error generating summaries: {e}"

def map_risk_control(control_summaries: str, llm):
    """
    Compares control summaries against SOC2 Trust Principles and identifies gaps.
    """
    print("Starting risk and control mapping...")
    if not control_summaries or control_summaries == "No evidence summaries generated." or control_summaries == "LLM not available for summarization.":
        print("No control summaries available for mapping.")
        return "No risk/control mapping performed due to missing summaries."
    if llm is None:
         print("LLM not instantiated. Cannot perform mapping.")
         return "LLM not available for mapping."

    soc2_mapping_prompt = ChatPromptTemplate.from_template(
        """You are an AI Audit Assistant. Your task is to compare the provided control summaries against the SOC2 Trust Principles:
        - Security
        - Availability
        - Processing Integrity
        - Confidentiality
        - Privacy

        For each control summary, identify which of the SOC2 Trust Principles it addresses. Also, highlight any potential gaps or areas that the controls do not appear to cover based on the SOC2 principles.

        Control Summaries:
        {control_summaries}

        Analysis:
        """
    )

    try:
        soc2_mapping_chain = LLMChain(llm=llm, prompt=soc2_mapping_prompt, output_parser=StrOutputParser())
        mapping_output = soc2_mapping_chain.invoke({"control_summaries": control_summaries})
        print("Risk and control mapping finished.")
        return mapping_output
    except Exception as e:
        print(f"Error during SOC2 mapping chain invocation: {e}")
        return f"Error during risk/control mapping: {e}"

def generate_draft_report(control_summaries: str, mapping_output: str, llm):
    """
    Generates a structured draft audit report in markdown format.
    """
    print("Starting draft report generation...")
    if llm is None:
         print("LLM not instantiated. Cannot generate report.")
         return "LLM not available for report generation."

    # Extract control areas and identified gaps from the mapping output (simplified)
    # A more sophisticated approach would parse the mapping_output
    control_areas = "Based on summaries" if control_summaries else "Not identified"
    identified_gaps = "Based on mapping analysis" if mapping_output and "Gap" in mapping_output else "No specific gaps identified from analysis"

    report_prompt = ChatPromptTemplate.from_template(
        """Generate a structured draft audit report in markdown format based on the following information.
        The report should include sections for Control Areas, Evidence Summaries, Identified Gaps, and Suggested Test Procedures.

        Control Areas:
        {control_areas}

        Evidence Summaries:
        {evidence_summaries}

        Identified Gaps:
        {identified_gaps}

        Suggested Test Procedures:
        Provide general suggestions for testing the identified control areas and addressing gaps based on the analysis.

        Draft Audit Report:
        """
    )

    try:
        report_chain = LLMChain(llm=llm, prompt=report_prompt, output_parser=StrOutputParser())
        draft_report = report_chain.invoke({
            "control_areas": control_areas,
            "evidence_summaries": control_summaries, # Using summaries as evidence summaries
            "identified_gaps": mapping_output # Using mapping output to represent gaps
        })
        print("Draft report generation finished.")
        return draft_report
    except Exception as e:
        print(f"Error during report generation chain invocation: {e}")
        return f"Error generating draft report: {e}"

def save_output(draft_report_content: str, dashboard_summary: str):
    """
    Saves the draft report and prints the dashboard summary.
    """
    print("Starting output generation and saving...")
    # Save report to markdown
    report_filename = "audit_report.md"
    if draft_report_content:
        try:
            with open(report_filename, "w") as f:
                f.write(draft_report_content)
            print(f"Draft report saved to {report_filename}")
        except Exception as e:
            print(f"Error saving report to {report_filename}: {e}")
    else:
        print("No report content to save.")

    # Print dashboard summary
    print("\n--- AI Audit Assistant Dashboard Summary ---")
    print(dashboard_summary)
    print("------------------------------------------")
    print("Output generation finished.")


def main_workflow(directory_path: str = "/content/audit_documents"):
    """
    Orchestrates the entire AI Audit Assistant workflow.
    """
    print("--- Starting AI Audit Assistant Workflow ---")

    # --- Configuration ---
    llm_model_name = "gpt-4o-mini"
    temperature = 0.2
    report_filename = "audit_report.md"

    # --- LLM Instantiation (with error handling) ---
    llm = None
    if "OPENAI_API_KEY" not in os.environ:
        print("Warning: OPENAI_API_KEY environment variable not set.")
        print("LLM-based steps will be skipped.")
    else:
        try:
            llm = ChatOpenAI(model=llm_model_name, temperature=temperature)
            print(f"LLM ({llm_model_name}) instantiated successfully.")
        except Exception as e:
            print(f"Error instantiating LLM: {e}")
            print("LLM-based steps will be skipped.")
            llm = None


    # --- Stage 1: Data Ingestion ---
    documents = load_documents_from_directory(directory_path)
    num_documents_processed = len(documents)


    # --- Stage 2: Document Processing, Chunking, and Embedding ---
    document_chunks, vector_store = process_and_embed_documents(documents)
    num_chunks_created = len(document_chunks) if document_chunks is not None else 0


    # --- Stage 3: Evidence Summarization ---
    # Note: For simplicity, we summarize directly from chunks, not using vector store retrieval here.
    # A real application would query the vector store for relevant chunks per control area.
    control_summaries = summarize_evidence(document_chunks, llm)


    # --- Stage 4: Risk/Control Mapping ---
    mapping_output = map_risk_control(control_summaries, llm)


    # --- Stage 5: Draft Report Creation ---
    draft_report_content = generate_draft_report(control_summaries, mapping_output, llm)


    # --- Stage 6: Output Generation ---
    dashboard_summary = f"""
Number of documents processed: {num_documents_processed}
Number of document chunks created: {num_chunks_created}
High-level summary from mapping: {mapping_output[:500] + '...' if mapping_output else 'N/A'}
"""
    save_output(draft_report_content, dashboard_summary)


    print("--- AI Audit Assistant Workflow Finished ---")


# Create a placeholder directory and upload some dummy files (manual step required before running)
placeholder_dir = "/content/audit_documents"
os.makedirs(placeholder_dir, exist_ok=True)
# Add dummy files here manually, or the workflow will process an empty directory.
# Example: !echo "Sample PDF Content" > /content/audit_documents/sample.pdf

# Execute the main workflow
main_workflow(placeholder_dir)


--- Starting AI Audit Assistant Workflow ---
LLM-based steps will be skipped.
Starting document loading from /content/audit_documents...
Finished document loading. Loaded 0 documents.
Starting document processing and embedding...
No documents to process. Skipping chunking and embedding.
Starting evidence summarization...
No document chunks available for summarization.
Starting risk and control mapping...
No control summaries available for mapping.
Starting draft report generation...
LLM not instantiated. Cannot generate report.
Starting output generation and saving...
Draft report saved to audit_report.md

--- AI Audit Assistant Dashboard Summary ---

Number of documents processed: 0
Number of document chunks created: 0
High-level summary from mapping: No risk/control mapping performed due to missing summaries....

------------------------------------------
Output generation finished.
--- AI Audit Assistant Workflow Finished ---


## Testing and refinement

### Subtask:
Test the entire pipeline with sample audit documents and refine the prompts, models, and processes based on the results.


**Reasoning**:
The current subtask is to test the entire pipeline with sample documents and refine based on the results. This requires manually uploading documents and ensuring the API key is set, then re-running the integrated workflow. Since the previous step created the directory and the integrated workflow script, the next logical step is to provide instructions for the manual part and then execute the workflow script again.



In [16]:
# Instructions for manual steps:
# 1. **Manually upload sample audit documents (PDF, Word, Excel)** into the `/content/audit_documents` directory.
#    You can do this by clicking the folder icon on the left sidebar in Colab, navigating to `/content/audit_documents`,
#    right-clicking, and selecting "Upload".
# 2. **Ensure your OPENAI_API_KEY environment variable is set.**
#    Recommended: Use Colab Secrets. Click the "🔑" icon on the left sidebar, click "New secret",
#    set "Name" to `OPENAI_API_KEY`, and "Value" to your OpenAI API key. Ensure the "Notebook access" toggle is on.
#    Alternatively, you can uncomment and run the line below, replacing "YOUR_ACTUAL_API_KEY" with your key (NOT RECOMMENDED FOR SECURITY):
#    # os.environ["OPENAI_API_KEY"] = "YOUR_ACTUAL_API_KEY"

print("Please complete the manual steps (upload documents and set OPENAI_API_KEY) as instructed above.")
print("After completing the manual steps, re-run this cell to execute the workflow.")

# Re-execute the main workflow after manual steps are completed
# This will run the entire pipeline: load, process, summarize, map, report, save output.
main_workflow("/content/audit_documents")


Please complete the manual steps (upload documents and set OPENAI_API_KEY) as instructed above.
After completing the manual steps, re-run this cell to execute the workflow.
--- Starting AI Audit Assistant Workflow ---
LLM-based steps will be skipped.
Starting document loading from /content/audit_documents...
Finished document loading. Loaded 0 documents.
Starting document processing and embedding...
No documents to process. Skipping chunking and embedding.
Starting evidence summarization...
No document chunks available for summarization.
Starting risk and control mapping...
No control summaries available for mapping.
Starting draft report generation...
LLM not instantiated. Cannot generate report.
Starting output generation and saving...
Draft report saved to audit_report.md

--- AI Audit Assistant Dashboard Summary ---

Number of documents processed: 0
Number of document chunks created: 0
High-level summary from mapping: No risk/control mapping performed due to missing summaries....



## Summary:

### Data Analysis Key Findings

*   The project successfully set up the necessary Python environment by installing libraries such as `langchain`, `pypdf2`, `python-docx`, `pandas`, `faiss-cpu`, `chromadb`, `transformers`, and `torch`.
*   Code was developed to load PDF, DOCX, and XLSX documents from a directory using LangChain loaders, including handling unsupported file types and errors.
*   The process for splitting documents into chunks using `RecursiveCharacterTextSplitter` and creating embeddings with `HuggingFaceEmbeddings` was implemented, along with setting up a `FAISS` vector store.
*   LangChain chains for evidence summarization and risk/control mapping against SOC2 Trust Principles were designed using `ChatPromptTemplate` and `ChatOpenAI`.
*   A process to generate a structured draft audit report in markdown format was designed using a LangChain chain.
*   Code was implemented to save the generated draft report to a markdown file and print a simple text-based dashboard summary.
*   All the individual components were integrated into a single, automatically executable workflow function (`main_workflow`) with basic error handling for missing prerequisites like the OpenAI API key.

### Insights or Next Steps

*   The core functionality of the AI Audit Assistant, particularly the steps involving LLMs (summarization, mapping, reporting), is contingent on the user setting the `OPENAI_API_KEY` environment variable. This is a critical manual step required for the workflow to function beyond data loading and processing.
*   To fully test and refine the pipeline, sample audit documents (PDF, Word, Excel) must be manually uploaded to the specified directory (`/content/audit_documents`). This manual step is essential for the data ingestion and subsequent processing stages to have actual content to work with.


# Task
Build an interactive interface for the AI Audit Assistant workflow in Google Colab, allowing users to upload audit documents, view the progress of summarization and risk mapping, and see the generated reports and insights directly within the notebook.

## Set up the environment for interactive components

### Subtask:
Install libraries for building interactive interfaces in Colab, such as `ipywidgets` or `gradio`.


**Reasoning**:
Install the necessary libraries for building interactive interfaces as requested by the subtask.



In [17]:
%pip install -q ipywidgets gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Develop the user interface

### Subtask:
Design and implement the interactive elements for file upload, displaying the progress of the workflow, showing the generated summaries, mapping results, and the final report.


**Reasoning**:
Import necessary modules from ipywidgets and IPython.display, create the required widgets (FileUpload, Output, Button), arrange them using VBox, and display the interface. This covers steps 1, 2, 3, 4, 5, and 6 of the instructions.



In [18]:
import ipywidgets as widgets
from IPython.display import display
import os

# 1. Import necessary modules from ipywidgets and IPython.display (done above)

# 2. Create an FileUpload widget for users to upload documents.
uploader = widgets.FileUpload(
    accept='.pdf,.docx,.xlsx',  # Accepted file extensions
    multiple=True  # Allow multiple file uploads
)
print("Please upload your audit documents using the uploader below.")

# 3. Create Output widgets to display workflow progress, summaries, mapping results, and the final report.
progress_output = widgets.Output()
summary_output = widgets.Output()
mapping_output_widget = widgets.Output() # Renamed to avoid conflict with variable
report_output = widgets.Output()

# 4. Create a button widget to trigger the workflow execution.
run_button = widgets.Button(description="Run Audit Workflow")

# Define the directory for uploaded files
upload_dir = "/content/uploaded_audit_documents"
os.makedirs(upload_dir, exist_ok=True)

# Define the event handler for the button click
def on_button_click(b):
    with progress_output:
        progress_output.clear_output()
        print("Starting workflow...")
        # Save uploaded files
        uploaded_files = []
        for name, content_dict in uploader.value.items():
             file_path = os.path.join(upload_dir, name)
             with open(file_path, 'wb') as f:
                 f.write(content_dict['content'])
             uploaded_files.append(file_path)
             print(f"Uploaded: {name}")

        if not uploaded_files:
            print("No files uploaded. Please upload documents to proceed.")
            return

        # Clear previous outputs
        summary_output.clear_output()
        mapping_output_widget.clear_output()
        report_output.clear_output()

        # --- Execute the main workflow ---
        # Call the main workflow function defined in the previous step
        # We need to pass the upload_dir to the workflow
        # For this to work, main_workflow needs to accept a directory path argument
        # Assuming main_workflow is accessible in the global scope
        try:
            # Pass the upload directory to the main workflow
            # Note: The main_workflow function is assumed to exist and handle the process
            # It also needs to return/print the results so we can capture them.
            # Since main_workflow prints directly, we'll capture stdout or rely on its prints for now.
            # A better approach would be to modify main_workflow to return the outputs.

            # Temporarily redirect stdout to capture prints from main_workflow
            import sys
            from io import StringIO
            old_stdout = sys.stdout
            sys.stdout = mystdout = StringIO()

            main_workflow(upload_dir) # Call the integrated workflow

            sys.stdout = old_stdout # Restore stdout

            workflow_output_text = mystdout.getvalue()
            print(workflow_output_text) # Print the captured output

            # This is a simplified approach. In a real scenario, you would modify
            # main_workflow to return structured data for summaries, mapping, and report,
            # and then update the respective output widgets with that data.
            # For now, we'll indicate where these outputs would go.

            with summary_output:
                print("\n--- Summaries (Placeholder - see full output above) ---")
                # In a real implementation, extract summaries from workflow_output_text or from function return
                # print("Summary 1: ...")

            with mapping_output_widget:
                print("\n--- Mapping Results (Placeholder - see full output above) ---")
                # In a real implementation, extract mapping results from workflow_output_text or function return
                # print("Mapping Analysis: ...")

            with report_output:
                print("\n--- Draft Report (Placeholder - see full output above or check audit_report.md) ---")
                # In a real implementation, display content of audit_report.md or function return
                # print("Report Content: ...")


        except NameError:
            print("Error: The 'main_workflow' function is not defined. Please ensure the previous integration step was run successfully.")
        except Exception as e:
            print(f"An error occurred during workflow execution: {e}")

    with progress_output:
        print("Workflow finished.")


run_button.on_click(on_button_click)


# 5. Arrange these widgets using layout containers (e.g., VBox, HBox) for a clear user interface.
ui = widgets.VBox([
    widgets.Label("Upload Audit Documents:"),
    uploader,
    run_button,
    widgets.Label("Workflow Progress:"),
    progress_output,
    widgets.Label("Evidence Summaries:"),
    summary_output,
    widgets.Label("Risk/Control Mapping Results:"),
    mapping_output_widget,
    widgets.Label("Draft Audit Report:"),
    report_output
])

# 6. Display the created interface using display().
display(ui)

Please upload your audit documents using the uploader below.


VBox(children=(Label(value='Upload Audit Documents:'), FileUpload(value={}, accept='.pdf,.docx,.xlsx', descrip…

## Modify the workflow to accept user inputs

### Subtask:
Update the existing functions to accept inputs from the interactive interface, such as the directory path for documents or parameters for chunking and summarization.


**Reasoning**:
Modify the `main_workflow` function to accept the `directory_path` as an argument and ensure all subsequent steps within the workflow correctly use the data originating from this input. This directly addresses instructions 1, 2, and 3 of the subtask.



In [19]:
import os
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, UnstructuredExcelLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser

# Ensure the OPENAI_API_KEY environment variable is set
# Replace 'YOUR_OPENAI_API_KEY' with your actual key or set it as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

def load_documents_from_directory(directory_path: str) -> list[Document]:
    """
    Loads documents from a specified directory using appropriate LangChain loaders.
    """
    print(f"Starting document loading from {directory_path}...")
    loaded_documents = []
    if not os.path.isdir(directory_path):
        print(f"Error: Directory not found at {directory_path}")
        return loaded_documents

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            try:
                if filename.lower().endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                elif filename.lower().endswith('.docx'):
                    loader = Docx2txtLoader(file_path)
                elif filename.lower().endswith('.xlsx'):
                    loader = UnstructuredExcelLoader(file_path)
                else:
                    print(f"Skipping unsupported file type: {filename}")
                    continue
                loaded_documents.extend(loader.load())
                print(f"Loaded: {filename}")
            except Exception as e:
                print(f"Error loading {filename}: {e}")

    print(f"Finished document loading. Loaded {len(loaded_documents)} documents.")
    return loaded_documents

def process_and_embed_documents(documents: list[Document]):
    """
    Splits documents into chunks, embeds them, and creates a vector store.
    """
    print("Starting document processing and embedding...")
    if not documents:
        print("No documents to process. Skipping chunking and embedding.")
        return [], None # Return empty list and None for consistency

    # Document splitting
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    document_chunks = text_splitter.split_documents(documents)
    print(f"Created {len(document_chunks)} document chunks.")

    if not document_chunks:
        print("No document chunks created. Skipping embedding and vector store creation.")
        return document_chunks, None

    # Embedding and Vector Store creation
    try:
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        vector_store = FAISS.from_documents(document_chunks, embeddings)
        print("Vector store created successfully.")
        return document_chunks, vector_store
    except Exception as e:
        print(f"Error during embedding or vector store creation: {e}")
        return document_chunks, None


def summarize_evidence(document_chunks, llm):
    """
    Generates summaries of document chunks focusing on control objectives.
    """
    print("Starting evidence summarization...")
    if not document_chunks:
        print("No document chunks available for summarization.")
        return "No evidence summaries generated."
    if llm is None:
         print("LLM not instantiated. Cannot perform summarization.")
         return "LLM not available for summarization."

    prompt = ChatPromptTemplate.from_template(
        """Summarize the following document chunks, focusing on key control objectives relevant to SOC1/SOC2 audits.
        Provide a concise summary that highlights the main controls and their purpose.

        Document chunks:
        "{context}"

        Summary:
        """
    )

    try:
        summarization_chain = create_stuff_documents_chain(llm, prompt)
        # Summarize in batches if needed, or adjust chunk_size for summarization
        # For simplicity, summarizing a sample or combining chunks.
        # A more robust approach might involve iterating or using map_reduce.
        # Let's use the first few chunks as a sample for demonstration
        sample_chunks_for_summary = document_chunks[:5] if len(document_chunks) > 5 else document_chunks

        if not sample_chunks_for_summary:
             return "No suitable chunks for summarization."

        # Combining sample chunks content for summarization chain
        # The create_stuff_documents_chain expects a list of Documents, so we pass the sample_chunks_for_summary directly
        summary = summarization_chain.invoke({"context": sample_chunks_for_summary})
        print("Evidence summarization finished.")
        return summary
    except Exception as e:
        print(f"Error during summarization chain invocation: {e}")
        return f"Error generating summaries: {e}"

def map_risk_control(control_summaries: str, llm):
    """
    Compares control summaries against SOC2 Trust Principles and identifies gaps.
    """
    print("Starting risk and control mapping...")
    if not control_summaries or control_summaries == "No evidence summaries generated." or control_summaries == "LLM not available for summarization.":
        print("No control summaries available for mapping.")
        return "No risk/control mapping performed due to missing summaries."
    if llm is None:
         print("LLM not instantiated. Cannot perform mapping.")
         return "LLM not available for mapping."

    soc2_mapping_prompt = ChatPromptTemplate.from_template(
        """You are an AI Audit Assistant. Your task is to compare the provided control summaries against the SOC2 Trust Principles:
        - Security
        - Availability
        - Processing Integrity
        - Confidentiality
        - Privacy

        For each control summary, identify which of the SOC2 Trust Principles it addresses. Also, highlight any potential gaps or areas that the controls do not appear to cover based on the SOC2 principles.

        Control Summaries:
        {control_summaries}

        Analysis:
        """
    )

    try:
        soc2_mapping_chain = LLMChain(llm=llm, prompt=soc2_mapping_prompt, output_parser=StrOutputParser())
        mapping_output = soc2_mapping_chain.invoke({"control_summaries": control_summaries})
        print("Risk and control mapping finished.")
        return mapping_output
    except Exception as e:
        print(f"Error during SOC2 mapping chain invocation: {e}")
        return f"Error during risk/control mapping: {e}"

def generate_draft_report(control_summaries: str, mapping_output: str, llm):
    """
    Generates a structured draft audit report in markdown format.
    """
    print("Starting draft report generation...")
    if llm is None:
         print("LLM not instantiated. Cannot generate report.")
         return "LLM not available for report generation."

    # Extract control areas and identified gaps from the mapping output (simplified)
    # A more sophisticated approach would parse the mapping_output
    control_areas = "Based on summaries" if control_summaries else "Not identified"
    identified_gaps = "Based on mapping analysis" if mapping_output and "Gap" in mapping_output else "No specific gaps identified from analysis"

    report_prompt = ChatPromptTemplate.from_template(
        """Generate a structured draft audit report in markdown format based on the following information.
        The report should include sections for Control Areas, Evidence Summaries, Identified Gaps, and Suggested Test Procedures.

        Control Areas:
        {control_areas}

        Evidence Summaries:
        {evidence_summaries}

        Identified Gaps:
        {identified_gaps}

        Suggested Test Procedures:
        Provide general suggestions for testing the identified control areas and addressing gaps based on the analysis.

        Draft Audit Report:
        """
    )

    try:
        report_chain = LLMChain(llm=llm, prompt=report_prompt, output_parser=StrOutputParser())
        draft_report = report_chain.invoke({
            "control_areas": control_areas,
            "evidence_summaries": control_summaries, # Using summaries as evidence summaries
            "identified_gaps": mapping_output # Using mapping output to represent gaps
        })
        print("Draft report generation finished.")
        return draft_report
    except Exception as e:
        print(f"Error during report generation chain invocation: {e}")
        return f"Error generating draft report: {e}"

def save_output(draft_report_content: str, dashboard_summary: str, report_filename: str = "audit_report.md"):
    """
    Saves the draft report and prints the dashboard summary.
    """
    print("Starting output generation and saving...")
    # Save report to markdown
    if draft_report_content:
        try:
            with open(report_filename, "w") as f:
                f.write(draft_report_content)
            print(f"Draft report saved to {report_filename}")
        except Exception as e:
            print(f"Error saving report to {report_filename}: {e}")
    else:
        print("No report content to save.")

    # Print dashboard summary
    print("\n--- AI Audit Assistant Dashboard Summary ---")
    print(dashboard_summary)
    print("------------------------------------------")
    print("Output generation finished.")


def main_workflow(directory_path: str):
    """
    Orchestrates the entire AI Audit Assistant workflow.

    Args:
        directory_path: The path to the directory containing the audit documents.
    """
    print("--- Starting AI Audit Assistant Workflow ---")

    # --- Configuration ---
    llm_model_name = "gpt-4o-mini"
    temperature = 0.2
    report_filename = "audit_report.md"

    # --- LLM Instantiation (with error handling) ---
    llm = None
    if "OPENAI_API_KEY" not in os.environ:
        print("Warning: OPENAI_API_KEY environment variable not set.")
        print("LLM-based steps will be skipped.")
    else:
        try:
            llm = ChatOpenAI(model=llm_model_name, temperature=temperature)
            print(f"LLM ({llm_model_name}) instantiated successfully.")
        except Exception as e:
            print(f"Error instantiating LLM: {e}")
            print("LLM-based steps will be skipped.")
            llm = None


    # --- Stage 1: Data Ingestion ---
    documents = load_documents_from_directory(directory_path)
    num_documents_processed = len(documents)


    # --- Stage 2: Document Processing, Chunking, and Embedding ---
    document_chunks, vector_store = process_and_embed_documents(documents)
    num_chunks_created = len(document_chunks) if document_chunks is not None else 0


    # --- Stage 3: Evidence Summarization ---
    # Note: For simplicity, we summarize directly from chunks, not using vector store retrieval here.
    # A real application would query the vector store for relevant chunks per control area.
    control_summaries = summarize_evidence(document_chunks, llm)


    # --- Stage 4: Risk/Control Mapping ---
    mapping_output = map_risk_control(control_summaries, llm)


    # --- Stage 5: Draft Report Creation ---
    draft_report_content = generate_draft_report(control_summaries, mapping_output, llm)


    # --- Stage 6: Output Generation ---
    dashboard_summary = f"""
Number of documents processed: {num_documents_processed}
Number of document chunks created: {num_chunks_created}
High-level summary from mapping: {mapping_output[:500] + '...' if mapping_output and len(mapping_output) > 500 else mapping_output if mapping_output else 'N/A'}
"""
    save_output(draft_report_content, dashboard_summary, report_filename)


    print("--- AI Audit Assistant Workflow Finished ---")

# Note: The actual execution of main_workflow with the interactive UI
# happens in the code cell that defines and displays the widgets.
# This code block only defines the updated main_workflow function.

## Connect the interface to the workflow

### Subtask:
Integrate the interactive components with the `main_workflow` function so that actions in the interface trigger the corresponding steps in the audit process.


**Reasoning**:
Integrate the interactive components with the `main_workflow` function by calling `main_workflow` in the button click handler and attempting to capture and display its outputs in the designated output widgets.



In [20]:
import ipywidgets as widgets
from IPython.display import display
import os
import sys
from io import StringIO

# Ensure the main_workflow function is defined (assuming it was defined in a previous cell)
# If main_workflow is not defined, the NameError in the button click handler will catch it.


# Define the directory for uploaded files
upload_dir = "/content/uploaded_audit_documents"
os.makedirs(upload_dir, exist_ok=True)

# Create an FileUpload widget for users to upload documents.
uploader = widgets.FileUpload(
    accept='.pdf,.docx,.xlsx',  # Accepted file extensions
    multiple=True  # Allow multiple file uploads
)
print("Please upload your audit documents using the uploader below.")

# Create Output widgets to display workflow progress, summaries, mapping results, and the final report.
progress_output = widgets.Output()
summary_output = widgets.Output()
mapping_output_widget = widgets.Output()
report_output = widgets.Output()

# Create a button widget to trigger the workflow execution.
run_button = widgets.Button(description="Run Audit Workflow")

# Define the event handler for the button click
def on_button_click(b):
    with progress_output:
        progress_output.clear_output()
        summary_output.clear_output()
        mapping_output_widget.clear_output()
        report_output.clear_output()

        print("Starting workflow...")
        # Save uploaded files
        uploaded_files_count = 0
        for name, content_dict in uploader.value.items():
             file_path = os.path.join(upload_dir, name)
             try:
                 with open(file_path, 'wb') as f:
                     f.write(content_dict['content'])
                 print(f"Uploaded: {name}")
                 uploaded_files_count += 1
             except Exception as e:
                 print(f"Error saving {name}: {e}")


        if uploaded_files_count == 0:
            print("No files uploaded or saved successfully. Please upload documents to proceed.")
            return

        # --- Execute the main workflow ---
        try:
            # Temporarily redirect stdout to capture prints from main_workflow
            # This is a simple way to show progress in the progress_output widget
            # A more sophisticated approach would involve modifying main_workflow
            # to accept output widgets and write to them directly.
            old_stdout = sys.stdout
            sys.stdout = mystdout = StringIO()

            # Call the main workflow function defined in the previous step
            # Pass the upload directory to the main workflow
            main_workflow(upload_dir)

            sys.stdout = old_stdout # Restore stdout

            # Capture the output from main_workflow
            workflow_output_text = mystdout.getvalue()
            print(workflow_output_text) # Print the captured output to progress_output

            # --- Display results in respective output widgets ---
            # This part is a placeholder. Ideally, main_workflow would return
            # the actual summaries, mapping results, and report content.
            # For now, we'll just indicate where they would appear.

            with summary_output:
                print("\n--- Evidence Summaries ---")
                # In a real implementation, extract summaries from workflow_output_text or from function return
                # Example placeholder:
                # print("Summary 1: Key control objectives related to access control were identified.")

            with mapping_output_widget:
                print("\n--- Risk/Control Mapping Results ---")
                # In a real implementation, extract mapping results from workflow_output_text or function return
                # Example placeholder:
                # print("Mapping Analysis: Access control maps to the Security principle. Potential gap: Periodic reviews.")

            with report_output:
                print("\n--- Draft Audit Report ---")
                # In a real implementation, display content of audit_report.md or function return
                # You could read the generated audit_report.md file and display its content here.
                try:
                    with open("audit_report.md", "r") as f:
                        report_content = f.read()
                        print(report_content)
                except FileNotFoundError:
                    print("Draft report file not found (audit_report.md).")
                except Exception as e:
                    print(f"Error reading draft report file: {e}")


        except NameError:
            print("Error: The 'main_workflow' function is not defined. Please ensure the previous integration step was run successfully.")
        except Exception as e:
            print(f"An error occurred during workflow execution: {e}")

    with progress_output:
        print("Workflow execution completed.")


run_button.on_click(on_button_click)


# Arrange these widgets using layout containers (e.g., VBox, HBox) for a clear user interface.
ui = widgets.VBox([
    widgets.Label("Upload Audit Documents:"),
    uploader,
    run_button,
    widgets.Label("Workflow Progress:"),
    progress_output,
    widgets.Label("Evidence Summaries:"),
    summary_output,
    widgets.Label("Risk/Control Mapping Results:"),
    mapping_output_widget,
    widgets.Label("Draft Audit Report:"),
    report_output
])

# Display the created interface using display().
display(ui)

Please upload your audit documents using the uploader below.


VBox(children=(Label(value='Upload Audit Documents:'), FileUpload(value={}, accept='.pdf,.docx,.xlsx', descrip…

## Display results in the interface

### Subtask:
Update the workflow to output the results of each stage (loaded documents, chunk count, summaries, mapping output, draft report) to the interactive interface instead of just printing to the console.


**Reasoning**:
Modify the main_workflow function to accept output widgets and return results, then update the on_button_click function to use these outputs and remove stdout redirection, as per the subtask instructions.



In [21]:
import os
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, UnstructuredExcelLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser
import ipywidgets as widgets
from IPython.display import display
import sys
from io import StringIO


# Ensure the OPENAI_API_KEY environment variable is set
# Replace 'YOUR_OPENAI_API_KEY' with your actual key or set it as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

def load_documents_from_directory(directory_path: str, output_widget: widgets.Output) -> list[Document]:
    """
    Loads documents from a specified directory using appropriate LangChain loaders.
    Outputs progress to the specified widget.
    """
    with output_widget:
        print(f"Starting document loading from {directory_path}...")
    loaded_documents = []
    if not os.path.isdir(directory_path):
        with output_widget:
            print(f"Error: Directory not found at {directory_path}")
        return loaded_documents

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            try:
                if filename.lower().endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                elif filename.lower().endswith('.docx'):
                    loader = Docx2txtLoader(file_path)
                elif filename.lower().endswith('.xlsx'):
                    loader = UnstructuredExcelLoader(file_path)
                else:
                    with output_widget:
                        print(f"Skipping unsupported file type: {filename}")
                    continue
                loaded_documents.extend(loader.load())
                with output_widget:
                    print(f"Loaded: {filename}")
            except Exception as e:
                with output_widget:
                    print(f"Error loading {filename}: {e}")

    with output_widget:
        print(f"Finished document loading. Loaded {len(loaded_documents)} documents.")
    return loaded_documents

def process_and_embed_documents(documents: list[Document], output_widget: widgets.Output):
    """
    Splits documents into chunks, embeds them, and creates a vector store.
    Outputs progress to the specified widget.
    """
    with output_widget:
        print("Starting document processing and embedding...")
    if not documents:
        with output_widget:
            print("No documents to process. Skipping chunking and embedding.")
        return [], None # Return empty list and None for consistency

    # Document splitting
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    document_chunks = text_splitter.split_documents(documents)
    with output_widget:
        print(f"Created {len(document_chunks)} document chunks.")

    if not document_chunks:
        with output_widget:
            print("No document chunks created. Skipping embedding and vector store creation.")
        return document_chunks, None

    # Embedding and Vector Store creation
    try:
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        vector_store = FAISS.from_documents(document_chunks, embeddings)
        with output_widget:
            print("Vector store created successfully.")
        return document_chunks, vector_store
    except Exception as e:
        with output_widget:
            print(f"Error during embedding or vector store creation: {e}")
        return document_chunks, None


def summarize_evidence(document_chunks, llm, output_widget: widgets.Output):
    """
    Generates summaries of document chunks focusing on control objectives.
    Outputs progress to the specified widget.
    """
    with output_widget:
        print("Starting evidence summarization...")
    if not document_chunks:
        with output_widget:
            print("No document chunks available for summarization.")
        return "No evidence summaries generated."
    if llm is None:
         with output_widget:
             print("LLM not instantiated. Cannot perform summarization.")
         return "LLM not available for summarization."

    prompt = ChatPromptTemplate.from_template(
        """Summarize the following document chunks, focusing on key control objectives relevant to SOC1/SOC2 audits.
        Provide a concise summary that highlights the main controls and their purpose.

        Document chunks:
        "{context}"

        Summary:
        """
    )

    try:
        summarization_chain = create_stuff_documents_chain(llm, prompt)
        # Summarize in batches if needed, or adjust chunk_size for summarization
        # For simplicity, summarizing a sample or combining chunks.
        # A more robust approach might involve iterating or using map_reduce.
        # Let's use the first few chunks as a sample for demonstration
        sample_chunks_for_summary = document_chunks[:5] if len(document_chunks) > 5 else document_chunks

        if not sample_chunks_for_summary:
            with output_widget:
                 print("No suitable chunks for summarization.")
            return "No suitable chunks for summarization."

        summary = summarization_chain.invoke({"context": sample_chunks_for_summary})
        with output_widget:
            print("Evidence summarization finished.")
        return summary
    except Exception as e:
        with output_widget:
            print(f"Error during summarization chain invocation: {e}")
        return f"Error generating summaries: {e}"

def map_risk_control(control_summaries: str, llm, output_widget: widgets.Output):
    """
    Compares control summaries against SOC2 Trust Principles and identifies gaps.
    Outputs progress to the specified widget.
    """
    with output_widget:
        print("Starting risk and control mapping...")
    if not control_summaries or control_summaries == "No evidence summaries generated." or control_summaries == "LLM not available for summarization." or control_summaries == "No suitable chunks for summarization.":
        with output_widget:
            print("No control summaries available for mapping.")
        return "No risk/control mapping performed due to missing summaries."
    if llm is None:
         with output_widget:
             print("LLM not instantiated. Cannot perform mapping.")
         return "LLM not available for mapping."

    soc2_mapping_prompt = ChatPromptTemplate.from_template(
        """You are an AI Audit Assistant. Your task is to compare the provided control summaries against the SOC2 Trust Principles:
        - Security
        - Availability
        - Processing Integrity
        - Confidentiality
        - Privacy

        For each control summary, identify which of the SOC2 Trust Principles it addresses. Also, highlight any potential gaps or areas that the controls do not appear to cover based on the SOC2 principles.

        Control Summaries:
        {control_summaries}

        Analysis:
        """
    )

    try:
        soc2_mapping_chain = LLMChain(llm=llm, prompt=soc2_mapping_prompt, output_parser=StrOutputParser())
        mapping_output = soc2_mapping_chain.invoke({"control_summaries": control_summaries})
        with output_widget:
            print("Risk and control mapping finished.")
        return mapping_output
    except Exception as e:
        with output_widget:
            print(f"Error during SOC2 mapping chain invocation: {e}")
        return f"Error during risk/control mapping: {e}"

def generate_draft_report(control_summaries: str, mapping_output: str, llm, output_widget: widgets.Output):
    """
    Generates a structured draft audit report in markdown format.
    Outputs progress to the specified widget.
    """
    with output_widget:
        print("Starting draft report generation...")
    if llm is None:
         with output_widget:
             print("LLM not instantiated. Cannot generate report.")
         return "LLM not available for report generation."

    # Extract control areas and identified gaps from the mapping output (simplified)
    # A more sophisticated approach would parse the mapping_output
    control_areas = "Based on summaries" if control_summaries and control_summaries != "No evidence summaries generated." and control_summaries != "LLM not available for summarization." and control_summaries != "No suitable chunks for summarization." else "Not identified"
    identified_gaps = "Based on mapping analysis" if mapping_output and "Gap" in mapping_output else "No specific gaps identified from analysis"


    report_prompt = ChatPromptTemplate.from_template(
        """Generate a structured draft audit report in markdown format based on the following information.
        The report should include sections for Control Areas, Evidence Summaries, Identified Gaps, and Suggested Test Procedures.

        Control Areas:
        {control_areas}

        Evidence Summaries:
        {evidence_summaries}

        Identified Gaps:
        {identified_gaps}

        Suggested Test Procedures:
        Provide general suggestions for testing the identified control areas and addressing gaps based on the analysis.

        Draft Audit Report:
        """
    )

    try:
        report_chain = LLMChain(llm=llm, prompt=report_prompt, output_parser=StrOutputParser())
        draft_report = report_chain.invoke({
            "control_areas": control_areas,
            "evidence_summaries": control_summaries, # Using summaries as evidence summaries
            "identified_gaps": mapping_output # Using mapping output to represent gaps
        })
        with output_widget:
            print("Draft report generation finished.")
        return draft_report
    except Exception as e:
        with output_widget:
            print(f"Error during report generation chain invocation: {e}")
        return f"Error generating draft report: {e}"

def save_output(draft_report_content: str, dashboard_summary: str, report_filename: str = "audit_report.md", output_widget: widgets.Output = None):
    """
    Saves the draft report and prints the dashboard summary.
    Outputs progress to the specified widget if provided.
    """
    if output_widget:
        with output_widget:
            print("Starting output generation and saving...")
    else:
        print("Starting output generation and saving...")

    # Save report to markdown
    if draft_report_content and "LLM not available for report generation." not in draft_report_content:
        try:
            with open(report_filename, "w") as f:
                f.write(draft_report_content)
            if output_widget:
                with output_widget:
                    print(f"Draft report saved to {report_filename}")
            else:
                print(f"Draft report saved to {report_filename}")
        except Exception as e:
            if output_widget:
                with output_widget:
                    print(f"Error saving report to {report_filename}: {e}")
            else:
                print(f"Error saving report to {report_filename}: {e}")
    else:
        if output_widget:
            with output_widget:
                print("No report content to save.")
        else:
            print("No report content to save.")


    # Print dashboard summary
    if output_widget:
        with output_widget:
            print("\n--- AI Audit Assistant Dashboard Summary ---")
            print(dashboard_summary)
            print("------------------------------------------")
            print("Output generation finished.")
    else:
        print("\n--- AI Audit Assistant Dashboard Summary ---")
        print(dashboard_summary)
        print("------------------------------------------")
        print("Output generation finished.")


def main_workflow(directory_path: str, progress_output: widgets.Output, summary_output: widgets.Output, mapping_output_widget: widgets.Output, report_output: widgets.Output):
    """
    Orchestrates the entire AI Audit Assistant workflow, outputting to widgets.

    Args:
        directory_path: The path to the directory containing the audit documents.
        progress_output: Widget for displaying workflow progress.
        summary_output: Widget for displaying evidence summaries.
        mapping_output_widget: Widget for displaying risk/control mapping results.
        report_output: Widget for displaying the draft report.

    Returns:
        A tuple containing:
            - control_summaries (str): The generated evidence summaries.
            - mapping_output (str): The risk/control mapping results.
            - draft_report_content (str): The generated draft report content.
    """
    with progress_output:
        print("--- Starting AI Audit Assistant Workflow ---")

    # --- Configuration ---
    llm_model_name = "gpt-4o-mini"
    temperature = 0.2
    report_filename = "audit_report.md"

    # --- LLM Instantiation (with error handling) ---
    llm = None
    if "OPENAI_API_KEY" not in os.environ:
        with progress_output:
            print("Warning: OPENAI_API_KEY environment variable not set.")
            print("LLM-based steps will be skipped.")
    else:
        try:
            llm = ChatOpenAI(model=llm_model_name, temperature=temperature)
            with progress_output:
                print(f"LLM ({llm_model_name}) instantiated successfully.")
        except Exception as e:
            with progress_output:
                print(f"Error instantiating LLM: {e}")
                print("LLM-based steps will be skipped.")
            llm = None


    # --- Stage 1: Data Ingestion ---
    documents = load_documents_from_directory(directory_path, progress_output)
    num_documents_processed = len(documents)


    # --- Stage 2: Document Processing, Chunking, and Embedding ---
    document_chunks, vector_store = process_and_embed_documents(documents, progress_output)
    num_chunks_created = len(document_chunks) if document_chunks is not None else 0


    # --- Stage 3: Evidence Summarization ---
    control_summaries = summarize_evidence(document_chunks, llm, summary_output)


    # --- Stage 4: Risk/Control Mapping ---
    mapping_output = map_risk_control(control_summaries, llm, mapping_output_widget)


    # --- Stage 5: Draft Report Creation ---
    draft_report_content = generate_draft_report(control_summaries, mapping_output, llm, report_output)


    # --- Stage 6: Output Generation ---
    dashboard_summary = f"""
Number of documents processed: {num_documents_processed}
Number of document chunks created: {num_chunks_created}
High-level summary from mapping: {mapping_output[:500] + '...' if mapping_output and len(mapping_output) > 500 else mapping_output if mapping_output else 'N/A'}
"""
    save_output(draft_report_content, dashboard_summary, report_filename, progress_output)

    with progress_output:
        print("--- AI Audit Assistant Workflow Finished ---")

    return control_summaries, mapping_output, draft_report_content


# --- Interactive Interface Setup ---

# Define the directory for uploaded files
upload_dir = "/content/uploaded_audit_documents"
os.makedirs(upload_dir, exist_ok=True)

# Create an FileUpload widget for users to upload documents.
uploader = widgets.FileUpload(
    accept='.pdf,.docx,.xlsx',  # Accepted file extensions
    multiple=True  # Allow multiple file uploads
)
print("Please upload your audit documents using the uploader below.")

# Create Output widgets to display workflow progress, summaries, mapping results, and the final report.
progress_output = widgets.Output()
summary_output = widgets.Output()
mapping_output_widget = widgets.Output()
report_output = widgets.Output()

# Create a button widget to trigger the workflow execution.
run_button = widgets.Button(description="Run Audit Workflow")

# Define the event handler for the button click
def on_button_click(b):
    # Clear all output widgets at the start of a new run
    progress_output.clear_output()
    summary_output.clear_output()
    mapping_output_widget.clear_output()
    report_output.clear_output()

    with progress_output:
        print("Starting workflow...")
        # Save uploaded files
        uploaded_files_count = 0
        # Clear the upload directory before saving new files
        for filename in os.listdir(upload_dir):
            file_path = os.path.join(upload_dir, filename)
            try:
                if os.path.isfile(file_path):
                    os.remove(file_path)
            except Exception as e:
                print(f"Error cleaning upload directory: {e}")


        for name, content_dict in uploader.value.items():
             file_path = os.path.join(upload_dir, name)
             try:
                 with open(file_path, 'wb') as f:
                     f.write(content_dict['content'])
                 print(f"Uploaded: {name}")
                 uploaded_files_count += 1
             except Exception as e:
                 print(f"Error saving {name}: {e}")


        if uploaded_files_count == 0:
            print("No files uploaded or saved successfully. Please upload documents to proceed.")
            return

        print("File upload complete.")

    # --- Execute the main workflow ---
    try:
        # Call the main workflow function and pass the output widgets
        # Receive the returned values
        control_summaries, mapping_output_result, draft_report_content_result = main_workflow(
            upload_dir,
            progress_output,
            summary_output,
            mapping_output_widget,
            report_output
        )

        # --- Display results in respective output widgets ---
        with summary_output:
            summary_output.clear_output() # Clear before displaying the actual summary
            print("\n--- Evidence Summaries ---")
            print(control_summaries)

        with mapping_output_widget:
            mapping_output_widget.clear_output() # Clear before displaying the actual mapping
            print("\n--- Risk/Control Mapping Results ---")
            print(mapping_output_result)

        with report_output:
            report_output.clear_output() # Clear before displaying the actual report
            print("\n--- Draft Audit Report ---")
            print(draft_report_content_result) # Display the content returned by the workflow

    except NameError:
        with progress_output:
            print("Error: The 'main_workflow' function is not defined. Please ensure the previous integration step was run successfully.")
    except Exception as e:
        with progress_output:
            print(f"An error occurred during workflow execution: {e}")

    with progress_output:
        print("Workflow execution completed.")


run_button.on_click(on_button_click)


# Arrange these widgets using layout containers (e.g., VBox, HBox) for a clear user interface.
ui = widgets.VBox([
    widgets.Label("Upload Audit Documents:"),
    uploader,
    run_button,
    widgets.Label("Workflow Progress:"),
    progress_output,
    widgets.Label("Evidence Summaries:"),
    summary_output,
    widgets.Label("Risk/Control Mapping Results:"),
    mapping_output_widget,
    widgets.Label("Draft Audit Report:"),
    report_output
])

# Display the created interface using display().
display(ui)

Please upload your audit documents using the uploader below.


VBox(children=(Label(value='Upload Audit Documents:'), FileUpload(value={}, accept='.pdf,.docx,.xlsx', descrip…

## Refine the user experience

### Subtask:
Improve the layout, add progress indicators, and ensure smooth interaction within the Colab environment.


**Reasoning**:
Implement the remaining instructions to enhance the layout, add a progress label, disable/enable the button, clear outputs, add upload confirmation, and improve error handling within the existing interactive interface code.



In [22]:
import ipywidgets as widgets
from IPython.display import display
import os
import sys
from io import StringIO
import time # Import time for potential delays to show progress indicator

# Assuming main_workflow, load_documents_from_directory, process_and_embed_documents,
# summarize_evidence, map_risk_control, generate_draft_report, and save_output
# functions are defined in previous cells and are accessible.

# Ensure the OPENAI_API_KEY environment variable is set
# Replace 'YOUR_OPENAI_API_KEY' with your actual key or set it as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Define the directory for uploaded files
upload_dir = "/content/uploaded_audit_documents"
os.makedirs(upload_dir, exist_ok=True)

# Create an FileUpload widget for users to upload documents.
uploader = widgets.FileUpload(
    accept='.pdf,.docx,.xlsx',  # Accepted file extensions
    multiple=True  # Allow multiple file uploads
)

# Create Output widgets to display workflow progress, summaries, mapping results, and the final report.
progress_output = widgets.Output()
summary_output = widgets.Output()
mapping_output_widget = widgets.Output()
report_output = widgets.Output()

# Add a label for explicit progress indication
progress_label = widgets.Label("Ready to upload documents.")

# Create a button widget to trigger the workflow execution.
run_button = widgets.Button(description="Run Audit Workflow")

# Define the event handler for the button click
def on_button_click(b):
    # 4. Clear previous outputs in all relevant output widgets
    progress_output.clear_output()
    summary_output.clear_output()
    mapping_output_widget.clear_output()
    report_output.clear_output()

    # 3. Disable the button to indicate processing
    run_button.disabled = True
    progress_label.value = "Workflow started..."


    try:
        with progress_output:
            print("Starting workflow...")
            # Save uploaded files
            uploaded_files_count = 0
            # Clear the upload directory before saving new files
            for filename in os.listdir(upload_dir):
                file_path = os.path.join(upload_dir, filename)
                try:
                    if os.path.isfile(file_path):
                        os.remove(file_path)
                except Exception as e:
                    print(f"Error cleaning upload directory: {e}")

            if not uploader.value:
                print("No files selected for upload. Please select documents to proceed.")
                progress_label.value = "No files selected."
                return # Stop if no files are selected

            for name, content_dict in uploader.value.items():
                 file_path = os.path.join(upload_dir, name)
                 try:
                     with open(file_path, 'wb') as f:
                         f.write(content_dict['content'])
                     print(f"Uploaded: {name}")
                     uploaded_files_count += 1
                 except Exception as e:
                     print(f"Error saving {name}: {e}")


            if uploaded_files_count == 0:
                print("No files uploaded or saved successfully. Please upload documents to proceed.")
                progress_label.value = "Upload failed or no files saved."
                return # Stop if no files were saved

            # 5. Add a confirmation message for file upload
            print(f"File upload complete. {uploaded_files_count} file(s) saved to {upload_dir}.")
            progress_label.value = "File upload complete. Starting processing..."


        # --- Execute the main workflow ---
        try:
            # Call the main workflow function and pass the output widgets
            # Receive the returned values
            control_summaries, mapping_output_result, draft_report_content_result = main_workflow(
                upload_dir,
                progress_output,
                summary_output,
                mapping_output_widget,
                report_output
            )

            # --- Display results in respective output widgets ---
            with summary_output:
                summary_output.clear_output() # Clear before displaying the actual summary
                print("\n--- Evidence Summaries ---")
                print(control_summaries)

            with mapping_output_widget:
                mapping_output_widget.clear_output() # Clear before displaying the actual mapping
                print("\n--- Risk/Control Mapping Results ---")
                print(mapping_output_result)

            with report_output:
                report_output.clear_output() # Clear before displaying the actual report
                print("\n--- Draft Audit Report ---")
                print(draft_report_content_result) # Display the content returned by the workflow

        except NameError:
            with progress_output:
                print("Error: The 'main_workflow' function is not defined. Please ensure the previous integration step was run successfully.")
        # 6. Ensure robust error handling
        except Exception as e:
            with progress_output:
                print(f"An error occurred during workflow execution: {e}")

    finally:
        # 3. Re-enable the button and update progress label
        run_button.disabled = False
        progress_label.value = "Workflow execution completed."
        with progress_output:
             print("Workflow execution completed.")


run_button.on_click(on_button_click)


# 1. Enhance the layout using layout widgets
# Arrange these widgets using layout containers (e.g., VBox, HBox) for a clear user interface.

# Input section layout
input_layout = widgets.VBox([
    widgets.Label("Upload Audit Documents:"),
    uploader,
    run_button,
    progress_label # Add the progress label here
])

# Output sections layout
output_layout = widgets.VBox([
    widgets.Label("Workflow Progress:"),
    progress_output,
    widgets.Label("Evidence Summaries:"),
    summary_output,
    widgets.Label("Risk/Control Mapping Results:"),
    mapping_output_widget,
    widgets.Label("Draft Audit Report:"),
    report_output
])

# Combine input and output sections
ui = widgets.VBox([
    input_layout,
    output_layout
])

# 6. Display the created interface using display().
display(ui)

VBox(children=(VBox(children=(Label(value='Upload Audit Documents:'), FileUpload(value={}, accept='.pdf,.docx,…

## Summary:

### Data Analysis Key Findings

*   The interactive interface for the AI Audit Assistant workflow was successfully built using `ipywidgets` in Google Colab.
*   The interface includes file upload functionality, a run button, and dedicated output areas for workflow progress, evidence summaries, risk/control mapping results, and the draft audit report.
*   The `main_workflow` function was modified to accept the path to the uploaded documents directory as input and to direct its outputs (progress messages, summaries, mapping results, report content) to the corresponding interactive widgets.
*   The interface includes improvements for user experience such as a progress label, button state management (disabling during execution), clearing previous outputs before each run, explicit file upload confirmation, and error handling displayed within the progress area.
*   LLM-dependent steps in the workflow include checks for the `OPENAI_API_KEY` environment variable and will skip execution if the key is not set or LLM instantiation fails.

### Insights or Next Steps

*   Refine the `main_workflow` function to return structured data for summaries, mapping, and the report instead of just printing, allowing for more granular display and manipulation within the interface widgets.
*   Implement a visual progress bar or step-by-step indicator within the `progress_output` widget to provide more detailed feedback on the current workflow stage (e.g., Loading Documents, Processing, Summarizing, Mapping, Reporting).
