# Generate Documentation using Amazon Q for Business and evaluate it against existing documentation.

In this notebook we will use Amazon Q for Business to generate a documentation for repo files, ingest them to Amazon Q, get documentation repo, iterate through its files, ask questions to Q about documented functionality and then compare AI-geenrated vs human generated doc.

In [84]:
%pip install boto3 --upgrade 
%pip install GitPython shutils python-dotenv

Collecting boto3
  Downloading boto3-1.34.72-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.35.0,>=1.34.72 (from boto3)
  Downloading botocore-1.34.72-py3-none-any.whl.metadata (5.7 kB)
Downloading boto3-1.34.72-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.34.72-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m112.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: botocore, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.71
    Uninstalling botocore-1.34.71:
      Successfully uninstalled botocore-1.34.71
  Attempting uninstall: boto3
    Found existing installation: boto3 1.34.71
    Uninstalling boto3-1.34.71:
      Successfully uninstalled boto3-1.34.71
[31mERROR: pip's dependency resolver does not currently ta

In [39]:
import os
import boto3
import time
import json
import datetime
import requests
import re
import logging
import shutil
import uuid
import git
from dotenv import load_dotenv

In [86]:
%%writefile .env
TOKEN=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
DOC_REPO_URL=https://github.com/WStobieniecka/test.git
CODE_REPO_URL=https://github.com/WStobieniecka/amazon-q-use-case-1.git
DOC_REPO_SUBDIR=untitled_folder
FILE_DOC_SUFFIX=API
USERNAME=WStobieniecka

Overwriting .env


In [88]:
load_dotenv(override=True)

True

In [89]:
# GitHub authentication token
token = os.environ['TOKEN']

# Repository owner, name etc
# this will be env vars in Lambda
# owner = os.environ['OWNER']
code_repo = os.environ['CODE_REPO_URL']
doc_repo = os.environ['DOC_REPO_URL']
doc_repo_subdir = os.environ['DOC_REPO_SUBDIR']
suffix = os.environ['FILE_DOC_SUFFIX']
username = os.environ['USERNAME']

In [52]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    handlers=[
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

## Creating the Amazon Q Application

In the steps below we will create an Amazon Q application that will be used to process and then answer questions about the code of a repository.

In [6]:
amazon_q_user_id = "splwis@amazon.pl"
role_arn = None
amazon_q_app_id = None
reuse_existing_q_app = False

If you want to reuse the application, uncomment the cell below, fill in the values, and run it.

In [7]:
# amazon_q_app_id = "c717d18f-9b37-45cf-9808-8bdc75c09663"
# role_arn = "arn:aws:iam::760804086109:role/QBusiness-Application-Code-Analysis-Demo-App-2024-03-14-11-06-48"
# reuse_existing_q_app = True

Next we create the IAM role that will be used by the Amazon Q application to access the repository.

In [8]:
# Create Q IAM Service role from iam-policy.json and trust-policy.json
if not role_arn:

    # Create IAM role
    project_name = f"Doc-Eval-Demo-App-{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    iam = boto3.client('iam')
    # Note to work with Q Business the role MUST start with "QBusiness-Application-"
    role_name = f"QBusiness-Application-{project_name}"
    role_policy_file = f"../security/iam-policy.json"
    trust_policy_file = f"../security/trust-policy.json"

    # Create role using iam policy and trust policy
    role_policy = json.load(open(role_policy_file))
    trust_policy = json.load(open(trust_policy_file))
    role = iam.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy)
    )
    role_arn = role['Role']['Arn']

2024-03-27 21:04:17 INFO Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Now we create the Amazon Q application passing the role created above. Note that we are enabling attachments so that
we can later process every file of code in the repository with the application

In [9]:
amazon_q = boto3.client('qbusiness')

if not reuse_existing_q_app:
    response = amazon_q.create_application(
        attachmentsConfiguration={
            'attachmentsControlMode': 'ENABLED'
        },
        description=f"{project_name}-{datetime.datetime.now().strftime('%Y-%m-%d')}",
        displayName=project_name,
        roleArn=role_arn,
    )
    amazon_q_app_id = response["applicationId"]

After creating the application, we will create an index that will be used to store the information about the code of the repository.

In [10]:
if not reuse_existing_q_app:
    response = amazon_q.create_index(
        applicationId=amazon_q_app_id,
        capacityConfiguration={
            'units': 1
        },
        description=f"{project_name}-{datetime.datetime.now().strftime('%Y-%m-%d')}",
        displayName=project_name,
    )
    index_id = response["indexId"]
else:
    response = amazon_q.list_indices(
        applicationId=amazon_q_app_id
    )
    index_id = response["indices"][0]['indexId']

Lastly we will create a retriever to fetch the relevant information when we ask questions about the repository.

In [11]:
if not reuse_existing_q_app:
    response = amazon_q.create_retriever(
        applicationId=amazon_q_app_id,
        configuration={
            'nativeIndexConfiguration': {
                'indexId': index_id
            }
        },
        displayName=project_name,
        roleArn=role_arn,
        type='NATIVE_INDEX'
    )
    retriever_id = response["retrieverId"]
else:
    retriever_id = amazon_q.list_retrievers(
        maxResults=1,
        applicationId=amazon_q_app_id,
    )["retrievers"][0]["retrieverId"]

In [12]:
while True:
    response = amazon_q.get_index(
        applicationId=amazon_q_app_id,
        indexId=index_id,
    )
    status = response.get('status')
    print(f"Creat index status {status}")
    if status == 'ACTIVE':
        break
    time.sleep(10)

Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status CREATING
Creat index status ACTIVE


## Generating and Ingesting Documentation
If we only ingest the code, we will be retrieving random code chunks that may not be relevant to the questions we want to ask. To avoid this, we will generate concise documentation for the repository and ingest it into the index.

In [31]:
def include_code_file_type(filename):
    if not filename.endswith(('.png', '.jpg', '.jpeg', '.gif', '.zip', '.md')) and not filename.startswith('.'):
        return True
    else:
        return False
    
    
def include_doc_file_type(filename):
    if filename.endswith('.md') and not filename.startswith('.'):
        return True
    else:
        return False
    

def check_if_missing_response(answer):
    if answer == "Sorry, I could not find relevant information to complete your request.":
        return True
    else:
        return False

In [29]:
def clone_repo(repo, local_path):
    _, repo_url = repo.split("https://")
    token_url = f"https://{username}:{token}@{repo_url}"
    git.Repo.clone_from(token_url, local_path)

In [35]:
def ask_question_to_gen_code_doc(prompt, filename):
    data = open(filename, 'rb')
    answer = amazon_q.chat_sync(
        applicationId=amazon_q_app_id,
        userId=amazon_q_user_id,
        userMessage=prompt,
        attachments=[
            {
                'data': data.read(),
                'name': filename
            },
        ],
    )
    return answer['systemMessage']

In [36]:
def ask_question_to_gen_doc_summary(prompt):
    answer = amazon_q.chat_sync(
        applicationId=amazon_q_app_id,
        userId=amazon_q_user_id,
        userMessage=prompt,
    )
    return answer['systemMessage']

In [37]:
def ask_question_to_compare_docs(prompt, human_gen_doc, ai_gen_doc):
    answer = amazon_q.chat_sync(
        applicationId=amazon_q_app_id,
        userId=amazon_q_user_id,
        userMessage=prompt,
        attachments=[
            {
                'data': human_gen_doc,
                'name': "Human-Generated-Summary"
            },
            {
                'data': ai_gen_doc,
                'name': "AI-Generated-Summary"
            },
        ],
    )
    return answer['systemMessage']

In [18]:
def upload_prompt_answer_and_file_name(filename, prompt, answer, repo_url):
    cleaned_file_name = os.path.join(repo_url[:-4], '/'.join(filename.split('/')[2:]))
    amazon_q.batch_put_document(
        applicationId=amazon_q_app_id,
        indexId=index_id,
        roleArn=role_arn,
        documents=[
            {
                'id': str(uuid.uuid5(uuid.NAMESPACE_URL, f"{cleaned_file_name}")),
                'contentType': 'PLAIN_TEXT',
                'title': cleaned_file_name,
                'content': {
                    'blob': f"{cleaned_file_name} | {prompt} | {answer}".encode('utf-8')
                },
                'attributes': [
                    {
                        'name': 'url',
                        'value': {
                            'stringValue': cleaned_file_name
                        }
                    }
                ]
            },
        ]
    )

In [19]:
def save_answers(answer, filepath, folder):
    filepath = f"{filepath}.out"
    with open(f"{folder}/{filepath}", "w") as f:
        f.write(answer)

In [20]:
file_doc_gen_prompts = [
    """Come up with a list of questions and answers about the attached file. 
    Keep answers dense with information. A good question for a database related file would
    be 'What is the database technology and architecture?' or for a file that executes SQL commands
    'What are the SQL commands and what do they do?' or for a file that contains a list of API endpoints
    'What are the API endpoints and what do they do?'""",

    """Generate comprehensive documentation about the attached file. Make sure you include
    what dependencies and other files are being referenced as well as function names, class names, and what they do.""",

    """Identify anti-patterns in the attached file. Make sure to include examples of how to fix them. Try Q&A like 
    'What are some anti-patterns in the file?' or 'What could be causing high latency?'""",

    """Suggest improvements to the attached file. Try Q&A like 'What are some ways to improve the file?'
    or 'Where can the file be optimized?'""",
    
    """Please provide description of the attached file. 
    Summarize the intent, resources and capabilities in separate sections.""",
    
    """Please describe each API method. Then list its inputs and outputs in a table.
    Include sample invocation.""",
]

In [21]:
file_doc_summary_gen_prompt = """Please provide me description <functionality> API in <repo_name>. Please include intent within a tag <intent></intent>, 
    resources within a tag <resources></resources>, capabilities within a tag <capabilities></capabilities> in the subsequent paragraphs.
    In capabilities section you should explain all API methods, what is their purpose, inputs and outputs. 
    Please include as much details as you can. This is important."""

In [22]:
file_doc_eval_prompt = """<instruction>Evaluate the AI-Generated-Summary using the Human-Generated-Summary using a 10-point scale. Justify your score. Write your score response in score brackets: <score></score>
Then write explanation in explanation brackets: <explanation></explanation> </instruction>
"""

In [23]:
# def generate_file_doc(prompts, file_str, prefix_path, file_path):
#     answers = ""
#     for idx, prompt in enumerate(prompts):
#         answer = ask_question_with_attachment(prompt, file_path, file_str)
#         answers = answers + f"{idx+1}. {prompt}:\n\n{answer}\n\n"
#         answer = ask_question_with_attachment(prompt, file_path, file_str)
#         if not check_if_missing_response(answer):
#             upload_prompt_answer_and_file_name(file_path, prompt, answer,
#                                                repo, prompt_type)
#             answers = answers + f"{idx+1}. {prompt}:\n\n{answer}\n\n"
#     save_answers(answers, file_path, f"documentation/{prefix_path}")
#     return None

In [24]:
def should_ignore_path(path):
    path_components = path.split(os.sep)
    for component in path_components:
        if component.startswith('.'):
            return True
        elif component == 'node_modules':
            return True
        elif component == '__pycache__':
            return True
    return False

In [25]:
def parse_repo_url(repo_url):
  
    # Match the pattern github.com/owner/repo
    match = re.search(r'github.com/([^/]+)/([^/]+)', repo_url)
  
    if match:
        owner = match.group(1)
        repo_name = match.group(2)[:-4]
        return owner, repo_name
    else:
        raise ValueError(f"Invalid repo URL: {repo_url}")

In [32]:
def process_code_repo_files(repo_url, prompts):
    # Temporary clone location
    repo_owner, repo_name = parse_repo_url(repo_url)
    tmp_dir = f"tmp/code/{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    destination_folder = 'repositories/code'

    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    # Clone the repository
    logger.info(f"Cloning repository... {repo_url}")
    clone_repo(repo_url, tmp_dir)
    logger.info(f"Finished cloning repository {repo_url}")
    
    # Copy all files to destination folder
    for src_dir, dirs, files in os.walk(tmp_dir):
        dst_dir = src_dir.replace(tmp_dir, destination_folder)
        if not os.path.exists(dst_dir):
            os.mkdir(dst_dir)
        for file_ in files:
            src_file = os.path.join(src_dir, file_)
            dst_file = os.path.join(dst_dir, file_)
            if os.path.exists(dst_file):
                os.remove(dst_file)
            shutil.copy(src_file, dst_dir)

    # Delete temp clone
    shutil.rmtree(tmp_dir)
    processed_files = []
    failed_files = []
    
    logger.info(f"Processing code files in {destination_folder}")
    for root, dirs, files in os.walk(destination_folder):
        if not should_ignore_path(root):
            for file in files:
                if include_code_file_type(file):
                    
                    file_path = os.path.join(root, file)

                    for attempt in range(3):
                        try:
                            logger.info(f"\033[92mProcessing code file: {file_path}\033[0m")
                            for prompt in prompts:
                                answer = ask_question_to_gen_code_doc(prompt, file_path)
                                upload_prompt_answer_and_file_name(file_path, prompt, answer, repo_url) 
                            # Upload the file itself to the index
                            code = open(file_path, 'r')
                            upload_prompt_answer_and_file_name(file_path, "", code.read(), repo_url)
                            processed_files.append(file)
                            break
                        except Exception as e:
                            logger.error(f"Error: {e}")
                            time.sleep(15)
                    else:
                        logger.info(f"\033[93mSkipping file: {file_path}\033[0m")
                        failed_files.append(file_path)
    return repo_name
    logger.info(f"Processed files: {processed_files}")
    logger.info(f"Failed files: {failed_files}")

In [94]:
def process_doc_repo_files(repo_url, doc_subdir, suffix_exc, gen_doc_summary_prompt, doc_eval_prompt, code_repo):
    # Temporary clone location
    repo_owner, repo_name = parse_repo_url(repo_url)
    tmp_dir = f"tmp/doc/{datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    destination_folder = "repositories/doc"
    eval_folder = "repositories/eval_results"

    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    # Clone the repository
    logger.info(f"Cloning repository... {repo_url}")
    clone_repo(repo_url, tmp_dir)
    logger.info(f"Finished cloning repository {repo_url}")
    
    # Copy all files from a selected dir
    subdir = f"{tmp_dir}/{doc_subdir}"
    for src_dir, dirs, files in os.walk(subdir):
        for file_ in files:
            if include_doc_file_type(file_) and src_dir == subdir and suffix_exc in file_:
                src_file = os.path.join(src_dir, file_)
                dst_file = os.path.join(destination_folder, file_)
                if os.path.exists(dst_file):
                    os.remove(dst_file)
                shutil.copy(src_file, destination_folder)

    # Delete temp clone
    shutil.rmtree(tmp_dir)
    
    processed_files = []
    failed_files = []
    
    logger.info(f"Processing doc files in {destination_folder}")
    for file in os.listdir(destination_folder):
        file_path = os.path.join(destination_folder, file)
        filename = file.split(".")[0]
        functionality = filename.replace(suffix_exc, "")
        eval_path = f"{eval_folder}/{filename}"
        if not os.path.exists(eval_path):
            os.makedirs(eval_path)
        for attempt in range(3):
            try:
                logger.info(f"\033[92mProcessing doc file: {file}\033[0m")
                gen_doc_summary_prompt = gen_doc_summary_prompt.replace("<functionality>", functionality)
                gen_doc_summary_prompt = gen_doc_summary_prompt.replace("<repo_name>", code_repo)
                ai_doc = ask_question_to_gen_doc_summary(gen_doc_summary_prompt)
                human_doc = open(file_path, 'r').read()
                eval_results = ask_question_to_compare_docs(doc_eval_prompt, human_doc, ai_doc)
                save_answers(ai_doc, "AI-Generated-Doc", eval_path)
                save_answers(human_doc, "Human-Generated-Doc", eval_path)
                save_answers(eval_results, "Evaluation", eval_path)
                shutil.move(file_path, eval_path)
                processed_files.append(file)
                break
            except Exception as e:
                logger.error(f"Error: {e}")
                time.sleep(15)
        else:
            logger.info(f"\033[93mSkipping file: {file_path}\033[0m")
            failed_files.append(file_path)

    logger.info(f"Processed files: {processed_files}")
    logger.info(f"Failed files: {failed_files}")

In [40]:
code_repo_name = process_code_repo_files(code_repo, file_doc_gen_prompts)

2024-03-27 21:17:10 INFO Cloning repository... https://github.com/WStobieniecka/amazon-q-use-case-1.git
2024-03-27 21:17:12 INFO Finished cloning repository https://github.com/WStobieniecka/amazon-q-use-case-1.git
2024-03-27 21:17:12 INFO Processing code files in repositories/code
2024-03-27 21:17:12 INFO [92mProcessing code file: repositories/code/LICENSE[0m
2024-03-27 21:17:34 INFO [92mProcessing code file: repositories/code/assets/cloudformation.yml[0m
2024-03-27 21:18:44 INFO [92mProcessing code file: repositories/code/assets/sample.py[0m
2024-03-27 21:19:15 INFO [92mProcessing code file: repositories/code/assets/edit-prompts.sh[0m
2024-03-27 21:19:25 INFO [92mProcessing code file: repositories/code/assets/create-webhook.sh[0m
2024-03-27 21:19:36 INFO [92mProcessing code file: repositories/code/cdk/tsconfig.json[0m
2024-03-27 21:20:01 INFO [92mProcessing code file: repositories/code/cdk/cdk.json[0m
2024-03-27 21:20:30 INFO [92mProcessing code file: repositories/code/

In [96]:
process_doc_repo_files(doc_repo, doc_repo_subdir, suffix, file_doc_summary_gen_prompt, file_doc_eval_prompt, code_repo_name)

2024-03-28 09:43:08 INFO Cloning repository... https://github.com/WStobieniecka/test.git
2024-03-28 09:43:10 INFO Finished cloning repository https://github.com/WStobieniecka/test.git
2024-03-28 09:43:10 INFO Processing doc files in repositories/doc
2024-03-28 09:43:10 INFO [92mProcessing doc file: READMEAPI.md[0m
2024-03-28 09:43:16 INFO Processed files: ['READMEAPI.md']
2024-03-28 09:43:16 INFO Failed files: []
