# Q Business by API
## Creating an Amazon Q Business instance, a Webcrawler Datasource, and debugging the datasource by JSON configuration

This notebook provides a comprehensive guide to programmatically setting up and managing Amazon Q Business with web crawler data sources using the AWS SDK for Python (boto3). Unlike the console experience, this API-driven approach gives you greater control and automation capabilities for your Q Business deployments.

What you'll learn
- How to create an Amazon Q Business instance with proper IAM roles and permissions
- Setting up a web crawler data source with customized configuration
- Adding advanced configurations like crawling delays that aren't available in the console
- Monitoring and troubleshooting data source synchronization
- Working with indexes and retrievers for effective document search
- Managing AWS session tokens for long-running operations

Why use this approach?
- The API-driven approach demonstrated in this notebook offers several advantages:
- Automation: Easily replicate your Q Business setup across multiple environments
- Advanced Configuration: Access undocumented features like crawling delays through JSON configuration
- Troubleshooting: Gain deeper insights into synchronization issues and crawler behavior
- Comparison: Export configurations to compare different instances and identify discrepancies
- Support: Generate detailed configuration exports to share with AWS Support for troubleshooting

This notebook serves as both a practical implementation guide and a reference for common debugging scenarios when working with Amazon Q Business web crawlers.




## Prerequisites

1. AWS CLI configured with appropriate credentials
2. Required Python packages installed
3. Appropriate IAM permissions to create Q Business instances and IAM roles

In [None]:
# Install required packages if not already installed
!pip install boto3

import boto3
import json
import botocore
from datetime import datetime
import time
import random
import string

## Define some variables

Set the values to what's applicable to your configuration


In [None]:


# Generate a random 5-character suffix
def generate_suffix(length=5):
    return ''.join(random.choices(string.ascii_lowercase + string.digits, k=length))


# Configuration parameters - Edit these values as needed
# AWS Region
region = "us-west-2"  # Replace with your desired region

# Role names
crawler_role_name = "QBusiness-DataSource-"

# Application name
instance_name = "my-q-business-application-"

# Add random suffix to names
suffix = generate_suffix()
crawler_role_name = f"{crawler_role_name}{suffix}"
instance_name = f"{instance_name}{suffix}"

print(f"Using crawler role name: {crawler_role_name}")
print(f"Using instance name: {instance_name}")


# User configuration
email = "john.doe@example.com"
first_name = "John"
family_name = "Doe"

# Webcrawler configuration
seed_urls = [
    "https://en.wikipedia.org/wiki/Amazon_Q"  # Replace with your target URLs
]
crawl_depth = 2
max_urls = 100
crawl_mode = "ALL"  # Options: ALL, HTML_ONLY, PDF_ONLY
exclude_patterns = []


## Validate credentials

Verify credentials. If not, configure them before continuing this notebook. 


In [None]:
# Check if AWS credentials are properly configured
try:
    # Try to get caller identity (lightweight AWS API call)
    sts = boto3.client('sts')
    identity = sts.get_caller_identity()
    
    print("✅ AWS credentials are properly configured!")
    print(f"Account ID: {identity['Account']}")
    print(f"User ID: {identity['UserId']}")
    print(f"ARN: {identity['Arn']}")
except Exception as e:
    print("❌ AWS credentials are not properly configured!")
    print(f"Error: {str(e)}")



## Create IAM Role for Q Business

 In most accounts, this AWS managed role will already exist. This cell creates an IAM role with a trust policy for Q Business. The role will be used by the Q Business instance to access AWS resources.

In [None]:
# Refresh session to ensure credentials are valid
session = boto3.Session(region_name=region)
iam = session.client('iam')

# Use the managed service-linked role for Amazon Q Business application
app_role_arn = f"arn:aws:iam::{identity['Account']}:role/aws-service-role/qbusiness.amazonaws.com/AWSServiceRoleForQBusiness"

# Check if the service-linked role exists, create it if it doesn't
try:
    iam.get_role(RoleName="AWSServiceRoleForQBusiness")
    print("✅ Service-linked role for Q Business already exists")
except iam.exceptions.NoSuchEntityException:
    print("Creating service-linked role for Q Business...")
    iam.create_service_linked_role(AWSServiceName="qbusiness.amazonaws.com")
    print("✅ Service-linked role for Q Business created")

print(f"Using service-linked role for application: {app_role_arn}")


## Create a role for the webcrawler/datasource

Best practice is to have a separate role each data source

In [None]:

# Create a separate role for the webcrawler with specific permissions
crawler_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "qbusiness.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

# Define custom policy for the webcrawler based on the example
crawler_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowsAmazonQToGetS3Objects",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                f"arn:aws:s3:::sitelist6117/*"
            ],
            "Effect": "Allow",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": f"{identity['Account']}"
                }
            }
        },
        {
            "Sid": "AllowsAmazonQToGetSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                f"arn:aws:secretsmanager:{region}:{identity['Account']}:secret:*"
            ]
        },
        {
            "Sid": "AllowsAmazonQToDecryptSecret",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                f"arn:aws:kms:{region}:{identity['Account']}:key/*"
            ],
            "Condition": {
                "StringLike": {
                    "kms:ViaService": [
                        "secretsmanager.*.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "AllowsAmazonQToIngestDocuments",
            "Effect": "Allow",
            "Action": [
                "qbusiness:BatchPutDocument",
                "qbusiness:BatchDeleteDocument"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowsAmazonQToIngestPrincipalMapping",
            "Effect": "Allow",
            "Action": [
                "qbusiness:PutGroup",
                "qbusiness:CreateUser",
                "qbusiness:DeleteGroup",
                "qbusiness:UpdateUser",
                "qbusiness:ListGroups"
            ],
            "Resource": "*"
        }
    ]
}

try:
    # Create the crawler role
    crawler_role_response = iam.create_role(
        RoleName=crawler_role_name,
        AssumeRolePolicyDocument=json.dumps(crawler_trust_policy),
        Description="Role for Amazon Q Business Webcrawler"
    )
    print("Webcrawler role created successfully.")
    
    # Create and attach the custom inline policy
    iam.put_role_policy(
        RoleName=crawler_role_name,
        PolicyName="QBusinessWebcrawlerPermissions",
        PolicyDocument=json.dumps(crawler_policy_document)
    )
    print("Custom policy attached to webcrawler role.")
    
    # Get the webcrawler role ARN
    crawler_role_arn = crawler_role_response['Role']['Arn']
    print(f"Webcrawler Role ARN: {crawler_role_arn}")
    
except iam.exceptions.EntityAlreadyExistsException:
    print(f"Role '{crawler_role_name}' already exists.")
    crawler_role_response = iam.get_role(RoleName=crawler_role_name)
    crawler_role_arn = crawler_role_response['Role']['Arn']
    print(f"Webcrawler Role ARN: {crawler_role_arn}")



## Figure out if IDC exists in your account or create IDC

Let's see if we have an IDC instance. If so, get the ARN, if not create one. Sometimes the IDC for your account may be in another region (cross region IDC).

In [None]:
# Find IAM Identity Center instance across all regions
def find_idc_instance():
    # List of AWS regions to check
    regions_to_check = [
        "us-east-1", "us-east-2", "us-west-1", "us-west-2", 
        "eu-west-1", "eu-central-1", "ap-northeast-1", "ap-southeast-1"
    ]
    
    print("Searching for IAM Identity Center instances across regions...")
    
    # Check each region for an Identity Center instance
    for check_region in regions_to_check:
        try:
            # Create SSO admin client for this region
            sso_admin = boto3.client('sso-admin', region_name=check_region)
            
            # List existing Identity Center instances
            response = sso_admin.list_instances()
            
            # Check if any instances exist in this region
            if response['Instances']:
                instance_arn = response['Instances'][0]['InstanceArn']
                instance_region = check_region
                print(f"✅ Found IAM Identity Center instance in {check_region}: {instance_arn}")
                return {
                    "identityType": "AWS_IAM_IDC",
                    "identityCenterInstanceArn": instance_arn,
                    "region": instance_region
                }
        except Exception as e:
            # Continue to next region if there's an error
            continue
    
    # If no instance found, try to create one in the current region
    print("No existing IAM Identity Center instances found. Attempting to create one...")
    
    try:
        # Create Organizations client to check if organization exists
        org_client = boto3.client('organizations')
        org_client.describe_organization()
        
        # Create SSO admin client in current region
        sso_admin = boto3.client('sso-admin', region_name=region)
        
        # Create Identity Center instance
        response = sso_admin.create_instance()
        print(f"✅ Created new IAM Identity Center instance in {region}")
        
        # Wait for instance to be available
        print("Waiting for instance to be available...")
        time.sleep(10)
        
        # Get the ARN of the new instance
        list_response = sso_admin.list_instances()
        if list_response['Instances']:
            instance_arn = list_response['Instances'][0]['InstanceArn']
            print(f"New instance ARN: {instance_arn}")
            return {
                "identityType": "AWS_IAM_IDC",
                "identityCenterInstanceArn": instance_arn,
                "region": region
            }
    except Exception as e:
        print(f"Error creating IAM Identity Center instance: {str(e)}")
    
    # If we couldn't find or create an instance, return None
    print("❌ Could not find or create an IAM Identity Center instance")
    return None

# Get IAM Identity Center configuration
idc_config = find_idc_instance()


## Create a user

For that IDC, let's create a user. "John Doe" or "john_doe@example.com". 

In [None]:
def create_idc_user(username, given_name, family_name):
    try:
        # List of AWS regions to check
        regions_to_check = [
            "us-east-1", "us-east-2", "us-west-1", "us-west-2", 
            "eu-west-1", "eu-central-1", "ap-northeast-1", "ap-southeast-1"
        ]
        
        # Variables to store instance information
        instance_arn = None
        identity_store_id = None
        idc_region = None
        
        # Check each region for an Identity Center instance
        for check_region in regions_to_check:
            try:
                # Create SSO admin client for this region
                sso_admin = boto3.client('sso-admin', region_name=check_region)
                
                # List existing Identity Center instances
                response = sso_admin.list_instances()
                
                # Check if any instances exist in this region
                if response.get('Instances'):
                    instance = response['Instances'][0]
                    instance_arn = instance['InstanceArn']
                    
                    # Get the Identity Store ID
                    instance_response = sso_admin.describe_instance(InstanceArn=instance_arn)
                    identity_store_id = instance_response['IdentityStoreId']
                    idc_region = check_region
                    break
            except Exception:
                continue
        
        if not instance_arn or not identity_store_id:
            print("No IAM Identity Center instances found in any region")
            return None
        
        # Create an Identity Store client in the correct region
        identity_store = boto3.client('identitystore', region_name=idc_region)
        
        # Create a new user with the provided username, given name, and family name
        user_response = identity_store.create_user(
            IdentityStoreId=identity_store_id,
            UserName=username,
            Name={
                'GivenName': given_name,
                'FamilyName': family_name
            },
            DisplayName=f"{given_name} {family_name}",
            Emails=[
                {
                    'Value': username if '@' in username else f'{username}@example.com',
                    'Type': 'Work',
                    'Primary': True
                }
            ]
        )
        
        user_id = user_response['UserId']
        print(f"\nUser '{given_name} {family_name}' created successfully. id: {user_id}, username: {username}")
        print("\nNote: For security reasons, password must be set manually through the AWS Console. Log into the console and naviate to IAM Identity Center, and reset the user's password with a temporary password. ")
        
        return user_id
        
    except Exception as e:
        print(f"Error creating user in IAM Identity Center: {str(e)}")
        return None


# Example usage:
user_id = create_idc_user(email, first_name, family_name)




## Create Q Business Instance

We'll create a Q Business instance using the IAM role we just created.

In [None]:
# Initialize the Q Business client with region
q_business = boto3.client('qbusiness', region_name=region)

try:
    if idc_config:
        # Create the Q Business instance with IAM Identity Center
        create_params = {
            "displayName": instance_name,
            "roleArn": app_role_arn,
            "identityType": idc_config["identityType"],
            "identityCenterInstanceArn": idc_config["identityCenterInstanceArn"]
        }
        
        print(f"Creating Q Business application with IAM Identity Center from {idc_config['region']}")
        response = q_business.create_application(**create_params)
        
        print("Q Business instance creation initiated successfully!")
        print("\nResponse:")
        print(json.dumps(response, indent=2))
        
        # Store the application ID for future reference
        application_id = response['applicationId']
    else:
        print("Cannot create Q Business application without IAM Identity Center")
        
except Exception as e:
    print(f"Error creating Q Business instance: {str(e)}")


## Check Instance Status

Let's create a function to check the status of our Q Business instance

In [None]:
def check_instance_status(application_id):
    try:
        response = q_business.get_application(
            applicationId=application_id
        )
        status = response['status']
        print(f"Current status: {status}")
        return status
    except Exception as e:
        print(f"Error checking status: {str(e)}")
        return None
# Check status periodically until the instance is ready
max_attempts = 30
attempt = 0

while attempt < max_attempts:
    status = check_instance_status(application_id)
    if status == 'ACTIVE':
        print("\nQ Business instance is now active!")
        break
    elif status == 'FAILED':
        print("\nQ Business instance creation failed!")
        break
    
    print(f"Waiting for instance to become active... (Attempt {attempt + 1}/{max_attempts})")
    time.sleep(60)  # Wait for 60 seconds before checking again
    attempt += 1

## Add user to the application, and configure the subscription


In [None]:
def add_user_with_pro_subscription(application_id, user_id):
    """
    Add a user to the Q Business instance with a Pro subscription.
    
    Parameters:
    - application_id: The ID of the Q Business application
    - user_id: The user ID from IAM Identity Center
    """
    try:
        # Initialize the Q Business client
        q_business = boto3.client('qbusiness', region_name=region)
        
        # Create the user in Q Business
        # user_response = q_business.create_user(
        #     applicationId=application_id,
        #     userId=user_id  # This is the IAM Identity Center user ID
        # )
        
        
        # Create a Pro subscription for the user
        subscription_response = q_business.create_subscription(
            applicationId=application_id,
            principal={
                'user': user_id  # Use the Q Business user ID
            },
            type="Q_BUSINESS"  # Pro subscription type
        )
        
        print(f"✅ User with ID {user_id} added to Q Business with Pro subscription!")
        print(f"Q Business User ID: {user_id}")
        print(f"Subscription ID: {subscription_response['subscriptionId']}")
        print (json.dumps(subscription_response, indent=5))
        
        return 
    except Exception as e:
        print(f"❌ Error adding user to Q Business: {str(e)}")
        return None


# Add the user with Pro subscription
q_business_user_id = add_user_with_pro_subscription(application_id, user_id)


In [None]:
print (f"application_id: {application_id}")
print (f"user {user_id}")

print ("list subscriptions:")
print (json.dumps(q_business.list_subscriptions(applicationId=application_id), indent=5))
sub_id = q_business.list_subscriptions(applicationId=application_id)["subscriptions"][0]['subscriptionId']
print (f'subscription_id: {sub_id}')

print (f"user_response: {user_response}")
#Create the user in Q Business
# user_response = q_business.create_user(
#     applicationId=application_id,
#     userId=user_id  # This is the IAM Identity Center user ID
# )

# print ("----")
# q_business.cancel_subscription(applicationId=application_id, subscriptionId=sub_id)
# q_business.delete_user(applicationId=application_id, userId=user_id)
# print (json.dumps(q_business.list_subscriptions(applicationId=application_id), indent=5))



## Create Index

Before creating a data source, we need to create an index that will store the crawled data.

In [None]:
try:
    # Create the index with displayName and capacity configuration
    index_response = q_business.create_index(
        applicationId=application_id,
        displayName="webcrawler-index",
        description="Index for webcrawler data source",
        type="ENTERPRISE",
        capacityConfiguration={
            "units": 1  # Specify the number of capacity units (1-10)
        }
    )
    
    print("Index creation initiated successfully!")
    print("\nResponse:")
    print(json.dumps(index_response, indent=2))
    
    # Store the index ID for future reference
    index_id = index_response['indexId']
    
except Exception as e:
    print(f"Error creating index: {str(e)}")


## Check Index Status

Let's monitor the status of our index creation.

In [None]:


def check_index_status(application_id, index_id):
    try:
        response = q_business.get_index(
            applicationId=application_id,
            indexId=index_id
        )
        status = response['status']
        print(f"Current status: {status}")
        return status
    except Exception as e:
        print(f"Error checking status: {str(e)}")
        return None

# Check status periodically until the index is ready
max_attempts = 30
attempt = 0

while attempt < max_attempts:
    status = check_index_status(application_id, index_id)
    if status == 'ACTIVE':
        print("\nIndex is now active!")
        break
    elif status == 'FAILED':
        print("\nIndex creation failed!")
        break
    
    print(f"Waiting for index to become active... (Attempt {attempt + 1}/{max_attempts})")
    time.sleep(60)  # Wait for 60 seconds before checking again
    attempt += 1


## Create retriever

Along with the index, you'll need a retriever to use the index.

In [None]:
try:
    # Create a retriever with correct configuration
    retriever_response = q_business.create_retriever(
        applicationId=application_id,
        displayName="webcrawler-retriever",
        type="NATIVE_INDEX",
        configuration={
            "nativeIndexConfiguration": {
                "indexId": index_id,
                
            }
        },
        roleArn=app_role_arn
    )
    
    print("Retriever creation initiated successfully!")
    print("\nResponse:")
    print(json.dumps(retriever_response, indent=2))
    
    # Store the retriever ID for future reference
    retriever_id = retriever_response['retrieverId']
    
except Exception as e:
    print(f"Error creating retriever: {str(e)}")


## Configure Webcrawler Data Source

Now that our Q Business instance and index are active, let's configure a webcrawler data source.

In [None]:
try:
    # Create the data source with the correct configuration structure
    webcrawler_config = {
        "type": "WEBCRAWLER",
        "version": "1.0.0",
        "syncMode": "FORCED_FULL_CRAWL",
        "connectionConfiguration": {
            "repositoryEndpointMetadata": {
                "seedUrlConnections": [
                    {
                        "seedUrl": seed_urls[0]
                    }
                ],
                "authentication": "NoAuthentication"
            }
        },
        "repositoryConfigurations": {
            "webPage": {
                "fieldMappings": [
                    {
                        "dataSourceFieldName": "category",
                        "indexFieldName": "_category",
                        "indexFieldType": "STRING"
                    },
                    {
                        "dataSourceFieldName": "sourceUrl",
                        "indexFieldName": "_source_uri",
                        "indexFieldType": "STRING"
                    },
                    {
                        "dataSourceFieldName": "title",
                        "indexFieldName": "_document_title",
                        "indexFieldType": "STRING"
                    }
                ]
            },
            "attachment": {
                "fieldMappings": [
                    {
                        "dataSourceFieldName": "category",
                        "indexFieldName": "_category",
                        "indexFieldType": "STRING"
                    },
                    {
                        "dataSourceFieldName": "sourceUrl",
                        "indexFieldName": "_source_uri",
                        "indexFieldType": "STRING"
                    }
                ]
            }
        },
        "additionalProperties": {
            "rateLimit": "300",
            "maxFileSize": "50",
            "crawlDepth": str(crawl_depth),
            "maxLinksPerUrl": str(max_urls),
            "crawlSubDomain": True,
            "crawlAllDomain": False,
            "honorRobots": True,
            "crawlAttachments": False,
            "inclusionURLCrawlPatterns": [],
            "exclusionURLCrawlPatterns": [],
            "inclusionURLIndexPatterns": [],
            "exclusionURLIndexPatterns": [],
            "inclusionFileIndexPatterns": [],
            "exclusionFileIndexPatterns": [],
            "proxy": {},
            "enableDeletionProtection": False,
            "deletionProtectionThreshold": "0"
        }
    }
    
    # Create the data source
    response = q_business.create_data_source(
        applicationId=application_id,
        indexId=index_id,
        displayName="webcrawler-source",
        configuration=webcrawler_config,
        roleArn=crawler_role_arn,
        description="Web crawler data source"
    )
    
    print("Webcrawler data source creation initiated successfully!")
    print("\nResponse:")
    print(json.dumps(response, indent=2))
    
    # Store the data source ID for future reference
    data_source_id = response['dataSourceId']
    
except Exception as e:
    print(f"Error creating webcrawler data source: {str(e)}")



## Check Data Source Status

Let's monitor the status of our webcrawler data source.

In [None]:
def check_data_source_status(application_id, index_id, data_source_id):
    try:
        response = q_business.get_data_source(
            applicationId=application_id,
            indexId=index_id,
            dataSourceId=data_source_id
        )
        status = response['status']
        print(f"Current status: {status}")
        return status
    except Exception as e:
        print(f"Error checking status: {str(e)}")
        return None

# Check status periodically until the data source is ready
max_attempts = 30
attempt = 0

while attempt < max_attempts:
    status = check_data_source_status(application_id, index_id, data_source_id)
    if status == 'ACTIVE':
        print("\nWebcrawler data source is now active!")
        break
    elif status == 'FAILED':
        print("\nWebcrawler data source creation failed!")
        break
    
    print(f"Waiting for data source to become active... (Attempt {attempt + 1}/{max_attempts})")
    time.sleep(60)  # Wait for 60 seconds before checking again
    attempt += 1


## Start a Data Source Sync

This cell initiates a sync job for the webcrawler data source

In [None]:
try:
    # Start a sync job for the data source
    sync_response = q_business.start_data_source_sync_job(
        applicationId=application_id,
        indexId=index_id,
        dataSourceId=data_source_id
    )
    
    if sync_response['ResponseMetadata']['HTTPStatusCode'] == 200:
        print("Data source sync job initiated successfully!")
        print("\nResponse:")
        print(json.dumps(sync_response, indent=2))
        
        # Store the execution ID for future reference
        execution_id = sync_response['executionId']
    else:
        print(f"Error starting sync job. Status code: {sync_response['ResponseMetadata']['HTTPStatusCode']}")
        print("\nResponse:")
        print(json.dumps(sync_response, indent=2))
    
except Exception as e:
    print(f"Error starting data source sync job: {str(e)}")


## Check Sync Job Status

Monitor the status of the data source sync job

In [None]:
def check_sync_job_status(application_id, index_id, data_source_id):
    try:
        response = q_business.list_data_source_sync_jobs(
            applicationId=application_id,
            indexId=index_id,
            dataSourceId=data_source_id
        )
        
        # Get the most recent sync job from history
        if response['history']:
            latest_job = response['history'][0]  # Jobs are returned in descending order
            status = latest_job['status']
            print(f"Current sync job status: {status}")
            
            # Print metrics if available
            if 'metrics' in latest_job:
                print("\nSync metrics:")
                print(json.dumps(latest_job['metrics'], indent=2))
            
            # Print error if job failed
            if status == 'FAILED' and 'error' in latest_job:
                print("\nError details:")
                print(json.dumps(latest_job['error'], indent=2))
                
            return status
        else:
            print("No sync jobs found in history")
            return None
            
    except Exception as e:
        print(f"Error checking sync job status: {str(e)}")
        return None

# Check status periodically until the sync job is complete
max_attempts = 30
attempt = 0

while attempt < max_attempts:
    status = check_sync_job_status(application_id, index_id, data_source_id)
    if status == 'SUCCEEDED':
        print("\nSync job completed successfully!")
        break
    elif status in ['FAILED', 'ABORTED']:
        print("\nSync job failed!")
        break
    elif status in ['SYNCING', 'SYNCING_INDEXING']:
        print(f"Waiting for sync job to complete... (Attempt {attempt + 1}/{max_attempts})")
        time.sleep(60)  # Wait for 60 seconds before checking again
        attempt += 1
    else:
        print(f"\nUnexpected status: {status}")
        break



## Debugging: Dump the web crawler configuration to a json file
Export the data source configuration to a JSON file using AWS CLI. We're doing this via the shell vs boto3 so you can capture the CLI command for your reference.


In [None]:


# Create the AWS CLI command
aws_cli_command = f"aws qbusiness get-data-source --application-id {application_id} --index-id {index_id} --data-source-id {data_source_id} --region {region} --output json > webcrawler_datasource_config.json"

# Display the command for reference
print("Executing command:")
print(aws_cli_command)

# Execute the command
!{aws_cli_command}

# Display the file content
print("\nFile content:")
!cat webcrawler_datasource_config.json

# Check file size
print("\nFile details:")
!ls -lh webcrawler_datasource_config.json



## Modify Webcrawler Configuration

The following cell adds an `implicitWaitDuration` parameter to the webcrawler configuration. This parameter adds a 5-second wait after a page is loaded but before the content is read. This wait time allows JavaScript-rendered content to fully load, which is particularly useful for modern web applications where content is dynamically generated.

We're doing this in shell (in the following cell), but its also easy enough to pop the json in your editor and add the following to the "additionalProperties" section: 

"implicitWaitDuration": "5"

We also need to remove several metadata fields from the JSON file before updating the data source. This is necessary to work around known bugs in the AWS CLI when importing modified configurations. The fields we remove are: "dataSourceArn", "type", "createdAt", "updatedAt", "status", and "error", as these are managed by the service and should not be included in update requests.



In [None]:
# Modify the webcrawler configuration using AWS CLI

# First, export the current configuration to a file
export_cmd = f"aws qbusiness get-data-source --application-id {application_id} --index-id {index_id} --data-source-id {data_source_id} --region {region} --output json > webcrawler_config.json"
print(f"Executing: {export_cmd}")
!{export_cmd}

# Install jq if not already installed (uncomment if needed)
# !apt-get update && apt-get install -y jq || brew install jq || echo "Please install jq manually"

# Modify the configuration file to:
# 1. Add implicitWaitDuration parameter
# 2. Remove fields that cause issues during import
modify_cmd = """jq 'del(.dataSourceArn, .type, .createdAt, .updatedAt, .status, .error) | 
    .configuration.additionalProperties.implicitWaitDuration = "5"' webcrawler_config.json > webcrawler_config_modified.json"""
print(f"Executing: {modify_cmd}")
!{modify_cmd}

# Display the modified section
display_cmd = "jq '.configuration.additionalProperties' webcrawler_config_modified.json"
print(f"Executing: {display_cmd}")
!{display_cmd}

# Update the data source with the modified configuration
update_cmd = f"aws qbusiness update-data-source --application-id {application_id} --index-id {index_id} --data-source-id {data_source_id} --region {region} --configuration file://webcrawler_config_modified.json --output json"
print(f"Executing: {update_cmd}")
!{update_cmd}

# Start a new sync job to apply the changes
sync_cmd = f"aws qbusiness start-data-source-sync-job --application-id {application_id} --index-id {index_id} --data-source-id {data_source_id} --region {region} --output json"
print(f"Executing: {sync_cmd}")
!{sync_cmd}



In [None]:
def create_idc_user(username, given_name, family_name):
    try:
        # List of AWS regions to check
        regions_to_check = [
            "us-east-1", "us-east-2", "us-west-1", "us-west-2", 
            "eu-west-1", "eu-central-1", "ap-northeast-1", "ap-southeast-1"
        ]
        
        # Variables to store instance information
        instance_arn = None
        identity_store_id = None
        idc_region = None
        
        # Check each region for an Identity Center instance
        for check_region in regions_to_check:
            try:
                # Create SSO admin client for this region
                sso_admin = boto3.client('sso-admin', region_name=check_region)
                
                # List existing Identity Center instances
                response = sso_admin.list_instances()
                
                # Check if any instances exist in this region
                if response.get('Instances'):
                    instance = response['Instances'][0]
                    instance_arn = instance['InstanceArn']
                    
                    # Get the Identity Store ID
                    instance_response = sso_admin.describe_instance(InstanceArn=instance_arn)
                    identity_store_id = instance_response['IdentityStoreId']
                    idc_region = check_region
                    break
            except Exception:
                continue
        
        if not instance_arn or not identity_store_id:
            print("No IAM Identity Center instances found in any region")
            return None
        
        # Create an Identity Store client in the correct region
        identity_store = boto3.client('identitystore', region_name=idc_region)
        
        # Create a new user with the provided username, given name, and family name
        user_response = identity_store.create_user(
            IdentityStoreId=identity_store_id,
            UserName=username,
            Name={
                'GivenName': given_name,
                'FamilyName': family_name
            },
            DisplayName=f"{given_name} {family_name}",
            Emails=[
                {
                    'Value': username if '@' in username else f'{username}@example.com',
                    'Type': 'Work',
                    'Primary': True
                }
            ]
        )
        
        user_id = user_response['UserId']
        print(f"\nUser '{given_name} {family_name}' created successfully. id: {user_id}, username: {username}")
        print("\nNote: For security reasons, password must be set manually through the AWS Console. Log into the console and naviate to IAM Identity Center, and reset the user's password with a temporary password. ")
        
        return user_id
        
    except Exception as e:
        print(f"Error creating user in IAM Identity Center: {str(e)}")
        return None


# Example usage:
user_id = create_idc_user("john.doe@example.com", "John", "Doe")




## Setting a Password for Your Test User (must be done outside of this notebook)

Before setting up a password for your test user, you need to verify that MFA is turned off for your IAM Identity Center instance to allow password-based login. This can't be done via API/CLI for security reasons. 

### Step 1: Verify MFA Settings in IAM Identity Center
1. Open the [AWS Management Console](https://console.aws.amazon.com/)
2. Navigate to **IAM Identity Center**
3. In the left navigation pane, click on **Settings**
4. Select the **Authentication** tab
5. Under **Multi-factor authentication**, check if MFA is set to:
   - **Not required** (allows password-only login)
6. If MFA is set to **Required**, consider changing it temporarily for testing purposes

### Step 2: Set a Password for Your Test User
1. In the left navigation pane, click on **Users**
2. Find the test user you just created
3. Click on the user's name to view their details
4. Click the **Reset password** button
5. Select **Generate a one-time password**
6. Click **Reset password** to confirm
7. Copy the temporary password that appears on the screen

### Step 3: Login to Reset the Password
This step is required to use the test user with Amazon Q Business:

1. Open the AWS access portal URL (shown in the IAM Identity Center console)
2. Enter the username and temporary password you copied
3. When prompted, create a new password that meets the security requirements
4. This user can now be used to access the Amazon Q Business web experience

> **Important:** You must complete the password reset process before the user can be used with Amazon Q Business. In a real production environment, MFA should be enabled and you would want to securely communicate the temporary password to a real user. 



## Clean Up (Optional)

If you want to delete the Q Business instance, data source, index, and the IAM role, you can use the following code:

In [None]:
# Uncomment and run this cell to delete the instance, data source, index, and role
# try:
#     # Delete the data source
#     q_business.delete_data_source(
#         applicationId=application_id,
#         dataSourceId=data_source_id
#     )
#     print("Data source deletion initiated successfully!")
#     
#     # Delete the index
#     q_business.delete_index(
#         applicationId=application_id,
#         indexId=index_id
#     )
#     print("Index deletion initiated successfully!")
#     
#     # Delete the Q Business instance
#     q_business.delete_application(
#         applicationId=application_id
#     )
#     print("Q Business instance deletion initiated successfully!")
#     
#     # Detach the policies from the application role
#     iam.detach_role_policy(
#         RoleName=app_role_name,
#         PolicyArn="arn:aws:iam::aws:policy/AmazonQBusinessFullAccess"
#     )
#     print(f"Policy detached from {app_role_name}")
#     
#     # Detach the policies from the crawler role
#     for policy_arn in crawler_policies:
#         iam.detach_role_policy(
#             RoleName=crawler_role_name,
#             PolicyArn=policy_arn
#         )
#         print(f"Policy {policy_arn} detached from {crawler_role_name}")
#     
#     # Delete the IAM roles
#     iam.delete_role(RoleName=app_role_name)
#     print(f"IAM role {app_role_name} deleted successfully!")
#     
#     iam.delete_role(RoleName=crawler_role_name)
#     print(f"IAM role {crawler_role_name} deleted successfully!")
#     
# except Exception as e:
#     print(f"Error during cleanup: {str(e)}")
