# Kendra Sample

* Wget a URL
* Upload files to S3
* Create a kendra index
* Add the source
* Sync the source
* Query the index
* Clean up

## How to Use the Notebook

Code is broken up into cells like the one below. There's a triangular `Run` button at the top of this page you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.


Simply follow the instructions below and execute the cells to get started with Amazon Personalize.

## Imports 

Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like boto3(The AWS SDK) and Pandas/Numpy which are core data science tools

In [1]:
# Imports
import boto3
import json
import time

Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [2]:
# Configure the SDK to Kendra:
kendra = boto3.client('kendra')

## Configure the data

Data is imported into Amazon Personalize through Amazon S3, below we will specify a bucket that you have created within AWS for the purposes of this exercise.

Below you will update the `bucket` variable to instead be set to the value that you created earlier in the CloudFormation steps, this should be in a text file from your earlier work. the `filename` does not need to be changed.

### Specify a Bucket and Data Output Location

In [3]:
region_name = "us-west-2"
bucket_name = "hunkim-kendra-sample"       # replace with the name of your S3 bucket
index_name = "hunkim-kendra-sample"
url2crawl = "http://www.cse.ust.hk/~hunkim"

### Download, Prepare, and Upload Training Data

At present you do not have the MovieLens data loaded locally yet for examination, execute the lines below to download the latest copy and to examine it quickly.

#### Download and Explore the Dataset

In [4]:
!rm -rf data
!mkdir data
!wget -P data --reject jpg,png,ico,jpeg,css  --accept html,pdf -m -p -E -k -K -np http://www.cse.ust.hk/~hunkim/
!ls data

nkim/images/c/c9/Papers_kim2006fse.pdf
Reusing existing connection to www.cse.ust.hk:80.
HTTP request sent, awaiting response...200 OK
Length: 876260 (856K) [application/pdf]
Saving to: ‘data/www.cse.ust.hk/~hunkim/images/c/c9/Papers_kim2006fse.pdf’


2020-05-31 22:23:04 (10.4 MB/s) - ‘data/www.cse.ust.hk/~hunkim/images/c/c9/Papers_kim2006fse.pdf’ saved [876260/876260]

--2020-05-31 22:23:04--  http://www.cse.ust.hk/~hunkim/images/5/59/Papers_kim2006ase.pdf
Reusing existing connection to www.cse.ust.hk:80.
HTTP request sent, awaiting response...200 OK
Length: 1315887 (1.3M) [application/pdf]
Saving to: ‘data/www.cse.ust.hk/~hunkim/images/5/59/Papers_kim2006ase.pdf’


2020-05-31 22:23:04 (11.2 MB/s) - ‘data/www.cse.ust.hk/~hunkim/images/5/59/Papers_kim2006ase.pdf’ saved [1315887/1315887]

--2020-05-31 22:23:04--  http://www.cse.ust.hk/~hunkim/images/f/f5/Bevan2005fse.pdf
Reusing existing connection to www.cse.ust.hk:80.
HTTP request sent, awaiting response...200 OK
Length: 327201 (320K)

#### Prepare and Upload Data

As you can see the data contains a UserID, ItemID, Rating, and Timestamp.

We are now going to remove the items with low rankings, and remove that column before we build our model.

Once done we will now save the file as a new CSV and then upload it to S3.

All of that is done by simply executing the lines below.

In [5]:
# https://www.developerfiles.com/upload-files-to-s3-with-python-keeping-the-original-folder-structure/
def upload_dir_s3(local_path):
    s3_client = boto3.Session().resource('s3')
    s3_bucket = s3_client.Bucket(bucket_name)

    for subdir, dirs, files in os.walk(local_path):
        for file in files:
            full_path = os.path.join(subdir, file)
            with open(full_path, 'rb') as data:
                print("Uploadng " + full_path)
                s3_bucket.put_object(Key=full_path[len(local_path)+1:], Body=data)

In [6]:
if boto3.Session().resource('s3').Bucket(bucket_name).creation_date is None:
    try:
        s3_client = boto3.client('s3', region_name=region_name)
        location = {'LocationConstraint': region_name}
        s3_client.create_bucket(Bucket=bucket_name,
                                CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
else:
    print("We found " + bucket_name)
    
upload_dir_s3("data")

Uploadng data/www.cse.ust.hk/robots.txt.tmp
Uploadng data/www.cse.ust.hk/~hunkim/Students.html
Uploadng data/www.cse.ust.hk/~hunkim/Software.html
Uploadng data/www.cse.ust.hk/~hunkim/Research.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/index.html
Uploadng data/www.cse.ust.hk/~hunkim/Bio.html
Uploadng data/www.cse.ust.hk/~hunkim/Teaching.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Contact.html
Uploadng data/www.cse.ust.hk/~hunkim/Research.html
Uploadng data/www.cse.ust.hk/~hunkim/index.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Bio.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Software.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Service.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Contact.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Service.html
Uploadng data/www.cse.ust.hk/~hunkim/Publications.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Students.html.orig
Uploadng data/www.cse.ust.hk/~hunkim/Teaching.html
Uploadng data/www.cse.ust.hk/~hunkim/Publications.html
Uploa

### Create Index


#### Create IAM for index and data source

In [7]:
iam = boto3.client("iam")

role_name = "hunkim-kendra-sample-role"
policy_name = "hunkim-kendra-sample-policy"

# Delete if exist
try:
    response = iam.list_attached_role_policies(RoleName=role_name)
    for policy in response["AttachedPolicies"]:
        iam.detach_role_policy(PolicyArn=policy["PolicyArn"], RoleName=role_name)
        iam.delete_policy(PolicyArn=policy["PolicyArn"])
    
    iam.delete_role(RoleName=role_name)
except Exception as e: 
    print (str(e))

# Create a policy
kendra_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:namespace": "Kendra"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "logs:DescribeLogGroups",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        },
         {
            "Effect": "Allow",
            "Action": "kendra:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

try:
    response = iam.create_policy(
        PolicyName=policy_name,
        PolicyDocument=json.dumps(kendra_policy))

    kendra_policy_arn = response["Policy"]["Arn"]
    print(kendra_policy_arn)
except Exception as e:
    print("Creating Policy", str(e))

assume_role_policy_document =  {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "kendra.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}


create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# Add Kendra policy
iam.attach_role_policy(
    PolicyArn=kendra_policy_arn,
    RoleName=role_name
)

time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

An error occurred (NoSuchEntity) when calling the ListAttachedRolePolicies operation: The role with name hunkim-kendra-sample-role cannot be found.
arn:aws:iam::294038372338:policy/hunkim-kendra-sample-policy
arn:aws:iam::294038372338:role/hunkim-kendra-sample-role


#### Create Kendra Index

In [8]:
kendra_response = kendra.create_index(
    Name=index_name,
    Edition='DEVELOPER_EDITION',
    RoleArn=role_arn,
    Description='Kendra Sample',
    ClientToken='hunkim'
)

kendra_index_id = kendra_response['Id']
print(kendra_index_id)

93dd4515-79e8-4634-93af-b64bd1ec36cd


#### Wait for Kendra Index to Have ACTIVE Status

Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [9]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_kendra_index_response = kendra.describe_index(
        Id = kendra_index_id
    )
    status = describe_kendra_index_response["Status"]
    print("KendraIndexStatus: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: CREATING
KendraIndexStatus: ACTIVE


### Create data source
Use simple S3 and reuse the role arn

In [10]:
kendra_response = kendra.create_data_source(
    Name='hunkim-kendra-s3-data-source',
    IndexId=kendra_index_id,
    Type='S3',
    Configuration={
        'S3Configuration': {
            'BucketName': bucket_name
        }
    },
    Description='Kendra S3 DataSource',
    RoleArn=role_arn,
)

kendra_data_source_id = kendra_response['Id']
print("kendra_data_source_id:", kendra_data_source_id)

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_data_source_response = kendra.describe_data_source(
        Id = kendra_data_source_id,
        IndexId = kendra_index_id
    )
    status = describe_kendra_index_response["Status"]
    print("DataSourceStatus: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

kendra_data_source_id: afeeda0b-70be-4b5c-b21e-ced264200800
DataSourceStatus: ACTIVE


### Sync the data source

In [11]:
kendra_response = kendra.start_data_source_sync_job(
    Id = kendra_data_source_id,
    IndexId=kendra_index_id,
)

kendra_data_sync_id = kendra_response['ExecutionId']
print("kendra_data_sync_id:", kendra_data_sync_id)

kendra_data_sync_id: 2f97ff3c-5294-4e50-ade1-2940b418b655


In [12]:
# FIXME: What is the best way to check the sync status?
# We cannot search for a given sync execution id
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_data_source_sync_response = kendra.list_data_source_sync_jobs(
        Id = kendra_data_source_id,
        IndexId = kendra_index_id
    )

    status = None
    for history in describe_data_source_sync_response['History']:
        if history['ExecutionId'] == kendra_data_sync_id:
            status = history["Status"]
            print("DataSourceStatus: {}".format(status))
            break
    
    # FIXME: INCOMPLETE?
    if status != "SYNCING":
        break

    time.sleep(60)

DataSourceStatus: SYNCING
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE
DataSourceStatus: INCOMPLETE


KeyboardInterrupt: 

### Sample Query
Fire query

In [13]:
kendra_response = kendra.query(
    IndexId = kendra_index_id,
    QueryText = 'Who is Sung?'
)

print(json.dumps(kendra_response, indent=2, sort_keys=True))

{
  "FacetResults": [],
  "QueryId": "b8a3b37b-630d-4a64-9656-e275a41e40e4",
  "ResponseMetadata": {
    "HTTPHeaders": {
      "content-length": "2776",
      "content-type": "application/x-amz-json-1.1",
      "date": "Sun, 31 May 2020 13:44:37 GMT",
      "x-amzn-requestid": "8d8d28bc-98cc-4ef5-9120-fd1dba4389a3"
    },
    "HTTPStatusCode": 200,
    "RequestId": "8d8d28bc-98cc-4ef5-9120-fd1dba4389a3",
    "RetryAttempts": 0
  },
  "ResultItems": [
    {
      "AdditionalAttributes": [
        {
          "Key": "AnswerText",
          "Value": {
            "TextWithHighlightsValue": {
              "Highlights": [
                {
                  "BeginOffset": 5,
                  "EndOffset": 9,
                  "TopAnswer": false
                }
              ],
              "Text": "From Sung Kim\n\n\n\t\t\t\n\n\t\t\t\n\n\n\n\n  Academic Service \n\n\n\n\n  2016 \n\n\n\t \nThe 38th International Conference on Software Engineering (ICSE 2016)\n\n\t Program Board\n\n\n\n\

## Delete the index and Wait
Makr sure you have deleted it to save unnecessary cost

In [13]:
print("Input DELETE if you want to delete: ")
if input() != "DELETE":
    raise 


# This will delete index and data sources
kendra_response = kendra.delete_index(
    Id=kendra_index_id
)

max_time = time.time() + 3*60*60 # 3 hours
try:
    while time.time() < max_time:
        describe_kendra_index_response = kendra.describe_index(
            Id = kendra_index_id
        )
        status = describe_kendra_index_response["Status"]
        print("KendraIndexStatus: {}".format(status))
            
        time.sleep(60)
except Exception as e: 
    # Should be ResourceNotFoundException
    print (str(e))
    print("Deleted")


Input DELETE if you want to delete: 


RuntimeError: No active exception to reraise

#### Delete S3 and IAM

Clean up others

In [14]:
print("Input DELETE if you want to delete: ")
if input() != "DELETE":
    raise 

# Delete Policy and IAM 
try:
    response = iam.list_attached_role_policies(RoleName=role_name)
    for policy in response["AttachedPolicies"]:
        iam.detach_role_policy(PolicyArn=policy["PolicyArn"], RoleName=role_name)
        iam.delete_policy(PolicyArn=policy["PolicyArn"])
    
    iam.delete_role(RoleName=role_name)
except Exception as e: 
    print (str(e))

# Delete S3
s3_client = boto3.Session().resource('s3')
s3_bucket = s3_client.Bucket(bucket_name)
for key in s3_bucket.objects.all():
    print("Deleting " + str(key))
    key.delete()
s3_bucket.delete()

Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/robots.txt.tmp')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Bio.html')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Bio.html.orig')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Contact.html')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Contact.html.orig')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Publications.html')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Publications.html.orig')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Research.html')
Deleting s3.ObjectSummary(bucket_name='hunkim-kendra-sample', key='www.cse.ust.hk/~hunkim/Research.html.orig')
Deleting s3.ObjectSummary(bucket_name

{'ResponseMetadata': {'RequestId': 'DB7D93BEADADCEA5',
  'HostId': '/ZA4bT1+EAQ7HyQlrxaE9ZxkMIxmqtvcE9atOkWCnNDbNaV/w02V6lJWTMMhc24mWINVG9zaVQg=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': '/ZA4bT1+EAQ7HyQlrxaE9ZxkMIxmqtvcE9atOkWCnNDbNaV/w02V6lJWTMMhc24mWINVG9zaVQg=',
   'x-amz-request-id': 'DB7D93BEADADCEA5',
   'date': 'Sun, 31 May 2020 13:18:59 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}