# DBMaint: Create Everything
This notebook will use the AWS boto3 APIs to create the needed resources for a dbmaint example.

## AWS Resources Created
- Database   
- Changeset to add data to database   
- Scaling Group that will contain the two clusters   
- Shared Volume to contain the two views (dbmaint and query)    
- Dataviews: two, one for dbmaint another for query
- Clusters: two, dbmaint (GP type) and query (GP type)

## Architecture
<img src="images/Deepdive Diagrams-dbmaint.drawio.png"  width="50%">


In [1]:
import os
import subprocess
import boto3
import json
import datetime

import pykx as kx

from env import *
from config import *
from managed_kx import *

# set q console width and height
kx.q.system.display_size = [50, 1000]

# ----------------------------------------------------------------
# Source data directory
SOURCE_DATA_DIR="hdb"

# Code directory
CODEBASE="dbmaint"

# S3 Destinations
S3_CODE_PATH="code"
S3_DATA_PATH="data"

NODE_TYPE="kx.sg.xlarge"

MAINT_DATABASE_CONFIG=[{ 
    'databaseName': DB_NAME,
    'dataviewName': MAINT_DBVIEW_NAME
    }]

QUERY_DATABASE_CONFIG=[{ 
    'databaseName': DB_NAME,
    'dataviewName': QUERY_DBVIEW_NAME
    }]

CODE_CONFIG={ 's3Bucket': S3_BUCKET, 's3Key': f'{S3_CODE_PATH}/{CODEBASE}.zip' }

NAS1_CONFIG= {
        'type': 'SSD_250',
        'size': 1200
}

INIT_SCRIPT='init.q'
CMD_ARGS=[
    { 'key': 's', 'value': '4' }, 
    { 'key': 'AWS_ZIP_DEFAULT', 'value': '17,2,6' },
]

VPC_CONFIG={ 
    'vpcId': VPC_ID,
    'securityGroupIds': SECURITY_GROUPS,
    'subnetIds': SUBNET_IDS,
    'ipAddressType': 'IP_V4' 
}


In [2]:
# create finspace client
session = boto3.Session()
client = get_client(session=session)

# Create the Database
Create a database from the supplied data in hdb.tar.gz.  

## Untar HDB Data in hdb.tar.gz
Data will be found in hdb directory

In [3]:
!rm -rf hdb

In [4]:
!tar -xf hdb.tar.gz

In [5]:
!ls -la hdb

total 68
drwxr-xr-x 12 ec2-user ec2-user  4096 Apr 24  2023 .
drwxrwxr-x  7 ec2-user ec2-user  4096 Nov 12 17:15 ..
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.14
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.15
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.16
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.17
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.18
drwxr-xr-x  3 ec2-user ec2-user  4096 Apr 24  2023 2023.04.19
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.20
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.21
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.22
drwxr-xr-x  3 ec2-user ec2-user  4096 Nov 12 17:15 2023.04.23
-rw-r--r--  1 ec2-user ec2-user 16392 Apr 24  2023 sym


## Stage HDB Data on S3
Using AWS cli, copy hdb to staging bucket

In [6]:
S3_DEST=f"s3://{S3_BUCKET}/{S3_DATA_PATH}/{SOURCE_DATA_DIR}/"

cp = ""

if AWS_ACCESS_KEY_ID is not None:
    cp = f"""
export AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY_ID}
export AWS_SECRET_ACCESS_KEY={AWS_SECRET_ACCESS_KEY}
export AWS_SESSION_TOKEN={AWS_SESSION_TOKEN}
"""
    
cp += f"""
aws s3 sync  --exclude .DS_Store {SOURCE_DATA_DIR} {S3_DEST} --quiet
aws s3 ls {S3_DEST}
"""
    
# execute the S3 copy
os.system(cp)

                           PRE 2023.04.14/
                           PRE 2023.04.15/
                           PRE 2023.04.16/
                           PRE 2023.04.17/
                           PRE 2023.04.18/
                           PRE 2023.04.19/
                           PRE 2023.04.20/
                           PRE 2023.04.21/
                           PRE 2023.04.22/
                           PRE 2023.04.23/
                           PRE 2024.10.22/
                           PRE 2024.10.23/
                           PRE 2024.10.24/
                           PRE 2024.10.25/
                           PRE 2024.10.28/
                           PRE 2024.10.29/
                           PRE 2024.10.30/
                           PRE 2024.10.31/
2024-11-04 20:01:11      16392 sym


0

## Create Managed Database
Using the AWS APIs, create a managed database in Managed kdb Insights.

In [7]:
# assume it exists
create_db=False

try:
    resp = client.get_kx_database(environmentId=ENV_ID, databaseName=DB_NAME)
    resp.pop('ResponseMetadata', None)
except:
    # does not exist, will create
    create_db=True

if create_db:
    print(f"CREATING Database: {DB_NAME}")
    resp = client.create_kx_database(environmentId=ENV_ID, databaseName=DB_NAME, description="Basictick kdb database")
    resp.pop('ResponseMetadata', None)

    print(f"CREATED Database: {DB_NAME}")

print(json.dumps(resp,sort_keys=True,indent=4,default=str))

CREATING Database: dbmaintdb
CREATED Database: dbmaintdb
{
    "createdTimestamp": "2024-11-12 17:15:28.315000+00:00",
    "databaseArn": "arn:aws:finspace:us-east-1:829845998889:kxEnvironment/jlcenjvtkgzrdek2qqv7ic/kxDatabase/dbmaintdb",
    "databaseName": "dbmaintdb",
    "description": "Basictick kdb database",
    "environmentId": "jlcenjvtkgzrdek2qqv7ic",
    "lastModifiedTimestamp": "2024-11-12 17:15:28.315000+00:00"
}


## Add HDB Data to Database
Add the data in the local hdb directory to the managed database using the changeset mechanism. The Data will be copied to S3 then ingested with the create-kx-changeset API.

In [8]:
# Check if there is a changeset in the database, if so, no need to add another
c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) == 0:

    changes=[]

    for f in os.listdir(f"{SOURCE_DATA_DIR}"):
        if os.path.isdir(f"{SOURCE_DATA_DIR}/{f}"):
            changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}/", 'dbPath': f"/{f}/" } )
        else:
            changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}", 'dbPath': f"/" } )

    resp = client.create_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, 
        changeRequests=changes)

    resp.pop('ResponseMetadata', None)
    changeset_id = resp['changesetId']

    print("Changeset...")
    print(json.dumps(resp,sort_keys=True,indent=4,default=str))
else:
    c_set_list=sorted(c_set_list, key=lambda d: d['createdTimestamp']) 
    changeset_id=c_set_list[-1]['changesetId']
    print(f"Using Last changeset: {changeset_id}")
    

Changeset...
{
    "changeRequests": [
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.14/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/hdb/2023.04.14/"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.16/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/hdb/2023.04.16/"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.22/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/hdb/2023.04.22/"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.20/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/hdb/2023.04.20/"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.23/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/hdb/2023.04.23/"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2023.04.15/",
            "s3Path": "s3://kdb-demo-829

In [9]:
wait_for_changeset_status(client, environmentId=ENV_ID, databaseName=DB_NAME, changesetId=changeset_id, show_wait=True)
print("**Done**")

Status is IN_PROGRESS, total wait 0:00:00, waiting 10 sec ...
Status is IN_PROGRESS, total wait 0:00:10, waiting 10 sec ...
**Done**


In [10]:
note_str = ""

c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) == 0:
    note_str = "<<Could not get changesets>>"
    
print(100*"=")
print(f"Database: {DB_NAME}, Changesets: {len(c_set_list)} {note_str}")
print(100*"=")

# sort by create time
c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 

for c in c_set_list:
    c_set_id = c['changesetId']
    print(f"  Changeset: {c_set_id}: Created: {c['createdTimestamp']} ({c['status']})")
    c_rqs = client.get_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, changesetId=c_set_id)['changeRequests']

    chs_pdf = pd.DataFrame.from_dict(c_rqs).style.hide(axis='index')
    display(chs_pdf)

Database: dbmaintdb, Changesets: 1 
  Changeset: VsmQr8Rc9WND6QaFsw2bdg: Created: 2024-11-12 17:15:29.339000+00:00 (COMPLETED)


changeType,s3Path,dbPath
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.14/,/2023.04.14/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.16/,/2023.04.16/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.22/,/2023.04.22/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.20/,/2023.04.20/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.23/,/2023.04.23/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.15/,/2023.04.15/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.18/,/2023.04.18/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.17/,/2023.04.17/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/sym,/
PUT,s3://kdb-demo-829845998889-kms/data/hdb/2023.04.19/,/2023.04.19/


# Create Scaling Group
The scaling group represents the total compute avilable to the application. All clusters will be placed into the scaling group ans share the compute and memory of the scaling group.

In [11]:
# Check if scaling group exits, only create if it does not
resp = get_kx_scaling_group(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME)

if resp is None:
    resp = client.create_kx_scaling_group(
        environmentId = ENV_ID, 
        scalingGroupName = SCALING_GROUP_NAME,
        hostType=NODE_TYPE,
        availabilityZoneId = AZ_ID
    )
else:
    print(f"Scaling Group {SCALING_GROUP_NAME} exists")

Scaling Group SCALING_GROUP_dbmaint exists


# Create Shared Volume
The shared volume is a common storage device for the application. Every cluster using the shared volume will have a writable directory named after the cluster, can read the directories named after other clusters in the application using the volume. Also, there is a common 

In [12]:
# Check if volume already exists before trying to create one
resp = get_kx_volume(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME)

if resp is None:
    resp = client.create_kx_volume(
        environmentId = ENV_ID, 
        volumeType = 'NAS_1',
        volumeName = VOLUME_NAME,
        description = 'Shared volume',
        nas1Configuration = NAS1_CONFIG,
        azMode='SINGLE',
        availabilityZoneIds=[ AZ_ID ]    
    )
else:
    print(f"Volume {VOLUME_NAME} exists")    

Volume DBMAINT_VOLUME exists


# Wait for Volume and Scaling Group
Before proceeding to use Volumes and Scaling groups, wait for their creation to complete.

In [13]:
# wait for the scaling group to create
wait_for_scaling_group_status(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME, show_wait=True)
print("** DONE **")

# wait for the volume to create
wait_for_volume_status(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME, show_wait=True)
print("** DONE **")

Scaling Group: SCALING_GROUP_dbmaint status is now ACTIVE, total wait 0:00:00
** DONE **
Volume: DBMAINT_VOLUME status is now ACTIVE, total wait 0:00:00
** DONE **


# Create Dataviews
Create dataviews, for a specific (static) version of the database and have all of its data cached using the shared volume.

In [14]:
# Check if dataview already exists and is set to the requested changeset_id
resp = get_kx_dataview(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=MAINT_DBVIEW_NAME)

if resp is None:
    # sort by create time
    c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 

    resp = client.create_kx_dataview(
        environmentId = ENV_ID, 
        databaseName=DB_NAME, 
        dataviewName=MAINT_DBVIEW_NAME,
        azMode='SINGLE',
        availabilityZoneId=AZ_ID,
        changesetId=c_set_list[-1]['changesetId'],
        segmentConfigurations=[
            { 
                'dbPaths': ['/*'],
                'volumeName': VOLUME_NAME,
                'onDemand': True,
            }
        ],
        autoUpdate=False,
        readWrite=True,
        description = f'Dataview of database {DB_NAME}'
    )
else:
    print(f"Dataview {MAINT_DBVIEW_NAME} exists")            

In [15]:
# Check if dataview already exists and is set to the requested changeset_id
resp = get_kx_dataview(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=QUERY_DBVIEW_NAME)

if resp is None:
    # sort by create time
    c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 

    resp = client.create_kx_dataview(
        environmentId = ENV_ID, 
        databaseName=DB_NAME, 
        dataviewName=QUERY_DBVIEW_NAME,
        azMode='SINGLE',
        availabilityZoneId=AZ_ID,
        changesetId=c_set_list[-1]['changesetId'],
        segmentConfigurations=[
            { 
                'dbPaths': ['/*'],
                'volumeName': VOLUME_NAME,
            }
        ],
        autoUpdate=False,
        description = f'Dataview of database {DB_NAME}'
    )
else:
    print(f"Dataview {QUERY_DBVIEW_NAME} exists")            

In [16]:
# wait for the view to create
for v in all_views:
    wait_for_dataview_status(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=v, show_wait=True)
print("** DONE **")

Dataview: dbmaintdb_DBVIEW_MAINT status is CREATING, total wait 0:00:00, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_MAINT status is CREATING, total wait 0:00:30, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_MAINT status is now ACTIVE, total wait 0:01:00
Dataview: dbmaintdb_DBVIEW_QUERY status is CREATING, total wait 0:00:00, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_QUERY status is CREATING, total wait 0:00:30, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_QUERY status is CREATING, total wait 0:01:00, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_QUERY status is CREATING, total wait 0:01:30, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_QUERY status is CREATING, total wait 0:02:00, waiting 30 sec ...
Dataview: dbmaintdb_DBVIEW_QUERY status is now ACTIVE, total wait 0:02:30
** DONE **


# Create Clusters
With foundational resources now completed, create the needed clusters for the application.

## Stage Code to S3
Code to be used in this application must be staged to an S3 bucket the service can read from, that code will then be deployed to the clusters as part of their creation workflow.

In [17]:
# zip the code
os.system(f"cd {CODEBASE}; zip -r -X ../{CODEBASE}.zip . -x '*.ipynb_checkpoints*';")

cp = ""

# copy code to S3
if AWS_ACCESS_KEY_ID is not None:
    cp = f"""
export AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY_ID}
export AWS_SECRET_ACCESS_KEY={AWS_SECRET_ACCESS_KEY}
export AWS_SESSION_TOKEN={AWS_SESSION_TOKEN}
"""

cp += f"""
aws s3 cp  --exclude .DS_Store {CODEBASE}.zip s3://{S3_BUCKET}/code/{CODEBASE}.zip
aws s3 ls s3://{S3_BUCKET}/code/
"""
    
# execute the S3 copy
os.system(cp)

updating: initdb.q (deflated 22%)
updating: dbmaint.q (deflated 66%)
updating: init.q (deflated 13%)
upload: ./dbmaint.zip to s3://kdb-demo-829845998889-kms/code/dbmaint.zip
2023-06-05 21:25:21          0 
2024-11-01 13:59:23      16585 basictick.zip
2024-11-07 17:07:12       1184 bmll.zip
2024-11-06 21:20:01        455 code.zip
2023-12-21 19:47:37        574 codebundle.zip
2024-02-02 21:34:56        582 codebundle1.zip
2023-12-21 21:26:00        582 codebundle2.zip
2024-11-12 17:19:28       2607 dbmaint.zip
2024-09-04 17:42:17        556 foo.q.zip
2023-11-22 14:58:53       1530 jpmc_code.zip
2024-01-01 19:57:08      33781 kdb-tick-flat-largetable.zip
2023-12-30 22:56:33      38867 kdb-tick-flat.zip
2024-01-08 13:05:33      28741 kdb-tick.zip
2023-08-22 16:58:18        765 qcode.zip
2024-10-16 22:31:45        465 taqcode.zip
2024-04-26 16:38:46     487423 torq_app.zip
2024-03-06 19:01:11    5807282 torq_app_20240306_1901.zip
2024-03-06 19:13:22    5807290 torq_app_20240306_1913.zip
202

0

## Create Clusters

Create the cluster for performing dbmaint and another to use for queries.

In [18]:
# cluster already exists?
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=MAINT_CLUSTER_NAME)

if resp is None:
    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=MAINT_CLUSTER_NAME,
        clusterType='GP',
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=MAINT_DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryReservation': 6,
            'nodeCount': 1,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        clusterDescription=f"{MAINT_CLUSTER_NAME} cluster created with create_all notebook",
        code=CODE_CONFIG,
        initializationScript=INIT_SCRIPT,
        commandLineArguments=CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration=VPC_CONFIG
    )
else:
    print(f"Cluster: {MAINT_CLUSTER_NAME} already exists")  
    

In [19]:
# cluster already exists?
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=QUERY_CLUSTER_NAME)

if resp is None:
    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=QUERY_CLUSTER_NAME,
        clusterType='GP',
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=QUERY_DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryReservation': 6,
            'nodeCount': 1,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        clusterDescription=f"{QUERY_CLUSTER_NAME} cluster created with create_all notebook",
        code=CODE_CONFIG,
        initializationScript="initdb.q",
        commandLineArguments=CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration=VPC_CONFIG
    )
else:
    print(f"Cluster: {QUERY_CLUSTER_NAME} already exists")  
    

### Wait for Cluster to Create

In [20]:
for c in all_clusters:
    wait_for_cluster_status(client, environmentId=ENV_ID, clusterName=c, show_wait=True)
print("** ALL DONE **")

Cluster: dbmaint_cluster_maint status is PENDING, total wait 0:00:00, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:00:30, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:01:00, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:01:30, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:02:00, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:02:30, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:03:00, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:03:30, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:04:00, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:04:30, waiting 30 sec ...
Cluster: dbmaint_cluster_maint status is CREATING, total wait 0:05:00, waiting 30 sec ...
Cluster: db

# All Processes Running

In [21]:
print( f"Last Run: {datetime.datetime.now()}" )

Last Run: 2024-11-12 17:32:43.494714
