# CSV Processing: Create Resources

This noptebook will create all the needed infrastructure to demonstrate how to use FinSpace with Managed kdb Insights to process csv.gz files on S3 into a managed matabase using managed clusters.

## Architecture
![Architecture](csv_arch.png "Architecture")

## Imports and Constants
Import necessary python libraries and define global variables.

In [1]:
import os
import subprocess
import boto3
import json
import datetime

import pykx as kx

from managed_kx import *
from env import *

# ----------------------------------------------------------------
DB_NAME="DEMO_DB"
DBVIEW_NAME=f"{DB_NAME}_VIEW"
SCALING_GROUP_NAME="DEMO_SCALING_GROUP"
VOLUME_NAME="DEMO_SHARED_VOLUME"
CODEBASE="demo"
CLUSTER_NAME="demo_csv_cluster"

HDB_CLUSTER_NAME="demo_hdb_cluster"

# S3 Destinations
S3_CODE_PATH="code"
S3_DATA_PATH="data"
SOURCE_DATA_DIR="demo"

# this file will seed the database (used for table schema)
CSV_FILE='AMZN-100.csv.gz'
# ----------------------------------------------------------------

NODE_TYPE="kx.sg.4xlarge"

DATABASE_CONFIG=[{ 
    'databaseName': DB_NAME,
    'dataviewName': DBVIEW_NAME
    }]
CODE_CONFIG={ 's3Bucket': S3_BUCKET, 's3Key': f'{S3_CODE_PATH}/{CODEBASE}.zip' }

NAS1_CONFIG= {
        'type': 'SSD_250',
        'size': 1200
}

CMD_ARGS=[
    { 'key': 's', 'value': '4' }, 
    { 'key': 'dbname', 'value': DB_NAME}, 
    { 'key': 'AWS_ZIP_DEFAULT', 'value': '17,2,6' },
]

HDB_CMD_ARGS=[
    { 'key': 's', 'value': '4' }, 
]

# Local q instance
gp = kx.q

In [2]:
# Get credentials and create service client
session=None

if AWS_ACCESS_KEY_ID is None:
    print("Using Defaults ...")
    # create AWS session: using access variables
    session = boto3.Session()
else:
    print("Using variables ...")
    session = boto3.Session(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
        aws_session_token=AWS_SESSION_TOKEN
    )

# create finspace client
client = session.client(service_name='finspace', endpoint_url=ENDPOINT_URL)

Using Defaults ...


# Create Managed Database
Create a managed database in Managed kdb Insights using the API [create_kx_database](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/delete_kx_database.html)

In [3]:
# assume it exists
create_db=False

try:
    resp = client.get_kx_database(environmentId=ENV_ID, databaseName=DB_NAME)
    resp.pop('ResponseMetadata', None)
except:
    # does not exist, will create
    create_db=True

if create_db:
    print(f"CREATING Database: {DB_NAME}")
    resp = client.create_kx_database(environmentId=ENV_ID, databaseName=DB_NAME, description="Basictick kdb database")
    resp.pop('ResponseMetadata', None)

    print(f"CREATED Database: {DB_NAME}")

print(json.dumps(resp,sort_keys=True,indent=4,default=str))

CREATING Database: DEMO_DB
CREATED Database: DEMO_DB
{
    "createdTimestamp": "2024-05-09 13:55:16.925000+00:00",
    "databaseArn": "arn:aws:finspace:us-east-1:829845998889:kxEnvironment/jlcenjvtkgzrdek2qqv7ic/kxDatabase/DEMO_DB",
    "databaseName": "DEMO_DB",
    "description": "Basictick kdb database",
    "environmentId": "jlcenjvtkgzrdek2qqv7ic",
    "lastModifiedTimestamp": "2024-05-09 13:55:16.925000+00:00"
}


# Create and Stage Data on S3
## Create Empty Database
With a local q instance, create initial in-memory table populated from a small csv file, then add that data to the managed database. Think of this as the first date of data in the database. We will ensure the table is initially empty.


In [4]:
!rm -rf demo

In [5]:
# have the local q instance process the csv.gz file into a table
gp(f'''
taq:("DNSSFJSS";enlist csv) 0: .Q.gz "c"$read1 hsym `$"{CSV_FILE}";
''')

# delete the Date column
gp("delete Date from `taq")

# delete all rows from table
gp("delete from `taq")

# set attribute on table
gp("update `g#Ticker from `taq")

# schema 
display(gp('meta taq'))

# table contents
display(gp('taq').pd())

Unnamed: 0_level_0,t,f,a
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Timestamp,"""n""",,
EventType,"""s""",,
Ticker,"""s""",,g
Price,"""f""",,
Quantity,"""j""",,
Exchange,"""s""",,
Conditions,"""s""",,


Unnamed: 0,Timestamp,EventType,Ticker,Price,Quantity,Exchange,Conditions


In [6]:
# Save the table in a date partition (Jan 1 2024)
kx.q('''
d:2024.01.01;
path:"demo";

{.Q.dpft[hsym`$x;y;`Ticker;z]}[path;d] each tables`.;
''')


pykx.Identity(pykx.q('::'))

### Contents of Database on Disk
Show the saved tables of database on the file system

In [7]:
!ls demo

2024.01.01  sym


In [8]:
!ls -la demo/2024.01.01/*

total 36
drwxrwxr-x 2 ec2-user ec2-user  127 May  9 13:55 .
drwxrwxr-x 3 ec2-user ec2-user   17 May  9 13:55 ..
-rw-rw-r-- 1 ec2-user ec2-user 4096 May  9 13:55 Conditions
-rw-rw-r-- 1 ec2-user ec2-user   70 May  9 13:55 .d
-rw-rw-r-- 1 ec2-user ec2-user 4096 May  9 13:55 EventType
-rw-rw-r-- 1 ec2-user ec2-user 4096 May  9 13:55 Exchange
-rw-rw-r-- 1 ec2-user ec2-user   16 May  9 13:55 Price
-rw-rw-r-- 1 ec2-user ec2-user   16 May  9 13:55 Quantity
-rw-rw-r-- 1 ec2-user ec2-user 4176 May  9 13:55 Ticker
-rw-rw-r-- 1 ec2-user ec2-user   16 May  9 13:55 Timestamp


### Stage the Database on S3
Copy the database files to S3 using the AWS CLI command [aws s3 cp](https://docs.aws.amazon.com/cli/latest/reference/s3/)

In [9]:
# Stage the local hdb database to S3
S3_DEST=f"s3://{S3_BUCKET}/{S3_DATA_PATH}/{SOURCE_DATA_DIR}/"

if AWS_ACCESS_KEY_ID is not None:
    cp = f"""
export AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY_ID} --quiet
export AWS_SECRET_ACCESS_KEY={AWS_SECRET_ACCESS_KEY}
export AWS_SESSION_TOKEN={AWS_SESSION_TOKEN}

aws s3 rm --recursive {S3_DEST} --quiet
aws s3 sync --exclude .DS_Store {SOURCE_DATA_DIR} {S3_DEST} --quiet
aws s3 ls {S3_DEST}
"""
else:
    cp = f"""
aws s3 rm --recursive {S3_DEST} --quiet
aws s3 sync --exclude .DS_Store {SOURCE_DATA_DIR} {S3_DEST} --quiet
aws s3 ls {S3_DEST}
"""
    
# execute the S3 copy
os.system(cp)

                           PRE 2024.01.01/
2024-05-09 13:55:19          8 sym


0

## Add Data to Database
Add the disk data to the managed database using the API [create_kx_changeset](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_changeset.html)

In [10]:
changes=[]

dir_list = os.listdir(f"{SOURCE_DATA_DIR}")

for f in dir_list:
    if os.path.isdir(f"{SOURCE_DATA_DIR}/{f}"):
        changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}/", 'dbPath': f"/{f}/" } )
    else:
        changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}", 'dbPath': f"/" } )

if len(dir_list) == 0:
    changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}", 'dbPath': f"/" } )
        
resp = client.create_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, 
    changeRequests=changes)

resp.pop('ResponseMetadata', None)
changeset_id = resp['changesetId']

print("Changeset...")
print(json.dumps(resp,sort_keys=True,indent=4,default=str))

Changeset...
{
    "changeRequests": [
        {
            "changeType": "PUT",
            "dbPath": "/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/demo/sym"
        },
        {
            "changeType": "PUT",
            "dbPath": "/2024.01.01/",
            "s3Path": "s3://kdb-demo-829845998889-kms/data/demo/2024.01.01/"
        }
    ],
    "changesetId": "MMeu0Yw31LcK9SXhXVYOXQ",
    "createdTimestamp": "2024-05-09 13:55:20.816000+00:00",
    "databaseName": "DEMO_DB",
    "environmentId": "jlcenjvtkgzrdek2qqv7ic",
    "lastModifiedTimestamp": "2024-05-09 13:55:20.816000+00:00",
    "status": "PENDING"
}


In [11]:
wait_for_changeset_status(client, environmentId=ENV_ID, databaseName=DB_NAME, changesetId=changeset_id, show_wait=True)
print("**Done**")

Status is IN_PROGRESS, total wait 0:00:00, waiting 10 sec ...
Status is IN_PROGRESS, total wait 0:00:10, waiting 10 sec ...
Status is IN_PROGRESS, total wait 0:00:20, waiting 10 sec ...
Status is IN_PROGRESS, total wait 0:00:30, waiting 10 sec ...
Status is IN_PROGRESS, total wait 0:00:40, waiting 10 sec ...
**Done**


### Managed Database Changesets
List the changesets for the Managed database using [list_kx_changesets](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/list_kx_changesets.html) and get details of each changeset with [get_kx_changeset](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/get_kx_changeset.html).

In [12]:
note_str = ""

c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) == 0:
    note_str = "<<No changesets>>"
    
print(100*"=")
print(f"Database: {DB_NAME}, Changesets: {len(c_set_list)} {note_str}")
print(100*"=")

# sort by create time
c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 

for c in c_set_list:
    c_set_id = c['changesetId']
    print(f"  Changeset: {c_set_id}: Created: {c['createdTimestamp']} ({c['status']})")
    c_rqs = client.get_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, changesetId=c_set_id)['changeRequests']

    chs_pdf = pd.DataFrame.from_dict(c_rqs).style.hide(axis='index')
    display(chs_pdf)

Database: DEMO_DB, Changesets: 1 
  Changeset: MMeu0Yw31LcK9SXhXVYOXQ: Created: 2024-05-09 13:55:20.816000+00:00 (COMPLETED)


changeType,s3Path,dbPath
PUT,s3://kdb-demo-829845998889-kms/data/demo/sym,/
PUT,s3://kdb-demo-829845998889-kms/data/demo/2024.01.01/,/2024.01.01/


# Create Scaling Group
The scaling group represents the total compute avilable to the application. All clusters will be placed into the scaling group ans share the compute and memory of the scaling group. Create the scaling group with the API [create_kx_scaling_group](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_scaling_group.html)

In [13]:
# Check if scaling group exits, only create if it does not
resp = get_kx_scaling_group(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME)

if resp is None:
    resp = client.create_kx_scaling_group(
        environmentId = ENV_ID, 
        scalingGroupName = SCALING_GROUP_NAME,
        hostType=NODE_TYPE,
        availabilityZoneId = AZ_ID
    )
else:
    print(f"Scaling Group {SCALING_GROUP_NAME} exists")

# Create Shared Volume
The shared volume is a common storage device for the application. Every cluster using the shared volume will have a writable directory named after the cluster, can read the directories named after other clusters in the application using the volume. Create the volume with the API [create_kx_volume](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_volume.html). 

In [14]:
# Check if volume already exists before trying to create one
resp = get_kx_volume(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME)

if resp is None:
    resp = client.create_kx_volume(
        environmentId = ENV_ID, 
        volumeType = 'NAS_1',
        volumeName = VOLUME_NAME,
        description = 'Shared volume between TP and RDB',
        nas1Configuration = NAS1_CONFIG,
        azMode='SINGLE',
        availabilityZoneIds=[ AZ_ID ]    
    )
else:
    print(f"Volume {VOLUME_NAME} exists")    

# Wait for Volume and Scaling Group
Before proceeding to use Volumes and Scaling groups, wait for their creation to complete.

Volume will be used by the dataview.    
Dataview and Scaling Group will be used by the clusters


In [15]:
# wait for the scaling group to create
wait_for_scaling_group_status(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME, show_wait=True)
print("** DONE **")

# wait for the volume to create
wait_for_volume_status(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME, show_wait=True)
print("** DONE **")

Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:00:00, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:00:30, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:01:00, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:01:30, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:02:00, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:02:30, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:03:00, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:03:30, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:04:00, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is CREATING, total wait 0:04:30, waiting 30 sec ...
Scaling Group: DEMO_SCALING_GROUP status is now ACTIVE, total wait 0:0

# Create Dataview
Create a dataview, for a specific (static) version of the database and have all of its data cached using the shared volume. Create the view with the API [create_kx_dataview](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_dataview.html).

In [16]:
# do changesets exist?
c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) != 0:
    # sort by create time
    c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 
    latest_changeset = c_set_list[-1]['changesetId']

    # Check if dataview already exists and is set to the requested changeset_id
    resp = get_kx_dataview(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=DBVIEW_NAME)

    if resp is None:
        resp = client.create_kx_dataview(
            environmentId = ENV_ID, 
            databaseName=DB_NAME, 
            dataviewName=DBVIEW_NAME,
            azMode='SINGLE',
            availabilityZoneId=AZ_ID,
            changesetId=latest_changeset, # latest changeset_id
            segmentConfigurations=[
                { 
                    'volumeName': VOLUME_NAME,
                    'dbPaths': ['/*'],  # cache all of database
    #                "onDemand": True,   # cache data onDemand (on read) else will ensure all is cached
                }
            ],
    #        readWrite=True,
            autoUpdate=False,
            description = f'Dataview of database'
        )
    elif resp['changesetId'] != latest_changeset:
        print(f"Dataview {DBVIEW_NAME} exists but needs updating...")
        resp = client.update_kx_dataview(
        )
    else:
        print(f"Dataview {DBVIEW_NAME} exists with current changeset: {latest_changeset}")
    
else:
    # no changesets, do NOT create view
    print(f"No changeset in database: {DB_NAME}, Dataview {DBVIEW_NAME} not created")        


In [17]:
# wait for the view to create
wait_for_dataview_status(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=DBVIEW_NAME, show_wait=True)
print("** DONE **")

Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:00:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:00:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:01:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:01:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:02:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:02:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:03:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:03:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:04:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:04:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:05:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:05:30, waiting 30 sec ...
Dataview: DEMO_D

# Create Clusters
With foundation resources now completed, create the needed clusters for the application. GP clsuter will be used for processing CVS files, the HDB cluster will serve up the contents of the database. 

Clusters are created using the API [create_kx_cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_cluster.html). The existance of a clusters is determined with [get_kx_cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/get_kx_cluster.html).

In [18]:
# does cluster already exist?
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=CLUSTER_NAME)

if resp is not None:
    print(f"Cluster: {CLUSTER_NAME} already exists")
else:
    print(f"Creating: {CLUSTER_NAME}")

    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=CLUSTER_NAME,
        clusterType="GP",
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryReservation': 6,
            'nodeCount': 1,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        savedownStorageConfiguration = { 'volumeName': VOLUME_NAME },
        clusterDescription="Created with create_all notebook",
    #    code=CODE_CONFIG,
    #    initializationScript=cluster_init,
        commandLineArguments=CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration={ 
            'vpcId': VPC_ID,
            'securityGroupIds': SECURITY_GROUPS,
            'subnetIds': SUBNET_IDS,
            'ipAddressType': 'IP_V4' }
    )

Creating: demo_csv_cluster


In [19]:
# cluster already exists
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=HDB_CLUSTER_NAME)
if resp is not None:
    print(f"Cluster: {HDB_CLUSTER_NAME} already exists")
else:
    print(f"Creating: {HDB_CLUSTER_NAME}")
    
    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=HDB_CLUSTER_NAME,
        clusterType='HDB',
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryLimit': 32*1024,
            'memoryReservation': 6,
            'nodeCount': 3,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        clusterDescription="Created with create_all notebook",
    #    code=CODE_CONFIG,
    #    initializationScript=HDB_INIT_SCRIPT,
        commandLineArguments=HDB_CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration={ 
            'vpcId': VPC_ID,
            'securityGroupIds': SECURITY_GROUPS,
            'subnetIds': SUBNET_IDS,
            'ipAddressType': 'IP_V4' },
    )

Creating: demo_hdb_cluster


## Wait for all clusters to finish creating

In [20]:
# Wait for clusters to start
wait_for_cluster_status(client, environmentId=ENV_ID, clusterName=CLUSTER_NAME, show_wait=True)
wait_for_cluster_status(client, environmentId=ENV_ID, clusterName=HDB_CLUSTER_NAME, show_wait=True)

print("** ALL DONE **")

Cluster: demo_csv_cluster status is PENDING, total wait 0:00:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:00:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:01:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:01:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:02:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:02:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:03:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:03:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:04:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:04:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:05:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:05:30, 

# All Processes Running
All resources are now created, database has data, and the clusters are up and running.

In [21]:
print( f"Last Run: {datetime.datetime.now()}" )

Last Run: 2024-05-09 14:28:00.145754
