# CSV Processing: Create Resources

This notebook will create all the needed infrastructure to demonstrate how to use FinSpace with Managed kdb Insights to process csv.gz files on S3 into a managed matabase using managed clusters.

## Architecture
![Architecture](images/csv_arch.png "Architecture")

A managed database is created and initially populated with an empty database by creating an initial changeset with an empty table. The empty database files are staged to an S3 bucket that FinSpace has access to and function [create_kx_changeset](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_changeset.html) is used to populate the managed database with the changset. Once the database has been populated, a shared volume and scaling group (infrastructure components) are created using [create_kx_volume](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_volume.html) and [create_kx_scaling_group](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_scaling_group.html). With the volume created, a dataview of the (still empty) database is created with [create_kx_dataview](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_dataview.html). Finally two clusters are created with [create_kx_cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_cluster.html), a general purpose cluster (GP) for processing the csv.gz files, and an historical database (HDB) to serve up the data for query.

## Processing Data
This notebook sets up the necessary infrastructure to process data, please see the [process_algoseek](process_algoseek.ipynb) notebook for how this is done.

## Clean Up
To delete the infrastucture, please run the [delete_all](delete_all.ipynb) notebook

## Algoseek LLC Data
Trade and Quote data has been provided by [AlgoSeek LLC](https://www.algoseek.com/), you can learn more about their data offerings from their home page.


## Imports and Constants
Import necessary python libraries and define global variables.

In [1]:
import os
import subprocess
import boto3
import json
import datetime

import pykx as kx

from managed_kx import *
from env import *

from config import *

# ----------------------------------------------------------------

NODE_TYPE="kx.sg.xlarge"

DATABASE_CONFIG=[{ 
    'databaseName': DB_NAME,
    'dataviewName': DBVIEW_NAME
    }]
CODE_CONFIG={ 's3Bucket': S3_BUCKET, 's3Key': f'{S3_CODE_PATH}/{CODEBASE}.zip' }

NAS1_CONFIG= {
        'type': 'SSD_250',
        'size': 1200
}

CMD_ARGS=[
    { 'key': 's', 'value': '2' }, 
    { 'key': 'dbname', 'value': DB_NAME}, 
    { 'key': 'AWS_ZIP_DEFAULT', 'value': '17,2,6' },
]

HDB_CMD_ARGS=[
    { 'key': 's', 'value': '2' }, 
    { 'key': 'dbname', 'value': DB_NAME}, 
]

VPC_CONFIG={ 
    'vpcId': VPC_ID,
    'securityGroupIds': SECURITY_GROUPS,
    'subnetIds': SUBNET_IDS,
    'ipAddressType': 'IP_V4' 
}

In [2]:
# Using credentials and create service client
session = boto3.Session()

# create finspace client
client = session.client(service_name='finspace')

# Create Managed Database
Create a managed database in Managed kdb Insights using the API [create_kx_database](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/delete_kx_database.html)

In [3]:
# assume database exists
create_db=False

try:
    resp = client.get_kx_database(environmentId=ENV_ID, databaseName=DB_NAME)
    resp.pop('ResponseMetadata', None)
except:
    # does not exist, will create
    create_db=True

if create_db:
    print(f"CREATING Database: {DB_NAME}")
    resp = client.create_kx_database(environmentId=ENV_ID, databaseName=DB_NAME, description="Basictick kdb database")
    resp.pop('ResponseMetadata', None)

    print(f"CREATED Database: {DB_NAME}")

print(json.dumps(resp,sort_keys=True,indent=4,default=str))

{
    "createdTimestamp": "2024-11-26 19:16:09.825000+00:00",
    "databaseArn": "arn:aws:finspace:us-east-1:829845998889:kxEnvironment/jlcenjvtkgzrdek2qqv7ic/kxDatabase/DEMO_DB",
    "databaseName": "DEMO_DB",
    "description": "Basictick kdb database",
    "environmentId": "jlcenjvtkgzrdek2qqv7ic",
    "lastCompletedChangesetId": "jMm0844KLNXdmbXIReAjEQ",
    "lastModifiedTimestamp": "2024-11-26 19:16:47.496000+00:00",
    "numBytes": 24805,
    "numChangesets": 1,
    "numFiles": 11
}


# Create and Stage Data on S3
## Create Empty Database
With a local q instance, create initial in-memory table populated from a small csv file, then add that data to the managed database. Think of this as the first date of data in the database. We will ensure the table is initially empty.


In [4]:
!rm -rf $SOURCE_DATA_DIR

In [5]:
%%q
/ create empty table
taq:([]
    Timestamp:`timespan$();
    EventType:`symbol$();
    Ticker:`symbol$();
    Price:`float$();
    Quantity:`long$();
    Exchange:`symbol$();
    Conditions:`symbol$();
    FileName:`symbol$();
    FileExtension:`symbol$() )

/ set attribute on table
update `g#Ticker from `taq

/ Schema
meta taq

/ show the table contents (its empty)
taq

/ Save the table locally to a date partition
d:2021.01.01;
path:"demo";

{.Q.dpft[hsym`$x;y;`Ticker;z]}[path;d] each tables`.;


taq
c            | t f a
-------------| -----
Timestamp    | n    
EventType    | s    
Ticker       | s   g
Price        | f    
Quantity     | j    
Exchange     | s    
Conditions   | s    
FileName     | s    
FileExtension| s    
Timestamp EventType Ticker Price Quantity Exchange Conditions FileName FileEx..
-----------------------------------------------------------------------------..


### Contents of Database on Disk
Show the saved tables of database on the file system

In [6]:
!ls -lR $SOURCE_DATA_DIR

demo:
total 8
drwxrwxr-x 3 ec2-user ec2-user 4096 Nov 26 19:18 2021.01.01
-rw-rw-r-- 1 ec2-user ec2-user    8 Nov 26 19:18 sym

demo/2021.01.01:
total 4
drwxrwxr-x 2 ec2-user ec2-user 4096 Nov 26 19:18 taq

demo/2021.01.01/taq:
total 40
-rw-rw-r-- 1 ec2-user ec2-user 4096 Nov 26 19:18 Conditions
-rw-rw-r-- 1 ec2-user ec2-user 4096 Nov 26 19:18 EventType
-rw-rw-r-- 1 ec2-user ec2-user 4096 Nov 26 19:18 Exchange
-rw-rw-r-- 1 ec2-user ec2-user 4096 Nov 26 19:18 FileExtension
-rw-rw-r-- 1 ec2-user ec2-user 4096 Nov 26 19:18 FileName
-rw-rw-r-- 1 ec2-user ec2-user   16 Nov 26 19:18 Price
-rw-rw-r-- 1 ec2-user ec2-user   16 Nov 26 19:18 Quantity
-rw-rw-r-- 1 ec2-user ec2-user 4176 Nov 26 19:18 Ticker
-rw-rw-r-- 1 ec2-user ec2-user   16 Nov 26 19:18 Timestamp


### Stage the Database on S3
Copy the database files to S3 using the AWS CLI command [aws s3 cp](https://docs.aws.amazon.com/cli/latest/reference/s3/)

In [7]:
# Stage the local hdb database to S3
S3_DEST=f"s3://{S3_BUCKET}/{S3_DATA_PATH}/{SOURCE_DATA_DIR}/"

cp = ""

if AWS_ACCESS_KEY_ID is not None:
    cp = f"""
export AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY_ID} --quiet
export AWS_SECRET_ACCESS_KEY={AWS_SECRET_ACCESS_KEY}
export AWS_SESSION_TOKEN={AWS_SESSION_TOKEN}
"""

cp += f"""
aws s3 rm --recursive {S3_DEST} --quiet
aws s3 sync --exclude .DS_Store {SOURCE_DATA_DIR} {S3_DEST} --quiet
aws s3 ls {S3_DEST}
"""

# execute the S3 copy
os.system(cp)

                           PRE 2021.01.01/
2024-11-26 19:18:09          8 sym


0

## Add Data to Database
Add the disk data to the managed database using the API [create_kx_changeset](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_changeset.html)

In [8]:
# add the changeset if database has no changeset
c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) == 0:
    print("Adding Changeset to Empty database")
    changes=[]

    dir_list = os.listdir(f"{SOURCE_DATA_DIR}")

    for f in dir_list:
        if os.path.isdir(f"{SOURCE_DATA_DIR}/{f}"):
            changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}/", 'dbPath': f"/{f}/" } )
        else:
            changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}{f}", 'dbPath': f"/" } )

    if len(dir_list) == 0:
        changes.append( { 'changeType': 'PUT', 's3Path': f"{S3_DEST}", 'dbPath': f"/" } )

    resp = client.create_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, 
        changeRequests=changes)

    resp.pop('ResponseMetadata', None)
    changeset_id = resp['changesetId']

    print("Changeset...")
    print(json.dumps(resp,sort_keys=True,indent=4,default=str))
else:
    changeset_id = c_set_list[0]['changesetId']

In [9]:
wait_for_changeset_status(client, environmentId=ENV_ID, databaseName=DB_NAME, changesetId=changeset_id, show_wait=True)
print("**Done**")

**Done**


# Managed Database Changesets
List the changesets for the Managed database using [list_kx_changesets](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/list_kx_changesets.html) and get details of each changeset with [get_kx_changeset](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/get_kx_changeset.html).

In [10]:
note_str = ""

c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) == 0:
    note_str = "<<No changesets>>"
    
print(100*"=")
print(f"Database: {DB_NAME}, Changesets: {len(c_set_list)} {note_str}")
print(100*"=")

# sort by create time
c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 

for c in c_set_list:
    c_set_id = c['changesetId']
    print(f"  Changeset: {c_set_id}: Created: {c['createdTimestamp']} ({c['status']})")
    c_rqs = client.get_kx_changeset(environmentId=ENV_ID, databaseName=DB_NAME, changesetId=c_set_id)['changeRequests']

    chs_pdf = pd.DataFrame.from_dict(c_rqs).style.hide(axis='index')
    display(chs_pdf)

Database: DEMO_DB, Changesets: 1 
  Changeset: jMm0844KLNXdmbXIReAjEQ: Created: 2024-11-26 19:16:13.975000+00:00 (COMPLETED)


changeType,s3Path,dbPath
PUT,s3://kdb-demo-829845998889-kms/data/demo/2021.01.01/,/2021.01.01/
PUT,s3://kdb-demo-829845998889-kms/data/demo/sym,/


# Create Scaling Group
The scaling group represents the total compute avilable to the application. All clusters will be placed into the scaling group ans share the compute and memory of the scaling group. Create the scaling group with the API [create_kx_scaling_group](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_scaling_group.html)

In [11]:
# Check if scaling group exits, only create if it does not
resp = get_kx_scaling_group(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME)

if resp is None:
    resp = client.create_kx_scaling_group(
        environmentId = ENV_ID, 
        scalingGroupName = SCALING_GROUP_NAME,
        hostType=NODE_TYPE,
        availabilityZoneId = AZ_ID
    )
else:
    print(f"Scaling Group {SCALING_GROUP_NAME} exists")

Scaling Group DEMO_SCALING_GROUP exists


# Create Shared Volume
The shared volume is a common storage device for the application. Every cluster using the shared volume will have a writable directory named after the cluster, can read the directories named after other clusters in the application using the volume. Create the volume with the API [create_kx_volume](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_volume.html). 

In [12]:
# Check if volume already exists before trying to create one
resp = get_kx_volume(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME)

if resp is None:
    resp = client.create_kx_volume(
        environmentId = ENV_ID, 
        volumeType = 'NAS_1',
        volumeName = VOLUME_NAME,
        description = 'Shared volume between TP and RDB',
        nas1Configuration = NAS1_CONFIG,
        azMode='SINGLE',
        availabilityZoneIds=[ AZ_ID ]    
    )
else:
    print(f"Volume {VOLUME_NAME} exists")    

Volume DEMO_SHARED_VOLUME exists


# Wait for Volume and Scaling Group
Before proceeding to use Volumes and Scaling groups, wait for their creation to complete.

Volume will be used by the dataview.    
Dataview and Scaling Group will be used by the clusters


In [13]:
# wait for the scaling group to create
wait_for_scaling_group_status(client=client, environmentId=ENV_ID, scalingGroupName=SCALING_GROUP_NAME, show_wait=True)
print("** Scaling Group DONE **")

# wait for the volume to create
wait_for_volume_status(client=client, environmentId=ENV_ID, volumeName=VOLUME_NAME, show_wait=True)
print("** Volume DONE **")

Scaling Group: DEMO_SCALING_GROUP status is now ACTIVE, total wait 0:00:00
** Scaling Group DONE **
Volume: DEMO_SHARED_VOLUME status is now ACTIVE, total wait 0:00:00
** Volume DONE **


# Create Dataview
Create a dataview, for a specific (static) version of the database and have all of its data cached using the shared volume. Create the view with the API [create_kx_dataview](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_dataview.html).

In [14]:
# do changesets exist?
c_set_list = list_kx_changesets(client, environmentId=ENV_ID, databaseName=DB_NAME)

if len(c_set_list) != 0:
    # sort by create time
    c_set_list = sorted(c_set_list, key=lambda d: d['createdTimestamp']) 
    latest_changeset = c_set_list[-1]['changesetId']

    # Check if dataview already exists and is set to the requested changeset_id
    resp = get_kx_dataview(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=DBVIEW_NAME)

    if resp is None:
        resp = client.create_kx_dataview(
            environmentId = ENV_ID, 
            databaseName=DB_NAME, 
            dataviewName=DBVIEW_NAME,
            azMode='SINGLE',
            availabilityZoneId=AZ_ID,
            changesetId=latest_changeset, # latest changeset_id
            segmentConfigurations=[
                { 
                    'volumeName': VOLUME_NAME,
                    'dbPaths': ['/*'],  # cache all of database
                }
            ],
            autoUpdate=False,
            description = f'Dataview of database'
        )
    elif resp['changesetId'] != latest_changeset:
        print(f"Dataview {DBVIEW_NAME} exists but needs updating...")
        resp = client.update_kx_dataview(environmentId=ENV_ID, 
            databaseName=DB_NAME, 
            dataviewName=DBVIEW_NAME, 
            changesetId=latest_changeset, 
            segmentConfigurations=[
                {'dbPaths': ['/*'], 'volumeName': VOLUME_NAME}
            ]
        )
    else:
        print(f"Dataview {DBVIEW_NAME} exists with current changeset: {latest_changeset}")

else:
    # no changesets, do NOT create view
    print(f"No changeset in database: {DB_NAME}, Dataview {DBVIEW_NAME} not created")        

Dataview DEMO_DB_VIEW exists with current changeset: jMm0844KLNXdmbXIReAjEQ


In [15]:
# wait for the view to create
wait_for_dataview_status(client=client, environmentId=ENV_ID, databaseName=DB_NAME, dataviewName=DBVIEW_NAME, show_wait=True)
print("** Dataview DONE **")

Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:00:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:00:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:01:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:01:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:02:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:02:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:03:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:03:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:04:00, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is CREATING, total wait 0:04:30, waiting 30 sec ...
Dataview: DEMO_DB_VIEW status is now ACTIVE, total wait 0:05:00
** Dataview DONE **


# Create Clusters
With foundation resources now completed, create the needed clusters for the application. GP clsuter will be used for processing CVS files, the HDB cluster will serve up the contents of the database. 

Clusters are created using the API [create_kx_cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/create_kx_cluster.html). The existance of a clusters is determined with [get_kx_cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/finspace/client/get_kx_cluster.html).

In [16]:
# does cluster already exist?
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=CLUSTER_NAME)

if resp is not None:
    print(f"Cluster: {CLUSTER_NAME} already exists")
else:
    print(f"Creating: {CLUSTER_NAME}")

    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=CLUSTER_NAME,
        clusterType="GP",
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryReservation': 6,
            'nodeCount': 1,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        savedownStorageConfiguration = { 'volumeName': VOLUME_NAME },
        clusterDescription="Created with create_all notebook",
        commandLineArguments=CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration=VPC_CONFIG
    )

Creating: demo_csv_cluster


In [17]:
# cluster already exists
resp = get_kx_cluster(client, environmentId=ENV_ID, clusterName=HDB_CLUSTER_NAME)

if resp is not None:
    print(f"Cluster: {HDB_CLUSTER_NAME} already exists")
else:
    print(f"Creating: {HDB_CLUSTER_NAME}")

    resp = client.create_kx_cluster(
        environmentId=ENV_ID, 
        clusterName=HDB_CLUSTER_NAME,
        clusterType='HDB',
        releaseLabel = '1.0',
        executionRole=EXECUTION_ROLE,
        databases=DATABASE_CONFIG,
        scalingGroupConfiguration={
            'memoryLimit': 32*1024,
            'memoryReservation': 6,
            'nodeCount': 3,
            'scalingGroupName': SCALING_GROUP_NAME,
        },
        clusterDescription="Created with create_all notebook",
        commandLineArguments=HDB_CMD_ARGS,
        azMode=AZ_MODE,
        availabilityZoneId=AZ_ID,
        vpcConfiguration=VPC_CONFIG
    )

Creating: demo_hdb_cluster


## Wait for all clusters to finish creating

In [18]:
# Wait for clusters to start
for c in clusters:
    wait_for_cluster_status(client, environmentId=ENV_ID, clusterName=c, show_wait=True)

print("** ALL DONE **")

Cluster: demo_csv_cluster status is PENDING, total wait 0:00:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:00:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:01:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:01:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:02:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:02:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:03:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:03:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:04:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:04:30, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:05:00, waiting 30 sec ...
Cluster: demo_csv_cluster status is CREATING, total wait 0:05:30, 

# All Processes Running
All resources are now created, database has data, and the clusters are up and running.

Now move onto the [process_algoseek](process_algoseek.ipynb) notebook to use this infrastructure.

In [19]:
print(f"Last Run: {datetime.datetime.now()}")

Last Run: 2024-11-26 19:38:33.370397
