In [19]:
%run ../setup_db.ipynb
conf = SandboxConfig(EXTERNAL_HOST_NAME="127.0.0.1", HOST_PORT=8563, BUCKETFS_PORT=2580)
setup_schema(conf)

%run 00_setup.ipynb
setup_cloud_storage_extension(conf)

Schema created in 17.11ms
Jar for version 2.7.6 already exists in exasol-cloud-storage-extension-2.7.6.jar, skip downloading
Jar file is already present in the bucketfs


In [3]:
import pyexasol

# Importing small amount of data from parquet files

For the beginning, we'll load small volume of data from publicly available [Youtube 8M dataset](https://registry.opendata.aws/yt8m/).

In this example, we'll work with "dataset vocabulary" which is information about classes of videos. In total there are 3862 entries, which are stored in one single parquet file: 
s3://aws-roda-ml-datalake/yt8m_ods/vocabulary/run-1644252350398-part-block-0-r-00000-snappy.parquet


## Schema and target table

As a first step, we need to obtain the schema of the data (set of columns stored in parquet files with their types). You might have this information in advance (if this is your dataset), but if not, you need to analyze parquet files to figure out their schema.

One of the options of doing this is parquet-tools library wrapped into a [docker container](https://hub.docker.com/r/nathanhowell/parquet-tools). To use it, you need to download one of parquet files locally, then run this docker container against this file. Using the same container, you can also peek into parquet files and looks at its actual data.

For the file above, I got the following schema information:

```
message glue_schema {
  optional binary Index (STRING);
  optional binary TrainVideoCount (STRING);
  optional binary KnowledgeGraphId (STRING);
  optional binary Name (STRING);
  optional binary WikiUrl (STRING);
  optional binary Vertical1 (STRING);
  optional binary Vertical2 (STRING);
  optional binary Vertical3 (STRING);
  optional binary WikiDescription (STRING);
}
```  

From this schema we see that all the columns in parquet file have string type and optional (nullable).
Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema.

In [38]:
TABLE_NAME = "Y8M_CLASSES"

sql = """
create or replace table {schema_name!i}.{table_name!i} 
(
    ClsIndex          VARCHAR2(1024),
    TrainVideoCount   VARCHAR2(1024),
    KnowledgeGraphId  VARCHAR2(1024),
    Name              VARCHAR2(1024),
    WikiUrl           VARCHAR2(1024),
    Vertical1         VARCHAR2(1024),
    Vertical2         VARCHAR2(1024),
    Vertical3         VARCHAR2(1024),
    WikiDescription   VARCHAR2(2048)
)
"""

with pyexasol.connect(**conf.connection_params) as conn:
    conn.execute(sql, query_params={
        "schema_name": conf.SCHEMA,
        "table_name": TABLE_NAME
    })

## S3 credentials

Even if our data resides on a publicly available S3 bucket (as the data we're dealing with in this examle), we need to provide S3 credentials to access the bucket. 

TODO: we have to provide valid credentials, even for public buckets, see [this issue](https://github.com/exasol/cloud-storage-extension/issues/283)

In [32]:
sql = """
CREATE OR REPLACE CONNECTION S3_CONNECTION TO '' USER '' 
IDENTIFIED BY 'S3_ACCESS_KEY={access_key!r};S3_SECRET_KEY={secret_key!r}';
"""

S3_ACCESS_KEY = "<ACCESS_KEY>"
S3_SECRET_KEY = "<SECRET_KEY>"

with pyexasol.connect(**conf.connection_params) as conn:
    conn.execute(sql, query_params={
        "schema": conf.SCHEMA,
        "access_key": S3_ACCESS_KEY,
        "secret_key": S3_SECRET_KEY,
    })

## Importing data

Now it's time to import our data. We call the `IMPORT_PATH` script, providing location of parquet files, their format, s3 endpoint (which has to match the bucket's configuration) and name of our connection object.

In [39]:
params = {
    "schema": conf.SCHEMA,
    "table": TABLE_NAME,  
}

sql = """
IMPORT INTO {schema!i}.{table!i}
FROM SCRIPT {schema!i}.IMPORT_PATH WITH
    BUCKET_PATH = 's3a://aws-roda-ml-datalake/yt8m_ods/vocabulary/*'
    DATA_FORMAT = 'PARQUET'
    S3_ENDPOINT = 's3-us-west-2.amazonaws.com'
    CONNECTION_NAME = 'S3_CONNECTION';
"""

with pyexasol.connect(**conf.connection_params) as conn:
    conn.execute(sql, query_params=params)

Let's check that data was imported

In [44]:
with pyexasol.connect(**conf.connection_params) as conn:
    data_rows = conn.execute("select count(*) from {schema!i}.{table!i}", query_params=params)
    count = next(data_rows)[0] 
    print(f"Loaded {count} rows")
    data = conn.execute("select * from {schema!i}.{table!i} limit 1", query_params=params)
    for row in data:
        print(row)
    

Loaded 3862 rows
('2054', '401', '/m/01hrv5', 'Popcorn', 'https://en.wikipedia.org/wiki/Popcorn', 'Food & Drink', None, None, 'Popcorn is a type of corn that expands from the kernel and puffs up when heated. Popcorn is able to pop like amaranth grain, sorghum, quinoa, and millet. When heated, pressure builds within the kernel, and a small explosion is the end result. Some strains of corn are now cultivated specifically as popping corns. There are various techniques for popping corn. Along with prepackaged popcorn, which is generally intended to be prepared in a microwave oven, there are small home appliances for popping corn. These methods require the use of minimally processed popping corn. A larger-scale, commercial popcorn machine, which resembled a modern movie theater popcorn machine on a cart with large bicycle style wheels, was invented by Charles Cretors in the late 19th century. Unpopped popcorn is considered nonperishable and will last indefinitely if stored in ideal conditio