# Importing data from cloud storage

In this notebook we'll use [cloud-storage-extension](https://github.com/exasol/cloud-storage-extension/) to import publicly available data from AWS S3 into the Exasol database. 

## Prerequisites

Before importing the data we need to configure the database, by setting up cloud-storage-extension jar files and UDF scripts used for the import. This need to be done once for the database.

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

### Setup of cloud-storage-extension

In [None]:
%run ../cloud_store_config.ipynb

# Importing small amount of data from parquet files

For the beginning, we'll load small volume of data from publicly available [Youtube 8M dataset](https://registry.opendata.aws/yt8m/).

In this example, we'll work with "dataset vocabulary" which is information about classes of videos. In total there are 3862 entries, which are stored in one single parquet file: 
s3://aws-roda-ml-datalake/yt8m_ods/vocabulary/run-1644252350398-part-block-0-r-00000-snappy.parquet


In [None]:
import pyexasol
from exasol.connections import open_pyexasol_connection

## Schema and target table

As a first step, we need to obtain the schema of the data (set of columns stored in parquet files with their types). You might have this information in advance (if this is your dataset), but if not, you need to analyze parquet files to figure out their schema.

One of the options of doing this is the parquet-tools library wrapped into a [docker container](https://hub.docker.com/r/nathanhowell/parquet-tools). To use it, you need to download one of parquet files locally, then run this docker container against this file. Using the same container, you can also peek into parquet files and looks at its actual data.

For the file above, I got the following schema information:

```
message glue_schema {
  optional binary Index (STRING);
  optional binary TrainVideoCount (STRING);
  optional binary KnowledgeGraphId (STRING);
  optional binary Name (STRING);
  optional binary WikiUrl (STRING);
  optional binary Vertical1 (STRING);
  optional binary Vertical2 (STRING);
  optional binary Vertical3 (STRING);
  optional binary WikiDescription (STRING);
}
```  

From this schema we see that all the columns in parquet file have string type and optional (nullable).
Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema.

In [38]:
TABLE_NAME = "Y8M_CLASSES"

sql = """
create or replace table {schema_name!i}.{table_name!i} 
(
    ClsIndex          VARCHAR2(1024),
    TrainVideoCount   VARCHAR2(1024),
    KnowledgeGraphId  VARCHAR2(1024),
    Name              VARCHAR2(1024),
    WikiUrl           VARCHAR2(1024),
    Vertical1         VARCHAR2(1024),
    Vertical2         VARCHAR2(1024),
    Vertical3         VARCHAR2(1024),
    WikiDescription   VARCHAR2(2048)
)
"""

with open_pyexasol_connection(sb_config) as conn:
    conn.execute(sql, query_params={
        "schema_name": sb_config.db_schema,
        "table_name": TABLE_NAME
    })

## S3 credentials

If S3 bucket is public, we can pass empty access key and secret keys. Otherwise replace with valid credentials.

In [32]:
sql = """
CREATE OR REPLACE CONNECTION S3_CONNECTION TO '' USER '' 
IDENTIFIED BY 'S3_ACCESS_KEY={access_key!r};S3_SECRET_KEY={secret_key!r}';
"""

S3_ACCESS_KEY = ""
S3_SECRET_KEY = ""

with open_pyexasol_connection(sb_config) as conn:
    conn.execute(sql, query_params={
        "schema": sb_config.db_schema,
        "access_key": S3_ACCESS_KEY,
        "secret_key": S3_SECRET_KEY,
    })

## Importing data

Now it's time to import our data. We call the `IMPORT_PATH` script, providing the location of parquet files, their format, the s3 endpoint (which has to match the bucket's configuration) and the name of our connection object.

In [39]:
params = {
    "schema": sb_config.db_schema,
    "table": TABLE_NAME,  
}

sql = """
IMPORT INTO {schema!i}.{table!i}
FROM SCRIPT {schema!i}.IMPORT_PATH WITH
    BUCKET_PATH = 's3a://aws-roda-ml-datalake/yt8m_ods/vocabulary/*'
    DATA_FORMAT = 'PARQUET'
    S3_ENDPOINT = 's3-us-west-2.amazonaws.com'
    CONNECTION_NAME = 'S3_CONNECTION';
"""

with open_pyexasol_connection(sb_config) as conn:
    conn.execute(sql, query_params=params)

Let's check that data was imported

In [None]:
with open_pyexasol_connection(sb_config) as conn:
    data_rows = conn.execute("select count(*) from {schema!i}.{table!i}", query_params=params)
    count = next(data_rows)[0] 
    print(f"Loaded {count} rows")
    data = conn.execute("select * from {schema!i}.{table!i} limit 1", query_params=params)
    for row in data:
        print(row)
    