# Importing data from cloud storage

In this notebook we'll use [cloud-storage-extension](https://github.com/exasol/cloud-storage-extension/) to import publicly available data from AWS S3 into the Exasol database. 

## Prerequisites

Before importing the data we need to configure the database, by setting up cloud-storage-extension jar files and UDF scripts used for the import. This needs to be done once for the database.

### Open Secure Configuration Storage

In [1]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…

### Setup of cloud-storage-extension

Before running the setup, you need to perform the "Main configuration" in [main_config.ipynb](../main_config.ipynb)

In [5]:
%run ../cloud_store_config.ipynb

Could Storage Extension was initialized


# Importing data from parquet files

For the beginning, we'll load small volume of data from publicly available [Ookla Network Performance Maps](https://registry.opendata.aws/speedtest-global-performance/), which contains aggregated network performance measurements from speedtest.net website.

In this example, we'll import only the subset of dataset - only mobile users for Q1 of 2019. In total there are 3M rows stored in parquet file on public S3 bucket: s3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet


In [6]:
import pyexasol
from exasol.nb_connector.connections import open_pyexasol_connection

## Schema and target table

As a first step, we need to obtain the schema of the data (set of columns stored in parquet files with their types). You might have this information in advance (if this is your dataset), but if not, you need to analyze parquet files to figure out their schema.

One of the options of doing this is the parquet-tools library wrapped into a [docker container](https://hub.docker.com/r/nathanhowell/parquet-tools). To use it, you need to download one of parquet files locally, then run this docker container against this file. Using the same container, you can also peek into parquet files and looks at its actual data.

For the file above, I got the following schema information:

```
message schema {
  optional binary quadkey (STRING);
  optional binary tile (STRING);
  optional int64 avg_d_kbps;
  optional int64 avg_u_kbps;
  optional int64 avg_lat_ms;
  optional int64 tests;
  optional int64 devices;
}
```  

From this schema we see that we have two types of columns in the parquet file - strings and integers.
Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema.

In [7]:
TABLE_NAME = "OOKLA_MAP"

sql = """
create or replace table {schema_name!i}.{table_name!i} 
(
    quadkey     VARCHAR2(1024),
    tile        VARCHAR2(1024),
    avg_d_kbps  BIGINT,
    avg_u_kbps  BIGINT,
    avg_lat_ms  BIGINT,
    tests       BIGINT,
    devices     BIGINT
)
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema,
        "table_name": TABLE_NAME
    })

## S3 credentials

If S3 bucket is public, we can pass empty access key and secret keys. Otherwise replace with valid credentials.

In [8]:
sql = """
CREATE OR REPLACE CONNECTION S3_CONNECTION TO '' USER '' 
IDENTIFIED BY 'S3_ACCESS_KEY={access_key!r};S3_SECRET_KEY={secret_key!r}';
"""

S3_ACCESS_KEY = ""
S3_SECRET_KEY = ""

with open_pyexasol_connection(ai_lab_config) as conn:
    conn.execute(sql, query_params={
        "schema": ai_lab_config.db_schema,
        "access_key": S3_ACCESS_KEY,
        "secret_key": S3_SECRET_KEY,
    })

## Importing data

Now it's time to import our data. We call the `IMPORT_PATH` script, providing the location of parquet files, their format, the s3 endpoint (which has to match the bucket's configuration) and the name of our connection object.

In [10]:
params = {
    "schema": ai_lab_config.db_schema,
    "table": TABLE_NAME,  
}

sql = """
IMPORT INTO {schema!i}.{table!i}
FROM SCRIPT {schema!i}.IMPORT_PATH WITH
    BUCKET_PATH = 's3a://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/*'
    DATA_FORMAT = 'PARQUET'
    S3_ENDPOINT = 's3-us-west-2.amazonaws.com'
    CONNECTION_NAME = 'S3_CONNECTION';
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    conn.execute(sql, query_params=params)

Let's check that data was imported by the process above

In [11]:
with open_pyexasol_connection(ai_lab_config) as conn:
    data_rows = conn.execute("select count(*) from {schema!i}.{table!i}", query_params=params)
    count = next(data_rows)[0] 
    print(f"Loaded {count} rows")
    data = conn.execute("select * from {schema!i}.{table!i} limit 1", query_params=params)
    for row in data:
        print(row)
    

Loaded 3231245 rows
('0212113210312302', 'POLYGON((-114.32373046875 53.1829958600872, -114.318237304688 53.1829958600872, -114.318237304688 53.1797038936054, -114.32373046875 53.1797038936054, -114.32373046875 53.1829958600872))', '39647', '29770', '30', '1', '1')
