<div style="text-align: right;">
  <img src="https://raw.githubusercontent.com/exasol/ai-lab/refs/heads/main/assets/Exasol_Logo_2025_Dark.svg" style="width:200px; margin: 10px;" />
</div>

In this notebook we create a Virtual Schema which accesses JSON files on a public S3 bucket.
The data we are loading is a part of the [Reuters 21578](https://paperswithcode.com/dataset/reuters-21578) dataset, which is a collection of documents with news articles. 

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

As a first step, we are setting up the S3 Virtual Schema connector in the database.
You can find more information in the [s3-document-files-virtual-schema](https://github.com/exasol/s3-document-files-virtual-schema/) GitHub repository. 

In [None]:
%run ./s3_vs_setup.ipynb

# Create Virtual Schema

To configure the virtual schema, we need the following:

1. Create an S3 Connection object, which specifies in which S3 bucket our data resides. Our dataset is on a public S3 bucket, so no S3 credentials are needed. If your data is not open to the public, you need to store credentials in the Connection object.
2. Put the EDML file into the Bucket FS. EDML describes the mapping between the JSON structure and the desired Virtual Schema tables and columns. In our case, the mapping is simple, as every JSON file describes an individual news article with the same set of columns. However, EDML supports more complex scenarios, like table references, foreign keys, and missing columns. You can find more details [in the documentation.](https://github.com/exasol/virtual-schema-common-document/blob/main/doc/user_guide/edml_user_guide.md)
3. Create a Virtual Schema with the S3 Connection object and the EDML file in the Bucket FS. After this step, new virtual tables are added to Exasol, and every query from them performs S3 data read.

In [None]:
import pyexasol
import pathlib
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector import bfs_utils

In [None]:
sql = """
CREATE CONNECTION S3_CONNECTION_VS
  TO ''
  USER ''
  IDENTIFIED BY '{
      "awsRegion": "eu-central-1", 
      "s3Bucket": "ai-lab-example-data-s3" 
  }';
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    conn.execute(sql)

Our EDML definition extracts 3 fields from JSON files:
* `title`: as a VARCHAR(255) where longer texts are truncated.
* `id` (named as `NEWS_ID`): article identifier stored as a VARCHAR(50)
* `body`: text of the article as VARCHAR(1024) where longer texts are truncated.

In addition to those 3 fields, we have the field `topics`, which is extracted as a reference table.
In the JSON data, topics are represented as a list of tags associated with the article, for example:

```
    "topics": [
      "grain",
      "wheat",
      "corn",
      "barley",
      "oat",
      "sorghum"
    ],
```

To have convenient access to the topics, we use the `toTableMapping` EDML feature, which creates a second table `NEWS_TOPICS`, keeping a list of topics and `NEWS_ID`:

```
      "topics": {
        "toTableMapping": {
          "mapping": {
            "toVarcharMapping": {
              "destinationName": "NAME"
            }
          }
        }
      },
```

In [None]:
# put edml to bucket fs
bfs_bucket = open_bucketfs_connection(ai_lab_config)
bfs_path = bfs_utils.put_file(bfs_bucket, pathlib.Path("reuters-edml.json"))

In [None]:
bfs_path.as_udf_path()

In [None]:
sql = """
CREATE VIRTUAL SCHEMA {schema_name!i}_VS USING {schema_name!i}.S3_FILES_ADAPTER WITH
    CONNECTION_NAME = 'S3_CONNECTION_VS'
    MAPPING         = {map_file!s};
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema,
        "map_file": bfs_path.as_udf_path()
    })

# Query the data

Once we have our Virtual Schema created, a new table `AI_LAB_VS.NEWS` is available for querying. 
*Note*: you need to be aware of performance overhead. Virtual Schemas don't do any intermediate caching, and every request to a virtual table performs several S3 data reads, which might be slow.

If you are going to access the data several times, it might make sense to create the new table and copy the data from the virtual table into a normal table. Here we don't do that for simplicity. 

In [None]:
sql = """
select title, news_id from {schema_name!i}_VS.NEWS
limit 10
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    res = conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema
    })
    for r in res:
        print(r)

Besides the `NEWS` table, where each row is a news article, we additionally have a reference table with topics associated with every article.
In our EDML file, `toTableMapping` was used, which created a second table `NEWS_TOPICS`, containing flattened list of topics associated with `NEWS_ID`.

In [None]:
sql = """
select *
from {schema_name!i}_VS.NEWS_TOPICS
limit 10
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    res = conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema
    })
    for r in res:
        print(r)

So, the query below can be used to find the most frequent topics in our dataset.

In [None]:
sql = """
select NAME, count(*)
from {schema_name!i}_VS.NEWS_TOPICS
group by 1
order by 2 desc
limit 10
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    res = conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema
    })
    for r in res:
        print(r)

The tables `NEWS` and `NEWS_TOPICS` can also be joined on column `NEWS_ID` to find articles about a specific topic:

In [None]:
sql = """
select topics.name, news.title, news.news_id
from 
    {schema_name!i}_VS.NEWS as news,
    {schema_name!i}_VS.NEWS_TOPICS as topics
where news.NEWS_ID = topics.NEWS_NEWS_ID
and topics.NAME = 'grain'

limit 10
"""

with open_pyexasol_connection(ai_lab_config) as conn:
    res = conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema
    })
    for r in res:
        print(r)

Materialize the news data in a table. This speeds up processing of the data in other notebooks.

In [None]:
sql = """
CREATE TABLE {schema_name!i}.NEWS as select T.NAME as TOPIC, N.TITLE as TITLE, N.NEWS_ID as NEWS_ID, N.BODY as BODY
FROM
    {schema_name!i}_VS.NEWS as N,
    {schema_name!i}_VS.NEWS_TOPICS as T
WHERE N.NEWS_ID = T.NEWS_NEWS_ID;
"""
with open_pyexasol_connection(ai_lab_config) as conn:
    res = conn.execute(sql, query_params={
        "schema_name": ai_lab_config.db_schema
    })