Unveiling the Full Dataset Structure: Leveraging platform_sdk.dataset_reader in AEPP #12

yoyo6022 · 2023-06-28T21:39:07Z

To enhance our understanding of the dataset's structure, I propose making the platform_sdk.dataset_reader accessible. This will enable us to unpack the entire dataset and view it comprehensively, including the nested fields. Currently, the AEPP supports data loading through the queryservice module by specifying a SQL query, which loads the data into a pandas dataframe. However, each column in the dataframe only represents the first hierarchy of the nested object in the schema, unless we manually unpack a certain object in the query. For example: "select web.* from table_abc" will give us the fields nested in the second layer under "web" object.

By utilizing the platform_sdk.dataset_reader, we can effortlessly load the data with its nested fields unpacked, resulting in a more extensive perspective of the dataset. This approach enables us to grasp a clearer understanding of the data's structure by having access to all the fields it contains. Furthermore, it enhances the efficiency of querying and data processing, data manipulation since we no longer need to manually unpack individual object and the value won't be nested for each field.

Example of using SDK dataset reader, automatically unpack all the nested fields under "web" object.

pitchmuc · 2023-06-29T10:52:00Z

Thanks for bringing the idea @yoyo6022.
We will consider it for the future development.
FYI: The SDK dataset reader and this library are 2 different projects working in different environment and connecting to different sources.
I do not mean it is not doable, but it is not as easy as it may sound.

pitchmuc · 2023-12-04T09:59:44Z

Hello @yoyo6022
I am coming back to that.
Have you checked the latest version of aepp, and especially the SchemaManager part ?
It is not as efficient that the SDK reader because it will not provide the values in the fields, but there is a way to flatten schema structure and work with the field path to use query service more efficiently.

Here is the simple documentation : https://github.com/adobe/aepp/blob/main/docs/schema.md#schemamanager

We will need to work on more documentation in the future but if you are familiar with python and notebooks, you may be able to learn by playing with it as all of the docstring are provided.

pitchmuc added the enhancement New feature or request label Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unveiling the Full Dataset Structure: Leveraging platform_sdk.dataset_reader in AEPP #12

Unveiling the Full Dataset Structure: Leveraging platform_sdk.dataset_reader in AEPP #12

yoyo6022 commented Jun 28, 2023 •

edited

Loading

pitchmuc commented Jun 29, 2023

pitchmuc commented Dec 4, 2023

Unveiling the Full Dataset Structure: Leveraging platform_sdk.dataset_reader in AEPP #12

Unveiling the Full Dataset Structure: Leveraging platform_sdk.dataset_reader in AEPP #12

Comments

yoyo6022 commented Jun 28, 2023 • edited Loading

pitchmuc commented Jun 29, 2023

pitchmuc commented Dec 4, 2023

yoyo6022 commented Jun 28, 2023 •

edited

Loading