Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization #76

Merged
merged 17 commits into from
Feb 9, 2023
Merged

Add normalization #76

merged 17 commits into from
Feb 9, 2023

Conversation

revit13
Copy link
Contributor

@revit13 revit13 commented Feb 2, 2023

/Closes #75
This PR adds the code for doing normalization of the database tables which transforms the data from Airbyte format to expected database format.

more on normalization:
https://docs.airbyte.com/understanding-airbyte/basic-normalization/

Signed-off-by: Revital Sur eres@il.ibm.com
Co-authored-by: Doron Chen cdoron@il.ibm.com

@revit13 revit13 marked this pull request as draft February 2, 2023 12:55
@revit13 revit13 force-pushed the normalization branch 3 times, most recently from dab7198 to 9351b09 Compare February 5, 2023 06:44
revit13 and others added 5 commits February 5, 2023 12:10
Signed-off-by: Revital Sur <eres@il.ibm.com>
Co-authored-by: Doron Chen <cdoron@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13 revit13 force-pushed the normalization branch 2 times, most recently from c7085bf to e3286c1 Compare February 6, 2023 05:00
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13 revit13 marked this pull request as ready for review February 6, 2023 07:22
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
abm/server.py Outdated
if write_mode:
if write_mode == "overwrite":
mode = DestinationSyncMode.overwrite
if write_mode != "append":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be elif

# if the port field is a string, cast it to integer
if 'port' in self.config and type(self.config['port']) == str:
self.config['port'] = int(self.config['port'])

self.catalog_dict = None
self.json_schema = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment explaining what is kept in self.json_schema

abm/connector.py Outdated
Translate the name of the temporary file in the host to the name of the same file
in the container.
For instance, it the path is '/tmp/tmp12345', return '/local/tmp12345'.
Remove metadata columns, if such exists, from "CATALOG" lines returned by an Airbyte read operation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future versions, we should consider keeping the metadata columns and returning them to the user.

abm/connector.py Outdated
for stream in catalog_streams:
# remove metadata columns for a specific stream (table) if such
# is provided
if stream_name != "" and stream['name'] != stream_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider replacing this line with:

if stream_name and stream['name'] != stream_name:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is no need to keep entries for all other streams (other than the stream you are really interested in)

abm/connector.py Outdated
properties = json_schema['properties']
for key in list(properties.keys()):
if key.startswith('_airbyte_'):
del properties[key]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add break (?)

abm/connector.py Outdated

# Given configuration, obtain the Airbyte Catalog, which includes list of datasets
def get_catalog(self):
ret = []
for lines in self.run_container('discover --config ' + self.name_in_container(self.conf_file.name)):
for lines in self.run_container('discover --config ' + self.name_in_container(self.conf_file.name, MOUNTDIR)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether there is a way to remove the MOUNTDIR parameter

abm/connector.py Outdated
'''
def name_in_container(self, path):
return path.replace(self.workdir, MOUNTDIR, 1)
def remove_metadata_columns(self, line_dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting the json_schema field in this method

abm/connector.py Outdated
Given a catalog return the json schema of a specific stream (table) if the stream
is provided. Otherwise return the json schema of the first stream.
'''
def get_stream_schema(self, catalog):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this method would no longer be needed once self.json_schema contains the schema for a single stream

abm/connector.py Outdated
@@ -260,6 +255,8 @@ def get_catalog_dict(self):

try:
self.catalog_dict = json.loads(airbyte_catalog[0])
# save the json_schema part in the catalog for later use
self.json_schema = self.get_stream_schema(self.catalog_dict['catalog'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe no longer needed

@@ -97,3 +97,12 @@ connectors:
storage: HTTPS
# dataset_name is the Name of the final table to replicate this file into.
dataset_name: userdata

# Atttributes related to the normzliation process. If they are provided then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

host: host.docker.internal
port: 3306
database: test
database: test
table: table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change table name

Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
@revit13
Copy link
Contributor Author

revit13 commented Feb 8, 2023

@cdoron Thanks for the review. I fixed the code according to the the comments. Thanks

Signed-off-by: Revital Sur <eres@il.ibm.com>
if stream['name'] == stream_name:
the_stream = stream
break
if the_stream == None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a log message as well

1. To verify that the Airbyte module writes the dataset, run:
```bash
kubectl exec -it mysql-client --namespace fybrik-airbyte-sample -- bash
mysql -h mysql.fybrik-airbyte-sample.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the kubectl exec line be in the same "bash" segment as the mysql -h line?
I think that if someone copies and pastes this entire segment it would not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works ok - after the msql command we enter a mysql shell prompt...

Signed-off-by: Revital Sur <eres@il.ibm.com>
@cdoron cdoron merged commit 36f0155 into fybrik:main Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add normalization
2 participants