Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to ingest non english utf-8 chars such as Japanese etc #5367

Open
fr-judson opened this issue Jul 11, 2022 · 4 comments
Open

Not able to ingest non english utf-8 chars such as Japanese etc #5367

fr-judson opened this issue Jul 11, 2022 · 4 comments
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles product PR or Issue related to the DataHub UI/UX

Comments

@fr-judson
Copy link

Describe the bug
Could not able to ingest data contains non english utf-8 chars such as Japanese ( for example "Sample Data - 商品ブランドコード") to dataset entity on their aspects such as datasetProperties, dataset schemaMetadata ( on column description part).

To Reproduce
Steps to reproduce the behavior:

  1. Ingest data which contains non english utf-8 chars such as Japanese to Dataset entity on the following aspects datasetProperties , datasetSchemametadata

Expected behavior
Metadata should be ingested to datahub.

Observed behavior
Metadata not able to ingest to datahub.

Additional context

Person to contact: Chris Margach (in datahub slack)

@fr-judson fr-judson added the bug Bug report label Jul 11, 2022
@maggiehays maggiehays added the ingestion PR or Issue related to the ingestion of metadata label Jul 11, 2022
@jjoyce0510 jjoyce0510 added the product PR or Issue related to the DataHub UI/UX label Jul 11, 2022
@shirshanka shirshanka added the on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles label Jul 12, 2022
@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

@hsheth2
Copy link
Collaborator

hsheth2 commented Nov 15, 2022

Additional context here: non utf-8 characters seems to work on the python side + on the frontend. I only tested the description field though, so @fr-judson let me know if there's other fields that were causing problems for you?

image (2)

As such, it looks like this error is specific to the java emitter

@humpfhumpf
Copy link

Fix suggestion:

in RestEmitter.java, StringEntity uses iso-8859-1 by default, but JSON is always UTF-8.

httpPost.setEntity(new StringEntity(payloadJson));

must be rewriten into:

httpPost.setEntity(new StringEntity(payloadJson, org.apache.http.entity.ContentType.APPLICATION_JSON));

and

httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload)));

must be rewriten into:

httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload), org.apache.http.entity.ContentType.APPLICATION_JSON));

@fr-chrismargach
Copy link

@hsheth2
Unfortunately, @fr-judson has left the organization. As far as I can see from our chat logs, we only tried "descriptions for both datasetproperties and dataset column descriptions"

@humpfhumpf
Thank you very much for the suggestion! We'll try it out.

@pedro93 pedro93 removed their assignment Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata on-deck PR or Issue that will be reviewed and/or addressed by the DataHub Maintainers in future cycles product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

No branches or pull requests

10 participants