Skip to content

sagemaker.model_monitor.DefaultModelMonitor suggest_baseline is not able to read Japanese text #4822

@johansew

Description

@johansew

Describe the bug
When creating statistics and constraints with DefaultModelMonitor.suggest_baseline for a UTF-8 encoded CSV containing Japanese text, the column names and categorical values are all appeared as ????? in the output JSON, making it unuseable.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.
Create a CSV dataset with Japanese columns name, and categorical values in Japanese.

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600, 
)

my_default_monitor.suggest_baseline(
    baseline_dataset="baselining_data_set.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=output_s3_uri,
)

Check the statistics.json and constraints.json created, it will show ?????? for Japanese text

{
  "version" : 0.0,
  "features" : [ {
    "name" : "????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "???????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "???????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }, {
    "name" : "????",
    "inferred_type" : "Integral",
    "completeness" : 1.0,
    "num_constraints" : {
      "is_non_negative" : true
    }
  }

Expected behavior
Correctly showing Japanese text.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.224.4
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version:
  • CPU or GPU:
  • Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions