-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
When creating statistics and constraints with DefaultModelMonitor.suggest_baseline for a UTF-8 encoded CSV containing Japanese text, the column names and categorical values are all appeared as ????? in the output JSON, making it unuseable.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.
Create a CSV dataset with Japanese columns name, and categorical values in Japanese.
my_default_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
my_default_monitor.suggest_baseline(
baseline_dataset="baselining_data_set.csv",
dataset_format=DatasetFormat.csv(header=True),
output_s3_uri=output_s3_uri,
)
Check the statistics.json and constraints.json created, it will show ?????? for Japanese text
{
"version" : 0.0,
"features" : [ {
"name" : "????",
"inferred_type" : "Integral",
"completeness" : 1.0,
"num_constraints" : {
"is_non_negative" : true
}
}, {
"name" : "???????",
"inferred_type" : "Integral",
"completeness" : 1.0,
"num_constraints" : {
"is_non_negative" : true
}
}, {
"name" : "???????",
"inferred_type" : "Integral",
"completeness" : 1.0,
"num_constraints" : {
"is_non_negative" : true
}
}, {
"name" : "????",
"inferred_type" : "Integral",
"completeness" : 1.0,
"num_constraints" : {
"is_non_negative" : true
}
}
Expected behavior
Correctly showing Japanese text.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.224.4
- Framework name (eg. PyTorch) or algorithm (eg. KMeans):
- Framework version:
- Python version:
- CPU or GPU:
- Custom Docker image (Y/N):
Additional context
Add any other context about the problem here.