Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): add --log-file option and show CLI logs in UI report #7118

Merged
merged 8 commits into from
Jan 26, 2023

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Jan 24, 2023

  • Add a --log-file attribute that writes debug logs and stdout (e.g. the final ingestion report) to a file. This can even be used to dump debug logs to a file while printing info logs to the terminal
  • Includes the cli logs in the UI ingestion report. This currently matches the user's logging level for the cli run, but we can change that if we want to
  • Colorize logs based on severity in the CLI (unless stderr is redirected to a file)
  • Refactor logging setup

image

image

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jan 24, 2023
Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@@ -122,6 +122,8 @@ def get_long_description():
"greenlet",
}

sqllineage_lib = "sqllineage==1.3.6"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor!

@@ -393,5 +393,5 @@ def rollback(
writer.writerow([row.get("urn")])

except IOError as e:
print(e)
logger.exception(f"Unable to save rollback failure report: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@@ -262,7 +262,7 @@ def dataplatform2instance_func(
delete_cli._delete_one_urn(src_entity_urn, soft=not hard, run_id=run_id)
migration_report.on_entity_migrated(src_entity_urn, "status") # type: ignore

print(f"{migration_report}")
click.echo(f"{migration_report}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg thank you

execution_result_aspect = ExecutionRequestResultClass(
status=status,
startTimeMs=self.start_time_ms,
durationMs=self.get_cur_time_in_ms() - self.start_time_ms,
report=json.dumps(report, indent=2),
report=summary,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! So in this case, we are now sending the structured report to the correct field. From here, we can continue to add to that report and get it into our result aspect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sets ourselves up for much richer reports)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup exactly - this way the structured report shows up in the correct field for both managed ingestion and cli ingestion

note that the structured report field is basically a string right now - we may need to rethink that in the future

return record.levelno >= logging.INFO
else:
if self.debug:
return record.levelno >= logging.WARNING
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. good idea to filter out non-datahub noise

structured_report_str = json.dumps(report, indent=2)
summary = f"~~~~ Ingestion Report ~~~~\n{structured_report_str}\n\n"
summary += "~~~~ Ingestion Logs ~~~~\n"
summary += get_log_buffer().format_lines()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so here we leverage the in memory buffer. nice

# Configure the loggers.
root_logger = logging.getLogger()
_remove_all_handlers(root_logger)
root_logger.setLevel(logging.INFO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this "normalization" step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is warning globally, and can also be messed with by other libraries

# or otherwise mess with the logging configuration.
lib_logger = logging.getLogger(lib)
_remove_all_handlers(lib_logger)
lib_logger.setLevel(logging.DEBUG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this step?

from typing import Any, TextIO


class TeeIO:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy!

file_handler: logging.Handler
if log_file:
file = stack.enter_context(open(log_file, "w"))
tee = TeeIO(sys.stdout, file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - why is this Tee and context piece even required? I'd imagine we should be able to create a global handler that directs to a file without requiring this right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're redirecting stdout here (e.g. click.echo and print statements)

we want those messages to show up both in the terminal and in the log_file, which is why we need the tee

logger.warning("This is a warning message")
logger.error("this is an error with no stack trace")
try:
1 / 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this

r"""\
this is a print statement
this is a click.echo statement
[.+] DEBUG .datahub.my_cli_module:\d+. - Example debug line
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! This is pretty cool

[.+] INFO .datahub.my_cli_module:\d+. - This is an info message
[.+] WARNING .datahub.my_cli_module:\d+. - This is a warning message
[.+] ERROR .datahub.my_cli_module:\d+. - this is an error with no stack trace
[.+] ERROR .datahub.my_cli_module:\d+. - failed to divide by zero
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does everything start with [.+]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to handle the timestamp part of the log

# The in-memory log buffer should contain the log messages.
# The first two lines are stdout, so we skip them.
expected_log_output = "\n".join(result.output.splitlines()[2:])
assert get_log_buffer().format_lines() == expected_log_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cleanup the file here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope its a pytest tmp_file so it gets cleaned up automatically

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hsheth2 hsheth2 merged commit 927d45d into datahub-project:master Jan 26, 2023
@hsheth2 hsheth2 deleted the log-file branch January 26, 2023 17:25
ericyomi pushed a commit to ericyomi/datahub that referenced this pull request Feb 8, 2023
oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants