Skip to content

feat: add ability to extract extra metadata with regex#763

Merged
MthwRobinson merged 17 commits into
mainfrom
feat/optional-metadata
Jun 16, 2023
Merged

feat: add ability to extract extra metadata with regex#763
MthwRobinson merged 17 commits into
mainfrom
feat/optional-metadata

Conversation

@MthwRobinson
Copy link
Copy Markdown
Contributor

Summary

Adds the ability to extract additional metadata by specifying a regex. Also introduces a process_metadata decorator which will allow us to make introduce metadata changes that apply to all document types without having to edit each individual file in the partition directory.

Testing

  from unstructured.partition.text import partition_text

  text = "SPEAKER 1: It is my turn to speak now!"
  elements = partition_text(text=text, regex_match_metadata={"speaker": r"SPEAKER \d{1,3}:"})
  elements[0].metadata.regex_metadata

@MthwRobinson MthwRobinson requested a review from qued June 15, 2023 19:04
Copy link
Copy Markdown
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few nits and one question/suggestion. The schema we get back for regex_metadata feels cumbersome (list of dicts for every regex pattern) but I understand why it's that way and can't really think of something better right now.

Comment thread docs/source/index.rst Outdated
Comment thread docs/source/metadata.rst Outdated
Comment thread docs/source/metadata.rst Outdated
@cragwolfe
Copy link
Copy Markdown
Contributor

Neat functionality, but isn't this something someone might want to do post partition_text? I.e., I'm not sure this would always be step zero in a preprocessing pipeline.

@MthwRobinson MthwRobinson merged commit 4ea7168 into main Jun 16, 2023
@MthwRobinson MthwRobinson deleted the feat/optional-metadata branch June 16, 2023 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants