feat: Extraction of headlines in markdown files#3445
Conversation
| word_count_sen = len(sen.split()) | ||
|
|
||
| if word_count_sen > split_length: | ||
| long_sentence_message = f"One or more sentence found with word count higher than the split length." |
There was a problem hiding this comment.
Can we also add a sentence about how they can fix it or what happens now?
There was a problem hiding this comment.
Nor sure what we should add. Nothing really happening here, just informing the user that they have at least one exceptionally long sentence (longer than their specified split_length) in their Document. The consequence of this is that they will have at least one Document consisting of a single sentence.
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
| """ | ||
| # md -> html -> text since BeautifulSoup can extract text cleanly | ||
| html = markdown(markdown_string) | ||
| headline_tags = {"h1", "h2", "h3", "h4", "h5", "h6"} |
There was a problem hiding this comment.
Make the depth level configurable?
There was a problem hiding this comment.
Or even better - find the deepest tree level in one pass of regular expressions?
There was a problem hiding this comment.
These are pre-defined HTML tags as listed for example here.
Before converting a markdown file to text, we convert it to HTML. The headline_tags set here is just used to check if an HTML element is a headline.
There was a problem hiding this comment.
Overall, LGTM @bogdankostic, especially the effort you put into unit tests. Some total newcomer views: would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines? Then they can order it any way they want (post/pre/in order). Great work overall
My initial idea was actually to represent the extracted headlines as a tree. I decided against it because this can result in quite nested, complex structures - especially for large Documents. I thought it would be much more understandable like this. |
| id_hash_keys = self.id_hash_keys | ||
|
|
||
| id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys | ||
| remove_code_snippets = remove_code_snippets if remove_code_snippets is not None else self.remove_code_snippets |
There was a problem hiding this comment.
@bogdankostic another option is id_hash_keys = id_hash_keys or self.id_hash_keys, but I am undecided about which one is better.
There was a problem hiding this comment.
id_hash_keys = id_hash_keys or self.id_hash_keys would work in this case because id_hash_keys is supposed to be a list, but this doesn't work with optional boolean parameters.
For example, if we explicitly set remove_code_snippets to False when calling convert, remove_code_snippets or self.remove_code_snippets would evaluate to whatever value self.remove_code_snippets has.
For consistency, I would like all these lines to have the same pattern.
Related Issues
Proposed Changes:
This PR adds the possibility to extract headlines out of a markdown file. For this, it adds the parameter
extract_headlinesto theMarkdownConverterand adapts thePreProcessorto be able to keep only the relevant headlines for each Document split when splitting the original Document.Headlines are stored as a list of dictionaries in the Document's meta field
"headlines"and are structured as followed:{ "headline": <THE HEADLINE STRING>, "start_idx": <IDX OF HEADLINE START IN document.content, int or None>, "level": <LEVEL OF THE HEADLINE, int> }start_idxis set to None by thePreProcessorwhen the headline is relevant for the current split but it appears in a previous split.Further changes:
remove_code_snippetstoMarkdownConverterto allow the users to choose whether to remove the code snippets (previously, code snippets were always removed)PreProcessor'ssplitmethod to make it (hopefully) more readable_split_sentencesto not remove whitespace characters when splitting a text into sentencesHow did you test it?
I added a couple of unit tests, let me know if I should add more.
Notes for the reviewer
I'm not quite sure about the dictionary format for the headlines, especially about setting
start_idxtoNoneif the headline cannot be found inDocument.content. Please let me know if you think this is okay or you see a better solution.Checklist