Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve quote detection by finding start and end quotation marks #380

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

afriedman412
Copy link

The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error. I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround.

My version only matches specific types of quotation marks with their accepted counterparts. For example, if there is it will pair it with the next . If it never finds a match, it ignores the singleton quotation mark. I did this by making a list of 11 approved tuples of unicode codes (which live in .constants).

For example:

Bill told me I "shouldn‘t wear those pants" but I will.

In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.

My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text)) is in the list of acceptable pairs. Here, that pair would be (34, 34) because a double quotation mark is unicode number 34.

@bdewilde
Copy link
Collaborator

Hi @afriedman412 , thanks for submitting these changes! I'm pretty sure I follow the updated logic, but it would be particularly helpful to have new tests (or just new test cases for an existing test) that confirm what we should expect to be handled by it. Would you be able to add some in here?

Copy link
Collaborator

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've suggested a few changes for performance and stylistic consistency reasons, but nothing major. my big ask continues to be additional testing to make sure that these changes do what we want them to, and to illustrate the improved extraction cases. i can certainly come up with some, but you surely know better than i what to test here.

Comment on lines 24 to 32
"""
Ordinal points of the token.is_quote characters, matched up by start and end.

source:
switch = "\"\'"
start = "“‘```“‘«‹「『„‚"
end = "”’’’’”’»›」』”’"

"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you move this docstring to directly below (rather than above) the object's assignment?


"""
QUOTATION_MARK_PAIRS = {
(34, 34),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical, but helpful: could you add the corresponding marks in a comment on each line?

Comment on lines 231 to 236
for n, q in enumerate(qtok):
if q.i not in [q_[1] for q_ in qtok_pair_idxs]:
for q_ in qtok[n+1:]:
if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS:
qtok_pair_idxs.append((q.i, q_.i))
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't profiled this code, but this series of nested loops feels like it could be a performance bottleneck. is there a still-clear but more performant alternative?

Comment on lines 328 to 334
# def get_cue_candidates(window_sents, verbs):
# return [
# tok
# for sent in window_sents
# for tok in sent
# if filter_cue_candidates(tok, verbs)
# ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this still be here?

Comment on lines 323 to 326
return all([
tok.pos == VERB,
tok.lemma_ in verbs
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return all([
tok.pos == VERB,
tok.lemma_ in verbs
])
return (tok.pos == VERB and tok.lemma_ in verbs)

Comment on lines 340 to 343
return all([
candidate.pos!=PUNCT,
((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)),
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return all([
candidate.pos!=PUNCT,
((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)),
])
return (
candidate.pos != PUNCT
and ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j))
)

Comment on lines 349 to 352
for i_, j_ in list(zip(
[tok.i for tok in doc if tok.text=="\n"],
[tok.i for tok in doc if tok.text=="\n"][1:])
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to improve efficiency/performance here

Suggested change
for i_, j_ in list(zip(
[tok.i for tok in doc if tok.text=="\n"],
[tok.i for tok in doc if tok.text=="\n"][1:])
):
lb_tok_idxs = [tok.i for tok in doc if tok.text == "\n"]
for i_, j_ in zip(lb_tok_idxs, lb_tok_idxs[1:]):

"""
if para:
i_, j_ = line_break_window(i, j, doc)
if i_:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if i_:
if i_ is not None:

return (
sent
for sent in doc.sents
# these boundary cases are a subtle bit of work...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇‍♂️

Comment on lines 264 to 265
windower(qtok_start_idx, qtok_end_idx, doc, True),
windower(qtok_start_idx, qtok_end_idx, doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for these two approaches to ever produce identical windows? does that matter?

@bdewilde
Copy link
Collaborator

hey @afriedman412 , does the newer PR (#382) supersede this one?

…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants