Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bobcat parser crashes on English contractions #60

Closed
pmarcis opened this issue Feb 9, 2023 · 2 comments
Closed

Bobcat parser crashes on English contractions #60

pmarcis opened this issue Feb 9, 2023 · 2 comments

Comments

@pmarcis
Copy link

pmarcis commented Feb 9, 2023

Hi!

When passing tokenised data containing English contractions, the parser crashes. Passing non-tokenised data seems wrong as the parser does not perform tokenisation internally (all punctuation gets attached to words, contractions are attached to the verb).

E.g.:

from lambeq import BobcatParser
bobcat_parser = BobcatParser()
diagram = bobcat_parser.sentence2diagram("Baby didn 't like it")
diagram.draw()

results in:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File .../site-packages/lambeq/text2diagram/bobcat_parser.py:382, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    381     result = self.parser(sentence_input)
--> 382     trees.append(self._build_ccgtree(result[0]))
    383 except Exception:
File .../site-packages/lambeq/bobcat/parser.py:258, in ParseResult.__getitem__(self, index)
    256 def __getitem__(self, index: Union[int, slice]) -> Union[ParseTree,
    257                                                          list[ParseTree]]:
--> 258     return self.root[index]

IndexError: list index out of range

During handling of the above exception, another exception occurred:

BobcatParseError                          Traceback (most recent call last)
Cell In[2], line 1
----> 1 diagram = bobcat_parser.sentence2diagram("Baby didn 't like it")
      2 diagram.draw()

File .../site-packages/lambeq/text2diagram/ccg_parser.py:231, in CCGParser.sentence2diagram(self, sentence, tokenised, planar, suppress_exceptions)
    228 if not isinstance(sentence, str):
    229     raise ValueError('`tokenised` set to `False`, but variable '
    230                      '`sentence` does not have type `str`.')
--> 231 return self.sentences2diagrams(
    232                 [sentence],
    233                 planar=planar,
    234                 suppress_exceptions=suppress_exceptions,
    235                 tokenised=tokenised,
    236                 verbose=VerbosityLevel.SUPPRESS.value)[0]

File .../site-packages/lambeq/text2diagram/ccg_parser.py:161, in CCGParser.sentences2diagrams(self, sentences, tokenised, planar, suppress_exceptions, verbose)
    125 def sentences2diagrams(
    126         self,
    127         sentences: SentenceBatchType,
   (...)
    130         suppress_exceptions: bool = False,
    131         verbose: Optional[str] = None) -> list[Optional[Diagram]]:
    132     """Parse multiple sentences into a list of discopy diagrams.
    133 
    134     Parameters
   (...)
    159 
    160     """
--> 161     trees = self.sentences2trees(sentences,
    162                                  suppress_exceptions=suppress_exceptions,
    163                                  tokenised=tokenised,
    164                                  verbose=verbose)
    165     diagrams = []
    166     if verbose is None:

File .../site-packages/lambeq/text2diagram/bobcat_parser.py:387, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    385                 trees.append(None)
    386             else:
--> 387                 raise BobcatParseError(' '.join(sent.words))
    389 for i in empty_indices:
    390     trees.insert(i, None)

BobcatParseError: Bobcat failed to parse "Baby didn 't like it".

@dimkart
Copy link
Contributor

dimkart commented Feb 9, 2023

Hi, you can use lambeq's SpasyTokeniser class to tokenise your sentences before feeding them to the parser. From the command line interface, you can just use the -t option. If you want to provide the sentence already tokenised, be sure to separate the words correctly, i.e. "did" and "n't", as below, otherwise the model will not recognise "didn" as a proper word.

image

Hope that helps.

@pmarcis
Copy link
Author

pmarcis commented Feb 10, 2023

Thanks! That solves this problem!

@pmarcis pmarcis closed this as completed Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants