Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaCy - BPE Alignment sometimes faulty, raises index errors #5

Closed
bhoov opened this issue Oct 22, 2019 · 2 comments
Closed

spaCy - BPE Alignment sometimes faulty, raises index errors #5

bhoov opened this issue Oct 22, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@bhoov
Copy link
Owner

bhoov commented Oct 22, 2019

Hi, thanks for the corpus creating scripts. They're very helpful.

I got an index error with BPE/spacy tokenization as shown below:

Extracting embeddings into /felicity/workspace/exbert/server/data/mt/deepak/embeddings/embeddings.hdf5
['deep', '##ak', 'i', 'don', '’', 't', 'feel', 'like', 'doing', 'my', 'meditation', 'today', '.']
['deepak', 'i', 'do', 'n’t', 'feel', 'like', 'doing', 'my', 'meditation', 'today', '.']
11
0
Deepak I don’t feel like doing my meditation today.
Traceback (most recent call last):
  File "/felicity/workspace/exbert/server/data/processing/create_corpus.py", line 19, in <module>
    create_hdf5.main(unique_sent_pckl, args.outdir, args.force)
  File "/felicity/workspace/exbert/server/data/processing/create_hdf5.py", line 221, in main
    sentences_to_hdf5(embedding_extractor, str(embedding_outpath), sentences, clear_file=force)
  File "/felicity/workspace/exbert/server/data/processing/create_hdf5.py", line 179, in sentences_to_hdf5
    b_pos = combine_tokens_meta(b_tokens, s_tokens, s_pos)
  File "/felicity/workspace/exbert/server/utils/token_processing.py", line 121, in combine_tokens_meta
    meta_list.append(spacy_meta[j])
IndexError: list index out of range

In IndexError designed to raise under certain circumstances? If not how can I solve it?

Thank you very much.

Originally posted by @felicitywang in #4 (comment)

@bhoov bhoov added the bug Something isn't working label Oct 22, 2019
@bhoov
Copy link
Owner Author

bhoov commented Oct 22, 2019

Looking at the two tokenized lists, it looks like the contraction don't messes up the alignment. spaCy has a bunch of built in exception words for tokenization that make it very difficult to align to the way BPE tokenizes words.

I will be working on this, but in the meantime, consider writing a script that filters the corpus of these exceptions

@felicitywang
Copy link

felicitywang commented Oct 23, 2019

Thanks for the quick response. Temporarily solved by using contractions.fix and removing the remaining sentences with "'".

@bhoov bhoov closed this as completed Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants