You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for the corpus creating scripts. They're very helpful.
I got an index error with BPE/spacy tokenization as shown below:
Extracting embeddings into /felicity/workspace/exbert/server/data/mt/deepak/embeddings/embeddings.hdf5
['deep', '##ak', 'i', 'don', '’', 't', 'feel', 'like', 'doing', 'my', 'meditation', 'today', '.']
['deepak', 'i', 'do', 'n’t', 'feel', 'like', 'doing', 'my', 'meditation', 'today', '.']
11
0
Deepak I don’t feel like doing my meditation today.
Traceback (most recent call last):
File "/felicity/workspace/exbert/server/data/processing/create_corpus.py", line 19, in <module>
create_hdf5.main(unique_sent_pckl, args.outdir, args.force)
File "/felicity/workspace/exbert/server/data/processing/create_hdf5.py", line 221, in main
sentences_to_hdf5(embedding_extractor, str(embedding_outpath), sentences, clear_file=force)
File "/felicity/workspace/exbert/server/data/processing/create_hdf5.py", line 179, in sentences_to_hdf5
b_pos = combine_tokens_meta(b_tokens, s_tokens, s_pos)
File "/felicity/workspace/exbert/server/utils/token_processing.py", line 121, in combine_tokens_meta
meta_list.append(spacy_meta[j])
IndexError: list index out of range
In IndexError designed to raise under certain circumstances? If not how can I solve it?
Looking at the two tokenized lists, it looks like the contraction don't messes up the alignment. spaCy has a bunch of built in exception words for tokenization that make it very difficult to align to the way BPE tokenizes words.
I will be working on this, but in the meantime, consider writing a script that filters the corpus of these exceptions
Hi, thanks for the corpus creating scripts. They're very helpful.
I got an index error with BPE/spacy tokenization as shown below:
In IndexError designed to raise under certain circumstances? If not how can I solve it?
Thank you very much.
Originally posted by @felicitywang in #4 (comment)
The text was updated successfully, but these errors were encountered: