Skip to content

Missing Data #28

@dovanquyet

Description

@dovanquyet

I tried to load all documents for each data sample and got the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'benchmarks/officeqa/treasury_bulletins_parsed/transformed_page_level/treasury_bulletin_1982_3.txt'

For your reference, here is my Python code:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='benchmarks/officeqa/officeqa.csv')['train']

import tiktoken
tiktoken_encoder = tiktoken.get_encoding("o200k_base")
count_tokens = lambda text: len(tiktoken_encoder.encode(text, disallowed_special=()))

context_lens = []
for datapoint in dataset:
    context = [open(f"benchmarks/officeqa/treasury_bulletins_parsed/transformed_page_level/{f}", 'r').read()
                for f in datapoint['source_files'].split('\r\n')]
    context_len = 0
    for doc in context:
        context_len += count_tokens(doc)
    context_lens.append(context_len)

Same error occurs when I use documents in benchmarks/officeqa/treasury_bulletins_parsed/transformed instead.

P/s: It is the only missing document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions