Missing Data

I tried to load all documents for each data sample and got the following error:

```bash
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'benchmarks/officeqa/treasury_bulletins_parsed/transformed_page_level/treasury_bulletin_1982_3.txt'
```

For your reference, here is my Python code:
```python
from datasets import load_dataset
dataset = load_dataset('csv', data_files='benchmarks/officeqa/officeqa.csv')['train']

import tiktoken
tiktoken_encoder = tiktoken.get_encoding("o200k_base")
count_tokens = lambda text: len(tiktoken_encoder.encode(text, disallowed_special=()))

context_lens = []
for datapoint in dataset:
    context = [open(f"benchmarks/officeqa/treasury_bulletins_parsed/transformed_page_level/{f}", 'r').read()
                for f in datapoint['source_files'].split('\r\n')]
    context_len = 0
    for doc in context:
        context_len += count_tokens(doc)
    context_lens.append(context_len)
```

Same error occurs when I use documents in `benchmarks/officeqa/treasury_bulletins_parsed/transformed` instead.

P/s: It is the only missing document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Data #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing Data #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions