Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk Error When Attempting to Run Tests #286

Open
pgarz opened this issue Nov 14, 2023 · 34 comments
Open

Disk Error When Attempting to Run Tests #286

pgarz opened this issue Nov 14, 2023 · 34 comments

Comments

@pgarz
Copy link

pgarz commented Nov 14, 2023

Describe the bug
I get a disk error when trying to run a test. Might be a bug in how some temp file is being cached by the DeepEval library

To Reproduce

import pytest
from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
from deepeval.metrics.conceptual_similarity import ConceptualSimilarityMetric

from deepeval.test_case import LLMTestCase
from deepeval.evaluator import run_test


def compute_deep_eval_metric(question, ground_truth_answer, pred_answer, context):

  factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.7)
  answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.7)
  conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.7)

  test_case = LLMTestCase(input=question, actual_output=pred_answer, context=context, expected_output=ground_truth_answer)
  results = run_test(test_case, [factual_consistency_metric, answer_relevancy_metric, conceptual_similarity_metric])

My stack trace is as follows

/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:473: UserWarning: The sentencepiece 
tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the
fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas 
the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the 
original piece of text.
  warnings.warn(
Error loading test run from disk
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-10-82f04413d550>](https://localhost:8080/#) in <cell line: 44>()
     42 expected_output_x = answers[1]
     43 
---> 44 x = compute_deep_eval_metric(input_x, expected_output_x, actual_output_x, context_x)

1 frames
[/usr/local/lib/python3.10/dist-packages/deepeval/evaluator.py](https://localhost:8080/#) in run_test(test_cases, metrics, max_retries, delay, min_success, raise_error)
     90             # metric.score = score
     91 
---> 92             test_run_manager.get_test_run().add_llm_test_case(
     93                 test_case=test_case,
     94                 metrics=[metric],

AttributeError: 'NoneType' object has no attribute 'add_llm_test_case'/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:473: UserWarning: The sentencepiece 
tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the
fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas 
the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the 
original piece of text.
  warnings.warn(
Error loading test run from disk
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-10-82f04413d550>](https://localhost:8080/#) in <cell line: 44>()
     42 expected_output_x = answers[1]
     43 
---> 44 x = compute_deep_eval_metric(input_x, expected_output_x, actual_output_x, context_x)

1 frames
[/usr/local/lib/python3.10/dist-packages/deepeval/evaluator.py](https://localhost:8080/#) in run_test(test_cases, metrics, max_retries, delay, min_success, raise_error)
     90             # metric.score = score
     91 
---> 92             test_run_manager.get_test_run().add_llm_test_case(
     93                 test_case=test_case,
     94                 metrics=[metric],

AttributeError: 'NoneType' object has no attribute 'add_llm_test_case'
@penguine-ip
Copy link
Contributor

@pgarz Thanks for bringing this up, we're aware of this problem and will ship a new release with a fix to this in the next 12 hours, let ping you when its done!

@penguine-ip
Copy link
Contributor

Hey @pgarz , as promised, it's fixed in the latest release 0.20.19. run_test now works, and theres a new evaluate function that can evaluate multiple test cases at once (changed run_test to just one test case). You can see more info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytest.

By the way, come join us on discord, we're talking there and today someone brought up the exact issue :) https://discord.com/invite/a3K9c8GRGt

@pgarz
Copy link
Author

pgarz commented Nov 16, 2023

wonderful, thanks!

@julfr
Copy link

julfr commented Mar 11, 2024

Hey @pgarz , as promised, it's fixed in the latest release 0.20.19. run_test now works, and theres a new evaluate function that can evaluate multiple test cases at once (changed run_test to just one test case). You can see more info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytest.

By the way, come join us on discord, we're talking there and today someone brought up the exact issue :) https://discord.com/invite/a3K9c8GRGt

I have the same problem when running both evaluate([test_case], [metric]) or dataset.evaluate([metric]) , although I have the latest version as of today (0.20.87)

@penguine-ip
Copy link
Contributor

@julfr AttributeError: 'NoneType' object has no attribute 'add_llm_test_case' ?

@julfr
Copy link

julfr commented Mar 12, 2024

@julfr AttributeError: 'NoneType' object has no attribute 'add_llm_test_case' ?

I am sorry, the error is: AttributeError: 'NoneType' object has no attribute 'test_cases'

@penguine-ip
Copy link
Contributor

penguine-ip commented Mar 12, 2024 via email

@julfr
Copy link

julfr commented Mar 13, 2024

Can you provide some code for me to reproduce?

On Tue, Mar 12, 2024 at 6:01 PM julfr @.> wrote: @julfr https://github.com/julfr AttributeError: 'NoneType' object has no attribute 'add_llm_test_case' ? I am sorry, the error is: AttributeError: 'NoneType' object has no attribute 'test_cases' — Reply to this email directly, view it on GitHub <#286 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFQK6Z46HUDKWD7ZQ6EW53YX3G6VAVCNFSM6AAAAAA7LSKZJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJRGIZTOOJZGQ . You are receiving this because you modified the open/close state.Message ID: @.>

Sure, i am just trying to reproduce the example in my jupyter notebook:

import openai
import os
os.environ["OPENAI_API_KEY"] = 'my_key'
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
import deepeval
from deepeval import evaluate

deepeval.login_with_confident_api_key("my_confident_api_key")
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-3.5-turbo-16k",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Here' no problem and I sees the output:

Event loop is already running. Applying nest_asyncio patch to allow async execution...
1.0
The score is 1.00 because there are no irrelevant statements in the actual output. Great job!

But when I try evaluate([test_case], [metric]) I get
Error loading test run from disk: [Errno 9] Bad file descriptor

AttributeError                            Traceback (most recent call last)
Cell In[5], line 2
      1 # or evaluate test cases in bulk
----> 2 evaluate([test_case], [metric])

File ~/.local/lib/python3.11/site-packages/deepeval/evaluate.py:229, in evaluate(test_cases, metrics, run_async, show_indicator, print_results)
    227 if run_async:
    228     loop = get_or_create_event_loop()
--> 229     test_results = loop.run_until_complete(
    230         a_execute_test_cases(test_cases, metrics, True)
    231     )
    232 else:
    233     test_results = execute_test_cases(test_cases, metrics, True)

File ~/.local/lib/python3.11/site-packages/nest_asyncio.py:99, in _patch_loop.<locals>.run_until_complete(self, future)
     96 if not f.done():
     97     raise RuntimeError(
     98         'Event loop stopped before Future completed.')
---> 99 return f.result()

File /software/jupyter/conda/envs/jupyter/lib/python3.11/asyncio/futures.py:203, in Future.result(self)
    201 self.__log_traceback = False
    202 if self._exception is not None:
--> 203     raise self._exception.with_traceback(self._exception_tb)
    204 return self._result

File /software/jupyter/conda/envs/jupyter/lib/python3.11/asyncio/tasks.py:267, in Task.__step(***failed resolving arguments***)
    263 try:
    264     if exc is None:
    265         # We use the `send` method directly, because coroutines
    266         # don't have `__iter__` and `__next__` methods.
--> 267         result = coro.send(None)
    268     else:
    269         result = coro.throw(exc)

File ~/.local/lib/python3.11/site-packages/deepeval/evaluate.py:152, in a_execute_test_cases(test_cases, metrics, save_to_disk)
    149 api_test_case.success = success
    151 test_run = test_run_manager.get_test_run()
--> 152 test_run.test_cases.append(api_test_case)
    153 test_run.dataset_alias = test_case.dataset_alias
    154 test_run_manager.save_test_run()

AttributeError: 'NoneType' object has no attribute 'test_cases'

The same with dataset:

from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[test_case])
dataset.evaluate([metric])

Out:
Error loading test run from disk: [Errno 9] Bad file descriptor

AttributeError                            Traceback (most recent call last)
Cell In[8], line 1
----> 1 dataset.evaluate([metric])

File ~/.local/lib/python3.11/site-packages/deepeval/dataset/dataset.py:75, in EvaluationDataset.evaluate(self, metrics)
     70 if len(self.test_cases) == 0:
     71     raise ValueError(
     72         "No test cases found in evaluation dataset. Unable to evaluate empty dataset."
     73     )
---> 75 return evaluate(self.test_cases, metrics)

File ~/.local/lib/python3.11/site-packages/deepeval/evaluate.py:229, in evaluate(test_cases, metrics, run_async, show_indicator, print_results)
    227 if run_async:
    228     loop = get_or_create_event_loop()
--> 229     test_results = loop.run_until_complete(
    230         a_execute_test_cases(test_cases, metrics, True)
    231     )
    232 else:
    233     test_results = execute_test_cases(test_cases, metrics, True)

File ~/.local/lib/python3.11/site-packages/nest_asyncio.py:99, in _patch_loop.<locals>.run_until_complete(self, future)
     96 if not f.done():
     97     raise RuntimeError(
     98         'Event loop stopped before Future completed.')
---> 99 return f.result()

File /software/jupyter/conda/envs/jupyter/lib/python3.11/asyncio/futures.py:203, in Future.result(self)
    201 self.__log_traceback = False
    202 if self._exception is not None:
--> 203     raise self._exception.with_traceback(self._exception_tb)
    204 return self._result

File /software/jupyter/conda/envs/jupyter/lib/python3.11/asyncio/tasks.py:267, in Task.__step(***failed resolving arguments***)
    263 try:
    264     if exc is None:
    265         # We use the `send` method directly, because coroutines
    266         # don't have `__iter__` and `__next__` methods.
--> 267         result = coro.send(None)
    268     else:
    269         result = coro.throw(exc)

File ~/.local/lib/python3.11/site-packages/deepeval/evaluate.py:152, in a_execute_test_cases(test_cases, metrics, save_to_disk)
    149 api_test_case.success = success
    151 test_run = test_run_manager.get_test_run()
--> 152 test_run.test_cases.append(api_test_case)
    153 test_run.dataset_alias = test_case.dataset_alias
    154 test_run_manager.save_test_run()

AttributeError: 'NoneType' object has no attribute 'test_cases'

@Peilun-Li
Copy link
Contributor

+1 I'm also facing the Error loading test run from disk: [Errno 9] Bad file descriptor error when trying to run deepeval.evaluate (version 0.21.00). Given that the logic is related to file locking I'm not sure whether that might be an issue for certain OSs/compute environments.

As a workaround I've made this snippet https://gist.github.com/Peilun-Li/a0f26847812e177383a3dc7f17b3d84b which runs bulk evaluation in full and customizable paralelleism, and does not rely on file sync given we support async mode now. Seems to be working through my preliminary test. Feel free to use and adapt.

@penguine-ip
Copy link
Contributor

hey @julfr I missed your message. May I ask does this only happen when running evaluate() after running metric.measure()? Ie if you just run evaluate(), there is no problem?

@penguine-ip penguine-ip reopened this Mar 22, 2024
@Peilun-Li
Copy link
Contributor

Peilun-Li commented Mar 22, 2024

The error on my end can be surfaced using this following simple snippet

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

coherence_metric = GEval(
    name="Coherence",
    criteria='Coherence - the collective quality of all sentences. Given input as the summarized source and actual_output as the summary, the actual_output should be well-structured and well-organized. The actual_output should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.',
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    async_mode=False,
)

test_case = LLMTestCase(
    input="placeholder",
    actual_output="placeholder"
)

evaluate([test_case], [coherence_metric], run_async=False)

Running it would end up with error Error loading test run from disk: [Errno 9] Bad file descriptor.

Digging a bit deeper it does feel like a file system related issue? When running the above code in mac/linux with normal local disk, there's not issue running the above snippet. The error seems only happen when a NFS disk mount (in our case, AWS EFS) is used.

So looking into the code which throws out the error:

  1. We use portalocker.Lock on a read mode file object https://github.com/confident-ai/deepeval/blob/v0.21.00/deepeval/test_run/test_run.py#L187-L189
  2. Internally by default that portalocker.Lock method uses lock flags of EXCLUSIVE and NON_BLOCKING
  3. The actual error [Errno 9] Bad file descriptor is thrown at this line when calling fcntl.flock, and can be reproduced through
# This throws out OSError: [Errno 9] Bad file descriptor
import fcntl
file_io = open(some_file_name, "r")
lock_flags = fcntl.LOCK_EX | fcntl.LOCK_NB
fcntl.flock(file_io, lock_flags)
  1. Surveyed a bit and it looks like for certain file systems, we can't set EXCLUSIVE and NON_BLOCKING lock flags when only attempting to read the file. Those lock flags only work in write mode. So if we switch to use a SHARED lock flag it seems to work
# This does not throw out errors
import fcntl
file_io = open(some_file_name, "r")
lock_flags = fcntl.LOCK_SH
fcntl.flock(file_io, lock_flags)

So the potential fix (at least for the one on my end), can be to use a different lock flag when attempting to read the file only at https://github.com/confident-ai/deepeval/blob/v0.21.00/deepeval/test_run/test_run.py#L187-L189 ? Or we can update the mode used there to be read and write (w+)? A deeper question though can be if this file sync pattern is sustainable and if there are potential cleaner ways to maintain everything in memory and only output to disk as end steps if needed.

@penguine-ip
Copy link
Contributor

Thanks @Peilun-Li for the detailed debugging, I was wondering does this work for the get_test_run() method:

    with open(self.temp_file_name, "r") as file:
        portalocker.lock(file, portalocker.LOCK_SH, timeout=5)
        self.test_run = self.test_run.load(file)

Here we are using portalocker but manually managing the opening of files. Since I don't have the environment you're working in, would it be possible if you try it out and see what happens? Thank you so much!

@penguine-ip
Copy link
Contributor

penguine-ip commented Mar 24, 2024

(as for the file sync pattern, we're doing it because it is probably the easiest way we can achieve multiprocessing, which does not have shared memory.)

@Peilun-Li
Copy link
Contributor

 with open(self.temp_file_name, "r") as file:
        portalocker.lock(file, portalocker.LOCK_SH, timeout=5)
        self.test_run = self.test_run.load(file)

this gives an error of TypeError: lock() got an unexpected keyword argument 'timeout' given I think it's not matching the signature of portalocker.lock.

The following one seems to be working from my end (I have to add a portalocker.LOCK_NB to stifle the warning timeout has no effect in blocking mode)

with portalocker.Lock(
    "temp_test_run_data.json", mode="r", timeout=5, flags=portalocker.LOCK_SH | portalocker.LOCK_NB
) as file:
    print(file)

<_io.TextIOWrapper name='temp_test_run_data.json' mode='r' encoding='UTF-8'>

Yeah agree this file pattern makes sense when the model calls can only be made in a sync pattern (meanwhile also may be due to the side-effect of statefulnes of metric objects. We can reserve the discussion for the other thread :) ). Also curious given async model & metrics are supported right now we might could make a shortcut (as in my above bulk evaluation snippet) if async mode is enabled?

@penguine-ip
Copy link
Contributor

@Peilun-Li So async mode just runs metrics concurrently for each test case, and not for all test cases at once. For all test cases at once we still rely on multiprocessing for now. By the way I'm going to bring this to discord, I'm losing track of the issues on Github...

@repetitioestmaterstudiorum
Copy link
Contributor

Experiencing the same issue on a HPC with NFS disk.

@penguine-ip
Copy link
Contributor

hey @repetitioestmaterstudiorum just to confirm are you calling evaluate() when this happens?

@repetitioestmaterstudiorum
Copy link
Contributor

Hey @penguine-ip no it's happening only with the assert_test() function. With evaluate() it works fine.

@penguine-ip
Copy link
Contributor

@repetitioestmaterstudiorum Ok i see, this is weird because they use the same methods under the hood (both writing to disk). Do you have a locally forked version of deepeval, or are you using the version from pypi (the one from pip install/poetry install)?

@repetitioestmaterstudiorum
Copy link
Contributor

I am using the unchanged version 0.21.21 on the HPC.
A think that a crucial difference in the scripts with assert_test() and evaluate() is that in the first one pytest loops over the test cases, which might cause this locking mechanism (maybe because things could theoretically happen in parallel?), whereas in the latter script with evaluate() I loop over data manually.

assert_test() setup:

model_hierarchy = [mistral_7b, 'gpt-3.5-turbo', None]
@pytest.mark.parametrize("test_case", test_cases)
def test_all(test_case):
    for Metric in [ContextualRelevancyMetric, ContextualPrecisionMetric, ...]:
        for model in model_hierarchy:
            try:
                metric = Metric(model=model, threshold=0)
                assert_test(test_case, [metric])
                break  # Proceeds if assert_test passes without exception.
            except ValueError:
                continue  # Tries the next model on ValueError.
        else:
            pytest.fail("All models failed for this metric.")

evaluate() setup:

model_hierarchy = [mistral_7b, 'gpt-3.5-turbo', None]
results = {}
for doc in documents:
    test_case = create_test_case_from_doc(doc)
    for Metric in [ContextualRelevancyMetric, ContextualPrecisionMetric, ...]:
        for model in model_hierarchy:
            try:
                metric = Metric(model=model, threshold=0)
                result = evaluate(test_case, [metric])
                results.setdefault(Metric.__name__, []).append(result)
                break  # Proceeds if evaluate succeeds without exception.
            except ValueError:
                continue  # Tries the next model on ValueError.

@penguine-ip
Copy link
Contributor

@repetitioestmaterstudiorum I see. well this is definitely an anti-pattern. A simple experiment you can be doing is try not running a for loop when doing assert_test, do you get the same disk error?

@repetitioestmaterstudiorum
Copy link
Contributor

Probably (regarding the anti-pattern). There's no benefit for me in using pytest & assert_test(), mostly because assert_test doesn't return anything, so I am sticking to evaluate(). I log individual results per test to my db, and I cache results in this way, such that if something fails, I can continue where I left off and handle exceptions correctly before escalating to the next model.

@repetitioestmaterstudiorum
Copy link
Contributor

I take that back. Today I ran the same script using evaluate() and got a [Errno 9] Bad file descriptor error!

@penguine-ip
Copy link
Contributor

Hey @Peilun-Li @repetitioestmaterstudiorum, can you please try the evaluate function again on the latest release (v0.21.25)? I added the share lock to read operations as @Peilun-Li described. Not causing any additional errors so far but I'd like to see if it fixes the problems on your systems, thanks!

@repetitioestmaterstudiorum
Copy link
Contributor

This worked for me, thank you for the update @penguine-ip !

@penguine-ip
Copy link
Contributor

penguine-ip commented Apr 15, 2024

No problem @repetitioestmaterstudiorum , all thanks to @Peilun-Li for leading the solution to this 🙏🏻

@Peilun-Li
Copy link
Contributor

Works from my end now as well! Thanks for fixing it!

@penguine-ip
Copy link
Contributor

@Peilun-Li Not at all, all thanks to you! Stateless metric UX will be done early next month I think.

@penguine-ip
Copy link
Contributor

Hey @Peilun-Li and @repetitioestmaterstudiorum, I fixed some bugs with some race conditions and abstracted away the logic here: https://github.com/confident-ai/deepeval/blob/main/deepeval/test_run/test_run.py#L318

I don't think it should cause any new issues with your file systems, but just in case please let me know if you run into any problem on the new update. The update with this new piece of code is v0.21.30, thanks !

@repetitioestmaterstudiorum
Copy link
Contributor

Hi @penguine-ip, it works for me!

@repetitioestmaterstudiorum
Copy link
Contributor

However, I'm noticing now that I'm getting this occasional warning:

lib/python3.10/site-packages/portalocker/utils.py:218: UserWarning: timeout has no effect in blocking mode
  warnings.warn(

@penguine-ip
Copy link
Contributor

@repetitioestmaterstudiorum I tried playing around with it and in the previous example @Peilun-Li put out (by using a NB lock), it hides the error. However, since it is now in "r+" mode we can't use a NB lock. So we have to either 1) remove timeout, or 2) ignore it. Does it print once, or does it spam your terminal console in a deepeval test run?

@Peilun-Li Do you have any suggestions on this fix? It is happening here: https://github.com/confident-ai/deepeval/blob/main/deepeval/test_run/test_run.py#L325

@repetitioestmaterstudiorum
Copy link
Contributor

repetitioestmaterstudiorum commented Apr 21, 2024

It only prints once, so it's not an annoyance.

I was thinking ... when executing scripts that read and write to files in a HPC environment, it's anyway good practice to execute such scripts in directories with physical (not NFS) drives. So e.g. when I train models that rely on data loaders (loading images, etc.), I always move such data to the local, physical drive and ensure my script reads and writes to this faster drive. What's special in this case is that I don't actually have a data loader setup or any setup where I intentionally read and write files locally, it's just a library that holds state.
Long story short, I wonder if there isn't a cleaner alternative. Either the ability to set a directory where temporary or caching files are stored, e.g. via environmental variable, or just via parameter. This would allow for a clean setup where all such files live in a specified place (or a default place, if not provided), and this place could be a local disk if on a machine with NFS storage. So this warning could be caught and be accompanied with something like "It appears deepeval is being executed on a machine with NFS. It's recommended to define the working directory to be a local disk in such an environment" or something similar.
Another idea would be a db that can handle multiple threads or processes.

@Peilun-Li
Copy link
Contributor

Would it be possible to set the lock to be portalocker.LOCK_EX | portalocker.LOCK_NB -- that's actually the default flags parameter used in portalocker ( https://github.com/wolph/portalocker/blob/v2.8.2/portalocker/utils.py#L206 , https://github.com/wolph/portalocker/blob/v2.8.2/portalocker/utils.py#L20 ) and if that works with r+ mode.

Alternatively yeah we can omit the timeout here given it's warned to be ineffective anyways. Plus based on the logic of portalocker ( https://github.com/wolph/portalocker/blob/v2.8.2/portalocker/utils.py#L215-L221 ) it will set to the same 5s timeout even if we do not provide timeout explicitly, and we could bypass the warning there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants