Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema validation does not allow for null values in case of Optional typed params #2589

Closed
tstadel opened this issue May 23, 2022 · 5 comments · Fixed by #2980
Closed

Schema validation does not allow for null values in case of Optional typed params #2589

tstadel opened this issue May 23, 2022 · 5 comments · Fixed by #2980
Assignees
Labels
Contributions wanted! Looking for external contributions topic:save/load type:bug Something isn't working

Comments

@tstadel
Copy link
Member

tstadel commented May 23, 2022

Describe the bug
Setting optional params to null in YAML (or to None when it's been converted to a python dict) will cause the schema validation to fail: see #2586

Error message

PipelineSchemaError: Node of type XYZ found, but it failed validation. Possible causes:
 - The node is missing some mandatory parameter
 - Wrong indentation of some parameter in YAML
See the stacktrace for more information.```

Expected behavior
null values are accepted for Optional typed params.

To Reproduce
Load following pipeline:

version: ignore

components:
  - name: ESRetriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: null
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      index: haystack_test
      label_index: haystack_test_label


pipelines:
  - name: query_pipeline
    nodes:
      - name: ESRetriever
        inputs: [Query]
@tstadel
Copy link
Member Author

tstadel commented May 23, 2022

Extended stacktrace:

>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
<haystack.pipelines.base.Pipeline object at 0x7fcfcd78d810>
>>> Pipeline.load_from_yaml('/home/tstad/git/haystack/test/samples/pipeline/bad-pipeline.yaml')
INFO - haystack.pipelines.config -  Missing definition for node of type BM25Retriever. Looking into local classes...
INFO - haystack.nodes._json_schema -  Creating schema for 'BM25Retriever'
INFO - haystack.nodes._json_schema -  Added definition for BM25Retriever
Traceback (most recent call last):
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 259, in validate_schema
    Draft7Validator(schema).validate(instance=pipeline_config)
  File "/home/tstad/miniconda3/envs/haystack-dev/lib/python3.7/site-packages/jsonschema/validators.py", line 254, in validate
    raise error
jsonschema.exceptions.ValidationError: {'name': 'ESRetriever', 'type': 'BM25Retriever', 'params': {'document_store': 'DocumentStore', 'custom_query': None}} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['components']['items']:
    {'anyOf': [{'$ref': '#/definitions/DeepsetCloudDocumentStoreComponent'},
               {'$ref': '#/definitions/ElasticsearchDocumentStoreComponent'},
               {'$ref': '#/definitions/FAISSDocumentStoreComponent'},
               {'$ref': '#/definitions/GraphDBKnowledgeGraphComponent'},
               {'$ref': '#/definitions/InMemoryDocumentStoreComponent'},
               {'$ref': '#/definitions/Milvus2DocumentStoreComponent'},
               {'$ref': '#/definitions/OpenDistroElasticsearchDocumentStoreComponent'},
               {'$ref': '#/definitions/OpenSearchDocumentStoreComponent'},
               {'$ref': '#/definitions/PineconeDocumentStoreComponent'},
               {'$ref': '#/definitions/SQLDocumentStoreComponent'},
               {'$ref': '#/definitions/WeaviateDocumentStoreComponent'},
               {'$ref': '#/definitions/AzureConverterComponent'},
               {'$ref': '#/definitions/BM25RetrieverComponent'},
               {'$ref': '#/definitions/CrawlerComponent'},
               {'$ref': '#/definitions/DensePassageRetrieverComponent'},
               {'$ref': '#/definitions/Docs2AnswersComponent'},
               {'$ref': '#/definitions/DocxToTextConverterComponent'},
               {'$ref': '#/definitions/ElasticsearchFilterOnlyRetrieverComponent'},
               {'$ref': '#/definitions/ElasticsearchRetrieverComponent'},
               {'$ref': '#/definitions/EmbeddingRetrieverComponent'},
               {'$ref': '#/definitions/EntityExtractorComponent'},
               {'$ref': '#/definitions/EvalAnswersComponent'},
               {'$ref': '#/definitions/EvalDocumentsComponent'},
               {'$ref': '#/definitions/FARMReaderComponent'},
               {'$ref': '#/definitions/FileTypeClassifierComponent'},
               {'$ref': '#/definitions/FilterRetrieverComponent'},
               {'$ref': '#/definitions/ImageToTextConverterComponent'},
               {'$ref': '#/definitions/JoinAnswersComponent'},
               {'$ref': '#/definitions/JoinDocumentsComponent'},
               {'$ref': '#/definitions/MarkdownConverterComponent'},
               {'$ref': '#/definitions/PDFToTextConverterComponent'},
               {'$ref': '#/definitions/PDFToTextOCRConverterComponent'},
               {'$ref': '#/definitions/ParsrConverterComponent'},
               {'$ref': '#/definitions/PreProcessorComponent'},
               {'$ref': '#/definitions/QuestionGeneratorComponent'},
               {'$ref': '#/definitions/RAGeneratorComponent'},
               {'$ref': '#/definitions/RCIReaderComponent'},
               {'$ref': '#/definitions/RouteDocumentsComponent'},
               {'$ref': '#/definitions/SentenceTransformersRankerComponent'},
               {'$ref': '#/definitions/Seq2SeqGeneratorComponent'},
               {'$ref': '#/definitions/SklearnQueryClassifierComponent'},
               {'$ref': '#/definitions/TableReaderComponent'},
               {'$ref': '#/definitions/TableTextRetrieverComponent'},
               {'$ref': '#/definitions/Text2SparqlRetrieverComponent'},
               {'$ref': '#/definitions/TextConverterComponent'},
               {'$ref': '#/definitions/TfidfRetrieverComponent'},
               {'$ref': '#/definitions/TikaConverterComponent'},
               {'$ref': '#/definitions/TransformersDocumentClassifierComponent'},
               {'$ref': '#/definitions/TransformersQueryClassifierComponent'},
               {'$ref': '#/definitions/TransformersReaderComponent'},
               {'$ref': '#/definitions/TransformersSummarizerComponent'},
               {'$ref': '#/definitions/TransformersTranslatorComponent'},
               {'$ref': '#/definitions/BM25RetrieverComponent'}]}

On instance['components'][0]:
    {'name': 'ESRetriever',
     'params': {'custom_query': None, 'document_store': 'DocumentStore'},
     'type': 'BM25Retriever'}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1334, in load_from_yaml
    strict_version_check=strict_version_check,
  File "/home/tstad/git/haystack/haystack/pipelines/base.py", line 1391, in load_from_config
    validate_config(pipeline_config, strict_version_check=strict_version_check)
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 203, in validate_config
    validate_schema(pipeline_config=pipeline_config, strict_version_check=strict_version_check)
  File "/home/tstad/git/haystack/haystack/pipelines/config.py", line 283, in validate_schema
    ) from validation
haystack.errors.PipelineSchemaError: Node of type BM25Retriever found, but it failed validation. Possible causes:
 - The node is missing some mandatory parameter
 - Wrong indentation of some parameter in YAML
See the stacktrace for more information.

@tstadel tstadel added topic:save/load type:bug Something isn't working labels May 23, 2022
@tstadel
Copy link
Member Author

tstadel commented May 23, 2022

Maybe extending the param types with an anyOf like in https://stackoverflow.com/questions/22565005/json-schema-validate-a-number-or-null-value could help.

@ZanSara
Copy link
Contributor

ZanSara commented May 23, 2022

Note: the suggestion above is valid, so in order to make optional parameter accept a null, we should write the schema as:

"anyOf": [
  {"type": "string"},
  {"type": "null"}
]

However, the node's parameter's schema is currently build entirely by Pydantic:

params_schema = model.schema()

So at a first glance it seems like we can't easily change the output by just staying within Haystack's code.

Next steps:

  1. Dig more into this method call to see if we can influence it in any way to return the above signature, or modify the output downstream somehow,
  2. See if this is worth a PR on Pydantic in case the issue can only be solved with monkeypatching.

@ZanSara ZanSara added the Contributions wanted! Looking for external contributions label Jul 19, 2022
@anakin87
Copy link
Member

I tried to investigate this problem, but it's not easy 😄.

  1. First, in the YAML example reported by @tstadel, custom_query is null but now, after Fix YAML validation for ElasticsearchDocumentStore.custom_query #2789, this parameter has a special validation, which doesn't raise the PipelineSchemaError.
    So, I propose another YAML configuration for experimenting (I hope that it is correct):
version: ignore

components:
  - name: MyReader
    type: FARMReader
    params:
      model_name_or_path: deepset/tinyroberta-squad2
      model_version: null         

pipelines:
  - name: example_pipeline
    nodes:
      - name: MyReader
        inputs: [Query]  
  1. In the pydantic community, this problem is old and difficult; they want to solve this issue in v2 (probably by the end of 2022). More information and some non-definitive monkey patches here: JSON Schema Nullable Required Syntax? pydantic/pydantic#1270

@ZanSara
Copy link
Contributor

ZanSara commented Jul 21, 2022

Hey @anakin87! Thanks for jumping it! 😊

About point 1 you're right: custom_query is no more a good experiment subject here. Any other Optional[Any] value should reproduce this bug though, so your example seems to match the issue.

About point 2: Wow the issue you link is really deep. I did not expect this to be so rooted into Pydantic... I think we could live with the monkeypatch while waiting for v2 to be out. The JSON schema generation in Haystack is not pretty anyway 😄 So I'd propose the following course of action, if you're up for it:

  1. Write a test on this issue that fails with the current code
  2. Add the monkeypatch in haystack/nodes/_json_schema.py (you'll notice is not the first one 😅)
  3. See if the test pass and adjust at need.
  4. Once it's all green again, add a big disclaimed over the monkeypatch linking to the issue, and one more comment over the test mentioning that it should be removed (or at least reviewed) during the migration to Pydantic 2
    • If you want to make it really fancy, you can even make the test fail if the Pydantic version is above 2.0, so we will be sure not to forget that 😁 But that's not necessary, a comment would suffice too

If you get stuck do not insist too much though. For as weird as it is, this issue is not mission critical. If we have to wait for Pydantic v2, so be it.

Have fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions topic:save/load type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants