Skip to content

Roman/fix ingest async connectors#3210

Merged
rbiseck3 merged 17 commits intomainfrom
roman/fix-ingest-async-connectors
Jun 17, 2024
Merged

Roman/fix ingest async connectors#3210
rbiseck3 merged 17 commits intomainfrom
roman/fix-ingest-async-connectors

Conversation

@rbiseck3
Copy link
Contributor

@rbiseck3 rbiseck3 commented Jun 14, 2024

Description

Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly:

  • Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing
  • fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a @property fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async.
  • elasticsearch was also updated to use the async client
  • weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing
  • all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async.
  • Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed

@rbiseck3 rbiseck3 force-pushed the roman/fix-ingest-async-connectors branch from 25e8082 to bba303b Compare June 14, 2024 12:34
@rbiseck3 rbiseck3 force-pushed the roman/fix-ingest-async-connectors branch from b20032a to 3a87282 Compare June 14, 2024 14:36
@rbiseck3 rbiseck3 enabled auto-merge June 14, 2024 17:30
raise SourceConnectionNetworkError(
f"failed to download file {file_data.identifier}"
)
async for result in async_scan(
Copy link
Contributor

@potter-potter potter-potter Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now get

Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x13658ee60>, 271841.176414041)...

in the logging. Not sure if we can address this. You would think it would close out the connector when it is done automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in latest commits

@rbiseck3 rbiseck3 requested a review from potter-potter June 17, 2024 14:44
@rbiseck3 rbiseck3 force-pushed the roman/fix-ingest-async-connectors branch from 23e64de to c822b7c Compare June 17, 2024 14:45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason that these expected-structured-output files for chunking have changed so much?

Copy link
Contributor

@potter-potter potter-potter Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind.! The parameters were not being passed through I assume:)

Copy link
Contributor

@potter-potter potter-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

Will be interesting to see if it passes this new (2 months old) expected-text-output test.

expected-text-output/**

@rbiseck3 rbiseck3 added this pull request to the merge queue Jun 17, 2024
Merged via the queue into main with commit d876a38 Jun 17, 2024
@rbiseck3 rbiseck3 deleted the roman/fix-ingest-async-connectors branch June 17, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants