Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sharepoint ingestion fails with remote end closed connection without response #70

Open
mawandm opened this issue May 4, 2024 · 0 comments
Labels
API Backend API bug Something isn't working

Comments

@mawandm
Copy link
Contributor

mawandm commented May 4, 2024

Nesis version

0.1.0

Describe the bug

During a long running Sharepoint ingestion process, an error

[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:36.695 [WARNING ] nesis.api.core.document_loaders.sharepoint - Error when getting and ingesting file Stock Market Wizards (Jack D. Schwager) (z-lib.org).pdf - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Generating embeddings:   0%|          | 0/14 [00:00<?, ?it/s]Killed
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.469 [ERROR   ] nesis.api.core.document_loaders.sharepoint - Error fetching and updating documents - Error: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 38, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     response.raise_for_status()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise HTTPError(http_error_msg, response=self)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] During handling of the above exception, another exception occurred:
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 117, in _sync_sharepoint_documents
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _process_folder_files(
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 168, in _process_folder_files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _files = folder.get_files(False).execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_object.py", line 52, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.context.execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_runtime_context.py", line 183, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.pending_request().execute_query(qry)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 42, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise ClientRequestException(*e.args, response=e.response)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] office365.runtime.client_request_exception.ClientRequestException: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.503 [INFO    ] apscheduler.executors.default - Job "ingest_datasource (trigger: date[2024-05-04 00:19:28 UTC], next run at: 2024-05-04 00:19:28 UTC)" executed successfully

Shows

To reproduce

  1. Create a sharepoint datasource
  2. Add multiple large documents to the Sharepoint
  3. Run the ingestion... after a while, the API service logs show a 401 Client Error: Unauthorized for url...

Expected behavior

The ingestion should run continuously. It seems like a refresh of the Sharepoint client authentication is needed

Screenshots

No response

Additional context

No response

@mawandm mawandm added bug Something isn't working API Backend API labels May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Backend API bug Something isn't working
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant