Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source File: Fix OOM; read Excel files in chunks #25575

Merged
merged 14 commits into from
May 1, 2023

Conversation

artem1205
Copy link
Collaborator

What

Resolving https://github.com/airbytehq/oncall/issues/1871

How

read Excel files in chunks

Recommended reading order

  1. y.python

🚨 User Impact 🚨

no breaking changes

Pre-merge Checklist

Expand the relevant checklist and delete the others.

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)

  • Secrets in the connector's spec are annotated with airbyte_secret

  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.

  • Code reviews completed

  • Connector version has been incremented

  • Documentation updated

    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 26, 2023
for sheetname in work_book.sheetnames:
work_sheet = work_book[sheetname]
data = work_sheet.values
cols = next(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty generator can kill you by StopIteration, is it ok ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated;

@artem1205
Copy link
Collaborator Author

artem1205 commented Apr 27, 2023

/test connector=connectors/source-file

🕑 connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4817879964
❌ connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4817879964
🐛 https://gradle.com/s/ewcmcrwbzf26a

Build Failed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestIncremental.test_two_sequential_reads: Incremental syncs are not supported on this connector.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
======================== 35 passed, 3 skipped in 46.45s ========================
	 =========================== short test summary info ============================
	 FAILED integration_tests/file_formats_test.py::test_local_file_read[excel-xls-8-50-demo]
	 FAILED integration_tests/file_formats_test.py::test_raises_file_wrong_format[csv-csv-excel-demo0]
	 FAILED integration_tests/file_formats_test.py::test_raises_file_wrong_format[csv-csv-excel-demo1]
	 FAILED integration_tests/file_formats_test.py::test_raises_file_wrong_format[jsonl-jsonl-excel-jsonl_nested]
	 �[31m================== �[31m�[1m4 failed�[0m, �[32m36 passed�[0m, �[33m2 warnings�[0m�[31m in 23.93s�[0m�[31m ===================�[0m

@artem1205 artem1205 requested a review from grubberr April 28, 2023 09:35
@artem1205
Copy link
Collaborator Author

artem1205 commented Apr 28, 2023

/test connector=connectors/source-file

🕑 connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4829213026
❌ connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4829213026
🐛

@artem1205
Copy link
Collaborator Author

artem1205 commented Apr 30, 2023

/test connector=connectors/source-file

🕑 connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4844572957
✅ connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/4844572957
Python tests coverage:

Name                      Stmts   Miss  Cover
---------------------------------------------
source_file/__init__.py       2      0   100%
source_file/utils.py         13      1    92%
source_file/source.py        81      7    91%
source_file/client.py       317     58    82%
---------------------------------------------
TOTAL                       413     66    84%
Name                      Stmts   Miss  Cover
---------------------------------------------
source_file/__init__.py       2      0   100%
source_file/client.py       317     48    85%
source_file/utils.py         13      8    38%
source_file/source.py        81     60    26%
---------------------------------------------
TOTAL                       413    116    72%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestIncremental.test_two_sequential_reads: Incremental syncs are not supported on this connector.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
======================== 35 passed, 3 skipped in 51.30s ========================

@artem1205
Copy link
Collaborator Author

artem1205 commented May 1, 2023

/publish connector=connectors/source-file

🕑 Publishing the following connectors:
connectors/source-file
https://github.com/airbytehq/airbyte/actions/runs/4850979954


Connector Version Did it publish? Were definitions generated?
connectors/source-file 0.3.1

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@artem1205 artem1205 merged commit a3fa2b1 into master May 1, 2023
18 of 19 checks passed
@artem1205 artem1205 deleted the artem1205/source-file-OC-1871-OOM-Excel branch May 1, 2023 12:45
marcosmarxm pushed a commit to natalia-miinto/airbyte that referenced this pull request Jun 8, 2023
* Source File: Use openpyxl to read excel files in chunks

* Source File: bump version

* Source File: update docs

* Source File Secure: bump version

* Source File Secure: add docstring

* Source File: use column names from reader options

* Source File: refactor; use pandas for non xlsx formats

* Source File: reformat

* auto-bump connector version

---------

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/file connectors/source/file-secure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants