-
Notifications
You must be signed in to change notification settings - Fork 5k
Open
Open
Copy link
Labels
area/connectorsConnector related issuesConnector related issuesautoteamconnectors/source/gcsteam/extensibilitytype/bugSomething isn't workingSomething isn't working
Description
Connector Name
source-gcs
Connector Version
0.8.14
What step the error happened?
During the sync
Relevant information
Connector: airbyte/source-gcs
Versions tested: 0.8.x, 0.7.x, 0.6.0
Behavior: Consistently reproducible across all versions.
🧠 Issue Summary
When syncing files from a GCS bucket using signed URLs and the unstructured parser (e.g. for PDF files), the connector fails with a TypeError due to file.name being None. This happens even though the file is clearly present and matched via glob pattern (*.pdf).
🔒 GCS URL (Redacted)
This occurred while attempting to parse:
https://storage.googleapis.com/<redacted-path>/Company_profile.pdf?... (signed URL)
🧪 Reproduction Steps
- Use
airbyte/source-gcswithformat: unstructured. - Provide a glob pattern like
*.pdf. - Files are fetched from GCS using signed URLs.
- Sync fails during parsing phase due to
file.namebeingNone.
Relevant log output
ERROR i.a.w.i.VersionedAirbyteStreamFactory(internalLog):308 - Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. stream=document-stream file=https://storage.googleapis.com/<REDACTED_PATH>/documents/Company_profile.pdf?<PRESIGNED_URL> line_no=0 n_skipped=0
Stack Trace: Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py", line 98, in read_records_from_slice
for record in parser.parse_records(self.config, file, self.stream_reader, self.logger, schema):
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 130, in parse_records
markdown = self._read_file(file_handle, file, format, logger)
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 158, in _read_file
filetype = self._get_filetype(file_handle, remote_file)
File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 320, in _get_filetype
type_based_on_content = detect_filetype(file=file)
File "/usr/local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 264, in detect_filetype
_, extension = os.path.splitext(file.name)
File "/usr/local/lib/python3.10/posixpath.py", line 118, in splitext
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Metadata
Metadata
Assignees
Labels
area/connectorsConnector related issuesConnector related issuesautoteamconnectors/source/gcsteam/extensibilitytype/bugSomething isn't workingSomething isn't working