Skip to content

[source-gcs] file.name is None causes crash when using unstructured parser (PDFs) #58103

@ariefkahfi

Description

@ariefkahfi

Connector Name

source-gcs

Connector Version

0.8.14

What step the error happened?

During the sync

Relevant information

Connector: airbyte/source-gcs
Versions tested: 0.8.x, 0.7.x, 0.6.0
Behavior: Consistently reproducible across all versions.

🧠 Issue Summary

When syncing files from a GCS bucket using signed URLs and the unstructured parser (e.g. for PDF files), the connector fails with a TypeError due to file.name being None. This happens even though the file is clearly present and matched via glob pattern (*.pdf).

🔒 GCS URL (Redacted)

This occurred while attempting to parse:

https://storage.googleapis.com/<redacted-path>/Company_profile.pdf?...  (signed URL)

🧪 Reproduction Steps

  1. Use airbyte/source-gcs with format: unstructured.
  2. Provide a glob pattern like *.pdf.
  3. Files are fetched from GCS using signed URLs.
  4. Sync fails during parsing phase due to file.name being None.

Relevant log output

ERROR   i.a.w.i.VersionedAirbyteStreamFactory(internalLog):308 - Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. stream=document-stream file=https://storage.googleapis.com/<REDACTED_PATH>/documents/Company_profile.pdf?<PRESIGNED_URL> line_no=0 n_skipped=0
Stack Trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py", line 98, in read_records_from_slice
    for record in parser.parse_records(self.config, file, self.stream_reader, self.logger, schema):
  File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 130, in parse_records
    markdown = self._read_file(file_handle, file, format, logger)
  File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 158, in _read_file
    filetype = self._get_filetype(file_handle, remote_file)
  File "/usr/local/lib/python3.10/site-packages/airbyte_cdk/sources/file_based/file_types/unstructured_parser.py", line 320, in _get_filetype
    type_based_on_content = detect_filetype(file=file)
  File "/usr/local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 264, in detect_filetype
    _, extension = os.path.splitext(file.name)
  File "/usr/local/lib/python3.10/posixpath.py", line 118, in splitext
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions