Skip to content

[Bug]: TextIO read record twice for large delimited files #29146

@vatanrathi

Description

@vatanrathi

What happened?

Beam TextIO reader reads same record twice for large (~45GB) delimited (.DAT) files . No issues encountered when same file was splitted into smaller files and processed together.
This behaviour is not intermittent or random. Same records appears twice in every run.

wc -l test_tx_20211101.dat
78132206 test_tx_20211101.dat 

record count of 78132206 was also verified loading file in spark which gives same count
Note: records are delimited with CRLF (^M$)

pipeline to read file:

input.apply("Read lines", TextIO.read().from(<s3 file path)
.apply("timestamp", WithTimestamp.of(e -> Instant.now()))
.setCoder(StringUtf8Coder.of())
.apply(Combine.globally(Count.<String>combineFn()).withoutDefaults())
.apply(ParDo.of(new LogOutput<>("Pcollection count: ")))

Same code was executed with beam 2.23.0 + aws sdk1 & beam 2.48.0 + aws sdk2

beam 2.23.0 + aws sdk1 --> Pcollection count: 78132206
beam 2.48.0 + aws sdk2 --> Pcollection count: 78132208

To eliminate the possibility of a record being corrupted , I deleted first (of 2) record which appeared twice.. Now the next record to it appeared twice and second duplicate record also shift down to next one .. this implies that something went wrong while read at certain byte offset

**Scenerio ** records at line # 33554433 & 67108865 appeared twice
Line# item
1
.
.
33554433 abc
.
.
.
67108865 xyz
.
.
78132206

In second test, when record at line # 33554433 was deleted, now the records at 33554433 (which was at line# 33554434) and 67108865 (which was at line# 67108866) were read twice

It seems records are read successfully until records# 33554432 and then record#33554433 read twice
Note: 67108865 - 33554433 = 33554432

Record#33554433 --> Line 33554433 of 78132206 ; Word 871941986 of 2028601968; Byte 21374174185 of 49770215222
Record#67108865 --> Line 67108865 of 78132206 ; Word 1743499355 of 2028601968; Byte 42748346369 of 49770215222

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions