-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What happened?
Beam TextIO reader reads same record twice for large (~45GB) delimited (.DAT) files . No issues encountered when same file was splitted into smaller files and processed together.
This behaviour is not intermittent or random. Same records appears twice in every run.
wc -l test_tx_20211101.dat
78132206 test_tx_20211101.dat
record count of 78132206 was also verified loading file in spark which gives same count
Note: records are delimited with CRLF (^M$)
pipeline to read file:
input.apply("Read lines", TextIO.read().from(<s3 file path)
.apply("timestamp", WithTimestamp.of(e -> Instant.now()))
.setCoder(StringUtf8Coder.of())
.apply(Combine.globally(Count.<String>combineFn()).withoutDefaults())
.apply(ParDo.of(new LogOutput<>("Pcollection count: ")))
Same code was executed with beam 2.23.0 + aws sdk1 & beam 2.48.0 + aws sdk2
beam 2.23.0 + aws sdk1 --> Pcollection count: 78132206
beam 2.48.0 + aws sdk2 --> Pcollection count: 78132208
To eliminate the possibility of a record being corrupted , I deleted first (of 2) record which appeared twice.. Now the next record to it appeared twice and second duplicate record also shift down to next one .. this implies that something went wrong while read at certain byte offset
**Scenerio ** records at line # 33554433 & 67108865 appeared twice
Line# item
1
.
.
33554433 abc
.
.
.
67108865 xyz
.
.
78132206
In second test, when record at line # 33554433 was deleted, now the records at 33554433 (which was at line# 33554434) and 67108865 (which was at line# 67108866) were read twice
It seems records are read successfully until records# 33554432 and then record#33554433 read twice
Note: 67108865 - 33554433 = 33554432
Record#33554433 --> Line 33554433 of 78132206 ; Word 871941986 of 2028601968; Byte 21374174185 of 49770215222
Record#67108865 --> Line 67108865 of 78132206 ; Word 1743499355 of 2028601968; Byte 42748346369 of 49770215222
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner