Skip to content

Keeping keys in a state for a very long time (keys expiry unknown) #20310

@damccorm

Description

@damccorm

I have a use case which I think might be a good addition to the pipelines patterns:

 
beam (java sdk) reads two kind of records from data stream like Kafka:
 

  1. Records of type A containing key and corresponding metadata. 
  2. Records of type B containing the same key, but no metadata. Beam then needs to fill metadata for records of type B  by doing a lookup for metadata using keys received in records of type A. 
     
    Idea is to save metadata or rather state for keys received in records of type A and then do a lookup when records of type B are received.
     Beam's "@State" construct  can be used here, however, problem is that we don't know when keys should expire. I don't think keeping a global window will be a good idea as there could be many keys (may be millions over a period of time) to be saved in a state.
     
    One possible solution as suggested by Reza Ardeshir Rokni (rarokni@gmail.com):
     
    We can maintain a state in a large fixed window (1 day or so), so that GC can happen within a window bound. After window expire, save the metadata values in an external DB like BigQuery. If we get a record with same key in a new window looking for this metadata, fetch the metadata for that key from external DB and save it in window's state again.
     
     
     
     

 

Imported from Jira BEAM-10019. Original Jira may contain additional context.
Reported by: mohilkhare.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions