Using HashSet based indexes in Apache Spark
An Apache Spark based solution to partition data and de-duplicate it while incrementally loading to a table.
The code is described further along with the problem statement in a lot more detail in this blog post here.
The problem statement
A table exists in Hive or any destination system and data is loaded every hour (or day) to this table. As per the source systems, it can generate duplicate data. However the final table should de-duplicate the data while loading. We assume that data is immutable, but can be delivered more than once, and we need a logic to filter these duplicates before appending the data to our master store.
Assumption: We have the data partitioned and data doesn’t get repeated across partition. This just makes the problem a simpler to optimize as it would help in reducing shuffles. For other use cases we can very well consider the whole data to be in one partition.
Sample use case: I used this on click stream data with an at least once delivery guarantee. However since the data is immutable, the source timestamp doesn’t change, and I partition the data based on this source timestamp.
More indepth explanation is here
An example of using the APIs which are exposed outside is here