Streaming Data Algorithm

This repo includes three algorithms: Bloom filtering, Flajolet-Martin algorithm, and reservoir sampling

1. Bloom Filter

Bloom Filtering algorithm is to estimate whether the city of a business in business_second.json has shown before in business_first.json. I used the following hash function to hash the city name:

where a, b is randomly picked large number, m is the length of bit array, which is just length of the bloom filter. Note: the representation of bloom filter is a not a list 1 or 0, but instead a set of indices of all 1's. This way we can save some spaces.

2. Flajolet-Martin algorithm

Uses generate_stream.jar to simulate the data stream to work with. The implementation listens to port number 9999 locally. FM algorithm estimate the number of unique cities within a window in the data stream by the following steps:

Multiple hash functions are used to improve the estimation accuracy. The standard way to do this is shown below

But for simplicity, instead of using kmeans to group multiple estimation result, I manually set a number of group to simulate the grouping process, turns out this solution works.

Results:

Time                  Ground Truth  Estimation
2021-04-23 14:45:40   24            32
2021-04-23 14:46:33   42            50
2021-04-23 14:48:46   62            59

3. Reservoir Sampling/Fixed Size Sampling

This implementation uses Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and find popular tags with the top 3 frequencies on tweets based on the samples. Assuming that the memory can only save 100 tweets, we need to use the fixed size sampling method to only keep part of the tweets as a sample in the streaming.

The idea of Reservoir Sampling is to keep a fixed size sample to work with, and either keep/replace existing element in the sample list with newly arrived element or discard the new element:

I only take tweets with hashtags into account when doing Reservoir Sampling, that is, the hashtag will be counted. Newly arrived hashtag will be kept with the probability of 100/n, where n is the number of the newly arrived hashtag's index.

For example, the 101st element will be kept with a probability of 100/101, which is pretty high.

Sample data is obtained by Twitter's API. The program is set to listen to the tweets with following keywords:

TOPIC_LIST = ['COVID19', 'SocialDistancing', 'StayAtHome', 'Trump', 'Quarantine', 'CoronaVirus', 'China', 'Wuhan', 'Pandemic']

Results:

...

The number of tweets with tags from the beginning: 59
COVID19 : 16
Covid19 : 6
China : 4

The number of tweets with tags from the beginning: 60
COVID19 : 17
Covid19 : 6
China : 3
India : 3
SOSIYC : 3
Unite2FightCorona : 3

...

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Bloom Filter.py		Bloom Filter.py
Bloom Filter.scala		Bloom Filter.scala
Flajolet-Martin algorithm.py		Flajolet-Martin algorithm.py
README.md		README.md
Reservoir Sampling.py		Reservoir Sampling.py
business.zip		business.zip
business_first.json		business_first.json
business_second.json		business_second.json
generate_stream.jar		generate_stream.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Streaming Data Algorithm

1. Bloom Filter

2. Flajolet-Martin algorithm

3. Reservoir Sampling/Fixed Size Sampling

About

Uh oh!

Releases

Packages

Languages

artisan1218/Streaming-Data-Algo

Folders and files

Latest commit

History

Repository files navigation

Streaming Data Algorithm

1. Bloom Filter

2. Flajolet-Martin algorithm

3. Reservoir Sampling/Fixed Size Sampling

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages