GitHub - adriantofan/go_play_hn: go experiment - process large amounts of data and expose a json web service

What it does ?

Loads and processes data from a TSV file listing all queries performed on HN Search during a few days.

It then launches a web server that answer the following queries :

GET /1/queries/count/<DATE_PREFIX>: returns a JSON object specifying the number of distinct queries that have been done during a specific time range
GET /1/queries/popular/<DATE_PREFIX>?size=<SIZE>: returns a JSON object listing the top popular queries that have been done during a specific time range

Install and run

$ go build
$ ./go_play_hn
# open in browser http://127.0.0.1:8080
# distinct http://127.0.0.1:8080/1/queries/count/2015
# top n http://127.0.0.1:8080//1/queries/popular/2015

Architecture

The app loads logs one by one and adds them in to a trie-like structure. The trie organizes the logs by date. For example given the logs :

    2015-08-01 00:03:42	term
    2015-08-01 00:03:43	term1    
    2015-08-01 00:03:44	term1

is stored in a six level deep trie as

    2015                                <- aggregate for 2015
    |
    -> 08                               <- aggregate for 08 month of 2015
        |
        -> 01                           <- aggregate for 01 day of 08 month of 2015
            |
            -> 00                       <- aggregate for 00 hour for 2015-08-01
                |
                ->03                    <- aggregate for 03 minute of 2015-08-01 00:
                    |
                    -> 42               <- aggregate for 42 second of 2015-08-01 00:03

During loading, at each level, a aggregate term count is kept in a hash table.

When all data is loaded, in a final processing step term counts are extracted form the hash table and kept sorted in a array as (term, count) pairs.

The trie is uses a hash table to access children nodes.

To answer how many distinct queries at 2015-08-01, just get the length of the sorted (term, count) array at a third level 2015->08->01.

To answer what are the top 5 queries at 2015-08-01, just get first 5 elements of the sorted (term, count) array at a third level 2015->08->01.

Performance (amortized) for k <= 6 (yy-mm-dd hh:mm:ss - six components):

Answer time is proportional to O(k) for both queries.
The startup time is around 3-4 seconds.
The load time is k * N log N
The memory usage is also proportional to k * N
The constant factors in the previous estimates are quite high. Improvements are possible, keeping more or less the same algorithm.

Notes for improvement:

The deterministic algorithm cannot be easily scaled. In order to scale up, data processing can be distributed to worker nodes and results merged centrally as follows :
- grouping by url - long tail problem, a few machines will process the most popular urls. I cannot see how to evenly distribute popular urls.
- grouping by date , let's say days. The problem is that the workers need to return to the server the complete hash map for the day and then do a central merge. If not done the aggregate computation of top n can be (very) wrong. Imagine top 1 of node 1: {m:5,x:4} node 2 {b:6,x:4}. Aggregate of top 1 is {b:6,m:5 }. In reality the result is {x:8, b:6}
The solution seem to be non deterministic algorithms such as hyperLogLog and family.

Speed optimizations

Given that AddLog is called on successive logs it might be possible to optimize it further such that we don't lookup the same log chain at each step
Given that logs are contiguous the hash of hash trie implementation could easily be replaced with an big array reducing further startup and look up time.
date parsing and conversion to components can be much improved given the standard format

Memory usage optimizations

it might be possible to not work with urls and replace them with a data structure that is basically a string hash and byte range in the file. Given that the file is memory mapped, it can work nicely, but eventually it will be very disk bound.
performance and especially memory consumption might be further improved using a radix tree instead of the temporary hash map that keeps the url counts

Updates

V1.1

added unit test for ComputeTopNQueries and fixes issue where size param is not interpreted correctly
the test for ComputeSortedURLs is non deterministic because the urls are not sorted. Now using a more appropriate test set
the test for topNHandler

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
hn_logs.tsv		hn_logs.tsv
log.go		log.go
log_test.go		log_test.go
main.go		main.go
main_test.go		main_test.go
test_small.tsv		test_small.tsv
trie.go		trie.go
trie_test.go		trie_test.go
tsv_loader.go		tsv_loader.go
tsv_loader_test.go		tsv_loader_test.go
util.go		util.go
util_test.go		util_test.go
validation_test.go		validation_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What it does ?

Install and run

Architecture

Notes for improvement:

Speed optimizations

Memory usage optimizations

Updates

V1.1

About

Releases

Packages

Languages

adriantofan/go_play_hn

Folders and files

Latest commit

History

Repository files navigation

What it does ?

Install and run

Architecture

Notes for improvement:

Speed optimizations

Memory usage optimizations

Updates

V1.1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages