Skip to content

go experiment - process large amounts of data and expose a json web service

Notifications You must be signed in to change notification settings

adriantofan/go_play_hn

Repository files navigation

What it does ?

Loads and processes data from a TSV file listing all queries performed on HN Search during a few days.

It then launches a web server that answer the following queries :

  • GET /1/queries/count/<DATE_PREFIX>: returns a JSON object specifying the number of distinct queries that have been done during a specific time range
  • GET /1/queries/popular/<DATE_PREFIX>?size=<SIZE>: returns a JSON object listing the top popular queries that have been done during a specific time range

Install and run

$ go build
$ ./go_play_hn
# open in browser http://127.0.0.1:8080
# distinct http://127.0.0.1:8080/1/queries/count/2015
# top n http://127.0.0.1:8080//1/queries/popular/2015

Architecture

The app loads logs one by one and adds them in to a trie-like structure. The trie organizes the logs by date. For example given the logs :

    2015-08-01 00:03:42	term
    2015-08-01 00:03:43	term1    
    2015-08-01 00:03:44	term1    

is stored in a six level deep trie as

    2015                                <- aggregate for 2015
    |
    -> 08                               <- aggregate for 08 month of 2015
        |
        -> 01                           <- aggregate for 01 day of 08 month of 2015
            |
            -> 00                       <- aggregate for 00 hour for 2015-08-01
                |
                ->03                    <- aggregate for 03 minute of 2015-08-01 00:
                    |
                    -> 42               <- aggregate for 42 second of 2015-08-01 00:03

During loading, at each level, a aggregate term count is kept in a hash table.

When all data is loaded, in a final processing step term counts are extracted form the hash table and kept sorted in a array as (term, count) pairs.

The trie is uses a hash table to access children nodes.

To answer how many distinct queries at 2015-08-01, just get the length of the sorted (term, count) array at a third level 2015->08->01.

To answer what are the top 5 queries at 2015-08-01, just get first 5 elements of the sorted (term, count) array at a third level 2015->08->01.

Performance (amortized) for k <= 6 (yy-mm-dd hh:mm:ss - six components):

  • Answer time is proportional to O(k) for both queries.
  • The startup time is around 3-4 seconds.
  • The load time is k * N log N
  • The memory usage is also proportional to k * N
  • The constant factors in the previous estimates are quite high. Improvements are possible, keeping more or less the same algorithm.

Notes for improvement:

  • The deterministic algorithm cannot be easily scaled. In order to scale up, data processing can be distributed to worker nodes and results merged centrally as follows :

    • grouping by url - long tail problem, a few machines will process the most popular urls. I cannot see how to evenly distribute popular urls.
    • grouping by date , let's say days. The problem is that the workers need to return to the server the complete hash map for the day and then do a central merge. If not done the aggregate computation of top n can be (very) wrong. Imagine top 1 of node 1: {m:5,x:4} node 2 {b:6,x:4}. Aggregate of top 1 is {b:6,m:5 }. In reality the result is {x:8, b:6}

    The solution seem to be non deterministic algorithms such as hyperLogLog and family.

Speed optimizations

  • Given that AddLog is called on successive logs it might be possible to optimize it further such that we don't lookup the same log chain at each step

  • Given that logs are contiguous the hash of hash trie implementation could easily be replaced with an big array reducing further startup and look up time.

  • date parsing and conversion to components can be much improved given the standard format

Memory usage optimizations

  • it might be possible to not work with urls and replace them with a data structure that is basically a string hash and byte range in the file. Given that the file is memory mapped, it can work nicely, but eventually it will be very disk bound.

  • performance and especially memory consumption might be further improved using a radix tree instead of the temporary hash map that keeps the url counts

Updates

V1.1

  • added unit test for ComputeTopNQueries and fixes issue where size param is not interpreted correctly
  • the test for ComputeSortedURLs is non deterministic because the urls are not sorted. Now using a more appropriate test set
  • the test for topNHandler

About

go experiment - process large amounts of data and expose a json web service

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages