Cosine Similarity in Go
Go Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

GoSineSim

Cosine Similarity of two or more shallow JSON objects in Go.

Usage

gosignsim --source=$JSON_OBJ_LITERAL --pool=[$JSON_OBJ_LITERAL,...] [--threshold=float] [--output_file=/path/to/save/results] [--verbose=boolean]

Options

  • source -- This is the main object that all other objects will be compared against.
  • pool -- This is a list of objects that will be compared against the source object.
  • threshold (optional) -- This is the minimal similarity value that an object must meet in order to make the cut.
  • output_file (optional) -- If this option is selected the resulting JSON will be saved to this file.
  • verbose (optional) -- If this is set to true(thy) various data will be sent to stdout. ie, if output_file was set, you would still see the resulting JSON.

GoSineSim works by comparing one data structure against one or more. Each item should be made up like this:

{
    "id": "15",
    "data": {
        "cars": 30,
        "money": 99
    }
}

[{
    "id": "44",
    "data": {
        "cars": 87,
        "money": 40
    }
}]

Data Rules

  1. Each item set must have a string id field and an object literal data field. The data is a simple key: value pair where the keys are stings and the values are floats

    type Item struct {
    	Id   string
    	Data map[string]float64
    }
  2. The source argument is a single item

  3. The pool argument is a list of items

Example

./gosinesim -source='{"id": "15", "data": {"cars": 30, "money": 99}}' --pool='[{"id": "44", "data": {"cars": 87, "money": 40}}]'

Which would produce the result

[{"Similarity":66.32728204403626,"Id":"44","Data":{"cars":87,"money":40}}]

Benchmarks

I wrote a very simple benchmarking suite in python bench.py that simulates what it would be like to compare a random sampling of items from python's system call method.

The file creates 100,000 items which can contain between 5 and 15 entries. It then chunks the items into groups of 100, 1000, 5000... up to 100k. As you can see from the results below, a comparison of less than ~5k items can be done in realtime.

My specs are: macOS, 2.5 Core i5, 16GB 1600 DDR3, and an SSD

************************************************************

Building the pool of 100000 items

done building the pool: 3.93292379379


****************************************

Comparing 100 items

	run 1 took 0.0157909393311
	run 2 took 0.0160090923309
	run 3 took 0.0309271812439


****************************************

Comparing 1000 items

	run 1 took 0.0655910968781
	run 2 took 0.0748629570007
	run 3 took 0.0511500835419


****************************************

Comparing 5000 items

	run 1 took 0.22353386879
	run 2 took 0.204764127731
	run 3 took 0.197242021561


****************************************

Comparing 10000 items

	run 1 took 0.291663885117
	run 2 took 0.608170032501
	run 3 took 0.627039909363


****************************************

Comparing 25000 items

	run 1 took 0.51177406311
	run 2 took 0.534518003464
	run 3 took 0.545824050903


****************************************

Comparing 50000 items

	run 1 took 0.905314922333
	run 2 took 0.918902873993
	run 3 took 0.902179002762


****************************************

Comparing 75000 items

	run 1 took 1.37646508217
	run 2 took 1.3969039917
	run 3 took 1.66373586655


****************************************

Comparing 100000 items

	run 1 took 1.83165812492
	run 2 took 1.93080997467
	run 3 took 5.73800802231


============================================================

finished running: 26.3283598423