Speed up duplication detection process #1

MrOrz · 2017-01-11T05:33:56Z

Currently it's O(n^2) scan for all rumor entries.

https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L108

Locality sensitive hashing should help, but requires effective design for the hashing bins.

The text was updated successfully, but these errors were encountered:

AnshulMalik · 2017-01-11T06:57:39Z

How come is it O(n^2) ?

MrOrz · 2017-01-11T07:32:40Z

Given a document, findDuplication scans through each processed document to calculate similarity between them (O(n)).

If we process n documents, findDuplication is invoked 1+2+...+n times, which derives O(n^2).

I think hashing (more specifically, LSH) can help reduce the complexity of findDuplication to constant. But I have no idea how to implement LSH now.

AnshulMalik · 2017-01-11T07:54:38Z

Ok.

I'll try to find some solution, also, can you tell me where is this findDuplication function is called?

MrOrz · 2017-01-11T09:05:44Z

Thanks. Call path:

npm run seed in CLI
--> aggregateRowsToDocs() in csvToElasticSearch.js
--> https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L43 and https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L64.

MrOrz · 2017-01-11T13:18:40Z

It is said that Levenshtein Automata is useful when the character distribution is poor.

Not sure if it works, I just note it here: https://www.npmjs.com/package/node-levenshtein-automata

MrOrz · 2017-01-14T03:59:54Z

Please note that we are moving away from Airtable very soon. This issue will be trivial after the shift.

After that, in the seed script we no longer need to ask for similarities -- we can just provide static set of seed data, in which all rumors and answers are already unique.

See https://www.facebook.com/groups/1847232902175197/1896817880550032/ for the shift, and the derivative issues are #2 , #3 and #4 .

MrOrz · 2017-03-18T14:23:52Z

We have shifted to elastic search and this is no longer required.

MrOrz added enhancement help wanted labels Jan 11, 2017

MrOrz mentioned this issue Jan 11, 2017

Automated script for updating elasticsearch from Airtable cofacts/rumors-api#16

Closed

MrOrz closed this as completed Mar 18, 2017

MrOrz removed enhancement help wanted labels Mar 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up duplication detection process #1

Speed up duplication detection process #1

MrOrz commented Jan 11, 2017 •

edited

AnshulMalik commented Jan 11, 2017

MrOrz commented Jan 11, 2017 •

edited

AnshulMalik commented Jan 11, 2017 •

edited

MrOrz commented Jan 11, 2017 •

edited

MrOrz commented Jan 11, 2017

MrOrz commented Jan 14, 2017 •

edited

MrOrz commented Mar 18, 2017

Speed up duplication detection process #1

Speed up duplication detection process #1

Comments

MrOrz commented Jan 11, 2017 • edited

AnshulMalik commented Jan 11, 2017

MrOrz commented Jan 11, 2017 • edited

AnshulMalik commented Jan 11, 2017 • edited

MrOrz commented Jan 11, 2017 • edited

MrOrz commented Jan 11, 2017

MrOrz commented Jan 14, 2017 • edited

MrOrz commented Mar 18, 2017

MrOrz commented Jan 11, 2017 •

edited

MrOrz commented Jan 11, 2017 •

edited

AnshulMalik commented Jan 11, 2017 •

edited

MrOrz commented Jan 11, 2017 •

edited

MrOrz commented Jan 14, 2017 •

edited