-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Per doc approach
- Read file line by line (input: csv-file, output: string)
- Tokenize line (input: 1.output, output: list-of-strings)
- Simhash on tokens (input: 2.output, output: list-of-long(s) )
- Attach weights to simhashed tokens (input: 3.output, output: list of floats) [V matrix calculation]
- fingerprint matrix creation (input: 4.output, output: list of booleans)
- fingerprint comparison (input: 2-D matrix [y = noOfDocks, x = noOfTokens], output: list of objects with field1 = docName, field2 = a list of identical docs(docName))
- output result
Metadata
Metadata
Assignees
Labels
No labels