Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large blocks of text do not get indexed properly #18

Closed
f-prime opened this issue Jun 25, 2019 · 7 comments · Fixed by #23
Closed

Large blocks of text do not get indexed properly #18

f-prime opened this issue Jun 25, 2019 · 7 comments · Fixed by #23
Labels
help wanted Extra attention is needed

Comments

@f-prime
Copy link
Owner

f-prime commented Jun 25, 2019

When indexing large blocks of text (e.g. a movie script) not all phrases get indexed properly. It seems like randomly some phrases/words get indexed while others do not.

@StefanSlehta
Copy link
Contributor

Do you have a link to a text I can replicate the issue with?

@00-matt
Copy link
Contributor

00-matt commented Jun 27, 2019

I was able to reproduce this with the lyrics from a song (after modifying READ_MAX in server.c so that it'd fit in one command)

INDEX fitter_happier fitter, happier more productive comfortable not drinking too much regular exercise at the gym, three days a week getting on better with your associate employee contemporaries at ease eating well, no more microwave dinners and saturated fats a patient, better driver a safer car, baby smiling in back seat sleeping well, no bad dreams no paranoia careful to all animals, never washing spiders down the plughole keep in contact with old friends, enjoy a drink now and then will frequently check credit at moral bank, hole in wall favours for favours, fond but not in love charity standing orders on sundays, ring-road supermarket no killing moths or putting boiling water on the ants car wash, also on sundays no longer afraid of the dark or midday shadows, nothing so ridiculously teenage and desperate nothing so childish at a better pace, slower and more calculated no chance of escape now self-employed concerned, but powerless an empowered and informed member of societ, pragmatism not idealism will not cry in public less chance of illness tires that grip in the wet, shot of baby strapped in backseat a good memory still cries at a good film still kisses with saliva no longer empty and frantic like a cat tied to a stick that's driven into frozen winter shit, the ability to laugh at weakness calm, fitter, healthier and more productive a pig in a cage on antibiotics
Text has been indexed
SEARCH a pig
["fitter_happier"]
SEARCH a pig in a cage
[]

edit Here's the index produced for the above string: https://termbin.com/oad9

@f-prime
Copy link
Owner Author

f-prime commented Jun 27, 2019

This problem has to do with the algorithm used in indexer.c (https://github.com/f-prime/fist/blob/master/fist/indexer.c). I haven't had a chance to look deeply yet, but I think there might be a problem with the look ahead logic.

If you notice, the phrase missing is at the end of the text. This leads me to believe there is something wrong with line 20 (https://github.com/f-prime/fist/blob/master/fist/indexer.c#L20).

If someone is currently investigating the problem that is where I suggest to look.

@StefanSlehta
Copy link
Contributor

Also notice that in the example fitter_happier is entered as the doc name, but ["a"] is returned in SEARCH

@00-matt
Copy link
Contributor

00-matt commented Jun 27, 2019

@StefanSlehta sorry, that part was just a bad copy and paste.

@00-matt
Copy link
Contributor

00-matt commented Jun 27, 2019

Do you think it is worth adding a longer string to the test cases?

@StefanSlehta
Copy link
Contributor

Might be worth adding a case that wasn't caught by the original test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants