Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future improvements for 'similar bills' #178

Closed
aih opened this issue Feb 16, 2021 · 1 comment
Closed

Future improvements for 'similar bills' #178

aih opened this issue Feb 16, 2021 · 1 comment

Comments

@aih
Copy link
Collaborator

aih commented Feb 16, 2021

For use of shingles (multi-word phrases) with 'more-like-this', see, e.g. https://discuss.elastic.co/t/more-like-this-and-shingles-phrases/100775

I got it to work by combining multiple More Like This queries each with their own analyzer instead of trying to use per_field_analyzer. That worked out better anyway, allowing me to have separate settings (e.g. stop_words, min_word_length) for unigrams vs bigrams.

===
Flatgov discussion

From Daniel:

I think multi word phrases probably makes a lot of sense here than looking at single word phrases.

We can also probably limit small bill to bill comparisons based on how CRS/LOC categorizes them. There's only a handful of monster bills and those are the ones it's probably useful to identify when smaller bills are components. We can also potentially use the section headings as a clue to narrow that down.

We could also use the Library of Congress summaries. They are much smaller but have to identify the key concepts.... could be a way to winnow it down.

@aih
Copy link
Collaborator Author

aih commented May 22, 2021

n-grams are now implemented and working.

@aih aih closed this as completed May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant