-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure bill similarity using the ES sections index #58
Comments
The first goal is just to have some way of finding a list of other bills that are similar, using the section similarity. It does not have to be perfect at first. So, for example, get all sections of a bill, find all sections of other bills that have a score > 30 and list all of those as similar bills. The total similarity can be a total of the individual section scores. That is very crude and is not normalized, but is ok for a first version. |
I. Identifying bills that are similar Some tentative ideas (in no particular order): FWIW, I think this will largely need to be a brute force approach. II. Sorting similarity in a way that's relevant to the user b/ virtually identical bills c/ Look at the legislative history from committee reports d/ CRS ties the bills together in a committee report. This requires more thinking. Adding some preliminary thoughts in case it sparks someone else's thinking. |
Stage 1 appears to be working. It took a long time to process all bills, and in the future we will want to work on speeding up this process (maybe using a dedicated server to preprocess). The scores are a cumulative similarity score, between sections. It would be good to add other measures, like how many sections were found to be similar between two bills, and the highest-scoring single section match (which score and which section). |
I wonder whether we can pre process and save the results for all prior
congresses. That way we only need to profess new bills as they are
introduced.
On Thu, Dec 3, 2020 at 11:45 AM Ari Hershowitz ***@***.***> wrote:
Stage 1 appears to be working. It took a *long* time to process all
bills, and in the future we will want to work on speeding up this process
(maybe using a dedicated server to preprocess).
The scores are a *cumulative* similarity score, between sections. It
would be good to add other measures, like how many sections were found to
be similar between two bills, and the highest-scoring single section match
(which score and which section).
[image: image]
<https://user-images.githubusercontent.com/217356/101059881-9e4f2500-3543-11eb-94a9-af5c78b5bd4a.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#58 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWRVUCEIEFYQL57PFAPJTLSS66CJANCNFSM4UBOIWBA>
.
--
- Daniel (Please excuse typos, send by phone )
|
That makes sense.
Edit: this calculation is off. Since we are indexing, the comparisons are not 90k*90k. For initial processing, we do 90k calculations. For new bills, we get bill similarity calculations for ~200/day, but that only adds the relation of previous bills to the current bill display table. If we also want to be able to add the current bill to the previous bill's page, we will have to re-run similarity calculations over all bills, or devise a more clever incremental comparison algorithm. |
We want to preprocess each bill to find 'similar' bills. The list of bills that are similar will be added to our existing 'related bills' JSON, with a new category, maybe 'es-similarity'.
The bill similarity algorithm should work something like this:
Create a ES section index (done)
For a given bill + version (e.g.
data/116/dtd/116hr1500ih
), break it down into individual sections and headers. We already have code for this in the ES indexing scripts, using lxml. To break up a bill into sections, we can reuse the code or method here:https://github.com/aih/FlatGov/blob/master/flatgovtools/elastic_load.py#L70
'section_text': etree.tostring(section, method="text", encoding="unicode")
) and do a query against the index in 1, to find similar bills. The query can use themoreLikeThis
search from elastic_load.py.The
moreLikeThis
search returns a list of (a) similar sections, (b) the bills that those sections are from, and (c) the similarity score for the current searched text. For each section in a bill we will save this information. We will need to be able to vary the number of similar bills returned (current ES default is 10) and the threshold for similar bills (default should be ~ score of 20)data/116/dtd/116hr1500ih
)We will do this in stages.
This list will be saved to the related bills JSON with the es-similarity category.
a) navigate the bill and show similar sections of other bills for each section of the bill
b) are there bills that are fully contained in other bills (i.e. all or most of the sections of bill A are very similar to sections in bill B).
The goal of Stage 1 is that a user can just search a bill, like they do now, typing 116hr1500, and they get a table of related bills and, in addition to the current categories (title, CRS), there are bills that are listed because of their text similarity.
The text was updated successfully, but these errors were encountered: