Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure bill similarity using the ES sections index #58

Closed
aih opened this issue Nov 24, 2020 · 6 comments
Closed

Measure bill similarity using the ES sections index #58

aih opened this issue Nov 24, 2020 · 6 comments
Assignees

Comments

@aih
Copy link
Collaborator

aih commented Nov 24, 2020

We want to preprocess each bill to find 'similar' bills. The list of bills that are similar will be added to our existing 'related bills' JSON, with a new category, maybe 'es-similarity'.

The bill similarity algorithm should work something like this:

  1. Create a ES section index (done)

  2. For a given bill + version (e.g. data/116/dtd/116hr1500ih), break it down into individual sections and headers. We already have code for this in the ES indexing scripts, using lxml. To break up a bill into sections, we can reuse the code or method here:

https://github.com/aih/FlatGov/blob/master/flatgovtools/elastic_load.py#L70

 sections = billTree.xpath('//section')
 headers = billTree.xpath('//header')
  1. For each section, serialize to text ( 'section_text': etree.tostring(section, method="text", encoding="unicode")) and do a query against the index in 1, to find similar bills. The query can use the moreLikeThis search from elastic_load.py.

The moreLikeThis search returns a list of (a) similar sections, (b) the bills that those sections are from, and (c) the similarity score for the current searched text. For each section in a bill we will save this information. We will need to be able to vary the number of similar bills returned (current ES default is 10) and the threshold for similar bills (default should be ~ score of 20)

  1. Combine the information from 3. to get a list of all similar bills, ranked by score. The score will be a normalized combination of the score of all of the sections of the current bill from item 1 (e.g. data/116/dtd/116hr1500ih)

We will do this in stages.

  • Stage 1 will just get the top X similar bills and their normalized similarity scores.

This list will be saved to the related bills JSON with the es-similarity category.

  • Stage 2 will allow users to ask more detailed questions. For example:

a) navigate the bill and show similar sections of other bills for each section of the bill
b) are there bills that are fully contained in other bills (i.e. all or most of the sections of bill A are very similar to sections in bill B).

The goal of Stage 1 is that a user can just search a bill, like they do now, typing 116hr1500, and they get a table of related bills and, in addition to the current categories (title, CRS), there are bills that are listed because of their text similarity.

@aih
Copy link
Collaborator Author

aih commented Nov 24, 2020

The first goal is just to have some way of finding a list of other bills that are similar, using the section similarity. It does not have to be perfect at first.

So, for example, get all sections of a bill, find all sections of other bills that have a score > 30 and list all of those as similar bills. The total similarity can be a total of the individual section scores. That is very crude and is not normalized, but is ok for a first version.

@DanielSchuman
Copy link
Collaborator

I. Identifying bills that are similar

Some tentative ideas (in no particular order):
-) look at the chapter or section headings
-) look at the text, especially for the appearance of unusual words or phrases
-) narrow the search based on library of congress subject headings or the committees the bills are referred to (this won't always work)
-) narrow the search based on common citation patters. Bills that amend the same laws may be related to each other
-) narrow the search based on co-sponsorship patterns

FWIW, I think this will largely need to be a brute force approach.

II. Sorting similarity in a way that's relevant to the user
a/ identical bills
a1/ bills CRS identifies as companion
a2/ bills that press releases or other entities identify as companions
a3/ bills you can connect through the associative property over multiple congresses. (E.G. if 116HR50 is the companion to 116S92, and we know that 115HR30 is the predecessor to 116HR50, and if we know that 115HR30 is the companion to 115S200, then we identify 116HR50 as related to 115S200)

b/ virtually identical bills
b1/ bills with virtually identical sections to other bills with virtually identical sections
b2/ bill title and co-sponsorship pattern

c/ Look at the legislative history from committee reports

d/ CRS ties the bills together in a committee report.

This requires more thinking. Adding some preliminary thoughts in case it sparks someone else's thinking.

@aih
Copy link
Collaborator Author

aih commented Dec 3, 2020

Stage 1 appears to be working. It took a long time to process all bills, and in the future we will want to work on speeding up this process (maybe using a dedicated server to preprocess).

The scores are a cumulative similarity score, between sections. It would be good to add other measures, like how many sections were found to be similar between two bills, and the highest-scoring single section match (which score and which section).

image

@DanielSchuman
Copy link
Collaborator

DanielSchuman commented Dec 3, 2020 via email

@aih
Copy link
Collaborator Author

aih commented Dec 3, 2020

That makes sense.

If we have 90k bills, the initial processing is 90k*90k = 8.1B comparisons. Then if each day brings ~ 200 new bills, we have 200 * 90k = 18M comparisons to do. A lot less, and maybe we don't need to optimize yet.

Edit: this calculation is off. Since we are indexing, the comparisons are not 90k*90k. For initial processing, we do 90k calculations. For new bills, we get bill similarity calculations for ~200/day, but that only adds the relation of previous bills to the current bill display table. If we also want to be able to add the current bill to the previous bill's page, we will have to re-run similarity calculations over all bills, or devise a more clever incremental comparison algorithm.

@aih
Copy link
Collaborator Author

aih commented Dec 10, 2020

This stage of the algorithm is working. Closing this issue.

image
image

@aih aih closed this as completed Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants