Measure bill similarity using the ES sections index #58

aih · 2020-11-24T21:44:38Z

We want to preprocess each bill to find 'similar' bills. The list of bills that are similar will be added to our existing 'related bills' JSON, with a new category, maybe 'es-similarity'.

The bill similarity algorithm should work something like this:

Create a ES section index (done)
For a given bill + version (e.g. data/116/dtd/116hr1500ih), break it down into individual sections and headers. We already have code for this in the ES indexing scripts, using lxml. To break up a bill into sections, we can reuse the code or method here:

https://github.com/aih/FlatGov/blob/master/flatgovtools/elastic_load.py#L70

 sections = billTree.xpath('//section')
 headers = billTree.xpath('//header')

For each section, serialize to text ( 'section_text': etree.tostring(section, method="text", encoding="unicode")) and do a query against the index in 1, to find similar bills. The query can use the moreLikeThis search from elastic_load.py.

The moreLikeThis search returns a list of (a) similar sections, (b) the bills that those sections are from, and (c) the similarity score for the current searched text. For each section in a bill we will save this information. We will need to be able to vary the number of similar bills returned (current ES default is 10) and the threshold for similar bills (default should be ~ score of 20)

Combine the information from 3. to get a list of all similar bills, ranked by score. The score will be a normalized combination of the score of all of the sections of the current bill from item 1 (e.g. data/116/dtd/116hr1500ih)

We will do this in stages.

Stage 1 will just get the top X similar bills and their normalized similarity scores.

This list will be saved to the related bills JSON with the es-similarity category.

Stage 2 will allow users to ask more detailed questions. For example:

a) navigate the bill and show similar sections of other bills for each section of the bill
b) are there bills that are fully contained in other bills (i.e. all or most of the sections of bill A are very similar to sections in bill B).

The goal of Stage 1 is that a user can just search a bill, like they do now, typing 116hr1500, and they get a table of related bills and, in addition to the current categories (title, CRS), there are bills that are listed because of their text similarity.

The text was updated successfully, but these errors were encountered:

aih · 2020-11-24T21:50:52Z

The first goal is just to have some way of finding a list of other bills that are similar, using the section similarity. It does not have to be perfect at first.

So, for example, get all sections of a bill, find all sections of other bills that have a score > 30 and list all of those as similar bills. The total similarity can be a total of the individual section scores. That is very crude and is not normalized, but is ok for a first version.

DanielSchuman · 2020-11-25T22:03:19Z

I. Identifying bills that are similar

Some tentative ideas (in no particular order):
-) look at the chapter or section headings
-) look at the text, especially for the appearance of unusual words or phrases
-) narrow the search based on library of congress subject headings or the committees the bills are referred to (this won't always work)
-) narrow the search based on common citation patters. Bills that amend the same laws may be related to each other
-) narrow the search based on co-sponsorship patterns

FWIW, I think this will largely need to be a brute force approach.

II. Sorting similarity in a way that's relevant to the user
a/ identical bills
a1/ bills CRS identifies as companion
a2/ bills that press releases or other entities identify as companions
a3/ bills you can connect through the associative property over multiple congresses. (E.G. if 116HR50 is the companion to 116S92, and we know that 115HR30 is the predecessor to 116HR50, and if we know that 115HR30 is the companion to 115S200, then we identify 116HR50 as related to 115S200)

b/ virtually identical bills
b1/ bills with virtually identical sections to other bills with virtually identical sections
b2/ bill title and co-sponsorship pattern

c/ Look at the legislative history from committee reports

d/ CRS ties the bills together in a committee report.

This requires more thinking. Adding some preliminary thoughts in case it sparks someone else's thinking.

aih · 2020-12-03T16:45:08Z

Stage 1 appears to be working. It took a long time to process all bills, and in the future we will want to work on speeding up this process (maybe using a dedicated server to preprocess).

The scores are a cumulative similarity score, between sections. It would be good to add other measures, like how many sections were found to be similar between two bills, and the highest-scoring single section match (which score and which section).

DanielSchuman · 2020-12-03T17:13:21Z

I wonder whether we can pre process and save the results for all prior congresses. That way we only need to profess new bills as they are introduced.

On Thu, Dec 3, 2020 at 11:45 AM Ari Hershowitz ***@***.***> wrote: Stage 1 appears to be working. It took a *long* time to process all bills, and in the future we will want to work on speeding up this process (maybe using a dedicated server to preprocess). The scores are a *cumulative* similarity score, between sections. It would be good to add other measures, like how many sections were found to be similar between two bills, and the highest-scoring single section match (which score and which section). [image: image] <https://user-images.githubusercontent.com/217356/101059881-9e4f2500-3543-11eb-94a9-af5c78b5bd4a.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWRVUCEIEFYQL57PFAPJTLSS66CJANCNFSM4UBOIWBA> .

-- - Daniel (Please excuse typos, send by phone )

aih · 2020-12-03T17:19:47Z

That makes sense.

If we have 90k bills, the initial processing is 90k*90k = 8.1B comparisons. Then if each day brings ~ 200 new bills, we have 200 * 90k = 18M comparisons to do. A lot less, and maybe we don't need to optimize yet.

Edit: this calculation is off. Since we are indexing, the comparisons are not 90k*90k. For initial processing, we do 90k calculations. For new bills, we get bill similarity calculations for ~200/day, but that only adds the relation of previous bills to the current bill display table. If we also want to be able to add the current bill to the previous bill's page, we will have to re-run similarity calculations over all bills, or devise a more clever incremental comparison algorithm.

aih · 2020-12-10T16:26:39Z

This stage of the algorithm is working. Closing this issue.

aih assigned kapphire Nov 24, 2020

aih mentioned this issue Dec 3, 2020

Compare results from 'related bills' to 'similar bills' #70

Closed

aih closed this as completed Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure bill similarity using the ES sections index #58

Measure bill similarity using the ES sections index #58

aih commented Nov 24, 2020

aih commented Nov 24, 2020

DanielSchuman commented Nov 25, 2020

aih commented Dec 3, 2020

DanielSchuman commented Dec 3, 2020 via email

aih commented Dec 3, 2020 •

edited

Loading

aih commented Dec 10, 2020

Measure bill similarity using the ES sections index #58

Measure bill similarity using the ES sections index #58

Comments

aih commented Nov 24, 2020

aih commented Nov 24, 2020

DanielSchuman commented Nov 25, 2020

aih commented Dec 3, 2020

DanielSchuman commented Dec 3, 2020 via email

aih commented Dec 3, 2020 • edited Loading

aih commented Dec 10, 2020

aih commented Dec 3, 2020 •

edited

Loading