Columbia importer updated #2865

quevon24 · 2023-07-06T18:37:17Z

This PR contains the updated version of columbia importer, it contains many changes like:

Update codebase to match python 3.11 style
Replace deprecated functions
Typing added
Remove court regex and use courts-db to find courts (We may need to update courts-db for test to pass PR 74)
Change etree with Beautiful Soup to parse xml files
Store opinions in the correct order
Store opinion footnotes
Find duplicates using citation, docket number, case name, and opinion content
Add citations when a duplicate is found
Store syllabus
Pass a csv file path as an argument with absolute paths to xml files
If we have a possible match, we only log a message and abort the import of that file instead of adding data to the matched cluster, that way we can review the logs manually
Default xml directory: /opt/courtlistener/_columbia
Default csv location: /opt/courtlistener/_columbia/columbia_import.csv
Log all messages to a file so that it can be reviewed manually without needing to see the container logs

Based on some calculations, ~1.2M files have to be imported based on the data in local_path in the Opinion model, the number could be lower because some of the cases in this list of files are already imported but from a different source

Usage:

Import using a csv file with xml file path pointing to mounted directory and file path
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/testfile.csv

Csv example:

filepath
michigan/supreme_court_opinions/documents/d5a484f1bad20ba0.xml

Import specifying the mounted directory where the xml files are located
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/files_to_import.csv --xml-dir /opt/courtlistener/columbia_files

remove unused code remove unused imports

Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added

log message when court doesn't exist in courtlistener

handle duplicate citations in xml

Log message when case has no citations Handle single volume nominative reporters

fix typing

cl/corpus_importer/management/commands/import_columbia.py

remove unused code remove unused imports

Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added

log message when court doesn't exist in courtlistener

handle duplicate citations in xml

quevon24 · 2024-05-22T17:35:38Z

@grossir when you have time available you could take a look

this is a sample file to test the command:

random_sample_1.zip

to run the command you need to copy the zip content to cl/assets/media

docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_1.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_1

if you have any questions, i'll stay tuned.

grossir

It ingested ~766 dockets/opinion clusters out of 1000 documents
I ran the script 4 times and got 6 triplicated ingestions, maybe you can check on your environment too. I used this query to detect them:

select 
    local_path, html_columbia,  count(*)
FROM 
    search_docket sd 
inner join search_opinioncluster oc 
    on docket_id=sd.id 
inner join search_opinion 
    on search_opinion.cluster_id=oc.id 
where 
    sd.date_created::date = '2024-05-23'::date 
group by local_path , html_columbia
having count(*) > 1;

I left some comments, mostly ideas for improvements

I haven't really tested the matching algorithms beyond most basic duplication. If you could send another sample file, but sampled from the most recent opinions (so that it is easier to search them on the web pages / scrape them), it would help testing those parts

grossir · 2024-05-24T00:05:58Z

cl/corpus_importer/management/commands/import_columbia.py

+    }
+
+    # Add date data into columbia dict
+    columbia_data.update(find_dates_in_xml(soup))


I think there are some missing "FILED_TAGS" strings in columbia_utils.py
For example, for texas/court_opinions/documents/5c8dba31985162bf.xml in the sample, the following date is parsed: [[('opinion issued', datetime.date(2006, 10, 23))]] but it is not assigned to the "date_filed" key

Looking at the raw text I think it should be considered as "date_filed":

<opinion> <reporter_caption><center>IN RE LARREW, 05-06-01227-CV (Tex.App.-Dallas 10-23-2006)</center></reporter_caption> <caption><center>IN RE STEPHEN JAMES LARREW, Relator.</center></caption> <docket><center>No. 05-06-01227-CV</center></docket><court><center>Court of Appeals of Texas, Fifth District, Dallas.</center></court> <date><center>Opinion issued October 23, 2006.</center>

I see that on FILED_TAGS "opinion issued" is included in ARGUED_TAGS, not sure about the logic for this

I collected all the documents that do have dates, but no date filed. Some are obviously not OpinionCluster.date_filed, like "case announcements and administrative actions", but some others I am not so sure

{'texas/court_opinions/documents/5c8dba31985162bf.xml': 'opinion issued', 'arkansas/court_opinions/documents/ae218d6345f5d320.xml': 'opinion delivered', 'texas/court_opinions/documents/5f4fe3e1c4e72785.xml': 'opinion delivered', 'arkansas/court_opinions/documents/96742836d45c4996.xml': 'opinion delivered', 'arkansas/court_opinions/documents/d218fb45d4055bdd.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/61198adb4b840f4d.xml': 'denied', 'arkansas/court_opinions/documents/5793152fb3e371a3.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/cda2e7a6c083f661.xml': 'denied', 'texas/court_opinions/documents/f7f4eb4e0bb7e71a.xml': 'opinion delivered and filed', 'connecticut/appellate_court_opinions/documents/e3e9aa07cc97f60f.xml': 'officially released', 'arkansas/court_opinions/documents/0b027f05aa07c2af.xml': 'opinion delivered', 'texas/court_opinions/documents/248981bf18493e9d.xml': 'opinion delivered', 'arkansas/court_opinions/documents/a968b68353ffe980.xml': 'opinion delivered', 'arkansas/court_opinions/documents/ebdc8da5b2ec8fe9.xml': 'opinion delivered', 'michigan/supreme_court_opinions/documents/63efa26d555875ea.xml': 'leave to appeal denied', 'texas/court_opinions/documents/d4a6653c3a7c08fe.xml': 'delivered', 'maryland/court_of_appeals_opinions/documents/69cb6658d5b0324d.xml': 'granted', 'texas/court_opinions/documents/0904c0a3016f8421.xml': 'delivered', 'texas/court_opinions/documents/a43217e67bd08858.xml': 'opinion issued', 'texas/court_opinions/documents/c48edff93471911d.xml': 'opinion issued', 'ohio/court_opinions/documents/52a07db0c124634f.xml': 'case announcements and administrative actions', 'arkansas/court_opinions/documents/e628b04ac0dcd6f1.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/9067dae3cd6d312e.xml': 'denied', 'arkansas/court_opinions/documents/300ebbd01ba38398.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/217ae38fdf9869af.xml': 'denied', 'arkansas/court_opinions/documents/9e3f71089f9d11dc.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/6600fe895d37d853.xml': 'denied', 'arkansas/court_opinions/documents/2c71f85af35b9e0f.xml': 'opinion delivered', 'texas/court_opinions/documents/cecfdd58268e8f07.xml': 'opinion issued', 'texas/court_opinions/documents/60a231f3da6a421f.xml': 'memorandum opinion delivered and filed', 'arkansas/court_opinions/documents/d37e7ba255a67a6d.xml': 'opinion delivered', 'texas/court_opinions/documents/b78271984621969a.xml': 'opinion issued', 'maryland/court_of_appeals_opinions/documents/4b97a5803331bb29.xml': 'denied', 'maryland/court_of_appeals_opinions/documents/9eefe6f3e03131f7.xml': 'denied', 'massachusetts/superior_court_opinions/documents/161739ca6ca6348b.xml': 'memorandum dated', 'maryland/court_of_appeals_opinions/documents/6ac77c8a8002a723.xml': 'denied', 'texas/court_opinions/documents/3bcee6268dd18a72.xml': 'opinion issued', 'texas/court_opinions/documents/4d909c6b7d4de7e4.xml': 'opinion delivered', 'arkansas/court_opinions/documents/8ddc4fe19662d9fb.xml': 'opinion delivered', 'arkansas/court_opinions/documents/9d7c8e94e2c2b40f.xml': 'opinion delivered', 'texas/court_opinions/documents/2db17b19d30d85df.xml': 'opinion issued', 'arkansas/court_opinions/documents/3dd26fb70896c79b.xml': 'opinion delivered', 'arkansas/court_opinions/documents/ab59ead0feee789f.xml': 'opinion delivered', 'maryland/court_of_appeals_opinions/documents/9d65b83825eed85f.xml': 'denied', 'michigan/supreme_court_opinions/documents/1287940f26660dfa.xml': 'summary dispositions', 'arkansas/court_opinions/documents/6ae7cc75fc0cd311.xml': 'opinion delivered', 'connecticut/appellate_court_opinions/documents/96cb6396c50954b6.xml': 'officially released', 'connecticut/appellate_court_opinions/documents/4df558796ec8e60c.xml': 'decision released'}

cl/corpus_importer/management/commands/import_columbia.py

update comments and logging messages

…pdate-columbia-importer

quevon24 · 2024-06-04T17:07:42Z

I implemented three small tweaks to reduce the number of duplicates:

Use SHA1 from xml file to try to find cases already imported into the system from the same source and skip them
When the opinion source is harvard, it sometimes tends to end with a <page_number>, removing that tag increase the accuracy of the opinion content match a little but enough to match columbia's opinion content.
Sometimes in CL we have the same opinion content as in columbia, but with some extra data (for example, when we matched the xml file with a lawbox opinion, lawbox opinions contain multiple metadata in opinion content, such as citation, case name, court and docket number). One of the algorithms checks whether a given opinion is subset of the other, and adding an extra condition helps to overcome that issue.

Beside that i found that in some cases we can have a match(same filed date, citation, docket number and court) but the opinion content is largely different, for example with this file:
e32cc12d6481ddab.xml.zip
and the cluster: https://www.courtlistener.com/opinion/1599823/go/

in cl we have:

1 So.3d 181 (2009)
MORALES
v.
McNEIL.
No. 1D08-1497.
District Court of Appeal of Florida, First District.

January 21, 2009.
Decision without published opinion. Affirmed.

but in the xml file we have:

AFFIRMED.
BARFIELD, ALLEN, and THOMAS, JJ., CONCUR.
NOT FINAL UNTIL TIME EXPIRES TO FILE MOTION FOR REHEARING AND DISPOSITION THEREOF IF FILED.

even if we analyze the lawbox structure to remove metadata, the opinion content is so different, the algorithms that compare the opinions will fail.

I already discussed this with @flooie and he mentioned that there is no problem creating a few duplicates, currently there are already duplicates in the system so later we will need to implement some specialized command to merge/eliminate those duplicates.

I used some random data from the columbia merger matches to try to refine the matching process as much as possible to reduce duplicates.

It can be tested cloning these clusters @grossir:

docker exec -it cl-django python /opt/courtlistener/manage.py clone_from_cl --type search.OpinionCluster --id 1053004 1067190 1164370 1170784 1275317 1296066 1397507 1549674 1580680 1584677 1599823 1642902 1709280 1731105 1737294 1755415 1759137 1769728 1820817 1919526 1920946 2064181 2066843 2076964 2081374 2101230 2103734 2122631 2132495 2133675 2153727 2161907 2174492 2183086 2200571 2213250 2345270 2381555 2402588 2623236 2634645 2879561 3128761 4911658 4970197 5058966 5229804 5241280 5277538 5332211 5346849 5408815 5468476 5471787 5492419 5509380 5509866 5527155 5546291 5553255 5558539 5569133 5682554 5748198 5816741 5852819 5899451 5956020 6014385 6054446 6071818 6096538 6200591 6390061 6500612 6568509 6671890 6867776 6894438 7161052 7260227 7263229 7384628 7495412 7512606 7513120 7575587 7590136 7611798 7612979 7623609 7635344 7648384 7650369 7668072 7760901 7777728 7847184 7912915 8885056

putting these files in cl/assets/media/random_sample_2
random_sample_2.zip

and then running the command with this file:
random_sample_2.csv

docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_2.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_2

grossir

The importer is working, the duplicates issue is gone

As a general comment, I only got 5 new opinions ingested out of the 100 test cases, with most of the others that do have matches left aside for manual review with a message like "Match found with cluster id: 2161907 for columbia file: california/court_of_appeal_opinions/documents/59ca503825f41539.xml" . But will it be feasible to manually check 95% of the dataset, if this sample is somewhat representative?

grossir · 2024-07-15T18:07:09Z

cl/corpus_importer/management/commands/import_columbia.py

-            help="If set, will run through the directories and files in random "
-            "order.",
+        inner_opinion_tags = all_opinions_soup.find_all()
+        if inner_opinion_tags and inner_opinion_tags[-1].name == "page_number":


Some opinions have more than 1 <page_number> tag. Why not get rid of all of those tags instead of only the last one?
For example arizona/court_opinions/documents/e574c2908e4b3f3d.xml has 3 tags

# Conflicts: # cl/corpus_importer/import_columbia/html_test.py

quevon24 added 14 commits June 13, 2023 18:34

fix(import_columbia): update code to import xml files from columbia

49c986b

fix(import_columbia): change ElementTree to BeautifulSoup

2445701

fix(import_columbia): handle per curiam opinions

b2f9cb7

fix(import_columbia): handle per curiam opinions

c46a34e

fix(import_columbia): search for duplicates

9eea25a

remove unused code remove unused imports

fix(import_columbia): Use courts-db to find courts

41ab528

Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added

fix(import_columbia): add footnotes to each opinion

932682c

fix(import_columbia): handle multiple court matches

eee80ab

log message when court doesn't exist in courtlistener

fix(import_columbia): handle tag mismatch

687f0c6

handle duplicate citations in xml

fix(import_columbia): Verify opinion author tag

836501d

Log message when case has no citations Handle single volume nominative reporters

fix(import_columbia): Store cleaned docket number

48311ce

fix(import_columbia): Store cleaned docket number

10052aa

fix(import_columbia): fix command options

4d3a0af

fix typing

fix(import_columbia): fix tests for columbia importer

510b07a

quevon24 added the enhancement label Jul 6, 2023

quevon24 requested a review from flooie July 6, 2023 18:37

quevon24 self-assigned this Jul 6, 2023

semgrep-app bot reviewed Jul 6, 2023

View reviewed changes

cl/corpus_importer/management/commands/import_columbia.py Outdated Show resolved Hide resolved

quevon24 and others added 2 commits July 6, 2023 12:38

Merge branch 'main' into update-columbia-importer

7afb6db

fix(import_columbia): fix semgrep warning

3f6c9c5

quevon24 marked this pull request as draft July 15, 2023 01:08

quevon24 added 9 commits July 14, 2023 19:09

fix(import_columbia): update code to import xml files from columbia

8d1cbab

fix(import_columbia): change ElementTree to BeautifulSoup

33391e0

fix(import_columbia): handle per curiam opinions

0be43d7

fix(import_columbia): handle per curiam opinions

6481b2e

fix(import_columbia): search for duplicates

3ebd385

remove unused code remove unused imports

fix(import_columbia): Use courts-db to find courts

852f244

Reduce amount of files used Use match_based_text function from harvard to find duplicated content Typing added

fix(import_columbia): add footnotes to each opinion

8810ed1

fix(import_columbia): handle multiple court matches

3eda0f1

log message when court doesn't exist in courtlistener

fix(import_columbia): handle tag mismatch

062c833

handle duplicate citations in xml

quevon24 added 7 commits May 7, 2024 19:02

Merge branch 'main' into update-columbia-importer

04a9916

Merge branch 'main' into update-columbia-importer

a9afbe1

Merge branch 'main' into update-columbia-importer

27359e7

Merge branch 'main' into update-columbia-importer

42e02c5

Merge branch 'main' into update-columbia-importer

e57cfe9

feat(columbia_importer): small tweaks

2dcf335

Merge branch 'main' into update-columbia-importer

1e731ff

quevon24 requested a review from grossir May 22, 2024 17:32

grossir reviewed May 24, 2024

View reviewed changes

quevon24 added 8 commits May 27, 2024 11:23

Merge branch 'main' into update-columbia-importer

ead2baa

feat(columbia_importer): fix creation of duplicates

7edfaf0

update comments and logging messages

Merge branch 'main' into update-columbia-importer

ec28b10

Merge branch 'main' into update-columbia-importer

22c3842

feat(columbia_importer): update failing test

45f2323

Merge remote-tracking branch 'origin/update-columbia-importer' into u…

bd88f24

…pdate-columbia-importer

feat(columbia_importer): small tweaks reduce duplicates

2a28bcc

Merge branch 'main' into update-columbia-importer

dec1000

quevon24 added 2 commits June 5, 2024 08:31

Merge branch 'main' into update-columbia-importer

49c507e

Merge branch 'main' into update-columbia-importer

bba40fc

grossir reviewed Jul 15, 2024

View reviewed changes

quevon24 added 8 commits August 22, 2024 11:15

Merge remote-tracking branch 'origin/main' into update-columbia-importer

020164d

# Conflicts: # cl/corpus_importer/import_columbia/html_test.py

fix(columbia_importer): remove html_test.py

0a67a53

Merge branch 'main' into update-columbia-importer

bf8408f

feat(columbia importer): add order field to opinions

7f2b311

Merge branch 'main' into update-columbia-importer

9fe1339

Merge branch 'main' into update-columbia-importer

2ea296b

Merge branch 'main' into update-columbia-importer

f49acb5

Merge branch 'main' into update-columbia-importer

cdae551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columbia importer updated #2865

Columbia importer updated #2865

quevon24 commented Jul 6, 2023 •

edited

Loading

quevon24 commented May 22, 2024

grossir left a comment

grossir May 24, 2024

quevon24 commented Jun 4, 2024 •

edited

Loading

grossir left a comment

grossir Jul 15, 2024

Columbia importer updated #2865

Are you sure you want to change the base?

Columbia importer updated #2865

Conversation

quevon24 commented Jul 6, 2023 • edited Loading

quevon24 commented May 22, 2024

grossir left a comment

Choose a reason for hiding this comment

grossir May 24, 2024

Choose a reason for hiding this comment

quevon24 commented Jun 4, 2024 • edited Loading

grossir left a comment

Choose a reason for hiding this comment

grossir Jul 15, 2024

Choose a reason for hiding this comment

quevon24 commented Jul 6, 2023 •

edited

Loading

quevon24 commented Jun 4, 2024 •

edited

Loading