-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve alt text by providing document text #230
Comments
So a couple quick ways we could do better than just using whatever text we get back from pdftotext:
Anyway, just some thoughts for trying to do this well. |
This would be useful for our blind users, I think, but I'm going to move it to our volunteer backlog. Having to click through to the PDF itself or to our website is a pretty normal thing to have to do whether you're blind or not, and this is a good fit for a volunteer in terms of complexity. |
I'd like to work on this. I'm a long time software engineer (since the 80s!) but my python tooling is rusty. A little background: My main languages have been OO, starting with Smalltalk-80, most recently ruby; also python, scala, and many more. Am a big proponent of SOLID & Patterns. Following Kent Beck since the 90s.) Done a huge amount of analysis & design, and project management. Have done lots of documentation and presentations (C-Level, developers, end-users. I have this up and running locally in PyCharm and have done the simple, minimal change ("Thumbnail of page X of the PDF linked above."). I did that in order to get this running locally and to initially explore the code.
Thanks! |
Hi, thanks for taking this on! I'm really glad to have help with it. A few responses:
That's great. I really like small initial PR's. If you want to submit that alone, @ERosendo (the lead dev for this project) can give it a review.
We have a Slack group I can invite you do, but why don't we keep with async:
Great!
If you want to get those thoughts going here now, feel free, but your sequencing sounds great too! |
Just FYI, I just found your account in CL and gave you access to some more of the API. @ERosendo pointed out that you'll probably need the access to complete this issue. :) |
I'd like to change the alt text for an image. That also removes "full" so that the alt text is a wee bit shorter. |
Sure, why not! :) |
So now I want to understand more about the next steps for alt text for PDFs. (Obviously I'm still learning a bit about the domain (legal stuff) & terms, and the systems & codebase.) From above:
|
(just taking some of the easy ones)
Yes, that text, which is not always blue and not always at the top of the page. It is substantially duplicative of the metadata that the bot puts in the text of the tweet (document number; description or type of description (depending); and the date is nominally implied, except when its not) …
It sounds like it, but handling the first page of document is going to be tricky, because there are multiple frames of text that may not come out quite right sequentially (although looking at, e.g. https://www.courtlistener.com/docket/67271062/52/walt-disney-parks-and-resorts-us-inc-v-desantis/ as my test case, it seems mostly ok), and there's a lot of case metadata that should probably not appear in the alt text. The case caption, for instance. If it's a West Coast style court (e.g. California), the names and addresses of the filing attorneys appear above the caption block on the front page. What really should be in the alt text is the first full paragraph of the filing, and perhaps subsequent paragraphs. I guess there are a lot of options as to how ambitious you want to be! |
I 100% agree that we (perhaps that's me) need to get some examples together. I agree the that text for the first page may be tricky, but I don't think that any text -- beyond that PACER header -- should be omitted. Sighted users will the filing attorneys, etc, and so should vision-impaired users. It's tricky, I know, because we have to limit the size of the alt-text so it's reasonable (alt-text should be succinct), but hopefully provide enough information so that the content and context are clear. Ultimately we can make some guesses, but having some visually impaired legal folks to give us feedback would be great. (For all I know, one or all of you fall into that group. :-) ) |
Yeah, it's not as helpful as I'd hoped, but I guess it's a start.
Yes, that should do it.
I'm not sure it's worth trying to figure out page breaks. That's a pretty difficult or at least annoying task. I'd say just load up the alt text with as much as it can handle. So, if page 1 actually has 500 chars, but Twitter allows 1000, just put the first 1000 chars with thumbnail one, and chars 1,001-2000 with thumbnail 2, etc, ignoring which page goes with which text. I actually think sighted people would even find this useful and it loads the tweet to the max with the small downside that the pagination isn't spot on (who cares?). John says:
@weedySeaDragon replies:
I'm pretty sure I'm with John on this. Most legal docs start out pretty much the same way, listing a bunch of junk nobody really reads. It's easy to skip when you're sighted, but it'd get pretty old to have to read it in the alt text, if we can help it. For example, some docs use a stack of
I certainly do, but if we load up the alt text as I described above, then the alt text on thumbnail one would be the place to start reading. I agree about finding some visually impaired folks. I'll post this thread and see if anybody replies. |
Oops, I typed a long reply an hour ago and failed to hit the Comment button. This is slightly redundant with Mike's comments, but not entirely.
Well, the rationale for omitting the case caption is the same as the header — we already summarize that in the tweet text. Also, sometimes, there will be a clerk's filestamp/timestamp/datestamp which is both hard to read (because it is stamped in ink and not laser printed) and mostly irrelevant. If you really care about the exact date the piece of paper was turned over across the counter, you're probably not going to be looking at the alt text.
I think you dropped a word here and I'm not sure what it was. Generally sighted users don't care about the filing attorneys, and certainly not their office addresses. To the extent they do, it's a lot less relevant than the text of the motion, and it is available in other places (e.g. the Parties tab on Courtlistener). Also it doesn't tend to change from filing to filing, so it's not what Big Cases is about, which is breaking news / market-moving information. (I exaggerate, but only a little). For instance, take https://www.courtlistener.com/docket/6639860/458/in-re-macbook-keyboard-litigation/ where the counsel name/addresses push the first paragraph onto page 2. Basically nobody cares about that info (and the courts probably regret their local practice on this, which predated electronic filing…). And that's not even a worst case. How about https://www.courtlistener.com/docket/7067512/1139/in-re-facebook-inc-consumer-privacy-user-profile-litigation/? There paragraph 1 doesn't even appear in the first four pages because the table of contents eats it. Perhaps that suggests if we are parsing the text we might change which pages we show? And the first page has … little of value, although it does have the Title. Or maybe https://www.courtlistener.com/docket/17084894/1/zepeda-rivas-v-jennings/ is another perverse example?
Well, the point of the bot is to let people know about breaking developments in cases. The metadata on the first page is generally always the same from document to document within a given case, with the exception of the date and the title of the filing. |
Excellent educational info for me. :-) thanks. I now get that no one -- no matter their visual capabilities -- wants to read or hear the case meta data on the first page. (I do enjoy learning all of this. Really.) So I'll need to figure out how to skip the various forms of that . And thanks for the examples, @johnhawkinson . Those are great places for me to start. If either of you thinks of any other examples that are either typical or edge cases, that'd be helpful. |
Just FYI -- chime in if you have thoughts, but I'm not expecting replies. I've been thinking ("percolating" is actually how I describe this specific activity) about how to handle ignoring the "meta case info" that can span the first x.y pages (1.5 pages, 0.8 pages, etc. is what I mean). I was thinking about how I recognize where to start reading; how do I know what to skip over? I clue in on where I see the first text paragraph. (I think someone mentioned this already.) So now I'm playing around with that -- how can the system to recognize a text paragraph? I'm starting with some really simple assumptions: (1) the first line is indented more than 5 spaces, and words are separated by either one or two spaces; and (2) the next line is either the start of another paragraph ( = another indented line), or an un-indented line of text. These assumptions ultimately may not work, but it's a starting place. If we can effectively recognize the first text paragraph, we can just skip over the "meta case info" no matter how long it is. Also, the PDF to text conversion gives us |
Yeah, this will be the challenge for sure. If you're game, I'd suggest downloading the last 250 docs from the bot and building them up into a sample set that you can test against until your heuristics are working. |
A reached out to a vision-impaired friend with this question. He replies (I bolded a few things):
I'm not so sure this really helps, but my takeaway is that doing it as best you can is what you hope for, and that you want the substantive content. |
Excellent feedback. Info from actual users -- or people in the same group -- is always great. And yup -- I'll create that data set. (Just the kind of geeky challenge I like.) |
I'm realizing we lost momentum here. Anything we can do on our end that'd help you pick it up again, @weedySeaDragon? |
It's me. I'm the problem. |
It'd be nice if our alt text were better. Currently it just says, "Thumbnail of page X of the PDF".
First, we could make that better just by saying, "Thumbnail of page X of the PDF linked above."
That makes it more clear that clicking the link will get you the text.
But second, we have the text in our database. Could we do better by including it in the post? I think the answer is yes, and I think the way to do it is to ask the recap-document API for the text and use it if it's available, skipping it if not (implying that OCR is still going, which we shouldn't wait for).
When we get the text from the API, we won't know what text came from which page, so we'll just have to dump as much of it as we can along with each thumbnail, possibly with some explanatory text:
Then on the second thumbnail:
Then on the third thumbnail:
(Of course, this should break at word boundaries, not in the middle of words.)
The alt-text bot goes further and even will do an image that just says "ALT TEXT" with additional text on it, but I think we can stop short of that (we're not providing all the pages, after all):
https://twitter.com/AltTextUtil/status/1653058214362238976
One open question is whether we'll want to pre-process the text to remove whitespace. I think we probably will, and I wonder if we'll want to go even further to remove dumb punctuation at the beginning too. Like, maybe all we want are the words? This will take some experimentation, I think.
The text was updated successfully, but these errors were encountered: