# goal
\
Some of our partners use Document Cloud to maintain document collections they've received through public records act requests.

Document Cloud has an [API](https://next.www.documentcloud.org/help/api/) with a python wrapper ([pypi](https://pypi.org/project/python-documentcloud/), [docs](https://documentcloud.readthedocs.io/en/latest/)) to make syncing these remote collections locally in a project repo easy.

Let's see if we can get the API working for Invisible Institute's project on Human Trafficking. I think that means we need to:
- [ ] [Create a client](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#creating-a-client)
- [ ] [Search for documents](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#searching-for-documents)
- [ ] [Pull the annotations and metadata for each document](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#interacting-with-a-document)
    - Not sure exactly what we should be grabbing, maybe text data if it looks usable, or just the location of annotated sections to use in identifying the line data from another OCR method.
- [ ] Either download the PDF locally or extract the text from the document and save that instead

Note: The document collection used in this example is not necessarily public due to unredacted personally identifying information about victims appearing in some documents, as well as the (currently) unpublished nature of the findings. The documents themselves are PDFs of Chicago Police Department Incident Reports, separated by unique `rd`, or Records Division, number of the incident. Most PDFs are less than 10 pages, and some have annotations from the team about the fields/sections of the reports with particularly useful info.

Note note: You have to setup credentials with muckrock and document cloud to use the client. I put my username and password in a file as "USER|PASSWORD" and use a method below to setup the client with the API key.

# setup

In [1]:
# dependencies
import re
import pandas as pd
from documentcloud import DocumentCloud

In [2]:
# support methods
def getcreds():
    with open("/Users/home/git/dotfiles/creds/doccloud") as f:
        line = f.readline()
    found = line.split("|")
    assert len(found) == 2
    return found[0], found[1]


def getclient():
    user, pwd = getcreds()
    client = DocumentCloud(username=user, password=pwd)
    return client

In [3]:
# main
link = "https://www.documentcloud.org/projects/219560-human-trafficking-cpd/"
rdpatt = "[A-Z]{1,2}[0-9]{5,7}"
client = getclient()

# start poking through the results

- [ ] [Create a client](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#creating-a-client)

`<obj>.<TAB>` is great for getting familiar with the different classes and methods available in a new library. I'm only going to keep the cells related to this exploration in this notebook, but I discovered what pieces were the right ones for my task by stepping through a lot of wrong answers that were available to auto-complete.

In [4]:
DocumentCloud?

[31mInit signature:[39m
DocumentCloud(
    username=[38;5;28;01mNone[39;00m,
    password=[38;5;28;01mNone[39;00m,
    base_uri=[33m'https://api.www.documentcloud.org/api/'[39m,
    auth_uri=[33m'https://accounts.muckrock.com/api/'[39m,
    timeout=[32m20[39m,
    loglevel=[38;5;28;01mNone[39;00m,
    rate_limit=[38;5;28;01mTrue[39;00m,
    rate_limit_sleep=[38;5;28;01mTrue[39;00m,
)
[31mDocstring:[39m      The public interface for the DocumentCloud API, now integrated with SquareletClient
[31mFile:[39m           ~/opt/miniconda3/lib/python3.12/site-packages/documentcloud/client.py
[31mType:[39m           type
[31mSubclasses:[39m     

In [5]:
client?

[31mType:[39m        DocumentCloud
[31mString form:[39m <documentcloud.client.DocumentCloud object at 0x1780c2ff0>
[31mFile:[39m        ~/opt/miniconda3/lib/python3.12/site-packages/documentcloud/client.py
[31mDocstring:[39m   The public interface for the DocumentCloud API, now integrated with SquareletClient

In [6]:
client.documents.list()

<APIResults: [<Document: 1 - A.I.G. Bailout: The Inspector General's Report>, <Document: 2 - President Obama's Health Care Proposal>, <Document: 3 - Shaping the Next Economic Expansion>, <Document: 4 - Inspector General's Report on Medicare Prescription Fraud>, <Document: 5 - Wiretapping Lawsuits Thrown Out>, <Document: 15 - Department of Defense IOB Reports, Part 4>, <Document: 30 - Bybee Response to OPR 2nd Draft>, <Document: 31 - A.I.G. Payments to Counterparties>, <Document: 33 - Letter from Mukasey and Filip to Jarrett>, <Document: 34 - Madoff's Accountant Charged with Fraud for Accounting Violations>, <Document: 43 - Madoff: Civil Suit Against Fairfield Greenwich Funds>, <Document: 52 - General Motors Bankruptcy Filing>, <Document: 55 - Bloomberg News vs. Federal Reserve Board>, <Document: 56 - Court Order Freezing Peter Madoff's Assets>, <Document: 61 - Pulling back the TARP: Oversight of the Financial Rescue Program>, <Document: 62 - A New Plan for the T.A.R.P.>, <Document: 63 

### checkpoint

- [x] [Create a client](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#creating-a-client)
- [ ] [Search for documents](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#searching-for-documents)

Hmmm. The results of calling `client.documents.list()` confirms the client is working as expected, but those are not the documents I'm looking for.

The project files I need are in a specific location that has been shared with me, so I need to point to that location.

In [7]:
client.projects.list()

<APIResults: [<Project: 17 - Pulitzer, Please>, <Project: 36 - Congressional Research Reports>, <Project: 38 - Demoing to Dan and Jen>, <Project: 40 - UNT dox>, <Project: 44 - Circle of Blue test project>, <Project: 49 - Stanford>, <Project: 51 - Menthol>, <Project: 52 - BBC>, <Project: 54 - Homestead>, <Project: 57 - Haiti Documents>, <Project: 58 - Immigration Enforcement>, <Project: 64 - March Madness>, <Project: 75 - John Doe vs. Catholic Bishop for the Diocese of Memphis>, <Project: 87 - Trouble on the Tray>, <Project: 99 - integraclick>, <Project: 100 - Pacific fisher>, <Project: 101 - San Joaquin water>, <Project: 102 - Deals>, <Project: 103 - telecom>, <Project: 104 - Telecomm docs>, <Project: 105 - FOIA Logs>, <Project: 123 - Chicago Documents>, <Project: 125 - DCFS Letters>, <Project: 127 - vitter-formaldehyde>, <Project: 130 - Justice Stevens' retires>]>

The project ID appears in the link, so rather than parsing the list of projects, we can use the ID in the link to specify the project the client should return.

In [8]:
project = client.projects.get_by_id("219560-human-trafficking-cpd")

In [9]:
project

<Project: 219560 - Human Trafficking CPD>

In [10]:
project.document_list[:10]

[<Document: 25285109 - JG538004>,
 <Document: 25285096 - JC342850>,
 <Document: 25285089 - JB572727>,
 <Document: 25285088 - HY309430>,
 <Document: 25220623 - JA191403>,
 <Document: 25220622 - JC526637>,
 <Document: 25220621 - HZ512953>,
 <Document: 25220620 - HZ453182>,
 <Document: 25220619 - HZ445465>,
 <Document: 25220618 - HZ419320>]

Sweet, that's what we're looking for.

In [11]:
docs = project.document_list

The project page shows there are 164 documents available in the project directory, so we should have at least that many from now on.

In [12]:
assert len(docs) >= 164

### checkpoint

- [x] [Search for documents](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#searching-for-documents)

In [13]:
docs[0]

<Document: 25285109 - JG538004>

In [14]:
testdoc = docs[0]

In [15]:
testdoc?

[31mType:[39m        Document
[31mString form:[39m JG538004
[31mFile:[39m        ~/opt/miniconda3/lib/python3.12/site-packages/documentcloud/documents.py
[31mDocstring:[39m   A single DocumentCloud document

In [16]:
testdoc.contributor_organization

'Invisible Institute'

In [17]:
testdoc.canonical_url

'https://www.documentcloud.org/documents/25285109-jg538004/'

In [18]:
testdoc.get_pdf_url()

'https://api.www.documentcloud.org/files/documents/25285109/jg538004.pdf'

- [x] (Clicking the link takes you to the document)
- [ ] [Pull the annotations and metadata for each document](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#interacting-with-a-document)

In [19]:
testdoc.file_hash

'cde03b008de37bb2d927263f8acc074f419779e2'

In [20]:
testdoc.full_text[:450]

'CHICAGO POLICE DEPARTHENT ro#:\nORIGINAL CASE INCIDENT REPORT Event 5: Srateoissn\naes 2 INCIDENT\nios, Monomers Hoe ee,\nrh\nWR: 1060 Human Trafcking - Commrcil Sox Acs.\noccurence E—— Beat: 1522 | Unit Assigned: Gos\nfrm RO Ara ate: 12 Decemier 20230000\nG50. Apument\nOccurrence Dae: 08 November 223 00.0013 December 2023000\n=\n.\nRe: — Beat 162 | pce boo: —008\nery] Age: eos\nBest 5100\noT\nName: [I\nSt\n| I (e——\nSea: 5100\nF——\nName: I\nRes Beat: 1522 Female\n e— '

Hmmm. I was thinking about grabbing the OCR text from document cloud instead of the PDFs, but that's a bit rough, and we have OCR results from MS Document Intelligence anyways. So, let's focus on grabbing the metadata for the PDF document and getting the annotations.

In [21]:
testdoc.title

'JG538004'

In [22]:
testdoc.id

25285109

In [23]:
testdoc.created_at

datetime.datetime(2024, 11, 11, 6, 6, 22, 824742, tzinfo=tzutc())

In [24]:
testdoc.updated_at

datetime.datetime(2024, 11, 11, 6, 6, 33, 788892, tzinfo=tzutc())

In [25]:
testdoc.page_count

2

Alright, that's perfectly good metadata, let's wrap it into a method we can apply to each document.

In [26]:
def getdoc(doc):
    info = {
        'pdfurl': doc.get_pdf_url(),
        'filehash': doc.file_hash,
        'filename': doc.title,
        'fileid': doc.id,
        'created_at': doc.created_at,
        'last_update': doc.updated_at,
        'n_pages': doc.page_count,
        'doc': doc,
    }
    return info

In [27]:
testinfo = getdoc(doc=testdoc)

In [28]:
testinfo

{'pdfurl': 'https://api.www.documentcloud.org/files/documents/25285109/jg538004.pdf',
 'filehash': 'cde03b008de37bb2d927263f8acc074f419779e2',
 'filename': 'JG538004',
 'fileid': 25285109,
 'created_at': datetime.datetime(2024, 11, 11, 6, 6, 22, 824742, tzinfo=tzutc()),
 'last_update': datetime.datetime(2024, 11, 11, 6, 6, 33, 788892, tzinfo=tzutc()),
 'n_pages': 2,
 'doc': <Document: 25285109 - JG538004>}

We should double-check that the filenames are all `rd` numbers, and if so, track that as an additional field (or instead of filename).

In [29]:
rdpatt

'[A-Z]{1,2}[0-9]{5,7}'

In [30]:
re.findall(pattern=rdpatt, string=testinfo['filename'])

['JG538004']

In [31]:
def build_docdf(docs):
    data = []
    for doc in docs:
        info = getdoc(doc=doc)
        data.append(info)
    assert any(data)
    out = pd.DataFrame(data)
    assert out.shape[0] == len(data) == len(docs)
    assert not out.pdfurl.isna().any()
    assert not out.fileid.isna().any()
    return out


def findrd(fname):
    assert not pd.isna(fname)
    found = re.findall(pattern=rdpatt, string=fname)
    if not len(found) == 1: return None
    return found[0]


def addrd(df):
    copy = df.copy()
    copy['rd'] = copy.filename.apply(findrd)
    assert copy.rd.notna().sum() == 163
    assert copy.rd.isna().sum() == 1
    assert (copy.loc[copy.rd.isna(), 'fileid'] == 25211366).all()
    copy.loc[(copy.rd.isna()) & (copy.fileid == 25211366), 'rd'] = "JG271294"
    return copy

In [32]:
docdf = build_docdf(docs=docs)
docdf = addrd(df=docdf)

In [33]:
docdf.sample().T

Unnamed: 0,35
pdfurl,https://api.www.documentcloud.org/files/docume...
filehash,3a85a23e5302eb3995bcee0d0258992813d0c47c
filename,JE117528
fileid,25220428
created_at,2024-10-16 17:46:01.258632+00:00
last_update,2024-10-16 17:46:20.156903+00:00
n_pages,2
doc,JE117528
rd,JE117528


### checkpoint

- [x] [Pull the metadata for each document](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#interacting-with-a-document)
- [ ] [Pull the annotations for each document](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#interacting-with-a-document)

In [34]:
annot_client = testdoc.annotations

In [35]:
annot_client?

[31mType:[39m        AnnotationClient
[31mString form:[39m <documentcloud.annotations.AnnotationClient object at 0x178101850>
[31mFile:[39m        ~/opt/miniconda3/lib/python3.12/site-packages/documentcloud/annotations.py
[31mDocstring:[39m   Client for interacting with Sections

In [36]:
annot_client.list()

<APIResults: [<Annotation: 2608131 - Scrape>, <Annotation: 2608132 - Scrape>, <Annotation: 2608133 - Scrape>, <Annotation: 2608134 - Scrape>, <Annotation: 2608135 - Scrape>, <Annotation: 2608136 - Scrape>]>

In [37]:
annots = annot_client.list()

In [38]:
annots[0]?

[31mType:[39m        Annotation
[31mString form:[39m Scrape
[31mFile:[39m        ~/opt/miniconda3/lib/python3.12/site-packages/documentcloud/annotations.py
[31mDocstring:[39m   A note on a document

In [39]:
testannot = annots[0]

In [40]:
testannot.created_at

'2024-11-12T03:44:18.871399Z'

In [41]:
testannot.id

2608131

In [42]:
testannot.title

'Scrape'

In [43]:
testannot.content

'Collect information here'

In [44]:
testannot.description

'Collect information here'

In [45]:
testannot.x1, testannot.x2, testannot.y1, testannot.y1

(0.03857142857142857, 0.93, 0.09030837004405286, 0.09030837004405286)

In [46]:
(testannot.location.top, testannot.location.bottom, testannot.location.left, testannot.location.right)

(81, 176, 27, 651)

In [47]:
testannot.page_number

0

Let's make sure this is a zero-based index system by reviewing the json text data.

In [48]:
testdoc.json_text

{'updated': 1731305193586,
 'pages': [{'page': 0,
   'contents': 'CHICAGO POLICE DEPARTHENT ro#:\nORIGINAL CASE INCIDENT REPORT Event 5: Srateoissn\naes 2 INCIDENT\nios, Monomers Hoe ee,\nrh\nWR: 1060 Human Trafcking - Commrcil Sox Acs.\noccurence E—— Beat: 1522 | Unit Assigned: Gos\nfrm RO Ara ate: 12 Decemier 20230000\nG50. Apument\nOccurrence Dae: 08 November 223 00.0013 December 2023000\n=\n.\nRe: — Beat 162 | pce boo: —008\nery] Age: eos\nBest 5100\noT\nName: [I\nSt\n| I (e——\nSea: 5100\nF——\nName: I\nRes Beat: 1522 Female\n e— Uno Reused\nChicago IL 60d\nI\nBest 5100\nSobre: Unknown\nName: UNKNOWN 1, Unknown 1\nprem\nnel\nFromm\nUnknown\nUnenommtiar Se\nrem\nfo (ote) aE\n— Fo UNKNOWN 1, Unknown 1\np— teri)\n— UNKNOWN 1, Unknown 1\na FR— |\noterse) a\nFim Goneraed vy P—— 7 wee EEE] RCE TE\nPage 19 of 53\n',
   'ocr': 'tess4',
   'lang': 'eng',
   'updated': 1731305193586},
  {'page': 1,
   'contents': 'Print Generated by: 2\n \n \nPage of 2 19-AUG-2024 11:51\nDOMESTIC INFO NARRATI

Yup, there are two page items in the json data, and the page numbers in the data reflect a zero-based index.

In [49]:
assert (
    testdoc.json_text['pages'][0]['page'] == 0) & (
    len(testdoc.json_text['pages']) == 2)

Right, so onto to wrapping up the annotation metadata into a tabular format.

In [50]:
def getannot(annotation):
    info = {
        'created_at': annotation.created_at,
        'annotid': annotation.id,
        'pageno': annotation.page_number,
        'title': annotation.title,
        'content': annotation.content,
        'loc_x12_y12': (annotation.x1, annotation.x2, annotation.y1, annotation.y1),
        'loc_btlr': (annotation.location.top, annotation.location.bottom, annotation.location.left, annotation.location.right),
        'annotation': annotation,
    }
    return info

In [51]:
getannot(annotation=testannot)

{'created_at': '2024-11-12T03:44:18.871399Z',
 'annotid': 2608131,
 'pageno': 0,
 'title': 'Scrape',
 'content': 'Collect information here',
 'loc_x12_y12': (0.03857142857142857,
  0.93,
  0.09030837004405286,
  0.09030837004405286),
 'loc_btlr': (81, 176, 27, 651),
 'annotation': <Annotation: 2608131 - Scrape>}

In [52]:
def formatannots(doc):
    annots = doc.annotations.list()
    if not any(annots): return pd.DataFrame()
    data = []
    for ea in annots:
        info = getannot(annotation=ea)
        data.append(info)
    assert any(data)
    out = pd.DataFrame(data)
    assert out.shape[0] == len(data)
    assert not out.loc[:, out.columns != 'annotation'].duplicated().any()
    assert not out.annotid.isna().any()
    assert not out.loc_x12_y12.isna().any()
    return out


def build_annotdf(df):
    data = []
    for tup in df[['fileid', 'doc']].itertuples():
        annots = formatannots(doc=tup.doc)
        if annots.shape[0] == 0: continue
        annots['fileid'] = tup.fileid
        data.append(annots)
    if len(data) == 0: return None
    out = pd.concat(data)
    assert out.shape[0] >= len(data)
    return out

In [53]:
build_annotdf(df=docdf.sample(3))

Unnamed: 0,created_at,annotid,pageno,title,content,loc_x12_y12,loc_btlr,annotation,fileid
0,2024-11-14T21:59:22.947063Z,2611495,0,Scrape,Collect information here,"(0.6928571428571428, 0.9471316964285714, 0.013...","(11, 67, 485, 662)",Scrape,25213429
1,2024-11-14T21:59:48.666555Z,2611496,0,Scrape,Collect information here,"(0.06142857142857143, 0.9299888392857143, 0.10...","(97, 222, 43, 650)",Scrape,25213429
2,2024-11-14T22:00:04.021413Z,2611497,0,Scrape,Collect information here,"(0.054285714285714284, 0.9414174107142858, 0.2...","(232, 386, 38, 658)",Scrape,25213429
3,2024-11-14T22:00:30.790710Z,2611498,0,Scrape,Collect information here,"(0.05285714285714286, 0.9128459821428572, 0.64...","(582, 713, 37, 638)",Scrape,25213429
4,2024-11-14T22:01:51.373748Z,2611499,1,Scrape,Collect information here,"(0.05714285714285714, 0.9342745535714285, 0.03...","(35, 211, 40, 653)",Scrape,25213429
5,2024-11-14T22:03:14.938769Z,2611500,2,Scrape,Collect information here,"(0.05857142857142857, 0.9499888392857143, 0.36...","(327, 757, 41, 664)",Scrape,25213429
6,2024-11-14T22:15:43.318298Z,2611526,4,Scrape,Collect information here,"(0.025714285714285714, 0.9185602678571428, 0.4...","(382, 758, 18, 642)",Scrape,25213429
7,2024-11-14T22:16:48.316031Z,2611527,8,Scrape,Collect information here,"(0.05857142857142857, 0.9828459821428571, 0.79...","(684, 833, 41, 687)",Scrape,25213429
8,2024-11-14T22:18:40.223942Z,2611529,9,Scrape,Collect information here,"(0.06, 0.9542745535714285, 0.10841983852364476...","(93, 819, 42, 667)",Scrape,25213429
9,2024-11-14T22:19:00.104614Z,2611530,10,Scrape,Collect information here,"(0.05142857142857143, 0.9471316964285714, 0.09...","(85, 444, 36, 662)",Scrape,25213429


### checkpoint

We didn't get the actually text that was annotated, but we can use the location info to snip the sections from the PDF for targeted OCR or filter through the existing OCR line data for lines matching these locations.

- [x] [Pull the annotations for each document](https://documentcloud.readthedocs.io/en/latest/gettingstarted.html#interacting-with-a-document)
- [ ] Either download the PDF locally or extract the text from the document and save that instead

In [54]:
testdoc.pdf[:10]

b'%PDF-1.7\n%'

Okay, we can work with byte data. Let's make sure it works as expected.

In [55]:
with open("test.pdf", 'wb') as f:
    f.write(testdoc.pdf)

In [56]:
from pypdf import PdfReader

In [57]:
PdfReader("test.pdf")

<pypdf._reader.PdfReader at 0x17850a1e0>

Aannnndd we're in business.

In [58]:
def writepdf(fname, pdfbyts):
    with open(fname, 'wb') as f:
        f.write(pdfbyts)
        f.close()
    return True

In [59]:
docdf.sample().T

Unnamed: 0,103
pdfurl,https://api.www.documentcloud.org/files/docume...
filehash,5df6768222210283735ba34cd3b0d757905bcf2b
filename,JE393104
fileid,25213269
created_at,2024-10-15 21:09:23.502732+00:00
last_update,2024-10-15 21:09:45.393784+00:00
n_pages,2
doc,JE393104
rd,JE393104


In [60]:
docdf.doc.values[0]

<Document: 25285109 - JG538004>

In [61]:
docdf['localcopy'] = docdf[['rd', 'doc']].head(5).apply(lambda row: writepdf(
    fname=f"output/{row.rd}.pdf", pdfbyts=row.doc.pdf), axis=1)

In [62]:
docdf.head(5)

Unnamed: 0,pdfurl,filehash,filename,fileid,created_at,last_update,n_pages,doc,rd,localcopy
0,https://api.www.documentcloud.org/files/docume...,cde03b008de37bb2d927263f8acc074f419779e2,JG538004,25285109,2024-11-11 06:06:22.824742+00:00,2024-11-11 06:06:33.788892+00:00,2,JG538004,JG538004,True
1,https://api.www.documentcloud.org/files/docume...,e07b7a80d019be0c3ae16b182aa0a7cad7dd0669,JC342850,25285096,2024-11-11 05:52:58.557121+00:00,2024-11-11 05:53:18.239539+00:00,3,JC342850,JC342850,True
2,https://api.www.documentcloud.org/files/docume...,c09912353de9bd6e0ce8f3b83ac22d6609dc6d8a,JB572727,25285089,2024-11-11 05:35:57.876177+00:00,2024-11-11 05:36:34.734234+00:00,11,JB572727,JB572727,True
3,https://api.www.documentcloud.org/files/docume...,3ccfaa1ada6d00e8e2284e4a8ad064f5f0948a24,HY309430,25285088,2024-11-11 05:17:26.478521+00:00,2024-11-11 05:17:48.141907+00:00,2,HY309430,HY309430,True
4,https://api.www.documentcloud.org/files/docume...,d5c32918707a3770eefb5be4eaa06296db347476,JA191403,25220623,2024-10-16 18:16:54.775996+00:00,2024-10-16 18:17:23.186277+00:00,2,JA191403,JA191403,True


In [63]:
!ls -al output/

total 11904
drwxr-xr-x   7 home  staff      224 May 14 17:47 [34m.[m[m
drwxr-xr-x  15 home  staff      480 May 14 18:38 [34m..[m[m
-rw-r--r--   1 home  staff   716957 May 14 18:38 HY309430.pdf
-rw-r--r--   1 home  staff   878093 May 14 18:38 JA191403.pdf
-rw-r--r--   1 home  staff  2704621 May 14 18:38 JB572727.pdf
-rw-r--r--   1 home  staff  1290930 May 14 18:38 JC342850.pdf
-rw-r--r--   1 home  staff   488755 May 14 18:38 JG538004.pdf


### checkpoint

- [x] Either download the PDF locally or extract the text from the document and save that instead
- [ ] put it in a script

In [68]:
!cat /Users/home/git/US-II-HT/doccloud/src/sync.py

#!/usr/bin/env python3
# vim: set ts=4 sts=0 sw=4 si fenc=utf-8 et:
# vim: set fdm=marker fmr={{{,}}} fdl=0 foldcolumn=4:
# Authors:     BP
# Maintainers: BP
# Copyright:   2025, HRDAG, GPL v2 or later

# ---- dependencies {{{
from pathlib import Path
from sys import stdout
import argparse
from loguru import logger
import re
import pandas as pd
from documentcloud import DocumentCloud
#}}}

# --- support methods --- {{{
def getargs():
    parser = argparse.ArgumentParser()
    parser.add_argument("--projectid", default="219560-human-trafficking-cpd")
    parser.add_argument("--outdir", default=None)
    parser.add_argument("--outpdfs", default=None)
    parser.add_argument("--outannots", default=None)
    args = parser.parse_args()
    assert Path(args.outdir).exists()
    return args


def setuplogging(logfile):
    logger.add(logfile,
               colorize=True,
               format="<green>{time:YYYY-MM-DD⋅at⋅HH:mm:ss}</green>⋅<level>{message}</level>",
               level="INFO"

- [x] put it in a script