

# Getting Started with Alacorder

### Alacorder processes case detail PDFs retrieved from Alacourt.com into data tables suitable for research purposes. Alacorder also generates compressed text archives from the source PDFs to speed future data collection from the same set of cases.

## Installation

**Alacorder can run on most devices. If your device can run Python 3.7 or later, it can run alacorder.**
* To install on Windows, open Command Prompt and enter `pip install alacorder`. 
    * To start the interface, enter `python -m alacorder` or `python3 -m alacorder`.
* On Mac, open the Terminal and enter `pip3 install alacorder` then `python3 -m alacorder`.
    * To start the interface, enter `python3 -m alacorder` or `python -m alacorder`.
* If pip, pip3, python, and python3 do not work from your computer, install [Anaconda Distribution](https://www.anaconda.com/products/distribution), create a virtual environment, open a terminal, and then repeat these instructions. If your copy of alacorder is corrupted, use `pip uninstall alacorder` or `pip3 uninstall alacorder` and then reinstall it. There may be a newer version.

#### Open this interactive tutorial on your desktop by installing [Anaconda Distribution](https://www.anaconda.com/) and opening "Jupyter Lab." Open this file in Jupyter then run the `pip` command below. Running the `pip` command below will also update your copy of `alacorder` to the latest version automatically.

[GitHub](https://github.com/sbrobson959/alacorder)  | [PyPI](https://pypi.org/project/alacorder/)     
[Report an issue](mailto:sbrobson@crimson.ua.edu)

<sup>© 2023 Sam Robson</sup>

In [None]:
%pip uninstall -y alacorder
%pip install alacorder

[0mNote: you may need to restart the kernel to use updated packages.
Collecting alacorder
  Using cached alacorder-7.3.1-py3-none-any.whl (11 kB)
Collecting pandas
  Using cached pandas-1.5.3-cp311-cp311-macosx_11_0_arm64.whl (10.8 MB)
Collecting numpy
  Using cached numpy-1.24.2-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
Collecting xlrd
  Using cached xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Collecting openpyxl
  Using cached openpyxl-3.1.0-py2.py3-none-any.whl (250 kB)
Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Collecting xlwt
  Using cached xlwt-1.3.0-py2.py3-none-any.whl (99 kB)
Collecting build
  Using cached build-0.10.0-py3-none-any.whl (17 kB)
Collecting pyproject_hooks
  Using cached pyproject_hooks-1.0.0-py3-none-any.whl (9.3 kB)
Collecting et-xmlfile
  Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Collecting pytz>=2020.1
  Using cached pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
Installing collected packages: xlwt, pytz, xlrd, pyprojec

*Alacorder should automatically install dependencies upon setup, but you can also install the dependencies yourself (pandas, numpy, PyPDF2, openpyxl, xlrd, xlwt, build, setuptools, xarray)*

# Using the guided interface

#### Once you have a Python environment up and running, you can launch the guided interface by:

* Importing the module from your command line. Depending on your Python configuration, enter `python -m alacorder` or `python3 -m alacorder` to launch the command line interface. 

* Importing the alacorder module in Python. Use the import statement from `alacorder import __main__` to run the command line interface.


* **Alacorder can be used without writing any code, and exports to common formats like Excel (.xls), Stata (.dta), CSV, and JSON. Alacorder full text archives are compressed pickle archives (.pkl.xz), a format which can store thousands of PDF's text in a very small file. If you need to unpack a pickle archive without the `alac` module, use a .xz compression tool, then read the pickle into Python with the standard library module `pickle`.**
    * Once installed, enter `python -m alacorder` or `python3 -m alacorder` to start the interface. 
    * If you are using iPython, launch the iPython shell and enter `from alacorder import __main__` to launch guided interface. 




In [2]:
from alacorder import __main__



	    ___    __                          __         
	   /   |  / /___ __________  _________/ /__  _____
	  / /| | / / __ `/ ___/ __ \/ ___/ __  / _ \/ ___/
	 / ___ |/ / /_/ / /__/ /_/ / /  / /_/ /  __/ /    
	/_/  |_/_/\__,_/\___/\____/_/   \__,_/\___/_/     
																																														
		
		ALACORDER beta 7.3.1 (pure-python)
		by Sam Robson	


Welcome to Alacorder. Please select an operating mode:

	A.	MAKE A TABLE FROM DIRECTORY OR ARCHIVE

		Create detailed cases table with convictions, charges,
		fees, and voting rights restoration information. 

		Inputs:		Text Archive (.pkl.xz) or PDF directory
		Outputs:	Recommend .xls -> all tables in one file
					Also supports .csv, .dta, .json, .txt

	B.	ARCHIVE CASES

		Collect text from PDFs in directory and compress to archive.
		Archives can be processed into tables upon completion.

		Inputs:		PDF Directory (./path/to/pdfs)
		Outputs:	filename.pkl.xz

>> Enter A or B:



# Writing basic scripts with `alac` 

### For more advanced queries, the `alacorder` module `alac` can be used to extract fields and tables from Alacourt records with only a few lines of code.

#### The `run` module creates the full text archives and detailed case summary tables outputted by the guided interface. 

* Call `alac.config(in_path: str, out_path='', flags='', print_log=True, warn=False)` and assign it to a variable to hold your configuration object. This tells the imported alacorder modules where and how to input and output. If `out-path` is left blank, `alac.write` methods will print to console instead of export. 

* Call `alac.writeArchive(config)` to export a full text archive. It's recommended that you create a full text archive and save it as a .pkl file before making tables from your data. Full text archives can be scanned faster than PDF directories and require much less storage. Full text archives can be used just like PDF directories. 

* Call `alac.writeTables(config)` to export detailed case information tables. If export type is .xls, the "cases","fees", and "charges" tables will all be exported. Otherwise, you can select which table you would like to export. 

* Call `alac.writeCharges(config)` to export charges table only.

* Call `alac.writeFees(config)` to export fee tables only.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from alacorder import alac

pdf_directory = "/Users/crimson/Desktop/Tutwiler/"
archive = "/Users/crimson/Desktop/Tutwiler.pkl.xz"
tables = "/Users/crimson/Desktop/Tutwiler.xls"

# make full text archive from PDF directory 
c = alac.config(pdf_directory, archive)
alac.writeArchive(c)

print("Full text archive complete. Now processing case information into tables at " + tables)

# then scan full text archive for spreadsheet
d = alac.config(archive, tables)
alac.writeTables(d)

# Custom Parsing with *`alac`*
### If you need to conduct a custom search of Alacorder records, the `alac` module has the tools you need to extract custom fields from Alacourt PDFs without any fuss. Try out `alac.write()` and `alac.search()` to search thousands of cases in just a few minutes.

In [None]:
from alacorder import alac
import pandas as pd
import re

archive = "/Users/crimson/Desktop/Tutwiler.pkl.xz"
tables = "/Users/crimson/Desktop/Tutwiler.xls"

def findName(text):
    name = ""
    if bool(re.search(r'(?a)(VS\.|V\.{1})(.+)(Case)*', text, re.MULTILINE)) == True:
        name = re.search(r'(?a)(VS\.|V\.{1})(.+)(Case)*', text, re.MULTILINE).group(2).replace("Case Number:","").strip()
    else:
        if bool(re.search(r'(?:DOB)(.+)(?:Name)', text, re.MULTILINE)) == True:
            name = re.search(r'(?:DOB)(.+)(?:Name)', text, re.MULTILINE).group(1).replace(":","").replace("Case Number:","").strip()
    return name

c = alac.config(archive, tables)

alac.write(c, findName)


| Method | Description |
| ------------- | ------ |
| `getPDFText(path) -> text` | Returns full text of case |
| `getCaseInfo(text) -> [case_number, name, alias, date_of_birth, race, sex, address, phone]` | Returns basic case details | 
| `getFeeSheet(text: str, cnum = '') -> [total_amtdue, total_balance, total_d999, feecodes_w_bal, all_fee_codes, table_string, feesheet: pd.DataFrame()]` | Returns fee sheet and summary as strings and pd.DataFrame() |
| `getCharges(text: str, cnum = '') -> [convictions_string, disposition_charges, filing_charges, cerv_eligible_convictions, pardon_to_vote_convictions, permanently_disqualifying_convictions, conviction_count, charge_count, cerv_charge_count, pardontovote_charge_count, permanent_dq_charge_count, cerv_convictions_count, pardontovote_convictions_count, charge_codes, conviction_codes, all_charges_string, charges: pd.DataFrame()]` | Returns charges table and summary as strings, int, and pd.DataFrame() |
| `getCaseNumber(text) -> case_number: str` | Returns case number



# Working with Python data types

### Out of the box, `alacorder` exports to .xls, .csv, .json, .dta, .pkl.xz, and .txt. But you can use `alac`, [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and other python modules to create your own data collection workflows and design custom exports. 

***The snippet below prints the fee sheets from a directory of case PDFs as it reads them.***

In [None]:
from alacorder import alac

c = alac.config("/Users/crimson/Desktop/Tutwiler/","/Users/crimson/Desktop/Tutwiler.xls")

for path in c['contents']:
    text = alac.getPDFText(path)
    cnum = alac.getCaseNumber(text)
    charges_outputs = alac.getCharges(text, cnum)
    if len(charges_outputs[0]) > 1:
        print(charges_outputs[0])

In [None]:
## use alac.search() to make a Series from a custom method. Write a method with a full text string parameter, parse the text to fit your purposes, and return the parsed values. 

## alac.write() works the same way, but will export the returned values in batches - like alac.writeTables() - instead of merely returning the mapped Series. 

from alacorder import alac

def getRaceAndSex(text):
  try:
    racesex = re.search(r'(B|W|H|A)\/(F|M)(?:Alias|XXX)', str(text))
    race = racesex.group(1).strip()
    sex = racesex.group(2).strip()
  except (IndexError, AttributeError):
    race = ""
    sex = ""
  return [race, sex]
    
c = alac.config("/Users/crimson/Desktop/Tutwiler/","/Users/crimson/Desktop/Tutwiler.xls")

d = alac.search(c, getRaceAndSex)

race = d[0]
sex = d[1]

print(race)
print(sex)


## Extending alacorder with `pandas`, `jupyter`, and other tools

Alacorder runs on [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), a python module you can use to perform calculations, process text data, and make tables and charts. Pandas can read from and write to all major data storage formats. It can connect to a wide variety of services to expand the capability of `alacorder` data. When `alacorder` table data is exported to .pkl.xz, it is stored as a pandas DataFrame (like a spreadsheet) and can be imported into other python [modules](https://www.anaconda.com/open-source) and libraries with `pd.read_pickle()` like below:
```
import pandas as pd
contents = pd.read_pickle("/path/to/pkl")
```

If you would like to visualize data without exporting to Excel or another format, create a `jupyter notebook`, and import a data visualization library like `matplotlib` to get started. The [pandas](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) tutorials and documentation can help you get started. `jupyter` is a Python [notebook](https://docs.jupyter.org/en/latest/start/index.html) kernel you can use to create interactive tools like this notebook. It can be installed using `pip install jupyter` or `pip3 install jupyter` and launched using `jupyter notebook`. Your computer may already be equipped to view `jupyter` notebooks. 

### Resources to get started

* [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* [regex cheat sheet](https://www.rexegg.com/regex-quickstart.html)
* [anaconda (tutorials on python data analysis)](https://www.anaconda.com/open-source)
* [The Python Tutorial](https://docs.python.org/3/tutorial/)
