# Loading CPDP Data Into the NPDC Index

This notebook processes the `unified_data` from [chicago-police-data](https://github.com/invinst/chicago-police-data/tree/master/data) as well as geocoding data and loads it into the NPDC index database.

Data is only inserted once, so it's safe to run this multiple times.

## Setup

Install project dependencies and Jupyter:

```bash
pip3 install jupyter
pip3 install -r requirements/dev_unix.txt
```

`unified_data_path` should point to the `unified_data` directory of the `chicago-police-data` repo, by default it is checked out to `<root>/chicago-police-data`:

```bash
git clone https://github.com/invinst/chicago-police-data.git
```

The database should be running and the tables should be up to date. You can use docker to reset the application to a clean state: 

```bash
# Stop services and remove volumes, rebuild images, start the database, create tables, run seeds, and follow logs
docker-compose down -v && docker-compose up --build -d db api && docker-compose logs -f
```

Then open the notebook with either [VSCode](https://code.visualstudio.com/) or `jupyter notebook`.

You can run the notebook from the command line as well:

```bash
jupyter nbconvert --to notebook --execute backend/scraper/cpdp.ipynb --output cpdp
```

In [2]:
import os
if os.getcwd().endswith("notebooks"):
    # Run this notebook from the repo root
    os.chdir("../../..")
import sys
import math
import numpy as np
import pandas as pd
import sqlalchemy
import psycopg2
from IPython.display import display, HTML
from collections import namedtuple
from backend.database import db, Incident, Officer, Accusation, Victim
from backend.database import *
print(os.getcwd())


backend.database.core
/app


In [4]:
from backend.api import create_app
app = create_app("development")

unified_data_path = "chicago-police-data/data/unified_data"
if not os.path.exists(unified_data_path):
    raise Exception(f"{unified_data_path} does not exist. It should point to the unified_data folder of the chicago-police-data repo.")

## Read Data Files

Data is organized into CSV's under `data/unified_data`. CSV file naming is described in `data/README.md`. CSV columns are described in `data/unified_data/data-dictionary/data-dictionary.yaml` as well as `data/complaints_general-summary.html`

In [5]:
def data_path(*args):
    return os.path.join(unified_data_path, *args)


def float_to_int_str(x):
    """Converts a floating-point string into an integer string.

    Empty strings are converted to nan
    """
    return str(int(float(x))) if x else math.nan


def read_csv(*path):
    return pd.read_csv(
        data_path(*path),
        dtype={
            "cr_id": str,
        },
        converters={
            "link_UID": float_to_int_str,
            "birth_year": float_to_int_str,
            "investigator_ID": float_to_int_str,
            "UID": float_to_int_str,
        },
        low_memory=False,
    )


def read_complaint(name):
    return read_csv("complaints", f"complaints-{name}.csv.gz")


def check_defined(df, column):
    assert df[column].isnull().values.sum() == 0


# Each record is one complaint, primary key cr_id
complaints = read_complaint("complaints").set_index(
    "cr_id", drop=False, verify_integrity=True
)

# Each record is an officer, deduplicated across data sources, primary key UID
profiles = read_csv("profiles", "final-profiles.csv.gz").set_index(
    "UID", drop=False, verify_integrity=True
)

# Each record is an accusation against one officer in one complaint, composite
# key (cr_id, UID)
accused = read_complaint("accused").set_index(
    ["cr_id", "UID"], drop=False, verify_integrity=True
)

# Each record is a person that filed a complaint. Many-to-1 with cr_id
complainants = read_complaint("complainants")
check_defined(complainants, "cr_id")

# Each record is a person assigned to investigate a particular complaint.
# Many-to-1 with (cr_id, investigator_ID). investigator_ID includes officer
# UID's and non-officer investigators. The same investigator may be assigned to
# the same complaint at different times, resulting in multiple records.
investigators = read_complaint("investigators")
check_defined(investigators, ["cr_id", "investigator_ID"])

# Each record is a victim in a complaint. Many-to-1 with cr_id. victims_v3
# contains injury information which is lost in the merged victims table.
victims_v2 = read_complaint("victims_2000-2016_2016-11")
victims_v3 = read_complaint("victims_2000-2018_2018-03")
victims_unified = read_complaint("victims")
check_defined(victims_v2, "cr_id")
check_defined(victims_v3, "cr_id")
check_defined(victims_unified, "cr_id")


## Read Geocoding Data

Locations are recorded as semi-structured addresses. In order to display incidents on a map, we convert the human-readable addresses to latitude/longitude coordinates. This conversion is called Geocoding.

Coordinates were not published to `chicago-police-data` but were computed by the CPDP team using Google Maps Geocoding API. They shared the geocoding results with us, stored in `cpdp_geocoded_cr.csv.gz`

In [7]:
csv = pd.read_csv(
    "backend/scraper/data_scrapers/CPDP/scraper_data/cpdp_geocoded_cr.csv.gz",
    dtype={
        "crid": str,
        "beat_id": str,
    },
    low_memory=False,
)
geocoding_results = pd.DataFrame()
geocoding_results[["longitude", "latitude"]] = (
    csv[["x", "y", "crid"]].copy().set_index("crid", verify_integrity=True)
)
display(geocoding_results.iloc[:2])


Unnamed: 0_level_0,longitude,latitude
crid,Unnamed: 1_level_1,Unnamed: 2_level_1
1086435,,
1000494,-87.70716,41.902985


In [8]:
def isnan(x):
    return isinstance(x, float) and math.isnan(x)


def nan_to_none(x):
    return None if isnan(x) else x


def strip_nan(r):
    return r._make([nan_to_none(e) for e in r])


def to_orm(instances, OrmClass):
    return [
        OrmClass(**strip_nan(i)._asdict())
        for i in instances.itertuples(index=False)
    ]


def to_dicts(instances):
    """Converts dataframe rows into dicts, converting NaN to None"""
    return [strip_nan(i)._asdict() for i in instances.itertuples(index=False)]


def create_bulk(instances, chunk_size=1000):
    """Inserts ORM instances into the database"""
    for chunk in range(0, len(instances), chunk_size):
        db.session.add_all(instances[chunk : chunk + chunk_size])
        db.session.flush()
    db.session.commit()


def insert_bulk(dicts, OrmClass):
    """Inserts dicts into the database.

    This is 3x faster but does not implement ORM features.
    """
    with app.app_context():
        db.session.bulk_insert_mappings(OrmClass, dicts)
        db.session.commit()


def insert_bulk_if_missing(dicts, OrmClass):
    try:
        insert_bulk(dicts, OrmClass)
    except sqlalchemy.exc.IntegrityError as e:
        if isinstance(e.orig, psycopg2.errors.UniqueViolation):
            print(f"Already created {OrmClass.__name__} records")
        else:
            raise e


def add_date_of_birth(target, birth_year):
    # Default to 01-01 for birthday
    has_birth_year = ~birth_year.isna()
    target.loc[has_birth_year, "date_of_birth"] = (
        birth_year[has_birth_year] + "-01-01"
    )


## Load Incidents

In [9]:
incidents = complaints[["complaint_date", "closed_date"]].copy()
incidents["source"] = "cpdp"
incidents[["source_id", "time_of_incident"]] = complaints[
    ["cr_id", "incident_date"]
]
has_full_address = ~complaints["full_address"].isna()

# full address only contains street information, so add the city
incidents.loc[has_full_address, "location"] = (
    complaints.loc[has_full_address, "full_address"] + " CHICAGO ILLINOIS"
)

# Join the address components, ignoring missing values and consolidating whitespace
address = complaints[~has_full_address]
incidents.loc[~has_full_address, "location"] = (
    address["add1"]
    .str.cat(address[["add2", "city"]], na_rep=" ")
    .str.strip()
    .str.replace(r"\s{2,}", " ", regex=True)
)

# Add Coordinates if available
incidents["longitude"] = incidents.source_id.map(geocoding_results['longitude'])
incidents["latitude"] = incidents.source_id.map(geocoding_results['latitude'])

incident_dicts = to_dicts(incidents)


In [10]:
insert_bulk_if_missing(incident_dicts, Incident)


## Load Officers

In [11]:
officers = profiles[
    ["first_name", "last_name", "race", "gender", "appointed_date"]
].copy()
officers["source"] = "cpdp"
officers[["source_id", "rank", "star", "unit"]] = profiles[
    [
        "link_UID",
        "cleaned_rank",
        "current_star",
        "current_unit",
    ]
]
add_date_of_birth(officers, profiles["birth_year"])

officer_dicts = to_dicts(officers)


In [12]:
insert_bulk_if_missing(officer_dicts, Officer)


## Generate Mapping from CPDP to NPDC ID's

CPDP entities are linked together using incident and officer ID's. There is a 1-many corespondance between incidents/officers and victims, investigations, participants, and accusations. In order to insert these entities into the database, we need to convert the CPDP id's in the source data to their corresponding NPDC id's.

In [13]:
def get_source_npdc_id_map(OrmClass):
    """Returns a dict mapping non-null source ID's to NPDC id's (PK's)"""
    with app.app_context():
        ids = (
            db.session.query(OrmClass)
            .filter(OrmClass.source_id != None)
            .with_entities(OrmClass.source_id, OrmClass.id)
            .all()
        )
    return dict(ids)


officer_id_by_link_uid = get_source_npdc_id_map(Officer)
incident_id_by_cr_id = get_source_npdc_id_map(Incident)


## Load Accusations

Complaints may be made against multiple officers for a single incident. Each accusation links a single officer to a single incident.

In [14]:
accusations = pd.DataFrame()
accusations[["category", "category_code", "finding", "outcome"]] = accused[
    ["complaint_category", "complaint_code", "final_finding", "final_outcome"]
].copy()
accusations["incident_id"] = accused["cr_id"].map(incident_id_by_cr_id)
accusations["officer_id"] = accused["link_UID"].map(officer_id_by_link_uid)
check_defined(accusations, ["incident_id", "officer_id"])

insert_bulk_if_missing(to_dicts(accusations), Accusation)


## Load Victims

Victims may or may not be the person making the complaint.

In [16]:
# About 25% of victims reference cr_id which is missing in the complaints table.
# CPDP still shows information for these complaints, and I'm not sure where that
# data comes from. For our purposes, drop victims with no associated complaint.
# Ex https://cpdp.co/complaint/1086131/
victims_v2 = victims_v2[victims_v2.cr_id.isin(incident_id_by_cr_id)]
victims_v3 = victims_v3[victims_v3.cr_id.isin(incident_id_by_cr_id)]

v2 = victims_v2[["gender", "race"]].copy()
v2["incident_id"] = victims_v2.cr_id.map(incident_id_by_cr_id)

v3 = victims_v3[
    ["gender", "race", "injury_condition", "injury_description"]
].copy()
v3["incident_id"] = victims_v3.cr_id.map(incident_id_by_cr_id)
add_date_of_birth(v3, victims_v3["birth_year"])
v3["deceased"] = v3["injury_condition"].str.match("deceased", case=False)

# Collect all victims in v3 as well as any in v2 that don't appear in v3
victims = v2[~v2.incident_id.isin(v3.incident_id)].append(v3, ignore_index=True)
check_defined(victims, "incident_id")
# Specify primary keys for victims so insertion into the database is idempotent
victims["id"] = 1 + np.arange(victims.shape[0])

insert_bulk_if_missing(to_dicts(victims), Victim)


Already created Victim records


# CPDP Data Summary

CPDP uses these entities in its data model:

- **Complaints**: Complaints are filed by individuals or groups against one or more officers. They correspond to our `Incident` model. Potentially, there could be multiple complaints surrounding the same event. Are incidents like that too?
- **Profiles**: CPDP generates officer profiles from all of its data collection, and deduplicates as best it can. Profiles correspond to the `Officers` model.
- **Accusations**: Each complaint accuses one or more officers, and each accusation is recorded in this table. It is the join table between officers and incidents.
- **Victims, Complainants, Investigators, Witnesses**: These are many-1 with complaints. Investigators have names or officer ID's, while the others are anonymized. These have equivalent tables in NPDC.

CPDP stores allegations, findings, and outcomes on Accusations and the overall close date on Complaints, while we stored it all on Investigations. Should we follow CPDP's model and drop the Investigation?

## Comparing the data used in the [UI Designs](https://www.figma.com/file/TM86P6ePUar5g24pk4h6Du/NPDC--2?node-id=1400%3A1723) with [Availability in CPDP](https://github.com/invinst/chicago-police-data/blob/master/data/unified_data/data-dictionary/data-dictionary.yaml)

#### Incident Search: Location, Incident Type, Date/Time

Dates, free-text location information, accusation descriptions, and injury information are available for searching. CPDP does not publish lat/lon coordinates for incidents. Accusation Description most closely matches Incident Type. We could also search all incident text fields at once.

#### Officer Search: Name, Location, Badge Number

Officer names, rank, star, and unit are available, but location and badge number are not.

#### Incident Results
  - Officer(s) involved
  - Date/Time
  - Incident type
  - Use of Force
  - Source

Use of force and source are not available as designed. Victim injury information could be listed instead, as well as the accusations and whether they were upheld.

CPDP does show use of force reports / Tactical Response Reports (TRRs) on its site. These are recorded by officers whenever they use force in the field. Unfortunately, they are not joined to complaints, so we can only associate them with officers, not incidents.

In [17]:
# Display CPDP tables

for t in ["incidents", "officers", "accusations", "victims"]:
    df = globals()[t]
    df["nna"] = df.isna().sum(axis=1)
    print(t, f"{df.shape[0]} records")
    display(df.sort_values("nna").drop("nna", axis=1).iloc[:10])


incidents 181412 records


Unnamed: 0_level_0,complaint_date,closed_date,source,source_id,time_of_incident,location,longitude,latitude
cr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
295329,2004-01-22,2004-12-15,cpdp,295329,2004-01-11,4800W IRVING PARK RD CHICAGO IL,-87.747929,41.953591
C184795,1991-06-07,1991-08-21,cpdp,C184795,1991-06-07,020TH DIST. STATION CHICAGO ILLINOIS,-87.693011,41.979973
C184796,1991-06-07,1993-02-26,cpdp,C184796,1991-06-07,102** S. HOXIE/ON STREET CHICAGO ILLINOIS,-87.561166,41.708441
C184798,1991-06-07,1994-03-29,cpdp,C184798,1991-06-07,102** S. HOXIE CHICAGO ILLINOIS,-87.561166,41.708441
C184799,1991-06-08,1991-08-14,cpdp,C184799,1991-06-07,37** N. CLARK/WRIGLEYVILLE TAP CHICAGO ILLINOIS,-87.658049,41.94883
C184800,1991-06-08,1991-09-30,cpdp,C184800,1991-06-07,51ST. & WENTWORTH/0** DIST. CHICAGO ILLINOIS,-87.619819,41.640297
C184801,1991-06-08,1991-10-02,cpdp,C184801,1991-06-07,55TH & LOWE IN THE ALLEY CHICAGO ILLINOIS,-87.60039,41.794834
C184802,1991-06-08,1991-08-30,cpdp,C184802,1991-06-08,12** W. PRATT CHICAGO ILLINOIS,-87.663051,42.005501
C184793,1991-06-07,1991-09-30,cpdp,C184793,1991-06-07,HARRISON AND FRANKLIN STS CHICAGO ILLINOIS,-87.635045,41.874456
C184803,1991-06-08,1994-04-09,cpdp,C184803,1991-06-08,62** S. KING DR. CHICAGO ILLINOIS,-87.615632,41.780717


officers 33693 records


Unnamed: 0_level_0,first_name,last_name,race,gender,appointed_date,source,source_id,rank,star,unit,date_of_birth
UID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
116100,JUDIE,FITTE BLASZ,WHITE,FEMALE,2001-08-27,cpdp,8609,POLICE OFFICER,3841.0,24.0,1968-01-01
112901,JEREMY,BALLING,WHITE,MALE,2013-05-01,cpdp,1201,POLICE OFFICER,15992.0,3.0,1980-01-01
112902,JEREMY,BARNES,WHITE,MALE,2013-03-05,cpdp,1352,POLICE OFFICER,13912.0,6.0,1986-01-01
124960,ROBERT,DISTASIO,WHITE,MALE,1994-01-18,cpdp,6968,DETECTIVE,20164.0,610.0,1964-01-01
112904,JEREMY,CARTER,WHITE,MALE,2013-05-01,cpdp,4072,FIELD TRAINING OFFICER,4007.0,11.0,1981-01-01
112905,JEREMY,DRZEWIECKI,WHITE,MALE,2000-10-10,cpdp,7420,POLICE OFFICER,2203.0,4.0,1970-01-01
112906,JEREMY,KELLER,WHITE,MALE,2015-08-31,cpdp,14329,POLICE OFFICER,17336.0,11.0,1988-01-01
112907,JEREMY,LORENZ,WHITE,MALE,2010-09-01,cpdp,16585,POLICE OFFICER,10256.0,10.0,1981-01-01
112908,JEREMY,LOWE,WHITE,MALE,2003-04-28,cpdp,16648,POLICE OFFICER,16615.0,19.0,1978-01-01
112909,JEREMY,MASON,BLACK,MALE,2015-10-26,cpdp,17674,POLICE OFFICER,17917.0,44.0,1989-01-01


accusations 244436 records


Unnamed: 0_level_0,Unnamed: 1_level_0,category,category_code,finding,outcome,incident_id,officer_id
cr_id,UID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C170981,130469,SUPERVISOR RESPONSIBILITY: FAIL TO OBTAIN COM...,12D,NS,NO ACTION TAKEN,113093,30477
293156,132382,SEARCH OF PREMISE/VEHICLE WITHOUT WARRANT,03C,EX,No Action Taken,88533,32390
293156,127190,SEARCH OF PREMISE/VEHICLE WITHOUT WARRANT,03C,EX,No Action Taken,88533,27198
293157,130197,INADEQUATE/FAILURE TO PROVIDE SERVICE,10U,NS,No Action Taken,88534,30205
293157,129579,INADEQUATE/FAILURE TO PROVIDE SERVICE,10U,NS,No Action Taken,88534,29587
293158,123144,MISCELLANEOUS,04J,NS,No Action Taken,88535,23152
293158,105674,MISCELLANEOUS,04J,NS,No Action Taken,88535,5682
293158,125456,MISCELLANEOUS,04J,NS,No Action Taken,88535,25464
293160,124948,USE OF PROFANITY,01A,UN,No Action Taken,88537,24956
293161,122749,ARRESTEE - DURING ARREST,05A,UN,No Action Taken,88538,22757


victims 89265 records


Unnamed: 0,gender,race,incident_id,injury_condition,injury_description,date_of_birth,deceased,id
75763,MALE,HISPANIC,40338,"INJURED, NOT HOSPITALIZED",LEFT SIDE OF THE NECK HAS A SCRATCH.,1982-01-01,False,75764
59804,MALE,HISPANIC,17115,"INJURED, NOT HOSPITALIZED",LACERATION TO RIGHT EYEBROW,1989-01-01,False,59805
72315,FEMALE,BLACK,38681,"NO VISIBLE INJURY, APPARENTLY NORMAL","HAS A HEART CONDITION, CHEST PAINS",1977-01-01,False,72316
5691,MALE,BLACK,2564,"INJURED, NOT HOSPITALIZED",STAB WOUNDS TO SCROTUM AREA AND BRUISE ON FORE...,1944-01-01,False,5692
44508,MALE,BLACK,28704,"INJURED, HOSPITALIZED",ABRASION TO LEFT SIDE OF FACE,1996-01-01,False,44509
44512,FEMALE,BLACK,29093,"INJURED, NOT HOSPITALIZED",BRUISES ON KNEES AND LEGS HAIR IS MISSING AND ...,1988-01-01,False,44513
10210,FEMALE,BLACK,6859,UNKNOWN,COMPLAINANT STATES THAT HER RIGHT ARM HURTS,1977-01-01,False,10211
10212,MALE,BLACK,6825,"INJURED, HOSPITALIZED",BROKEN WRIST,1977-01-01,False,10213
5685,MALE,BLACK,2822,"INJURED, NOT HOSPITALIZED",INJURY TO RIGHT EYE,1974-01-01,False,5686
72327,FEMALE,BLACK,41114,"INJURED, HOSPITALIZED",SWELLING IN CHEST AREA AND HEADACHE,1973-01-01,False,72328


## Listing Officers by Number of Accusations

This matches the results at [cpdp.co](https://cpdp.co)

In [18]:
def cpdp_officer_url(id):
    return f"https://cpdp.co/officer/{id}"


desc = db.desc
count = db.func.count
with app.app_context():
    q = (
        db.session.query(
            Officer.source_id, Officer.first_name, Officer.last_name, count()
        )
        .join(Accusation)
        .group_by(Officer.id)
        .order_by(desc(count()))
    )
    print("Query:\n", str(q), "\n")
    df = pd.DataFrame(
        data=[
            (
                id,
                f"{first.title()} {last.title()}",
                num_accusations,
                cpdp_officer_url(id),
            )
            for id, first, last, num_accusations in q
        ],
        columns=["id", "Officer Name", "Number of Accusations", "Source Link"],
    ).set_index("id")


display(HTML(df.iloc[:10].to_html(render_links=True)))


Query:
 SELECT officer.source_id AS officer_source_id, officer.first_name AS officer_first_name, officer.last_name AS officer_last_name, count(*) AS count_1 
FROM officer JOIN accusation ON officer.id = accusation.officer_id GROUP BY officer.id ORDER BY count(*) DESC 



Unnamed: 0_level_0,Officer Name,Number of Accusations,Source Link
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8562,Jerome Finnigan,175,https://cpdp.co/officer/8562
21837,Joe Parker,137,https://cpdp.co/officer/21837
17816,Edward May,136,https://cpdp.co/officer/17816
8138,Glenn Evans,132,https://cpdp.co/officer/8138
21468,Kevin Osborn,125,https://cpdp.co/officer/21468
28805,Charles Toussas,123,https://cpdp.co/officer/28805
31631,Adam Zelitzky,117,https://cpdp.co/officer/31631
29033,Jerome Turbyville,114,https://cpdp.co/officer/29033
4807,Maurice Clayton,109,https://cpdp.co/officer/4807
32166,Emmett Mc Clendon,109,https://cpdp.co/officer/32166
