# Pretrial Recidivism Label Construction (MDJ + PSP)

## Overview
This notebook constructs **pretrial recidivism outcome labels** by linking
MDJ–CPMC case-level records with Pennsylvania State Police (PSP) arrest data.

For each MDJ case, we identify whether the defendant was arrested **during the
pretrial window** and classify those arrests by offense severity.

## Key Outputs
For each MDJ case, we generate:
- Any pretrial recidivism (binary)
- Felony, misdemeanor, and other pretrial recidivism flags
- Number of in-window arrests
- Earliest in-window arrest date
- Pretrial window start and end dates (carried through)

## Inputs
- `mdj_cpmc.csv`: Case-level MDJ–CPMC dataset with pretrial windows
- Raw PSP arrest-level dataset

## Output
- `labels_mdj_cpmc.csv`: Case-level recidivism label dataset

## Important Notes
- Same-case arrests are explicitly excluded
- Arrests are linked using defendant `id`
- A ±2 day buffer is used when identifying same-case arrests

In [1]:
import pandas as pd
import numpy as np

In [2]:
MDJ_CPMC_PATH = "mdj_cpmc.csv"         
PSP_PATH      = "csv data/CMU PSP Data.csv"      # raw PSP

In [3]:
mdj = pd.read_csv(MDJ_CPMC_PATH)

  mdj = pd.read_csv(MDJ_CPMC_PATH)


In [4]:
psp = pd.read_csv(PSP_PATH)

  psp = pd.read_csv(PSP_PATH)


In [5]:
print("MDJ+CPMC shape:", mdj.shape)
print("PSP shape:", psp.shape)

MDJ+CPMC shape: (452265, 52)
PSP shape: (12753994, 60)


In [6]:
psp.columns.to_list()

['dobyear',
 'dobmonth',
 'dobday',
 'name',
 'sex',
 'race',
 'rapstatus',
 'max_year',
 'max_month',
 'max_day',
 'death_yeardeath_month',
 'death_day',
 'sor_status',
 'dv_flag',
 'dob',
 'otn',
 'dispo_status',
 'arrest_year',
 'arrest_month',
 'arrest_day',
 'county',
 'ori',
 'juvflag',
 'release_year',
 'release_month',
 'release_day',
 'laflag',
 'arrest_date',
 'charge',
 'grade',
 'inchoate_charge',
 'final_charge',
 'consec_charge',
 'disp_year',
 'disp_month',
 'disp_day',
 'disp1',
 'disp2',
 'disp3',
 'disp4',
 'disp5',
 'disp6',
 'disp7',
 'disp8',
 'disp9',
 'disp10',
 'susp_flag',
 'conv_flag',
 'min_sent_year',
 'max_sent_year',
 'offense_year',
 'offense_month',
 'offense_day',
 'offense_date',
 'disp_date',
 'ofn_title',
 'ofn_section',
 'ofn_subsection',
 'citation',
 'id']

In [7]:
mdj.columns.to_list()

['Unnamed: 0',
 'docketnumber',
 'filingdate',
 'offensedate',
 'complaintdate',
 'arrestdate',
 'offensedispositiondate',
 'startdate',
 'enddate',
 'casestatus',
 'casecategory',
 'casedisposition',
 'citytownboro',
 'countyofoffense',
 'defendantdisplayname',
 'dob',
 'gender',
 'race',
 'ethnicity',
 'defendantcounty',
 'zipcode',
 'defenseattorneyname',
 'defenseattorneyrepresentationtyp',
 'ori',
 'fine',
 'fineadjustment',
 'cost',
 'costadjustment',
 'restitution',
 'restitutionadjustment',
 'sentencingjudge',
 'penaltyeventtype',
 'programtype',
 'programperiod',
 'facilitylocation',
 'limitedaccessoffenseindicator',
 'title',
 'section',
 'subsection',
 'grade',
 'offensedisposition',
 'citationcomplaintnumber',
 'sequencenumber',
 'id',
 'otn',
 'originatingdocketnumber',
 'cpmc_offensedispositiondate',
 'pretrial_start',
 'pretrial_end',
 'transferred',
 'pretrial_days',
 'pretrial_days_capped']

In [8]:
# 2. Make sure date columns are datetime
date_cols_mdj = ['pretrial_start', 'pretrial_end', 'filingdate']
for col in date_cols_mdj:
    if col in mdj.columns:
        mdj[col] = pd.to_datetime(mdj[col], errors='coerce')

# PSP dates
psp['arrest_date'] = pd.to_datetime(psp['arrest_date'], errors='coerce')

  psp['arrest_date'] = pd.to_datetime(psp['arrest_date'], errors='coerce')


In [9]:
# 3. Derive offense_level in PSP (same as old notebooks)
#    F* -> felony, M* -> misdemeanor, else other
g = psp.get('grade', pd.Series(index=psp.index, dtype='object')).astype(str).str.strip().str.upper()
psp['offense_level'] = np.where(
    g.str.startswith('F'), 'felony',
    np.where(g.str.startswith('M'), 'misdemeanor', 'other')
)

In [10]:
# 4. keep only identification + time + offense-level columns from PSP
psp_slim = psp[['id', 'otn', 'arrest_date', 'grade']].copy()

# derive offense_level in PSP (you already started this above, but let's keep it here)
g = psp_slim['grade'].astype(str).str.upper().str.strip()
psp_slim['offense_level'] = np.where(
    g.str.startswith('F'), 'felony',
    np.where(g.str.startswith('M'), 'misdemeanor', 'other')
)

# optional: trim PSP to only arrests that could POSSIBLY fall in any pretrial window
# get overall min/max from MDJ
min_start = mdj['pretrial_start'].min()
max_end   = mdj['pretrial_end'].max()

# keep a small buffer of 2 days on each side
buffer = pd.Timedelta(days=2)
psp_slim = psp_slim[
    (psp_slim['arrest_date'] >= (min_start - buffer)) &
    (psp_slim['arrest_date'] <= (max_end + buffer))
].copy()

print("Trimmed PSP shape:", psp_slim.shape)


Trimmed PSP shape: (5658675, 5)


In [11]:
# 5. build the slim case table we will merge against
case_slim = mdj[['id', 'docketnumber', 'otn', 'filingdate',
                 'pretrial_start', 'pretrial_end']].copy()

print("Case-slim shape:", case_slim.shape)

Case-slim shape: (452265, 6)


In [12]:
# 6. long merge on id
long_df = case_slim.merge(
    psp_slim.rename(columns={'otn': 'otn_psp'}),
    on='id',
    how='left'
)

print("Long merged shape:", long_df.shape)

Long merged shape: (9168949, 10)


In [13]:
# 7a. same OTN between case and PSP arrest
long_df['same_case_otn'] = (
    long_df['otn'].astype(str) == long_df['otn_psp'].astype(str)
)

# 7b. arrest within ±2 days of filing -> count as same case
long_df['same_case_date'] = (
    long_df['arrest_date'].notna() &
    long_df['filingdate'].notna() &
    ((long_df['arrest_date'] - long_df['filingdate']).abs().dt.days <= 2)
)

long_df['same_case_arrest'] = long_df['same_case_otn'] | long_df['same_case_date']

In [14]:
# Flag the within_window arrest and excldue the same case arrest
long_df['within_window'] = (
    long_df['arrest_date'].between(long_df['pretrial_start'],
                                   long_df['pretrial_end'],
                                   inclusive='both')
    & (~long_df['same_case_arrest'])
).astype(int)

In [15]:
# Add pretrial_start and pretrial_end to the grouped labels output

def summarize_case(g):
    in_win = g["within_window"] == 1
    return pd.Series({
        "pretrial_recidivism": int(in_win.any()),
        "misdemeanor_recidivism": int((in_win & (g["offense_level"] == "misdemeanor")).any()),
        "felony_recidivism": int((in_win & (g["offense_level"] == "felony")).any()),
        "other_recidivism": int((in_win & (g["offense_level"] == "other")).any()),
        "n_in_window_arrests": int(in_win.sum()),
        "earliest_in_window_arrest": g.loc[in_win, "arrest_date"].min(),
        # Add pretrial period columns here
        "pretrial_start": g["pretrial_start"].iloc[0] if "pretrial_start" in g else pd.NaT,
        "pretrial_end": g["pretrial_end"].iloc[0] if "pretrial_end" in g else pd.NaT
    })

labels = (
    long_df
    .groupby(["id", "docketnumber"], as_index=False)
    .apply(summarize_case)
    .reset_index(drop=True)
)

print("Labels shape:", labels.shape)
labels.head()

# Save to CSV
labels.to_csv("csv data/labels_mdj_cpmc.csv", index=False)

  .apply(summarize_case)


Labels shape: (452265, 10)
