# Detention By Nationality Analysis

The full methodology for this analysis is available [here](../methodology.md).

## Load the data

In [2]:
import pandas as pd
import sys
sys.path.append("../utils")
import loaders

*Note: loaders is a custom module to handle basic data-loading. It is available [here](https://github.com/BuzzFeedNews/2015-08-immigration/blob/master/utils/loaders.py).*

In [7]:
first_scheduled_proceeding = pd.read_csv("../data/first-scheduled-proceeding.csv", 
     parse_dates=["ADJ_DATE"],
     dtype={
          "IDNCASE": str,
          "IDNPROCEEDING": str,
     },
     encoding='latin1'
)

*Note: first-scheduled-proceeding.csv is a pre-processed data file. The code to create that file from tbl_schedule.csv is available [here](../utils/generate-first-scheduled-proceeding.py).*

In [27]:
nationality_table = loaders.load_file("tblLookupNationality.csv")



  nationality_table = loaders.load_file("tblLookupNationality.csv")


In [4]:
case_date_list = [
    "E_28_DATE",
    "DATE_OF_ENTRY",
    "C_BIRTHDATE",
    "C_RELEASE_DATE",
    "DATE_DETAINED",
    "DATE_RELEASED"
]

In [5]:
_cases = loaders.load_file("A_tblCase.csv",
    parse_dates=case_date_list,
    dtype={
        "IDNCASE": str
    },
)



  _cases = loaders.load_file("A_tblCase.csv",


In [8]:
_cases["GENDER"] = _cases["GENDER"].fillna("UNK")

In [9]:
_charges = loaders.load_file("B_tblProceedCharges.csv",
    dtype={ "IDNCASE": str, "IDNPROCEEDING": str })



  _charges = loaders.load_file("B_tblProceedCharges.csv",
b'Skipping line 1165848: expected 5 fields, saw 6\n'
b'Skipping line 1433634: expected 5 fields, saw 6\n'
b'Skipping line 2646392: expected 5 fields, saw 6\n'
b'Skipping line 2847501: expected 5 fields, saw 6\n'
b'Skipping line 2947399: expected 5 fields, saw 6\n'
b'Skipping line 3131015: expected 5 fields, saw 6\n'


*Note: Six rows — of the more than 8 million total rows — in the charges table contain malformed data stemming from extra tab characters, triggering the warning messages above.*

## Process the data

Join the various tables and prepare them for analysis.

In [10]:
charges_group = _charges.groupby([ "IDNCASE", "IDNPROCEEDING" ])

In [11]:
charge_lists = pd.DataFrame({
    "charge_list": charges_group["CHARGE"].apply("|".join)
}).reset_index()

In [12]:
charge_lists.head()

Unnamed: 0,IDNCASE,IDNPROCEEDING,charge_list
0,2046920,3200048,212a06Ai
1,2046921,3200049,212a06Ai
2,2046922,3200050,212a06Ai
3,2046923,3200051,212a06Ci
4,2046923,3525150,212a06Ci


In [13]:
assert(charge_lists["IDNCASE"].nunique() == 5033293)
assert(len(first_scheduled_proceeding) == 5045511)

From the numbers above: A small fraction of cases — approximately 0.2% — have a scheduled proceding but no charges.

In [14]:
cases_with_first_proceeding = first_scheduled_proceeding\
    .merge(charge_lists, how="left", on=[ "IDNCASE", "IDNPROCEEDING" ])\
    .merge(_cases, how="left", on="IDNCASE", suffixes=["_schedule", "_case"])

Legal representatives file the EOIR-28 form to notify the court of their representation for a given immigrant.

`ADJ_DATE` in this table indicates the date of the case's first proceeding.

In [15]:
cases_with_first_proceeding["legal_rep_at_first_proceeding"] = cases_with_first_proceeding\
    .apply(lambda x: x["E_28_DATE"] <= x["ADJ_DATE"], axis=1)

## Select non-criminal removal cases between Jan. 1, 2003 and Jan. 1, 2015

In [19]:
selected_cases = cases_with_first_proceeding[
     # Select cases with first-scheduled-hearing dates in 2003–2014
    (cases_with_first_proceeding["ADJ_DATE"] >= "2003-01-01") &
    (cases_with_first_proceeding["ADJ_DATE"] < "2015-01-01")
].copy()

In [20]:
selected_cases["has_criminal_charge"] = (
    selected_cases["charge_list"].str.contains("237a02") |
    selected_cases["charge_list"].str.contains("212a02")
)

In [21]:
selected_cases["detained"] = selected_cases["CUSTODY"].map({"N": 0, "D": 1, "R": 1})

## Calculate detention rates by nationality

In [23]:
custody_by_nationality = selected_cases.groupby(["NAT", "CUSTODY"])\
    .size()\
    .unstack()\
    .fillna(0)

In [24]:
custody_by_nationality["total"] = custody_by_nationality.sum(axis=1)

In [25]:
custody_by_nationality["percent_detained"] = custody_by_nationality\
    .apply(lambda x: round(100.0 * (x["D"] + x["R"]) / x["total"], 1), axis=1)

In [28]:
nationality_table.set_index("NAT_CODE")["NAT_NAME"].head()

NAT_CODE
??    UNKNOWN NATIONALITY
AB                  ARUBA
AC    ANTIGUA AND BARBUDA
AF            AFGHANISTAN
AG                ALGERIA
Name: NAT_NAME, dtype: object

In [29]:
# Add full country names
custody_by_nationality["NAT_NAME"] = custody_by_nationality\
    .join(nationality_table.set_index("NAT_CODE")[["NAT_NAME"]])["NAT_NAME"]

In [37]:
main_columns = ["NAT_NAME", "N", "D", "R", "total", "percent_detained"]
custody_by_nationality = custody_by_nationality.sort_values("total", ascending=False)[main_columns]

## Table: Per-Nationality Detention Rate

In [34]:
custody_by_nationality

CUSTODY,N,D,R,total,percent_detained,NAT_NAME
NAT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
FW,0.0,1.0,0.0,1.0,100.0,FRENCH WEST INDIES
FO,0.0,0.0,1.0,1.0,100.0,FAEROE ISLAND
??,316.0,3643.0,500.0,4459.0,92.9,UNKNOWN NATIONALITY
XS,3.0,35.0,3.0,41.0,92.7,SOUTH SUDAN
NN,40.0,328.0,51.0,419.0,90.5,NO NATIONALITY
...,...,...,...,...,...,...
MN,11.0,0.0,0.0,11.0,0.0,MONACO
BV,1.0,0.0,0.0,1.0,0.0,BOUVET ISLAND
UV,3.0,0.0,0.0,3.0,0.0,UPPER VOLTA
TM,2.0,0.0,0.0,2.0,0.0,EAST TIMOR


In [38]:
# save to output dataframe
custody_by_nationality.to_csv("custody_by_nationality.csv", index=False)