# Task-1 Making sense of the data

One of the reporters asks if you can find Mahmoud Khalil in the dataset. If you do, explain how you identified the correct record in the ICE data, by describing what logic you used, which fields you filtered on, how you handled any ambiguities and what steps you took to be confident it was the right match. 

## 0. Set up

In [1]:
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd

import process_data

In [2]:
arrests_filename = 'arrests-0923-0625.xlsx'
cwd = Path.cwd()
root = cwd.parent
data = root / "data"

In [32]:
pd.set_option("display.max_rows", 300)

In [3]:
arrests_df = process_data.read_arrests_data(data/arrests_filename)

In [4]:
arrests_df.head()

Unnamed: 0,apprehension_date,apprehension_state,apprehension_aor,final_program,apprehension_method,apprehension_criminality,case_status,case_category,departed_date,departure_country,final_order_yes_no,birth_year,citizenship_country,gender,apprehension_site_landmark,unique_identifier
0,2024-08-07 09:43:00,VIRGINIA,WASHINGTON AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,NON-CUSTODIAL ARREST,1 CONVICTED CRIMINAL,8-EXCLUDED/REMOVED - INADMISSIBILITY,[16] REINSTATED FINAL ORDER,2024-08-19,HONDURAS,YES,1981,HONDURAS,MALE,"HBG GENERAL AREA, NON-SPECIFIC",0000b34edd657d516c02b13a7c352d62d0effcb6
1,2024-10-19 20:33:00,TEXAS,HOUSTON AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP LOCAL INCARCERATION,1 CONVICTED CRIMINAL,6-DEPORTED/REMOVED - DEPORTABILITY,[16] REINSTATED FINAL ORDER,2024-10-22,MEXICO,YES,1984,MEXICO,MALE,"HARRIS COUNTY JAIL, HOUSTON, TX",0000ba6e459998a6046d185d82cf4349de1479d0
2,2025-04-15 10:08:21,NEW JERSEY,NEWARK AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP FEDERAL INCARCERATION,1 CONVICTED CRIMINAL,8-EXCLUDED/REMOVED - INADMISSIBILITY,[16] REINSTATED FINAL ORDER,2025-06-10,DOMINICAN REPUBLIC,YES,1988,DOMINICAN REPUBLIC,MALE,"FORT DIX EAST, NEW JERSEY",0000c3d23fb0e444864559575900d410c4e8490f
3,2025-06-03 09:20:00,MINNESOTA,ST. PAUL AREA OF RESPONSIBILITY,FUGITIVE OPERATIONS,NON-CUSTODIAL ARREST,3 OTHER IMMIGRATION VIOLATOR,ACTIVE,[8G] EXPEDITED REMOVAL - CREDIBLE FEAR REFERRAL,NaT,,YES,1985,COLOMBIA,FEMALE,"SPM GENERAL AREA, NON-SPECIFIC",0000d3dbf8033b5f209f6547ffee5b84feb4f599
4,2025-01-21 17:41:00,,MIAMI AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP LOCAL INCARCERATION,2 PENDING CRIMINAL CHARGES,3-VOLUNTARY DEPARTURE CONFIRMED,[8C] EXCLUDABLE / INADMISSIBLE - ADMINISTRATIV...,2025-02-01,MEXICO,YES,1983,MEXICO,MALE,MIAMI DADE COUNTY JAIL TURNER GUILFORD KNIGHT ...,000104d730bf021326c6dc0deb3dd575304136b5


# Approach:

**Note:** for initial data exploration please see `./0-data_exploration.ipynb`

## 1. Choosing what to filter on:

**What information do we know about Mahmoud Khalil that we can use to find him in the dataset?**

Things that have been reported about Mahmoud Khalil:
* Location of arrest: New York (Columbia University Residence)
* Year of birth: 1995
* Citizenship: Algeria (born in Syria to Palestinian parents, and moved to Lebanon)
* He had a green card (ie not on a tempory visa or undocumented)
* Time and Date of Arrest: 2025-03-08, around 8.30pm 
* Where was he taken after his arrest? [his own understanding of this was shared in his letter](https://www.newsweek.com/mahmoud-khalil-columbia-hamas-gaza-israel-letter-2047002)
* * 26 Federal Plaza - on the 8th March (?)
  * Elizabeth Detention Centre, New Jersey - taken in the early hours of the 9th (NB when his wife tried to find him there on the 9th she was told he was not there, so there are some question marks over this)
  * ICE detention center in Louisiana 



### Finding the relevant variables:

In [7]:
birth_1995 = arrests_df['birth_year'] == 1995
citizenship_algeria = arrests_df['citizenship_country'] == 'ALGERIA' # might need to try out SYRIA and LEBANON here (although I don't think he has citizenship)

In [8]:
unique_vars_filter_str = lambda col, str_match: arrests_df[col][arrests_df[col].fillna('').str.contains(str_match)].unique()

#### Locations

##### 1. Arrest location = New York, initially taken to 26 Federal Plaza

In [7]:
np.sort(arrests_df['apprehension_state'].fillna('').unique())

array(['', 'ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS',
       'ARMED FORCES - EUROPE', 'ARMED FORCES - THE AMERICAS',
       'ARMED SERVICES - PACIFIC', 'CALIFORNIA', 'COLORADO',
       'CONNECTICUT', 'DELAWARE', 'DISTRICT OF COLUMBIA',
       'FEDERATED STATES OF MICRONESIA', 'FLORIDA', 'GEORGIA', 'GUAM',
       'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA', 'IOWA', 'KANSAS',
       'KENTUCKY', 'LOUISIANA', 'MAINE', 'MARYLAND', 'MASSACHUSETTS',
       'MEXICO', 'MICHIGAN', 'MINNESOTA', 'MISSISSIPPI', 'MISSOURI',
       'MONTANA', 'NEBRASKA', 'NEVADA', 'NEW HAMPSHIRE', 'NEW JERSEY',
       'NEW MEXICO', 'NEW YORK', 'NORTH CAROLINA', 'NORTH DAKOTA',
       'NORTHERN MARIANA ISLANDS', 'OHIO', 'OKLAHOMA', 'OREGON',
       'PENNSYLVANIA', 'PUERTO RICO', 'RHODE ISLAND', 'SOUTH CAROLINA',
       'SOUTH DAKOTA', 'TAMAULIPAS', 'TENNESSEE', 'TEXAS', 'UTAH',
       'VERMONT', 'VIRGIN ISLANDS', 'VIRGINIA', 'WASHINGTON',
       'WEST VIRGINIA', 'WISCONSIN', 'WYOMING'], dtype=object)

In [8]:
unique_vars_filter_str('apprehension_site_landmark','26 FEDERAL')

array(['NDD - 26 FEDERAL PLAZA NY, NY'], dtype=object)

In [9]:
arrests_df[arrests_df['apprehension_site_landmark']=='NDD - 26 FEDERAL PLAZA NY, NY']['apprehension_state'].value_counts(dropna=False)

apprehension_state
NEW YORK          5646
NaN                151
NORTH CAROLINA       4
NEW JERSEY           1
FLORIDA              1
Name: count, dtype: int64



**Observations:**
* Looks like that is the correct site for 26 Federal Plaza, NY
* But note - not all who have `'apprehension_site_landmark' == 'NDD - 26 FEDERAL PLAZA NY, NY'` have `'apprehension_state' == 'NEW YORK'`
   * Maybe taken to be processed at 26 Federal Plaza after being initially arrested in the location? But Florida and North Carolina are quite a long way
   * 151 with missing `apprehension_state`

In [24]:
state_ny = arrests_df['apprehension_state'] == 'NEW YORK' 
state_missing = arrests_df['apprehension_state'].isna() # NB - from initial data exploration we know this has quite a lot of missing, so don't rely on it being added
loc_federal_plaza = arrests_df['apprehension_site_landmark'] == 'NDD - 26 FEDERAL PLAZA NY, NY'

##### 2. Taken to: Elizabeth Detention Center in New Jersey and ICE Detention Center in Louisiana

Doesn't look like these are in this dataset, which makes sense because this is arrests, not detentions:

In [10]:
unique_vars_filter_str('apprehension_site_landmark','ELIZABETH')

array(['ELIZABETH CITY', 'ELIZABETH PD'], dtype=object)

In [11]:
unique_vars_filter_str('apprehension_site_landmark','LOUISIANA')

array(['LOUISIANA STATE PENITENTIARY - ANGOLA'], dtype=object)

#### 2. Apprehension date

In [11]:
get_date = lambda date_str: datetime.strptime(date_str, '%Y-%m-%d').date()

In [12]:
def filter_dates(df, start_date, eq=True, end_date=None, date_col='apprehension_date'):
    if end_date:
        return (df[date_col].dt.date >= get_date(start_date)) & (df[date_col].dt.date <= get_date(end_date))
    elif eq:
        return df[date_col].dt.date == get_date(start_date)
    else:
        return df[date_col].dt.date >= get_date(start_date)

In [13]:
arrests_df[filter_dates(arrests_df, '2025-03-08')]

Unnamed: 0,apprehension_date,apprehension_state,apprehension_aor,final_program,apprehension_method,apprehension_criminality,case_status,case_category,departed_date,departure_country,final_order_yes_no,birth_year,citizenship_country,gender,apprehension_site_landmark,unique_identifier
429,2025-03-08 22:05:00,TEXAS,HOUSTON AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP LOCAL INCARCERATION,2 PENDING CRIMINAL CHARGES,8-EXCLUDED/REMOVED - INADMISSIBILITY,[8C] EXCLUDABLE / INADMISSIBLE - ADMINISTRATIV...,2025-05-03,GUATEMALA,YES,2002,GUATEMALA,MALE,"HARRIS COUNTY JAIL, HOUSTON, TX",0068cbef972a10a35f635b938da93b004b0e8363
1801,2025-03-08 08:53:00,FLORIDA,MIAMI AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,LOCATED,3 OTHER IMMIGRATION VIOLATOR,8-EXCLUDED/REMOVED - INADMISSIBILITY,[16] REINSTATED FINAL ORDER,2025-03-18,GUATEMALA,YES,1988,GUATEMALA,MALE,ORL - OSCEOLA COUNTY JAIL FLORIDA STATE,01bb776414fe6dd08384c3607f89f2527587afbb
4209,2025-03-08 23:36:00,GEORGIA,ATLANTA AREA OF RESPONSIBILITY,287G PROGRAM,287(G) PROGRAM,2 PENDING CRIMINAL CHARGES,8-EXCLUDED/REMOVED - INADMISSIBILITY,[8C] EXCLUDABLE / INADMISSIBLE - ADMINISTRATIV...,2025-05-12,NICARAGUA,YES,1996,NICARAGUA,MALE,"DTN GENERAL AREA, NON-SPECIFIC",041cb3841e8124f1b2e9fb5a4014c48178bafa64
6619,2025-03-08 09:47:00,WASHINGTON,SEATTLE AREA OF RESPONSIBILITY,NON-DETAINED DOCKET CONTROL,LOCATED,1 CONVICTED CRIMINAL,6-DEPORTED/REMOVED - DEPORTABILITY,[16] REINSTATED FINAL ORDER,2025-06-03,MEXICO,YES,1984,MEXICO,MALE,SEATTLE FUG OPS,067e9f4a30f1d622088ba0ae32409d3bbd00a6a4
6848,2025-03-08 07:13:00,TEXAS,HARLINGEN AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP FEDERAL INCARCERATION,3 OTHER IMMIGRATION VIOLATOR,6-DEPORTED/REMOVED - DEPORTABILITY,[16] REINSTATED FINAL ORDER,2025-03-12,GUATEMALA,YES,1995,GUATEMALA,FEMALE,COASTAL BEND DETENTION CENTER,06b47e14d010e63a5e03fc9c5945486dbcebed87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264332,2025-03-08 04:39:00,TEXAS,DALLAS AREA OF RESPONSIBILITY,FUGITIVE OPERATIONS,NON-CUSTODIAL ARREST,3 OTHER IMMIGRATION VIOLATOR,,,NaT,,,1986,HONDURAS,MALE,N DIST TX LUBBOCK DIV LUBBOCK CO NON CRIM,
264333,2025-03-08 09:50:11,TEXAS,HARLINGEN AREA OF RESPONSIBILITY,NON-DETAINED DOCKET CONTROL,CAP LOCAL INCARCERATION,3 OTHER IMMIGRATION VIOLATOR,9-VR WITNESSED,[9] VR UNDER SAFEGUARDS,2025-03-09,MEXICO,NO,2003,MEXICO,MALE,"HLG GENERAL AREA, NON-SPECIFIC",
264334,2025-03-08 07:54:37,TEXAS,HARLINGEN AREA OF RESPONSIBILITY,ERO CRIMINAL ALIEN PROGRAM,CAP LOCAL INCARCERATION,2 PENDING CRIMINAL CHARGES,9-VR WITNESSED,[9] VR UNDER SAFEGUARDS,2025-03-08,MEXICO,NO,1994,MEXICO,MALE,"CAMERON COUNTY JAIL- OLMITO, TX",
264335,2025-03-08 09:47:51,TEXAS,HARLINGEN AREA OF RESPONSIBILITY,DETAINED DOCKET CONTROL,LOCATED,2 PENDING CRIMINAL CHARGES,9-VR WITNESSED,[9] VR UNDER SAFEGUARDS,2025-03-08,MEXICO,NO,2005,MEXICO,MALE,"LRD GENERAL AREA, NON-SPECIFIC",


#### 3. Immigration status

According to the documentation, Unique Identifier is based on Alien Registration Number (A-number). As Mahmoud Khalil had a green card, we might assume he would not have an A-number, and therefore wouldn't have a Unique Identifier. This is a big assumption though, and the ICE enforcement agents did think he had a student visa, so he might have an A-number from before he had a green card?

In [64]:
no_ui = arrests_df['unique_identifier'].isna()

Are there any other clues in other variables that indicate immigration status?

In [65]:
arrests_df['final_program'].value_counts()

final_program
ERO CRIMINAL ALIEN PROGRAM             162433
FUGITIVE OPERATIONS                     39776
NON-DETAINED DOCKET CONTROL             33165
ALTERNATIVES TO DETENTION               12761
DETAINED DOCKET CONTROL                  8972
287G PROGRAM                             4934
MOBILE CRIMINAL ALIEN TEAM               2214
JUVENILE                                  444
ERO CRIMINAL PROSECUTIONS                 304
DETENTION AND DEPORTATION                 159
LAW ENFORCEMENT AREA RESPONSE UNITS        45
287G TASK FORCE                            10
VIOLENT CRIMINAL ALIEN SECTION              5
JOINT CRIMINAL ALIEN RESPONSE TEAM          4
Name: count, dtype: int64

In [66]:
arrests_df['apprehension_method'].value_counts()

apprehension_method
CAP LOCAL INCARCERATION                        112103
NON-CUSTODIAL ARREST                            56910
LOCATED                                         31405
CAP FEDERAL INCARCERATION                       23408
CAP STATE INCARCERATION                         10419
OTHER EFFORTS                                    9018
ERO REPROCESSED ARREST                           8875
287(G) PROGRAM                                   6313
PROBATION AND PAROLE                             3833
LAW ENFORCEMENT AGENCY RESPONSE UNIT              845
OTHER TASK FORCE                                  505
PATROL BORDER                                     477
OTHER AGENCY (TURNED OVER TO INS)                 444
WORKSITE ENFORCEMENT                              252
INSPECTIONS                                       127
ANTI-SMUGGLING                                     84
TRAFFIC CHECK                                      62
ORGANIZED CRIME DRUG ENFORCEMENT TASK FORCE        52
PATROL I

In [15]:
mk_key_columns=['apprehension_date','apprehension_state','birth_year','citizenship_country','gender','apprehension_site_landmark']

Potentially useful info in Final Program, but would have to do some more research to clarify meaning of some of the categories

## 2. Filtering - trying to find Mahmoud Khalil

### 1. Location and date:

#### In 26 Federal Plaza around 8th March 2025:
Note - taking a few days before and after 8th March because I don't know how immediately records are entered and processed.

In [21]:
arrests_df[filter_dates(arrests_df,'2025-03-07', end_date='2025-03-11') & loc_federal_plaza][mk_key_columns].sort_values(by='apprehension_date')

Unnamed: 0,apprehension_date,apprehension_state,birth_year,citizenship_country,gender,apprehension_site_landmark
66791,2025-03-09 11:15:00,NEW YORK,1973,HONDURAS,FEMALE,"NDD - 26 FEDERAL PLAZA NY, NY"
181085,2025-03-10 08:49:00,NEW YORK,1985,BANGLADESH,MALE,"NDD - 26 FEDERAL PLAZA NY, NY"
136002,2025-03-10 08:58:00,NEW YORK,1987,ECUADOR,MALE,"NDD - 26 FEDERAL PLAZA NY, NY"
90940,2025-03-10 10:27:00,NEW YORK,1988,ECUADOR,FEMALE,"NDD - 26 FEDERAL PLAZA NY, NY"
208815,2025-03-11 09:15:13,NEW YORK,1990,MONTENEGRO,MALE,"NDD - 26 FEDERAL PLAZA NY, NY"
120693,2025-03-11 10:00:00,NEW YORK,1999,ECUADOR,FEMALE,"NDD - 26 FEDERAL PLAZA NY, NY"
121873,2025-03-11 11:15:00,,1990,"CHINA, PEOPLES REPUBLIC OF",MALE,"NDD - 26 FEDERAL PLAZA NY, NY"
49911,2025-03-11 11:23:00,NEW YORK,1999,INDIA,MALE,"NDD - 26 FEDERAL PLAZA NY, NY"
15260,2025-03-11 13:29:00,NEW YORK,1987,PERU,MALE,"NDD - 26 FEDERAL PLAZA NY, NY"


#### In New York around 8th March 2025:

In [22]:
arrests_df[filter_dates(arrests_df,'2025-03-07', end_date='2025-03-10') & state_ny][mk_key_columns].sort_values(by='apprehension_date')

Unnamed: 0,apprehension_date,apprehension_state,birth_year,citizenship_country,gender,apprehension_site_landmark
142348,2025-03-07 10:07:00,NEW YORK,1997,GUATEMALA,FEMALE,ERIE COUNTY
17660,2025-03-07 10:12:28,NEW YORK,1970,BURMA,MALE,MIDSTATE CORRECTIONAL FACILITY
172444,2025-03-07 10:54:00,NEW YORK,1977,HONDURAS,MALE,CAP - USM - SOUTHERN DISTRICT NY STATE
116429,2025-03-07 10:59:00,NEW YORK,1995,EL SALVADOR,MALE,"CIP GENERAL AREA, NON-SPECIFIC"
178822,2025-03-08 00:23:00,NEW YORK,1995,EL SALVADOR,MALE,"CIP GENERAL AREA, NON-SPECIFIC"
162513,2025-03-08 09:18:00,NEW YORK,1995,VENEZUELA,MALE,ALBANY COUNTY
73250,2025-03-09 10:52:00,NEW YORK,1995,ECUADOR,MALE,MONROE COUNTY
66791,2025-03-09 11:15:00,NEW YORK,1973,HONDURAS,FEMALE,"NDD - 26 FEDERAL PLAZA NY, NY"
169123,2025-03-09 11:15:00,NEW YORK,1996,HONDURAS,FEMALE,FUGITIVE OPERATIONS NY STATE
60359,2025-03-09 15:13:00,NEW YORK,1965,HONDURAS,MALE,FUGITIVE OPERATIONS NY STATE


#### State missing around 8th March 2025:

In [37]:
arrests_df[filter_dates(arrests_df,'2025-03-08', end_date='2025-03-10') & state_missing][
    mk_key_columns].sort_values(by='apprehension_date')[mk_key_columns].iloc[50:100]

Unnamed: 0,apprehension_date,apprehension_state,birth_year,citizenship_country,gender,apprehension_site_landmark
12323,2025-03-08 12:37:00,,2000,GUATEMALA,MALE,"GNS GENERAL AREA, NON-SPECIFIC"
157165,2025-03-08 12:52:00,,2002,CAMEROON,FEMALE,"WAS GENERAL AREA, NON-SPECIFIC"
201250,2025-03-08 13:04:32,,1999,VENEZUELA,MALE,DALLAS COUNTY GENERAL AREA
157786,2025-03-08 13:09:00,,2004,HONDURAS,MALE,"MID GENERAL AREA, NON-SPECIFIC"
19623,2025-03-08 13:10:00,,2001,MEXICO,MALE,DALLAS COUNTY GENERAL AREA
153911,2025-03-08 13:25:00,,1984,GUATEMALA,MALE,"FMY GENERAL AREA, NON-SPECIFIC"
93328,2025-03-08 13:44:00,,1996,HONDURAS,MALE,DENVER COUNTY
141546,2025-03-08 14:34:00,,1980,MEXICO,MALE,"MTG GENERAL AREA, NON-SPECIFIC"
211740,2025-03-08 14:58:00,,1995,GUATEMALA,MALE,"FMY GENERAL AREA, NON-SPECIFIC"
180013,2025-03-08 15:11:00,,1999,COLOMBIA,MALE,"AGU GENERAL AREA, NON-SPECIFIC"


**Summary:** There does not seem to be a clear match for either location field around the timeframe. 

### 2. Demographics

#### Citizenship == Algeria, and Birth Year is 1995:

In [35]:
arrests_df[citizenship_algeria & birth_1995][mk_key_columns]

Unnamed: 0,apprehension_date,apprehension_state,birth_year,citizenship_country,gender,apprehension_site_landmark
107560,2025-03-11 12:56:29,NEW YORK,1995,ALGERIA,MALE,SING SING CORRECTIONAL FACILITY


### Could this be Mahmoud Khalil?

**Headline:** I think it is unlikely. While there are some clear similarities, there are important inconsistancies that don't fit what has been stated in the media, and by Mahmoud Khalil himself, about the case.

Evidence in support:
* Apprehension in New York close to the time Mahmoud Khalil was arrested
* Demographic match - Citizenship is Algerian, and Birth Year is 1995

Evidence against:
* Apprehension site landmark is Sing Sing Correctional Facility, which is a prison in upstate New York. There is no reporting or other evidence to suggest Mahmoud Khalil was taken to Sing Sing, and the timeframe does not really allow for it (he was stated to be in Louisiana around the same time)
* Apprehension method is "CAP STATE INCARCERATION" - CAP is Criminal Alien Program ([source: ICE.gov](https://www.ice.gov/identify-and-arrest/criminal-alien-program)), and "CAP STATE INCARCERATION" suggests this is an example of someone already incarcerated being arrested by ICE 
