# 0 - Data Exploration

In [1]:
import numpy as np
import pandas as pd

In [24]:
arrests_df = pd.read_csv('../data/arrests-0923-0625.csv', parse_dates=['Apprehension Date','Departed Date'])

**NB** - see note in README - converting to csv has messed with the times (24 hour clock going to 12 hour clock). The dates are fine though, so this can be used as long as not using time of arrest (for that analysis we should use the .xlsx file)

In [25]:
arrests_df.head()

Unnamed: 0,Apprehension Date,Apprehension State,Apprehension County,Apprehension AOR,Final Program,Final Program Group,Apprehension Method,Apprehension Criminality,Case Status,Case Category,...,Final Order Date,Birth Date,Birth Year,Citizenship Country,Gender,Apprehension Site Landmark,Alien File Number,EID Case ID,EID Subject ID,Unique Identifier
0,2024-08-07 09:43:00,VIRGINIA,,Washington Area of Responsibility,ERO Criminal Alien Program,ICE,Non-Custodial Arrest,1 Convicted Criminal,8-Excluded/Removed - Inadmissibility,[16] Reinstated Final Order,...,10/18/1999,"(b)(6), (b)(7)(C)",1981,HONDURAS,Male,"HBG GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",0000b34edd657d516c02b13a7c352d62d0effcb6
1,2024-10-19 08:33:00,TEXAS,,Houston Area of Responsibility,ERO Criminal Alien Program,ICE,CAP Local Incarceration,1 Convicted Criminal,6-Deported/Removed - Deportability,[16] Reinstated Final Order,...,10/10/2023,"(b)(6), (b)(7)(C)",1984,MEXICO,Male,"HARRIS COUNTY JAIL, HOUSTON, TX","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",0000ba6e459998a6046d185d82cf4349de1479d0
2,2025-04-15 10:08:00,NEW JERSEY,,Newark Area of Responsibility,ERO Criminal Alien Program,ICE,CAP Federal Incarceration,1 Convicted Criminal,8-Excluded/Removed - Inadmissibility,[16] Reinstated Final Order,...,04/15/2025,"(b)(6), (b)(7)(C)",1988,DOMINICAN REPUBLIC,Male,"FORT DIX EAST, NEW JERSEY","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",0000c3d23fb0e444864559575900d410c4e8490f
3,2025-06-03 09:20:00,MINNESOTA,,St. Paul Area of Responsibility,Fugitive Operations,ICE,Non-Custodial Arrest,3 Other Immigration Violator,ACTIVE,[8G] Expedited Removal - Credible Fear Referral,...,06/03/2025,"(b)(6), (b)(7)(C)",1985,COLOMBIA,Female,"SPM GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",0000d3dbf8033b5f209f6547ffee5b84feb4f599
4,2025-01-21 05:41:00,,,Miami Area of Responsibility,ERO Criminal Alien Program,ICE,CAP Local Incarceration,2 Pending Criminal Charges,3-Voluntary Departure Confirmed,[8C] Excludable / Inadmissible - Administrativ...,...,02/01/2025,"(b)(6), (b)(7)(C)",1983,MEXICO,Male,MIAMI DADE COUNTY JAIL TURNER GUILFORD KNIGHT ...,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",000104d730bf021326c6dc0deb3dd575304136b5


In [30]:
arrests_df.columns

Index(['Apprehension Date', 'Apprehension State', 'Apprehension County',
       'Apprehension AOR', 'Final Program', 'Final Program Group',
       'Apprehension Method', 'Apprehension Criminality', 'Case Status',
       'Case Category', 'Departed Date', 'Departure Country',
       'Final Order Yes No', 'Final Order Date', 'Birth Date', 'Birth Year',
       'Citizenship Country', 'Gender', 'Apprehension Site Landmark',
       'Alien File Number', 'EID Case ID', 'EID Subject ID',
       'Unique Identifier'],
      dtype='object')

In [22]:
arrests_df.shape

(265226, 23)

### How complete is the data?

In [26]:
arrests_df.isna().sum().sort_values()

Apprehension Date                  0
Gender                             0
Citizenship Country                0
Final Program                      0
Final Program Group                0
Apprehension Method                0
Apprehension Criminality           0
Birth Year                         0
Birth Date                         0
EID Subject ID                     0
Alien File Number               2023
Unique Identifier               2023
Final Order Yes No              3898
Case Category                   3898
Case Status                     3898
EID Case ID                     3898
Apprehension AOR                5903
Apprehension Site Landmark      6100
Apprehension State             56243
Final Order Date              104937
Departed Date                 119909
Departure Country             119947
Apprehension County           265226
dtype: int64

#### Observations:
* Location variables have quite a lot of missing: `Apprehension State`, `Apprehension Site Landmark`, `Apprehension AOR`
* `Unique Identifier` missing for 2023 rows - docs say: *"based on Alien Registration Number (A-number). A-numbers are assigned to noncitizens by ICE or USCIS; undocumented noncitizens who have not interacted with the U.S. government, as well as people on nonimmigrant visas"* (this tracks as same number of missing for `Alien File Number`)  **Need to keep this in mind**


#### Is every row unique?


In [27]:
arrests_df['Unique Identifier'].value_counts().head()

Unique Identifier
1e2ec1831c1fd33fdb93b412699a798db3544957    7
45c6d66afff974e4fea3a5daca554031dd2f61ff    5
26d8a297a31da2d298cc5af37176920d34cb088f    5
1007639d0d35049ada17e291e3c201542ff443b1    5
1604e1b821a91b42ac5543a4a59e143633309dc8    5
Name: count, dtype: int64

In [28]:
arrests_df[arrests_df['Unique Identifier']=='1604e1b821a91b42ac5543a4a59e143633309dc8']

Unnamed: 0,Apprehension Date,Apprehension State,Apprehension County,Apprehension AOR,Final Program,Final Program Group,Apprehension Method,Apprehension Criminality,Case Status,Case Category,...,Final Order Date,Birth Date,Birth Year,Citizenship Country,Gender,Apprehension Site Landmark,Alien File Number,EID Case ID,EID Subject ID,Unique Identifier
22612,2023-12-19 12:25:00,CALIFORNIA,,Los Angeles Area of Responsibility,ERO Criminal Alien Program,ICE,CAP Local Incarceration,1 Convicted Criminal,ACTIVE,[8C] Excludable / Inadmissible - Administrativ...,...,10/14/1998,"(b)(6), (b)(7)(C)",1977,VIETNAM,Male,"LOS ANGELES COUNTY GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",1604e1b821a91b42ac5543a4a59e143633309dc8
22613,2024-04-01 10:35:00,CALIFORNIA,,Los Angeles Area of Responsibility,Alternatives to Detention,ICE,Non-Custodial Arrest,1 Convicted Criminal,ACTIVE,[8C] Excludable / Inadmissible - Administrativ...,...,10/14/1998,"(b)(6), (b)(7)(C)",1977,VIETNAM,Male,"SAA GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",1604e1b821a91b42ac5543a4a59e143633309dc8
22614,2024-04-01 09:14:00,,,Los Angeles Area of Responsibility,Alternatives to Detention,ICE,Non-Custodial Arrest,1 Convicted Criminal,ACTIVE,[8C] Excludable / Inadmissible - Administrativ...,...,10/14/1998,"(b)(6), (b)(7)(C)",1977,VIETNAM,Male,"SAA GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",1604e1b821a91b42ac5543a4a59e143633309dc8
22615,2024-08-05 10:35:00,CALIFORNIA,,Los Angeles Area of Responsibility,ERO Criminal Alien Program,ICE,Non-Custodial Arrest,1 Convicted Criminal,ACTIVE,[8C] Excludable / Inadmissible - Administrativ...,...,10/14/1998,"(b)(6), (b)(7)(C)",1977,VIETNAM,Male,"SAA GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",1604e1b821a91b42ac5543a4a59e143633309dc8
22616,2025-06-17 11:12:00,CALIFORNIA,,Los Angeles Area of Responsibility,ERO Criminal Alien Program,ICE,Non-Custodial Arrest,1 Convicted Criminal,ACTIVE,[8C] Excludable / Inadmissible - Administrativ...,...,10/14/1998,"(b)(6), (b)(7)(C)",1977,VIETNAM,Male,"SAA GENERAL AREA, NON-SPECIFIC","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)",1604e1b821a91b42ac5543a4a59e143633309dc8


Some seem to be duplicated rows, and others are actual multiple arrests for some people

In [29]:
arrests_df[arrests_df.duplicated()].shape

(41, 23)

In [21]:
arrests_df.drop_duplicates(inplace=True)

### Do all the columns hold useful information?

In [26]:
arrests_df.isna().all()

Apprehension Date             False
Apprehension State            False
Apprehension County            True
Apprehension AOR              False
Final Program                 False
Final Program Group           False
Apprehension Method           False
Apprehension Criminality      False
Case Status                   False
Case Category                 False
Departed Date                 False
Departure Country             False
Final Order Yes No            False
Final Order Date              False
Birth Date                    False
Birth Year                    False
Citizenship Country           False
Gender                        False
Apprehension Site Landmark    False
Alien File Number             False
EID Case ID                   False
EID Subject ID                False
Unique Identifier             False
dtype: bool

In [27]:
arrests_df.nunique()

Apprehension Date             166619
Apprehension State                61
Apprehension County                0
Apprehension AOR                  26
Final Program                     14
Final Program Group                1
Apprehension Method               25
Apprehension Criminality           3
Case Status                       13
Case Category                     29
Departed Date                    683
Departure Country                186
Final Order Yes No                 2
Final Order Date                9721
Birth Date                         1
Birth Year                        90
Citizenship Country              194
Gender                             3
Apprehension Site Landmark      4968
Alien File Number                  1
EID Case ID                        1
EID Subject ID                     1
Unique Identifier             249872
dtype: int64

#### Looking into the columns that only have one value to check if it's useful info:

In [36]:
nunique = arrests_df.nunique().reset_index().rename(columns={'index':'column', 0:'nunique'})
nunique_1_cols = nunique['column'][nunique['nunique']==1]
arrests_df[nunique_1_cols]

Unnamed: 0,Final Program Group,Birth Date,Alien File Number,EID Case ID,EID Subject ID
0,ICE,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
1,ICE,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
2,ICE,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
3,ICE,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
4,ICE,"(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C)","(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
...,...,...,...,...,...
265221,ICE,"(b)(6), (b)(7)(C)",,"(b)(6), (b)(7)(C), (b)(7)(E)","(b)(6), (b)(7)(C), (b)(7)(E)"
265222,ICE,"(b)(6), (b)(7)(C)",,,"(b)(6), (b)(7)(C), (b)(7)(E)"
265223,ICE,"(b)(6), (b)(7)(C)",,,"(b)(6), (b)(7)(C), (b)(7)(E)"
265224,ICE,"(b)(6), (b)(7)(C)",,,"(b)(6), (b)(7)(C), (b)(7)(E)"


In [38]:
nunique_1_cols

5     Final Program Group
14             Birth Date
19      Alien File Number
20            EID Case ID
21         EID Subject ID
Name: column, dtype: object

Seems it is mainly redacted information, and the Final Program Group just tells us that all of these arrests are done by ICE (which we already know). So we can remove these columns from the dataset to focus on the variables that have more information

In [45]:
ARRESTS_KEY_COLS = [c for c in arrests_df.columns if 
                        c not in list(nunique_1_cols) and 
                            not arrests_df[c].isna().all()]
ARRESTS_KEY_COLS

['Apprehension Date',
 'Apprehension State',
 'Apprehension AOR',
 'Final Program',
 'Apprehension Method',
 'Apprehension Criminality',
 'Case Status',
 'Case Category',
 'Departed Date',
 'Departure Country',
 'Final Order Yes No',
 'Final Order Date',
 'Birth Year',
 'Citizenship Country',
 'Gender',
 'Apprehension Site Landmark',
 'Unique Identifier']

#### Quick eyeballing of value counts for these cols to check the data

In [56]:
for c in ARRESTS_KEY_COLS:
    if arrests_df[c].nunique() < 300: # don't want to print out all the dates and IDs
        print(f"""
        {arrests_df[c].value_counts()}

        """)


        Apprehension State
TEXAS                             52778
FLORIDA                           20432
CALIFORNIA                        14433
NEW YORK                          10006
GEORGIA                            7053
                                  ...  
FEDERATED STATES OF MICRONESIA       15
ARMED SERVICES - PACIFIC              3
ARMED FORCES - EUROPE                 2
TAMAULIPAS                            1
MEXICO                                1
Name: count, Length: 61, dtype: int64

        

        Apprehension AOR
Miami Area of Responsibility             26925
New Orleans Area of Responsibility       23598
Dallas Area of Responsibility            23185
Houston Area of Responsibility           21785
Chicago Area of Responsibility           18370
Atlanta Area of Responsibility           16596
San Antonio Area of Responsibility       15161
Harlingen Area of Responsibility         11607
Los Angeles Area of Responsibility       11575
Newark Area of Responsibility      

#### Checking data quality:

##### 1. `Citizenship Country`

In [55]:
np.sort(arrests_df['Citizenship Country'].unique())

array(['AFGHANISTAN', 'ALBANIA', 'ALGERIA', 'ANDORRA', 'ANGOLA',
       'ANGUILLA', 'ANTIGUA-BARBUDA', 'ARGENTINA', 'ARMENIA', 'AUSTRALIA',
       'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH',
       'BARBADOS', 'BELARUS', 'BELGIUM', 'BELIZE', 'BENIN', 'BHUTAN',
       'BOLIVIA', 'BOSNIA-HERZEGOVINA', 'BOTSWANA', 'BRAZIL',
       'BRITISH VIRGIN ISLANDS', 'BRUNEI', 'BULGARIA', 'BURKINA FASO',
       'BURMA', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE',
       'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE',
       'CHINA, PEOPLES REPUBLIC OF', 'COLOMBIA', 'CONGO', 'COSTA RICA',
       'CROATIA', 'CUBA', 'CURACAO', 'CYPRUS', 'CZECH REPUBLIC',
       'CZECHOSLOVAKIA', 'DEM REP OF THE CONGO', 'DENMARK', 'DJIBOUTI',
       'DOMINICA', 'DOMINICAN REPUBLIC', 'EAST TIMOR', 'ECUADOR', 'EGYPT',
       'EL SALVADOR', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA',
       'ESWATINI', 'ETHIOPIA', 'FIJI', 'FINLAND', 'FRANCE',
       'FRENCH GUIANA', 'FRENCH POLYNESIA', 'GABON', 

**Note** - some old countries, and some potential duplication with this.

Examples:
* `YUGOSLAVIA`
* `CZECHOSLOVAKIA` and `CZECH REPUBLIC` & `SLOVAKIA`
* `USSR`

Is it because these were the individual's citizenships at the time they moved to the US?

In [59]:
arrests_df[arrests_df['Citizenship Country']=='CZECHOSLOVAKIA']['Birth Year']

167363    1966
Name: Birth Year, dtype: int64

##### 2. `Apprehension Site Landmark`

In [113]:
np.sort(arrests_df['Apprehension Site Landmark'].fillna('').unique())

array(['', ' ADELANTO DETENTION CENTER (SBSD) ADELANTO, CA',
       ' LICENSING UNIT/ STATE POLICE', ...,
       'Z09 - MASTERSON TO MANGANA HEIN TO MCKENDRICK/RIO BRAVO FENCELINE',
       'ZAPATA COUNTY REGIONAL JAIL', 'ZAVALA COUNTY JAIL'], dtype=object)

**Note** looks like some whitespace that needs to be cleaned

In [114]:
arrests_df['Apprehension Site Landmark'].value_counts().head(50)

Apprehension Site Landmark
DALLAS COUNTY GENERAL AREA                               11105
MTG GENERAL AREA, NON-SPECIFIC                            9009
NDD - 26 FEDERAL PLAZA NY, NY                             5805
HARRIS COUNTY JAIL, HOUSTON, TX                           4713
LOS ANGELES COUNTY GENERAL AREA, NON-SPECIFIC             4388
HLG GENERAL AREA, NON-SPECIFIC                            3535
ATLANTA, GA                                               3442
SNA GENERAL AREA, NON-SPECIFIC                            3357
AUS GENERAL AREA, NON-SPECIFIC                            2869
FUGITIVE OPERATIONS MA                                    2827
CAP - MARICOPA COUNTY SHERIFFS OFFICE JAIL                2732
Miramar ICE/ERO Sub-Office                                2465
WAS GENERAL AREA, NON-SPECIFIC                            2359
MIAMI DADE COUNTY JAIL TURNER GUILFORD KNIGHT (TGK)       2334
EDN GENERAL AREA, NON-SPECIFIC                            2227
ICE ERO NEWARK              

In [104]:
(arrests_df['Apprehension Site Landmark'].value_counts(dropna=False) == 1).sum()

1216

In [111]:
arrests_df[arrests_df['Apprehension Site Landmark'].fillna('').str.contains('COUNTY')]['Apprehension Site Landmark'].unique()

array(['HARRIS COUNTY JAIL, HOUSTON, TX',
       'MIAMI DADE COUNTY JAIL TURNER GUILFORD KNIGHT (TGK)',
       'TAM-PINELLAS COUNTY JAIL', ..., 'ARCHER COUNTY JAIL',
       "GOCGEN-GOLIAD COUNTY SHERIFF'S OFFICE", 'CRAWFORD COUNTY, GA'],
      dtype=object)

In [115]:
arrests_df[arrests_df['Apprehension Site Landmark'].fillna('').str.contains('287')]['Apprehension Site Landmark'].unique()

array(['Collier County 287g Program', 'CCN GENERAL AREA, CABARRUS 287G',
       'HALL COUNTY JAIL - 287(G)', 'Benton County Jail 287(g)',
       'GASTON COUNTY JAIL 287G', 'GCN GENERAL AREA, GASTON 287G',
       '287(G) - ADC JEO', '287(G) - MESA CITY JAIL JEO',
       'YORK COUNTY (SC) JAIL 287G', '287(G) - YCSO CAMP VERDE JAIL JEO',
       "287(G)-CLAY COUNTY SHERIFF'S OFFICE",
       'Henderson County Jail 287G', 'HEN GENERAL AREA, HENDERSON 287G',
       'Gaston County Jail 287G', 'GWINNETT COUNTY JAIL - 287(G)',
       '287(G) PINAL COUNTY JAIL FLORENCE JEO',
       '287G COLLIER FMY ERO PROGRAM',
       '287(G) PINAL COUNTY SHERIFF PATROL (TFO)',
       '287G OFFICER BARNSTABLE COUNTY', 'HENDERSON COUNTY JAIL 287G',
       'CABARRUS COUNTY JAIL 287G', "287(G)-KNOX COUNTY SHERIFF'S OFFICE",
       'BENTON COUNTY 287G JAIL ENFORCEMENT OFFICER',
       'Cabarrus County Jail 287G', 'DOC 287G OFFICER, MCI CONCORD',
       '287(G) - AZ DOC (JEO)', 'WHITFIELD COUNTY JAIL - 287(G)'],
   

### General variables observations:
* `Birth Year` and `Citizenship Country` are both complete and have  - key variables that tell us about individual arrested
   * Remember that `Citizenship Country` doesn't map only to current countries, but is a reflection that the individuals likely moved to the US prior to the dissolution/division of these countries
* `Apprehension Method` might have some useful info for later - note `287(g) Program` is a variable here
* `Final Program` has both `287G PROGRAM` and `287G TASK FORCE`
* `Apprehension Site Landmark`
    * needs a bit of cleaning: Whitespace; CAPS (some don't seem to be caps, although most are)
    * very varied in amount of detail and format
    * some specify the arrest was under the 287g program