/
smu-rhu.pmd
458 lines (263 loc) · 19.9 KB
/
smu-rhu.pmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
```python, header, echo=False
# Author: University of Washington Center for Human Rights
# Title: GEO Group Internal Datasets on Use of Solitary Confinement at Northwest Detention Center
# Date: 2020-11-30
# License: GPL 3.0 or greater
```
```python, footnote_functions, echo=False
# Functions for HTML formatted footnotes
fn_count = 1
fn_buffer = []
def fn(ref_text=str):
global fn_count, fn_buffer
ftn_sup = f'<a href="#_ftn{fn_count}" name="_ftnref{fn_count}"><sup>[{fn_count}]</sup></a>'
ftn_ref = f'<a href="#_ftnref{fn_count}" name="_ftn{fn_count}"><sup>[{fn_count}]</sup></a> {ref_text}'
fn_buffer.append(ftn_ref)
fn_count = fn_count + 1
print(ftn_sup)
def print_fn_refs():
global fn_buffer
for fn in fn_buffer:
print(fn)
print()
# Functions for labeling figures and tables
fig_count = 1
tab_count = 1
def fig_label():
global fig_count
print(f'Figure {fig_count}')
fig_count = fig_count + 1
def tab_label():
global tab_count
print(f'Table {tab_count}')
tab_count = tab_count + 1
```
# Use of Solitary Confinement at the Northwest Detention Center: Data Appendix
# 1. GEO Group Internal Datasets ("SMU", "RHU")
## UW Center for Human Rights
[Back to Data Appendix Index](index.html)
**Data analyzed:**
1.1 - GEO Group Segregation Lieutenant's log of Restricted Housing Unit (**"RHU"**) placements at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.
1.2 - GEOTrack report of Segregation Management Unit (**"SMU"**) housing assignments at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.
```python, imports, echo=True
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yaml
with open('input/cleanstats.yaml','r') as yamlfile:
cur_yaml = yaml.load(yamlfile)
smu_cleanstats = cur_yaml['output/smu.csv.gz']
rhu_cleanstats = cur_yaml['output/rhu.csv.gz']
```
# 1.1 - GEOtrack report ("SMU")
Original filename: `Sep_1_2013_to_March_31_2020_SMU_geotrack_report_Redacted.pdf`
Described by US DOJ attorneys for ICE as follows:
> "The GEOtrack report that was provided to Plaintiffs runs from September 1, 2013 to March 31, 2020. That report not only reports all placements into segregation, but it also tracks movement. This means that if an individual is placed into one particular unit then simply moves to a different unit, it is tracked in that report (if an individual is moved from H unit cell 101 to H unit cell 102, it would reflect the move as a new placement on the report)."
We refer to this dataset here by the shorthand "SMU" for "Special Management Unit".
The original file has been converted from PDF to CSV format using the [Xpdf pdftotext](https://www.xpdfreader.com/pdftotext-man.html) command line tool with `--table` option, and hand cleaned to correct OCR errors. The resulting CSV has been minimally cleaned in a private repository, dropping <%= smu_cleanstats['duplicates'] %> duplicated records and adding a unique identifier field, `hashid`; cleaning code available upon request.
The original file includes three redacted fields: `Alien #`, `Name`, and `Birthdate`. The file appears to be generated by a database report for the date range "9/1/2013 To 3/31/2020", presumably from the "GEOtrack" database referenced in the filename and by the DOJ attorneys for ICE. The original file has no un-redacted unique field identifiers or individual identifiers.
```python, smu_import, echo=True
csv_opts = {'sep': '|',
'quotechar': '"',
'compression': 'gzip',
'encoding': 'utf-8'}
smu = pd.read_csv('input/smu.csv.gz', **csv_opts)
assert len(set(smu['hashid'])) == len(smu)
assert sum(smu['hashid'].isnull()) == 0
data_cols = list(smu.columns)
data_cols.remove('hashid')
print(smu.info())
```
Here we display the first five records in the dataset (excluding `hashid` field):
<% print(smu[data_cols].head().to_html(border=0, index=False)) %>
```python, smu_date_convert, echo=True
# All date fields convert successfully
assert pd.to_datetime(smu['assigned_dt']).isnull().sum() == 0
smu['assigned_dt'] = pd.to_datetime(smu['assigned_dt'])
assert pd.to_datetime(smu['removed_dt']).isnull().sum() == 0
smu['removed_dt'] = pd.to_datetime(smu['removed_dt'])
assert pd.to_datetime(smu['assigned_date']).isnull().sum() == 0
smu['assigned_date'] = pd.to_datetime(smu['assigned_date'])
assert pd.to_datetime(smu['removed_date']).isnull().sum() == 0
smu['removed_date'] = pd.to_datetime(smu['removed_date'])
```
The GEOTrack database export time-frame conforms to `removed_dt` min/max values:
```python, smu_date_describe, echo=True
print(smu['assigned_dt'].describe())
print()
print(smu['removed_dt'].describe())
```
One record has a `removed_dt` value less than `assigned_dt`, but this is only a discrepancy in the hour values:
<% print(smu[data_cols].loc[smu['assigned_dt'] > smu['removed_dt']].to_html(border=0, index=False)) %>
<%= sum(smu['assigned_dt'] == smu['removed_dt']) %> records have a `removed_dt` value equal to `assigned_dt`, as seen in this sample of five records:
<% print(smu[data_cols].loc[smu['assigned_dt'] == smu['removed_dt']].head().to_html(border=0, index=False)) %>
We retain these records despite the logical inconsistency of these datetime fields, under the assumption that they represent short placements of less than one full day.
Recalculating segregation placement length based on date only results in same value as `days_in_seg` field.
Note that this calculation is not first day inclusive, as in the case of the original version of the RHU dataset (see below). We will disregard hourly data for comparison purposes, as no other dataset includes hourly placement or release times.
```python, smu_date_calc, echo=True
smu['days_calc'] = (smu['removed_date'] - smu['assigned_date']) / np.timedelta64(1, 'D')
assert sum(smu['days_in_seg'] == smu['days_calc']) == len(smu)
```
The below desciptive statistics reflect first day exclusive stay lengths, including stays of 0 days. <%= sum(smu['days_calc'] < 1) %>, or <%= round((sum(smu['days_calc'] < 1) / len(smu) * 100), 2) %>% of records reflect stay lengths of less than one day, based on placement dates. Note that placements in the SMU dataset represent specific housing assignments within one of <%= print(int(set(smu['housing']))) %> cells in the segregation management unit, and would therefore be expected to reflect more and shorter placements than other datasets:
```python, smu_days_calc_describe, echo=True
print(smu['days_calc'].describe())
```
All housing assignments are represented during each year covered by the dataset, but usage patterns vary, with housing units in the 200 block associated with longer average placements:
```python, smu_housing, echo=True
smu_annual = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])
housing_unit_count = smu_annual['housing'].nunique()
assert int(housing_unit_count.unique()) == 20
print(smu.groupby('housing')['days_calc'].mean())
```
Annual median and mean placement lengths show an increase during calendar years 2017-2018:
```python, smu_med_avg_length, echo=True
g = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])
smu_annual_med = g['days_calc'].median()
smu_annual_avg = g['days_calc'].mean()
print(smu_annual_med)
print()
print(smu_annual_avg)
```
Total placement counts per calendar year (note incomplete data for 2013, 2020):
```python, smu_total_placements, echo=True
smu_total_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()
print(smu_total_annual)
```
Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. Again note that placements are by housing assignment in one of <%= len(smu['housing'].unique()) %> total housing locations, not cumulative stay length, so long stays may not be accurately represented here. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period; or individuals with special vulnerabilities.
We find that long placements increase over time both absolutely and as proportion of total placements. However, this may simply reflect fewer transfers of individuals between housing assignments:
```python, smu_long_stays, echo=True
smu['long_stay'] = smu['days_calc'] > 14
long_stays_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()
print(long_stays_annual)
print()
print(long_stays_annual / smu_total_annual)
```
Top citizenship values:
```python, smu_citizenship_table, echo=False
citizenship = smu['citizenship'].value_counts()
top_5 = pd.DataFrame(citizenship.head(5))
all_others = smu[~smu['citizenship'].isin(list(top_5.index))]
top_5.loc['ALL OTHERS', 'citizenship'] = len(all_others)
top_5['citizenship'] = top_5['citizenship'].astype(int)
top_5 = top_5.rename({'citizenship': 'placements'}, axis=1)
top_5.index.name = 'citizenship'
```
**<%= tab_label() %>: SMU dataset top five countries of citizenship**
<% print(top_5.reset_index().to_html(border=0, index=False)) %>
### Comparison with segregation placements reported by DHS inspectors
A [June 24-26, 2014 DHS inspection report](https://drive.google.com/file/d/1YDX4fOOJ3DCftWiQv7O_5jwA2eZ0ftWR/view?usp=sharing) for NWDC states, "Documentation reflects there were 776 assignments to segregation in the past year". The DHS inspection report does not specify the source of the records cited.
The SMU dataset covers this period, albeit with only partial records for June-Sept 2013. The total count of placements recorded in the SMU dataset during this period, <%= len(smu.set_index('assigned_dt').loc['2013-06-01':'2014-06-30']) %> , is reasonably close to figure cited by DHS inspectors, which suggests an average of about <%= round((776 / 12), 0) %> placements per month:
```python, smu_dhs_compare, echo=True
### Monthly total placements during period of DHS inspection report:
dhs_period = smu.set_index('assigned_dt').loc[:'2014-06-30']
g = dhs_period.groupby(pd.Grouper(freq='M'))
print(g['hashid'].nunique())
dhs_period_complete = smu.set_index('assigned_dt').loc['2013-09-01':'2014-06-30']
g = dhs_period_complete.groupby(pd.Grouper(freq='M'))
dhs_period_complete_monthly_avg = g['hashid'].nunique().mean()
```
This is comparable to the average of <%= round(dhs_period_complete_monthly_avg, 1) %> placements per month reported in the SMU dataset during the period for which complete data exists (September 2013 - June 2014). If the GEOtrack database is the source of the data cited in the 2014 DHS inspection report, this is not noted in the inspection report itself.
# 1.2 - GEO Lieutenant's report ("RHU")
Original file: `15_16_17_18_19_20_RHU_admission_Redacted.xlsx`
Log created and maintained by hand by GEO employee to track Restricted Housing Unit placements. Described by US DOJ attorneys for ICE as follows:
> "The spreadsheet runs from January 2015 to May 28, 2020 and was created by and for a lieutenant within the facility once he took over the segregation lieutenant duties. The spreadsheet is updated once a detainee departs segregation. The subjects who are included on this list, therefore, are those who were placed into segregation and have already been released from segregation. It does not include those individuals who are currently in segregation."
We refer to this dataset here by the shorthand "RHU" for "Restricted Housing Unit".<%= fn('US DOJ attorneys for ICE specified that the terms "Special Management Unit" and "Restricted Housing Unit" are interchangeable and identify the same locations.') %>
The original file has been converted from XLSX to CSV format, with each annual tab saved as a separate CSV. The resulting CSVs have been concatenated and minimally cleaned in a private repository, dropping <%= rhu_cleanstats['duplicates'] %> duplicated records and adding a unique identifier field, `hashid`; cleaning code availabe upon request.
The original file includes two fully redacted fields: `Name` and `Alien #`; and one partially redacted field, `Placement reason`. The original file has no un-redacted unique field identifiers or individual identifiers.
```python, rhu_import, echo=True
csv_opts = {'sep': '|',
'quotechar': '"',
'compression': 'gzip',
'encoding': 'utf-8'}
rhu = pd.read_csv('input/rhu.csv.gz', **csv_opts)
assert len(set(rhu['hashid'])) == len(rhu)
assert sum(rhu['hashid'].isnull()) == 0
data_cols = list(rhu.columns)
data_cols.remove('hashid')
print(rhu.info())
```
Here we display the first five records in the dataset (excluding `hashid` field):
<% print(rhu[data_cols].head().to_html(border=0, index=False)) %>
## Dates and total days calculation
Inspection of the original Excel file shows that the `Total days` column values are often incorrect, based on a missing cell formula. For example, on the "2020" spreadsheet tab, the `Total days` column values are integers which only occasionally align with calculated placement length based on the `Date in` and `Date out` columns. However, additional spreadsheet rows at the bottom of the sheet not containing values in other fields contain an Excel formula ("=(D138-C138)+1") which should have been used to calculate these values. Comparing calculated stay lengths with reported `Total days` suggests that this formula was not updated consistently, causing fields to become misaligned. Additionally, the "2015" spreadsheet tab includes many `Total days` values equal to "1", suggesting that the formula was applied incorrectly or with missing data.
We can recalculate actual stay lengths based on the formula cited above (inclusive of start days, with stays of less than one day calculated as "1"); or with the formula used for the "SMU" records above (exclusive of start days, with stays of less than one day calculated as "0"), for more consistent comparison with other datasets.
The above issue raises the possibility that other fields in addition to `Total days` may be misaligned in the original dataset. One fact mitigating this possibility is that no `Date out` values predate associated `Date in` values. We can also look more closely at qualitative fields to make an educated guess as to the data quality: for example, do `intial_placement` values suggesting disciplinary placements align with `placement_reason` values also consistent with disciplinary placements? However, we do not intend to use this dataset for detailed qualitative analysis; of most interest are total segregation placements and segregation stay lengths.
```python, rhu_date_setup, echo=True
rhu['date_in'] = pd.to_datetime(rhu['date_in'])
rhu['date_out'] = pd.to_datetime(rhu['date_out'])
# As noted above, no `date_out` values predate associated `date_in` values:
assert sum(rhu['date_in'] > rhu['date_out']) == 0
print(rhu['date_in'].describe())
print()
print(rhu['date_out'].describe())
```
Here we recalculate the total days field based on the first day inclusive formula in the original Excel spreadsheet ("=(D138-C138)+1"):
```python, rhu_total_days_calc, echo=True
rhu['total_days_calc'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D') + 1
compare_pct = sum(rhu['total_days_calc'] == rhu['total_days']) / len(rhu) * 100
print(rhu['total_days'].describe())
print()
print(rhu['total_days_calc'].describe())
```
Only <% round(compare_pct, 2) %>% of original `total_days` values match their respective recalculated stay lengths in `total_days_calc`.
However, note that the above summary statistics for the original field (`total_days`) are very similar to the recalculated field (`total_days_calc`), suggesting that most values are present in the dataset but misaligned.
Therefore, we will conclude that it is correct to recalculate the `total_days` field. Instead of the first day inclusive formula suggested in the original dataset, here we will use a first day exclusive formula, where placements starting and ending on the same day have length 0. While this risks underestimating placement lengths represented in the dataset, it is more consistent with the calculation of placement lenghts in the SMU and SRMS datasets:
```python, rhu_recalculate_total_days, echo=True
rhu['total_days'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D')
rhu = rhu.drop('total_days_calc', axis=1)
print(rhu['total_days'].describe())
```
Annual median and mean placement lengths are relatively consistent, showing an apparent decrease during the first few months of 2020, possibly explained by incomplete placements excluded from this dataset:
```python, rhu_med_avg_length, echo=True
g = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])
rhu_annual_med = g['total_days'].median()
rhu_annual_avg = g['total_days'].mean()
print(rhu_annual_med)
print()
print(rhu_annual_avg)
```
Total placement counts per calendar year (note data for 2020 is incomplete):
```python, rhu_total_placements, echo=True
rhu_total_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()
print(rhu_total_annual)
```
Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period. Inconsistencies and lack of information in `placement_reason` make it a poor candidate for flagging placements involving individuals with special vulnerabilities. We note an increasing proportion and absolute number of long placements during 2017-2019:
```python, rhu_long_stays, echo=True
rhu['long_stay'] = rhu['total_days'] > 14
long_stays_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()
print(long_stays_annual)
print()
print(long_stays_annual / rhu_total_annual)
```
There are <%= print(int(len(rhu['initial_placement'].str.strip().str.lower().unique()))) %> `initial_placement` values. These closely correspond to the `placement_reason` values cited in the SRMS datasets (see [SRMS 2](nwdc-srms-2.html), [National SRMS Comparison](natl-srms.html) appendices). The most common `initial_placement` values (not correcting for some minor spelling variations) are:
```python, rhu_initial_placement, echo=True
print(rhu['initial_placement'].str.strip().str.lower().value_counts().head(5))
```
There are <%= print(int(len(rhu['placement_reason'].str.strip().str.lower().unique()))) %> `placement_reason` values, including some redacted fields. Below we print the 10 most common values:
```python, rhu_placement_reason, echo=True
print(rhu['placement_reason'].str.strip().str.lower().value_counts().head(10))
```
There are <%= print(int(len(rhu['release_reason'].str.strip().str.lower().unique()))) %> `release_reason` values (not correcting for spelling or other variations). Below we print the 10 most common values:
```python, rhu_release_reason, echo=True
print(rhu['release_reason'].str.strip().str.lower().value_counts().head(10))
```
The field `disc_seg` flags disciplinary segregation placements, which require a hearing process; versus administrative segregation placements. The majority of placements are administrative. Average stay lengths for disciplinary and administrative are similar; though median values differ.
```python, rhu_disc_seg, echo=True
rhu['disc_seg'] = rhu['disc_seg'].str.strip().str.upper()
assert sum(rhu['disc_seg'].isnull()) == 0
print('Proportion:')
print(rhu['disc_seg'].value_counts(normalize=True, dropna=False))
print('\nCount per year:')
print(rhu.set_index('date_in').groupby(pd.Grouper(freq='AS'))['disc_seg'].value_counts())
print('\nStay length by category:')
print(rhu.set_index('date_in').groupby(['disc_seg'])['total_days'].describe())
print('\nAnnual median stay length by category:')
print(rhu.set_index('date_in').groupby([pd.Grouper(freq='AS'), 'disc_seg'])['total_days'].mean())
```
Next section: [Data Appenix 2. Comparison of GEO Group and ICE SRMS Datasets](smu-rhu-srms-compare.html)
[Back to Data Appendix Index](index.html)
<!---
---
## Notes
<%= print_fn_refs() %>
-->