write/src/smu-rhu.pmd

```python, header, echo=False
# Author: University of Washington Center for Human Rights
# Title: GEO Group Internal Datasets on Use of Solitary Confinement at Northwest Detention Center
# Date: 2020-11-30
# License: GPL 3.0 or greater
```

```python, footnote_functions, echo=False

# Functions for HTML formatted footnotes

fn_count = 1
fn_buffer = []

def fn(ref_text=str):
    global fn_count, fn_buffer
    ftn_sup = f'<a href="#_ftn{fn_count}" name="_ftnref{fn_count}"><sup>[{fn_count}]</sup></a>'
    ftn_ref = f'<a href="#_ftnref{fn_count}" name="_ftn{fn_count}"><sup>[{fn_count}]</sup></a> {ref_text}'
    fn_buffer.append(ftn_ref)
    fn_count = fn_count + 1
    print(ftn_sup)

def print_fn_refs():
    global fn_buffer
    for fn in fn_buffer:
        print(fn)
        print()

# Functions for labeling figures and tables

fig_count = 1
tab_count = 1

def fig_label():
    global fig_count
    print(f'Figure {fig_count}')
    fig_count = fig_count + 1

def tab_label():
    global tab_count
    print(f'Table {tab_count}')
    tab_count = tab_count + 1

```

# Use of Solitary Confinement at the Northwest Detention Center: Data Appendix
# 1. GEO Group Internal Datasets ("SMU", "RHU")
## UW Center for Human Rights

[Back to Data Appendix Index](index.html)


**Data analyzed:**

1.1 - GEO Group Segregation Lieutenant's log of Restricted Housing Unit (**"RHU"**) placements at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.
1.2 - GEOTrack report of Segregation Management Unit (**"SMU"**) housing assignments at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.

```python, imports, echo=True

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yaml

with open('input/cleanstats.yaml','r') as yamlfile:
    cur_yaml = yaml.load(yamlfile)
    smu_cleanstats = cur_yaml['output/smu.csv.gz']
    rhu_cleanstats = cur_yaml['output/rhu.csv.gz']

```

# 1.1 - GEOtrack report ("SMU")

Original filename: `Sep_1_2013_to_March_31_2020_SMU_geotrack_report_Redacted.pdf`

Described by US DOJ attorneys for ICE as follows:

> "The GEOtrack report that was provided to Plaintiffs runs from September 1, 2013 to March 31, 2020. That report not only reports all placements into segregation, but it also tracks movement. This means that if an individual is placed into one particular unit then simply moves to a different unit, it is tracked in that report (if an individual is moved from H unit cell 101 to H unit cell 102, it would reflect the move as a new placement on the report)."

We refer to this dataset here by the shorthand "SMU" for "Special Management Unit".

The original file has been converted from PDF to CSV format using the [Xpdf pdftotext](https://www.xpdfreader.com/pdftotext-man.html) command line tool with `--table` option, and hand cleaned to correct OCR errors. The resulting CSV has been minimally cleaned in a private repository, dropping <%= smu_cleanstats['duplicates'] %> duplicated records and adding a unique identifier field, `hashid`; cleaning code available upon request.

The original file includes three redacted fields: `Alien #`, `Name`, and `Birthdate`. The file appears to be generated by a database report for the date range "9/1/2013 To 3/31/2020", presumably from the "GEOtrack" database referenced in the filename and by the DOJ attorneys for ICE. The original file has no un-redacted unique field identifiers or individual identifiers.

```python, smu_import, echo=True

csv_opts = {'sep': '|',
            'quotechar': '"',
            'compression': 'gzip',
            'encoding': 'utf-8'}

smu = pd.read_csv('input/smu.csv.gz', **csv_opts)

assert len(set(smu['hashid'])) == len(smu)
assert sum(smu['hashid'].isnull()) == 0

data_cols = list(smu.columns)
data_cols.remove('hashid')

print(smu.info())

```

Here we display the first five records in the dataset (excluding `hashid` field):

<% print(smu[data_cols].head().to_html(border=0, index=False)) %>

```python, smu_date_convert, echo=True

# All date fields convert successfully

assert pd.to_datetime(smu['assigned_dt']).isnull().sum() == 0
smu['assigned_dt'] = pd.to_datetime(smu['assigned_dt'])
assert pd.to_datetime(smu['removed_dt']).isnull().sum() == 0
smu['removed_dt'] = pd.to_datetime(smu['removed_dt'])
assert pd.to_datetime(smu['assigned_date']).isnull().sum() == 0
smu['assigned_date'] = pd.to_datetime(smu['assigned_date'])
assert pd.to_datetime(smu['removed_date']).isnull().sum() == 0
smu['removed_date'] = pd.to_datetime(smu['removed_date'])


```
The GEOTrack database export time-frame conforms to `removed_dt` min/max values:

```python, smu_date_describe, echo=True

print(smu['assigned_dt'].describe())
print()
print(smu['removed_dt'].describe())


```

One record has a `removed_dt` value less than `assigned_dt`, but this is only a discrepancy in the hour values:

<% print(smu[data_cols].loc[smu['assigned_dt'] > smu['removed_dt']].to_html(border=0, index=False)) %>

<%= sum(smu['assigned_dt'] == smu['removed_dt']) %> records have a `removed_dt` value equal to `assigned_dt`, as seen in this sample of five records:

<% print(smu[data_cols].loc[smu['assigned_dt'] == smu['removed_dt']].head().to_html(border=0, index=False)) %>

We retain these records despite the logical inconsistency of these datetime fields, under the assumption that they represent short placements of less than one full day.

Recalculating segregation placement length based on date only results in same value as `days_in_seg` field.

Note that this calculation is not first day inclusive, as in the case of the original version of the RHU dataset (see below). We will disregard hourly data for comparison purposes, as no other dataset includes hourly placement or release times.

```python, smu_date_calc, echo=True

smu['days_calc'] = (smu['removed_date'] - smu['assigned_date']) / np.timedelta64(1, 'D')
assert sum(smu['days_in_seg'] == smu['days_calc']) == len(smu)

```

The below desciptive statistics reflect first day exclusive stay lengths, including stays of 0 days. <%= sum(smu['days_calc'] < 1) %>, or <%= round((sum(smu['days_calc'] < 1) / len(smu) * 100), 2) %>% of records reflect stay lengths of less than one day, based on placement dates. Note that placements in the SMU dataset represent specific housing assignments within one of <%= print(int(set(smu['housing']))) %> cells in the segregation management unit, and would therefore be expected to reflect more and shorter placements than other datasets:

```python, smu_days_calc_describe, echo=True

print(smu['days_calc'].describe())

```

All housing assignments are represented during each year covered by the dataset, but usage patterns vary, with housing units in the 200 block associated with longer average placements:


```python, smu_housing, echo=True

smu_annual = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])

housing_unit_count = smu_annual['housing'].nunique()

assert int(housing_unit_count.unique()) == 20

print(smu.groupby('housing')['days_calc'].mean())

```

Annual median and mean placement lengths show an increase during calendar years 2017-2018:


```python, smu_med_avg_length, echo=True

g = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])

smu_annual_med = g['days_calc'].median()

smu_annual_avg = g['days_calc'].mean()

print(smu_annual_med)
print()
print(smu_annual_avg)


```

Total placement counts per calendar year (note incomplete data for 2013, 2020):


```python, smu_total_placements, echo=True

smu_total_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()

print(smu_total_annual)

```

Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. Again note that placements are by housing assignment in one of <%= len(smu['housing'].unique()) %> total housing locations, not cumulative stay length, so long stays may not be accurately represented here. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period; or individuals with special vulnerabilities.

We find that long placements increase over time both absolutely and as proportion of total placements. However, this may simply reflect fewer transfers of individuals between housing assignments:

```python, smu_long_stays, echo=True

smu['long_stay'] = smu['days_calc'] > 14
long_stays_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()

print(long_stays_annual)
print()
print(long_stays_annual / smu_total_annual)


```

Top citizenship values:

```python, smu_citizenship_table, echo=False

citizenship = smu['citizenship'].value_counts()
top_5 = pd.DataFrame(citizenship.head(5))
all_others = smu[~smu['citizenship'].isin(list(top_5.index))]
top_5.loc['ALL OTHERS', 'citizenship'] = len(all_others)
top_5['citizenship'] = top_5['citizenship'].astype(int)
top_5 = top_5.rename({'citizenship': 'placements'}, axis=1)
top_5.index.name = 'citizenship'

```

**<%= tab_label() %>: SMU dataset top five countries of citizenship**

<% print(top_5.reset_index().to_html(border=0, index=False)) %>

### Comparison with segregation placements reported by DHS inspectors

A [June 24-26, 2014 DHS inspection report](https://drive.google.com/file/d/1YDX4fOOJ3DCftWiQv7O_5jwA2eZ0ftWR/view?usp=sharing) for NWDC states, "Documentation reflects there were 776 assignments to segregation in the past year". The DHS inspection report does not specify the source of the records cited.

The SMU dataset covers this period, albeit with only partial records for June-Sept 2013. The total count of placements recorded in the SMU dataset during this period, <%= len(smu.set_index('assigned_dt').loc['2013-06-01':'2014-06-30']) %> , is reasonably close to figure cited by DHS inspectors, which suggests an average of about <%= round((776 / 12), 0) %> placements per month:

```python, smu_dhs_compare, echo=True

### Monthly total placements during period of DHS inspection report:

dhs_period = smu.set_index('assigned_dt').loc[:'2014-06-30']

g = dhs_period.groupby(pd.Grouper(freq='M'))

print(g['hashid'].nunique())

dhs_period_complete = smu.set_index('assigned_dt').loc['2013-09-01':'2014-06-30']

g = dhs_period_complete.groupby(pd.Grouper(freq='M'))

dhs_period_complete_monthly_avg = g['hashid'].nunique().mean()


```

This is comparable to the average of <%= round(dhs_period_complete_monthly_avg, 1) %> placements per month reported in the SMU dataset during the period for which complete data exists (September 2013 - June 2014). If the GEOtrack database is the source of the data cited in the 2014 DHS inspection report, this is not noted in the inspection report itself.

# 1.2 - GEO Lieutenant's report ("RHU")

Original file: `15_16_17_18_19_20_RHU_admission_Redacted.xlsx`

Log created and maintained by hand by GEO employee to track Restricted Housing Unit placements. Described by US DOJ attorneys for ICE as follows:

> "The spreadsheet runs from January 2015 to May 28, 2020 and was created by and for a lieutenant within the facility once he took over the segregation lieutenant duties. The spreadsheet is updated once a detainee departs segregation. The subjects who are included on this list, therefore, are those who were placed into segregation and have already been released from segregation. It does not include those individuals who are currently in segregation."

We refer to this dataset here by the shorthand "RHU" for "Restricted Housing Unit".<%= fn('US DOJ attorneys for ICE specified that the terms "Special Management Unit" and "Restricted Housing Unit" are interchangeable and identify the same locations.') %>

The original file has been converted from XLSX to CSV format, with each annual tab saved as a separate CSV. The resulting CSVs have been concatenated and minimally cleaned in a private repository, dropping <%= rhu_cleanstats['duplicates'] %> duplicated records and adding a unique identifier field, `hashid`; cleaning code availabe upon request.

The original file includes two fully redacted fields: `Name` and `Alien #`; and one partially redacted field, `Placement reason`. The original file has no un-redacted unique field identifiers or individual identifiers.


```python, rhu_import, echo=True

csv_opts = {'sep': '|',
            'quotechar': '"',
            'compression': 'gzip',
            'encoding': 'utf-8'}

rhu = pd.read_csv('input/rhu.csv.gz', **csv_opts)

assert len(set(rhu['hashid'])) == len(rhu)
assert sum(rhu['hashid'].isnull()) == 0

data_cols = list(rhu.columns)
data_cols.remove('hashid')

print(rhu.info())

```

Here we display the first five records in the dataset (excluding `hashid` field):

<% print(rhu[data_cols].head().to_html(border=0, index=False)) %>

## Dates and total days calculation

Inspection of the original Excel file shows that the `Total days` column values are often incorrect, based on a missing cell formula. For example, on the "2020" spreadsheet tab, the `Total days` column values are integers which only occasionally align with calculated placement length based on the `Date in` and `Date out` columns. However, additional spreadsheet rows at the bottom of the sheet not containing values in other fields contain an Excel formula ("=(D138-C138)+1") which should have been used to calculate these values. Comparing calculated stay lengths with reported `Total days` suggests that this formula was not updated consistently, causing fields to become misaligned. Additionally, the "2015" spreadsheet tab includes many `Total days` values equal to "1", suggesting that the formula was applied incorrectly or with missing data.

We can recalculate actual stay lengths based on the formula cited above (inclusive of start days, with stays of less than one day calculated as "1"); or with the formula used for the "SMU" records above (exclusive of start days, with stays of less than one day calculated as "0"), for more consistent comparison with other datasets.

The above issue raises the possibility that other fields in addition to `Total days` may be misaligned in the original dataset. One fact mitigating this possibility is that no `Date out` values predate associated `Date in` values. We can also look more closely at qualitative fields to make an educated guess as to the data quality: for example, do `intial_placement` values suggesting disciplinary placements align with `placement_reason` values also consistent with disciplinary placements? However, we do not intend to use this dataset for detailed qualitative analysis; of most interest are total segregation placements and segregation stay lengths.

```python, rhu_date_setup, echo=True

rhu['date_in'] = pd.to_datetime(rhu['date_in'])
rhu['date_out'] = pd.to_datetime(rhu['date_out'])

# As noted above, no `date_out` values predate associated `date_in` values:

assert sum(rhu['date_in'] > rhu['date_out']) == 0

print(rhu['date_in'].describe())
print()
print(rhu['date_out'].describe())


```

Here we recalculate the total days field based on the first day inclusive formula in the original Excel spreadsheet ("=(D138-C138)+1"):

```python, rhu_total_days_calc, echo=True

rhu['total_days_calc'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D') + 1

compare_pct = sum(rhu['total_days_calc'] == rhu['total_days']) / len(rhu) * 100

print(rhu['total_days'].describe())
print()
print(rhu['total_days_calc'].describe())

```

Only <% round(compare_pct, 2) %>% of original `total_days` values match their respective recalculated stay lengths in `total_days_calc`.

However, note that the above summary statistics for the original field (`total_days`) are very similar to the recalculated field (`total_days_calc`), suggesting that most values are present in the dataset but misaligned.

Therefore, we will conclude that it is correct to recalculate the `total_days` field. Instead of the first day inclusive formula suggested in the original dataset, here we will use a first day exclusive formula, where placements starting and ending on the same day have length 0. While this risks underestimating placement lengths represented in the dataset, it is more consistent with the calculation of placement lenghts in the SMU and SRMS datasets:

```python, rhu_recalculate_total_days, echo=True

rhu['total_days'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D')
rhu = rhu.drop('total_days_calc', axis=1)

print(rhu['total_days'].describe())

```

Annual median and mean placement lengths are relatively consistent, showing an apparent decrease during the first few months of 2020, possibly explained by incomplete placements excluded from this dataset:


```python, rhu_med_avg_length, echo=True

g = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])

rhu_annual_med = g['total_days'].median()

rhu_annual_avg = g['total_days'].mean()

print(rhu_annual_med)
print()
print(rhu_annual_avg)


```

Total placement counts per calendar year (note data for 2020 is incomplete):


```python, rhu_total_placements, echo=True

rhu_total_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()

print(rhu_total_annual)

```

Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period. Inconsistencies and lack of information in `placement_reason` make it a poor candidate for flagging placements involving individuals with special vulnerabilities. We note an increasing proportion and absolute number of long placements during 2017-2019:

```python, rhu_long_stays, echo=True

rhu['long_stay'] = rhu['total_days'] > 14
long_stays_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()

print(long_stays_annual)
print()
print(long_stays_annual / rhu_total_annual)


```

There are <%= print(int(len(rhu['initial_placement'].str.strip().str.lower().unique()))) %> `initial_placement` values. These closely correspond to the `placement_reason` values cited in the SRMS datasets (see [SRMS 2](nwdc-srms-2.html), [National SRMS Comparison](natl-srms.html) appendices). The most common `initial_placement` values (not correcting for some minor spelling variations) are:

```python, rhu_initial_placement, echo=True

print(rhu['initial_placement'].str.strip().str.lower().value_counts().head(5))

```

There are <%= print(int(len(rhu['placement_reason'].str.strip().str.lower().unique()))) %> `placement_reason` values, including some redacted fields. Below we print the 10 most common values:

```python, rhu_placement_reason, echo=True

print(rhu['placement_reason'].str.strip().str.lower().value_counts().head(10))

```

There are <%= print(int(len(rhu['release_reason'].str.strip().str.lower().unique()))) %> `release_reason` values (not correcting for spelling or other variations). Below we print the 10 most common values:

```python, rhu_release_reason, echo=True

print(rhu['release_reason'].str.strip().str.lower().value_counts().head(10))

```

The field `disc_seg` flags disciplinary segregation placements, which require a hearing process; versus administrative segregation placements. The majority of placements are administrative. Average stay lengths for disciplinary and administrative are similar; though median values differ.

```python, rhu_disc_seg, echo=True

rhu['disc_seg'] = rhu['disc_seg'].str.strip().str.upper()

assert sum(rhu['disc_seg'].isnull()) == 0

print('Proportion:')
print(rhu['disc_seg'].value_counts(normalize=True, dropna=False))
print('\nCount per year:')
print(rhu.set_index('date_in').groupby(pd.Grouper(freq='AS'))['disc_seg'].value_counts())
print('\nStay length by category:')
print(rhu.set_index('date_in').groupby(['disc_seg'])['total_days'].describe())
print('\nAnnual median stay length by category:')
print(rhu.set_index('date_in').groupby([pd.Grouper(freq='AS'), 'disc_seg'])['total_days'].mean())

```

Next section: [Data Appenix 2. Comparison of GEO Group and ICE SRMS Datasets](smu-rhu-srms-compare.html)

[Back to Data Appendix Index](index.html)

<!---

---

## Notes

<%= print_fn_refs() %>

-->