In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Read in datafile

In [2]:
df = pd.read_csv("rawJailDataIntern.csv")

# 2. Explore Missingness

### Missingness Table:

In [4]:
# Total Missing Data:
df.isna().sum()

fips                    0
facility_name           0
year                    0
rated_capacity          7
total_confined_pop      3
adp                    15
confined_women          0
confined_men           12
admissions_year       122
admissions_week        92
admissions_day        159
discharge_year        122
discharge_week         92
discharge_day         160
facility_year           0
dtype: int64

### 2 Takeaways from the missingness table:

1. **Impute yearly counts based on daily and weekly counts**: It seems as if there may be some non-overlapping missingness from the days and weeks counts versus the year counts. Thus, I will try to use the daily and weekly counts to infer the yearly counts for discharge and admissions. 

2. **Impute confined_men using confined_women**: Because this dataset doesn't take into account individuals with non-binary gender identities (confined_other doesn't exist in this dataset), it looks like we can fill up some confined_men values by subtracting confined_women from total_confined_pop, given that confined_women isn't missing any values. 

# 3. Impute yearly counts based on daily and weekly counts

After some exploratory data analysis, it's clear that many of the missing yearly values for admissions and discharge have either daily or weekly data next to them.

As a **rough estimate**, we can impute the missing year columns by multiplying the weekly and daily data that exists for those years. If anomalous, we will remove them later in the code. 

In [5]:
# impute missing admissions_year data based on admissions_week and admissions_day
df['admissions_year'] = np.where(df.admissions_year.isnull(), df.admissions_week * 52, df.admissions_year)
df['admissions_year'] = np.where(df.admissions_year.isnull(), df.admissions_day * 365, df.admissions_year)

# impute missing discharge_year data based on discharge_week and discharge_day
df['discharge_year'] = np.where(df.discharge_year.isnull(), df.discharge_week * 52, df.discharge_year)
df['discharge_year'] = np.where(df.discharge_year.isnull(), df.discharge_day * 365, df.discharge_year)

Now, let's drop the daily and weekly counts because we wish to remove them in the final output. 

In [6]:
df = df.drop(columns=['admissions_day', 'admissions_week', 'discharge_week', 'discharge_day'])

# 4. Impute confined_men using confined_women

Subtract confined_women from total_confined_pop to get confined_men. 

As you can see, doing this reduced the missingness of confined_men to 0.

In [7]:
df['confined_men'] = np.where(df.confined_men.isnull(), df.total_confined_pop - df.confined_women, df.confined_men)
df.confined_men.isna().sum() # how much is still missing from confined_men?

0

# 5. Detect Anomalous Data

Create Percent Change Columns for rated_capacity, total_confined_pop, adp, confined_women, confined_men, admissions_year, discharge_year.

In [8]:
columns = ['rated_capacity', 'total_confined_pop', 'adp', 'confined_women', 'confined_men', 'admissions_year', 'discharge_year']

df = (df.join(df[columns]
              .pct_change(fill_method='ffill'), rsuffix='_pctchange'))

### Pinpoint anomalies using `anomaly`:
`anomaly` is a table that will help pinpoint anomalies (40%+ difference) that we can use to analyze the original dataset. It has the original values for variables like adp, confined_men, etc. but on the right, it also contains a string statement saying whether or not a specific value for the variable is an anomaly (> 40% absolute value difference)

In [16]:
anomaly = df.copy()

for column in columns:
    column_pctchange = column + '_pctchange'
    anomaly[column_pctchange] = np.where(abs(anomaly[column_pctchange]) > .40, 
                          "Anomaly", 
                          "Good")
anomaly

KeyError: 'rated_capacity_pctchange'

### Do research on a county using `show_anomaly_table()`
If you want to do specific research on a particular county, `show_anomaly_table` is a function shows anomalies for a specific county and a county's selected statistic.

In [10]:
def show_anomaly_table(county, statistic):
    '''
    code to index and show anomalies for a specific county and the county's [insert statistic here]
    
    county - county name (all caps)
    statistic - statistic (i.e. admissions_year, etc.)
    '''
    statistic_pct_change = statistic + '_pctchange'
    return anomaly[anomaly['facility_name'].str.contains(county)][['facility_name',
                                                           'year',
                                                           statistic,
                                                           statistic_pct_change]]

show_anomaly_table('BRISTOL', 'admissions_year')

Unnamed: 0,facility_name,year,admissions_year,admissions_year_pctchange
105,BRISTOL COUNTY SHERIFFS OFFICE,1985,1605.0,Anomaly
106,BRISTOL COUNTY SHERIFFS OFFICE,1986,1804.0,Good
107,BRISTOL COUNTY SHERIFFS OFFICE,1987,1975.0,Good
108,BRISTOL COUNTY SHERIFFS OFFICE,1989,2085.0,Good
109,BRISTOL COUNTY SHERIFFS OFFICE,1990,2439.0,Good
110,BRISTOL COUNTY SHERIFFS OFFICE,1992,1825.0,Good
111,BRISTOL COUNTY SHERIFFS OFFICE,1994,,Good
112,BRISTOL COUNTY SHERIFFS OFFICE,1995,,Good
113,BRISTOL COUNTY SHERIFFS OFFICE,1996,,Good
114,BRISTOL COUNTY SHERIFFS OFFICE,1997,,Good


#### Research Bristol County House of Correction and Jail:

Here's an example of the type of anomaly research that can be aided by the dataset I've created:

Using the output from `show_anomaly_table('BRISTOL', 'admissions_year')`, we can see that Bristol has many anomalous years for its admissions per year variable. 

For instance, Bristol's admissions per year ranges from 1605.0 people/year in 1985 to 29900.0 people/year in 2004. 

Doing some online research, Bristol County House of Correction and Jail has a capacity of 1100 bed facility houses. [<sup>1</sup>](#fn1) Thus, it seems improbable that Bristol County would have a yearly entry of 29900.0 people/year, because each bed would hold 27 people per year if true. Thus, when cleaning the data later using the function, I will pay close attention to making sure that the data that is removed are the numbers closer to 29900.0. 

<span id="fn1"><sup>1</sup>Bristol County Facilities: https://www.bcso-ma.us/facilities.htm</span>

# 6. Remove Anomalous Data:

This code traverses through the columns/variables that we are returning and removes any anomalies.

In [11]:
for column in columns:
    column_pctchange = column + '_pctchange'
    df[column] = np.where(abs(df[column_pctchange]) > .40, 
                          np.nan, 
                          df[column])
df = df.iloc[:, 0:10]
df

# 7. Linear interpolation

Apply linear interpolation to each counties. 

In [13]:
df = df.groupby('fips').apply(lambda group: group.interpolate(method='index'))

Linear Interpolation (after removing anomalies) has reduced missingness to:

In [14]:
df.isna().sum()

fips                   0
facility_name          0
year                   0
rated_capacity         8
total_confined_pop    11
adp                   21
confined_women        10
confined_men          12
admissions_year       18
discharge_year        27
dtype: int64

# 8. Write data to new file 

In [15]:
df.to_csv('cleaned.csv')