# Analyzing California State Prisons with Professor Victoria Robinson 
***
### Table of Contents

[CONTEXT](#sectioncontext)<br>

1. [PRISON DATA](#subsection1)<br> 
2. [DATA CLEANING](#subsection2)<br>

##### Exploratory Data Analysis
3. [DESIGNED CAPACITY](#subsection3)<br>
4. [TOTAL POPULATION](#subsection4)<br>

In [None]:
from datascience import * 
import numpy as np 

import matplotlib  
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight') 

## Context<a id='sectioncontext'></a>
---
Prison and county jail are overcrowded, and we will explore how people are moved between the two institutions.

The key difference between state prisons and jails involve the process of sentencing. Prisons are designed for long term sentences, while jails are for those who are unsentenced or have short term sentences. It is important to note that short-term sentences are generally one year or less. Another difference is that prisons are larger and controlled at the state level. In contrast, jails are smaller and handled by a city or county.

## **Prisons Data** <a id='subsection1'> </a>
The CDCR (California Department of Corrections and Rehabilitation) reports the number of people in state prisons across California. If you are interested in learning more please visit: [CDCR data and infromation](https://github.com/nrjones8/cdcr-population-data)

|Column Name   | Description |
|--------------|---------|
|year |Year that the data was collected  |
|month | Month that the data was collected |
|institution_name |  Abbreviated Name of the State Prison|
|population_felons | People imprisoned for committing felony (serious or violent crime) |
|civil_addict | People imprisoned for drug related offenses |
|total_population | Sum of civil addict and population felons columns|
|designed_capacity | Max number of people the prison was built to hold|
|percent_occupied | Percentage of people incarcerated out of designed capacity|
|staffed_capacity | Max number of people the prison can hold based on the number of people employed|

In [None]:
prisons = Table().read_table("data/monthly_cdcr.csv").drop(0)
prisons.sort('year').show(5)

## **Data Cleaning** <a id='subsection2'> </a>

The dataset is made up of prisons in California from the year 1996 to 2018. The counts in the data set were done monthly, meaning for each year we expect there to be at most 12 counts for each prison. Let's calculate the number of times we expect each prison to appear in a cell below.

In [None]:
months = 12
years = 2018 - 1996 + 1 # We add one to our calculation because we want to include 1996

months * years

We expect 276 rows for each prison. If we group by institution, we can see how many times each institution appears.

In [None]:
# We choose to have the counts in descending order in the following example:
prisons.group("institution_name").sort("count", descending = True)

<div class="alert alert-warning">
<b>Question 1:</b> What do you notice from this table?
   </div>

*INSERT ANSWER HERE*

### **Multiple Entries**

If a prison has more than 276 rows, then it recorded multiple entries per month. Why would a prison need to do this? This happens for:
- Valley State Prison (VSP)
- Sierra Conservation Center (SCC)
- Los Angeles County State Prison (LAC)

#### Valley State Prison

From outside research, we discovered that Valley State Prison changed from a female prison to a male prison in 2012 -- October of 2012 to be exact. However, all rows are labeled as a male institution. We relabeled the cases, so we would have male and female separated. 

In [None]:
prisons_1 = Table().read_table("data/prisons3.csv")
prisons_1.where('institution_name', are.containing("VALLEY SP (")).show(5)

<div class="alert alert-warning">
<b>Question 2:</b> What does this transition mean for people who were incarcerated at Valley State? What would happen to them?
   </div>

*INSERT ANSWER HERE*

### Designed Capacity of California's Prisons Over Time <a id='subsection3'> </a>
The designed capacity of an institution is based on the number of people each room is able to hold given the constraints of a building's size.

In [None]:
year_and_design_capacity = prisons_1.select(
    "year", "month","institution_name",'designed_capacity', "total_population")
year_and_design_capacity.show(3)

Each row represents an institution at a certain month. To show the changes across years, we need have one. number to represent each institution per year. Instead of 12 (months) rows of data, it will be 1 row for a year. We do this by grouping by both the year and institution and the average of the rest of the years.

In [None]:
correct_table = year_and_design_capacity.group(["year","institution_name"], np.average)
correct_table = correct_table.relabeled("designed_capacity average", "designed_capacity")
correct_table = correct_table.relabeled("total_population average", "total_population")
correct_table

Now, if we group by year and sum the other columns, the sums will represent the total across all institutions in a given year.

In [None]:
design_capacity_ca = correct_table.select("designed_capacity", "year").group("year", sum)
design_capacity_ca.show(3)

Although it is possible to compare the values for designed capacity year by year and try to notice a pattern, often times it is more useful to visually inspect the information as it might reveal useful insights and provide a context for the data we are looking at.

Below we drew a line plot to reflect the changes in design capacity of California's state prisons over time.
We will focus on some years that mark important shifts as a result of realignment.

<div class="alert alert-warning">
<b>Question 3:</b> Fill in the code to create the line plot. Using the `designed_capacity sum` and `year` columns.
   </div>

In [None]:
# Assign the year 2011 as the x-coordinate
x_coordinate_2011 = 2011

# Assign the designed capacity sum in the year 2011 as the y-coordinate
y_coordinate_2011 = design_capacity_ca.where("year", 2011).column("designed_capacity sum")




### ADD YOUR CODE BELOW ###
design_capacity_ca.plot(..., ...)




# Plot a single x,y coordinate
plt.title("Design Capacity over time")
plt.plot(x_coordinate_2011, y_coordinate_2011, 'ro'); 

<div class="alert alert-warning">
<b>Question 4:</b> In looking at the graph produced, how does it reflect the systematic changes in California's potential prison population?
   </div>

*INSERT ANSWER HERE*

### Total Population in California's Prisons Over Time <a id='subsection4'> </a>
Let's explore the change in the number of people California's prisons are designed to hold given the total number of people held at a location.

In [None]:
total_pop_and_design = correct_table.select("year", "total_population", "designed_capacity")
total_pop_and_design.show(5)

Similar to our last investigation, we will want to group by year and sum both `total_population` and `designed_capacity` for all prisons given a year.

In [None]:
sum_total_and_design = total_pop_and_design.group("year", sum)
sum_total_and_design.show(5)

Given this table, we can find the percent full of these institutions. We use designed capacity as a measure of how many people may be incarcerated in these institutions.
$$\text{Percent full in a given year} = 100* \dfrac{\text{Number of people incarcerated that year}}{\text{Number of Spaces Avaiable that year}}$$ 

<div class="alert alert-warning">
<b>Question 5:</b> Fill in the blanks so that the code below calculated the percent at which prisons are full.
   </div>

In [None]:
total_population = sum_total_and_design.column(...) 
designed_sum = sum_total_and_design.column("designed_capacity sum")

capacity_percentage = 100 * (total_population / ...)
capacity_percentage

To plot these percentages, we have to add a new column to the table. Let's call this column "Design Percent".

In [None]:
# Use .with_column() to add a new column with the percentages calculated above!
total_and_design_and_percentages_table = sum_total_and_design.with_column("Design Percent", capacity_percentage)
total_and_design_and_percentages_table.show(3)

<div class="alert alert-warning">
<b>Question 6:</b> Draw a line plot to reflect the trend in overcrowding in California state prisons over time. 
   </div>

In [None]:
total_and_design_and_percentages_table....("year", ...)

plt.ylabel("Design Percent Occupied (%)")
plt.title("Overcrowding According to Designed Capacity");

<div class="alert alert-warning">
<b>Question 7:</b> What do you notice about this trend over time?
   </div>

*INSERT ANSWER HERE*

In [None]:
total_and_design_and_percentages_table.plot("year", "Design Percent")

plt.xlabel("Year")
plt.ylabel("Design Percent Occupied (%)")
plt.title("Overcrowding According to Designed Capacity")
plt.ylim(0, 210);

<div class="alert alert-warning">
<b>Question 8:</b> How does the scale influence the trend?
   </div>

*INSERT ANSWER HERE*