# pandas exploration
In this assignment you will select a data set and do some munging and analysis of it using `pandas`, Jupyter Notebooks, and associated Python-centric data science tools.

## Set up environment

The following lines ensure that `numpy` and `pandas` are installed in the notebook environment.  Depending on your system, this may not be necessary and may be removed.

In [5]:
!pip3 install numpy
!pip3 install pandas



Import the core data science libraries:

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Import the raw data
In this section, you will import the raw data into a `pandas` DataFrame.

In [9]:
# place your code into this Code cell
df = pd.read_csv('data/nys_incarceration.csv')

   Snapshot Year Latest Admission Type County of Indictment Gender  \
0           2023  NEW COURT COMMITMENT               ALBANY   MALE   
1           2023  NEW COURT COMMITMENT               ALBANY   MALE   
2           2023  NEW COURT COMMITMENT               ALBANY   MALE   
3           2023  NEW COURT COMMITMENT               ALBANY   MALE   
4           2023  NEW COURT COMMITMENT               ALBANY   MALE   

   Most Serious Crime  Current Age Housing Facility Facility Security Level  \
0  ATT C POS WEAP 2ND           19   WOODBOURNE SNU         MEDIUM SECURITY   
1    C POS WEAPON 2ND           19         FRANKLIN         MEDIUM SECURITY   
2  ATT C POS WEAP 2ND           19       WASHINGTON         MEDIUM SECURITY   
3  ATT C POS WEAP 2ND           19          WYOMING         MEDIUM SECURITY   
4      YO ATT ROBBERY           19           GREENE         MEDIUM SECURITY   

  Race/Ethnicity  
0          BLACK  
1          BLACK  
2          BLACK  
3          BLACK  
4        

## Data inspection
In this section, you will show enough of your data for a viewer to get a general sense of how the data is structured and any unique features of it.  Complete each of the indicated tasks in a Code cell, making sure to include a Markdown cell above each Code cell that explains what is being shown by the code.  
- Show 5 rows, selected at random, from the data set.
- Show each of the column names and their data types.
- Show any unique features of your chosen data set.

Feel free to add as many additional cells as you need to help explain the raw data.

In [10]:
df.sample(5)

Unnamed: 0,Snapshot Year,Latest Admission Type,County of Indictment,Gender,Most Serious Crime,Current Age,Housing Facility,Facility Security Level,Race/Ethnicity
398476,2014,NEW COURT COMMITMENT,RICHMOND,MALE,MURDER 1ST,46,SHAWANGUNK,MAXIMUM SECURITY,WHITE
611053,2010,NEW COURT COMMITMENT,WESTCHESTER,MALE,MURDER 2ND,60,FIVE POINTS,MAXIMUM SECURITY,BLACK
573595,2011,NEW COURT COMMITMENT,NEW YORK,MALE,MANSLAUGHTER 1ST,51,FIVE POINTS,MAXIMUM SECURITY,BLACK
159871,2019,NEW COURT COMMITMENT,NEW YORK,MALE,ROBBERY 1ST,37,GREEN HAVEN ICP,MAXIMUM SECURITY,BLACK
72763,2022,NEW COURT COMMITMENT,KINGS,MALE,RAPE 1ST,61,MARCY OSOTP,MEDIUM SECURITY,HISPANIC


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 784159 entries, 0 to 784158
Data columns (total 9 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   Snapshot Year            784159 non-null  int64 
 1   Latest Admission Type    784159 non-null  object
 2   County of Indictment     784159 non-null  object
 3   Gender                   784159 non-null  object
 4   Most Serious Crime       784159 non-null  object
 5   Current Age              784159 non-null  int64 
 6   Housing Facility         784159 non-null  object
 7   Facility Security Level  784159 non-null  object
 8   Race/Ethnicity           784159 non-null  object
dtypes: int64(2), object(7)
memory usage: 53.8+ MB


In [16]:
# prints a series of unique county names in the dataframe
unique_counties = df['County of Indictment'].unique()
print(unique_counties)
unique_county_count = len(unique_counties)
# Print the number of unique counties
print("Number of unique counties:", unique_county_count)

['ALBANY' 'MONROE' 'ALLEGANY' 'BROOME' 'CATTARAUGUS' 'CAYUGA' 'BRONX'
 'CHAUTAUQUA' 'CHEMUNG' 'CHENANGO' 'CLINTON' 'COLUMBIA' 'CORTLAND'
 'DELAWARE' 'DUTCHESS' 'ERIE' 'ESSEX' 'FRANKLIN' 'FULTON' 'GENESEE'
 'GREENE' 'HAMILTON' 'HERKIMER' 'JEFFERSON' 'KINGS' 'LEWIS' 'LIVINGSTON'
 'MADISON' 'MONTGOMERY' 'NASSAU' 'NEW YORK' 'NIAGARA' 'ONEIDA' 'ONONDAGA'
 'ONTARIO' 'ORANGE' 'ORLEANS' 'OSWEGO' 'OTSEGO' 'PUTNAM' 'QUEENS'
 'RENSSELAER' 'RICHMOND' 'ROCKLAND' 'ST LAWRENCE' 'SARATOGA' 'SCHENECTADY'
 'SCHOHARIE' 'SCHUYLER' 'SENECA' 'STEUBEN' 'SUFFOLK' 'SULLIVAN' 'TIOGA'
 'TOMPKINS' 'ULSTER' 'WARREN' 'WASHINGTON' 'WAYNE' 'WESTCHESTER' 'WYOMING'
 'YATES' 'UNKNOWN']
Number of unique counties: 63


In [15]:
# Print the number of unique crimes
unique_crimes = df['Most Serious Crime'].unique()
unique_crime_count = len(unique_crimes)
print("Number of unique crimes:", unique_crime_count)

Number of unique crimes: 537


## Data munging
Place your **data munging** code and documentation within this section.  
- Keep each of your Code cells short and focused on a single task.  
- Include a Markdown cell above each code cell that describes what task the code within the code cell is performing.
- Make as many code cells as you need to complete the munging - a few have been created for you to start with.
- Display 5 sample rows of the modified data after each transformation so a viewer can see how the data has changed.

**Note**: If you believe that your data set does not require any munging, please explain in detail.  Create Markdown cells that explain your thinking and create Code cells that show any specific structures of the data you refer to in your explanation.

## Data analysis
Place your **data analysis** code and documentation within this section.
- Perform at least 5 different statistical or other analyses of different aspects of the data.
    - Your analyses must be specific and relevant to your chosen data set and show interesting aspects of it.
    - Include at least one analysis that includes grouping rows by a shared attribute and performing some kind of statistical analysis on each group.
    - Sort the data in at least 1 of your analyses, but sort on its own does not constitute an analysis on its own.
- Keep each of your Code cells short and focused on a single task.
- Include a Markdown cell above each Code cell that describes what task the code within the Code cell is performing.
- Make as many code cells as you need to complete the analysis - a few have been created for you to start with.

## Data visualization
In this section, you will create a few **visualizations** that show some of the insights you have gathered from this data.
- Create at least 5 different visualizations, where each visualization shows different insights into the data.
- Use at least 3 different visualization types (e.g. bar charts, line charts, stacked area charts, pie charts, etc)
- Create a Markdown cell and a Code cell for each, where you explain and show the visualizations, respectively.
- Create as many additional cells as you need to prepare the data for the visualizations.