# NY State Foster Care Data Analysis

What questions do I intend to answer with this data?
1. How many Indicated CPS Reports are there for NYS in the most recent year?
2. How does the frequency of chosen types of care vary between years?
3. Does the average number of children served in different counties correlate with cps reports?
4. Is the number of Indicated CPS reports affected by the type of care chosen?
5. Which year did NYC county have the most children in their care?
6. Is the number of Indicated CPS Reports affected by how many children are in their care?

In [1]:
#import packages
import pandas as pd
import csv
import seaborn as sns
import numpy as np

## Get the data!

This dataset shows the total number of admissions, discharges, and children in foster care, the type of care, and total Child Protective Services(CPS) reports indicated that year.

In [2]:
df = pd.read_csv('nys-foster.csv')
df.head(10)

Unnamed: 0,County,Year,Adoptive Home,Agency Operated Boarding Home,Approved Relative Home,Foster Boarding Home,Group Home,Group Residence,Institution,Supervised Independent Living,Other,Total Days In Care,Admissions,Discharges,Children In Care,Number of Children Served,Indicated CPS Reports
0,ALBANY,2011,0,911,6269,23187,5817,1133,24034,2149,1106,64606,125,178.0,167,335,750.0
1,ALBANY,2010,0,978,9197,32014,7384,1683,24343,1384,1377,78360,154,164.0,206,376,801.0
2,ALBANY,2009,0,914,11365,41222,6601,1761,27456,1799,1604,92722,167,219.0,219,428,793.0
3,ALBANY,2008,0,787,14728,43758,6918,3280,27883,1637,3101,102092,169,209.0,259,464,1037.0
4,ALBANY,2007,0,1722,10027,51877,5175,3702,31763,632,2199,107097,183,185.0,295,477,1177.0
5,ALBANY,2006,0,2125,6029,54178,5569,2353,38291,0,2699,111244,243,267.0,285,542,1235.0
6,ALBANY,2005,221,2777,3564,58995,9446,1914,40386,0,3938,121241,263,326.0,289,621,1339.0
7,ALBANY,2004,366,4350,2864,71173,12717,3023,41139,0,4429,140061,288,327.0,356,692,1254.0
8,ALBANY,2003,365,4773,1972,83089,11422,3136,50523,0,3377,158657,288,345.0,396,739,1286.0
9,ALBANY,2002,0,4726,2505,105425,10757,2661,55245,0,2906,184225,286,386.0,450,832,1233.0


In [3]:
#get column names
column_names = df.columns
print(column_names)

Index([u'County', u'Year', u'Adoptive Home', u'Agency Operated Boarding Home',
       u' Approved Relative Home', u' Foster Boarding Home', u' Group Home',
       u' Group Residence', u'Institution', u' Supervised Independent Living',
       u'Other', u'Total Days In Care', u'Admissions', u'Discharges',
       u'Children In Care', u'Number of Children Served',
       u'Indicated CPS Reports'],
      dtype='object')


In [4]:
#get column datatypes
df.dtypes

County                             object
Year                                int64
Adoptive Home                       int64
Agency Operated Boarding Home       int64
 Approved Relative Home             int64
 Foster Boarding Home               int64
 Group Home                         int64
 Group Residence                    int64
Institution                         int64
 Supervised Independent Living      int64
Other                               int64
Total Days In Care                  int64
Admissions                          int64
Discharges                        float64
Children In Care                    int64
Number of Children Served           int64
Indicated CPS Reports             float64
dtype: object

In [5]:
#check for unique columns
for i in column_names:
    print('{} is unique: {}'.format(i, df[i].is_unique))

County is unique: False
Year is unique: False
Adoptive Home is unique: False
Agency Operated Boarding Home is unique: False
 Approved Relative Home is unique: False
 Foster Boarding Home is unique: False
 Group Home is unique: False
 Group Residence is unique: False
Institution is unique: False
 Supervised Independent Living is unique: False
Other is unique: False
Total Days In Care is unique: False
Admissions is unique: False
Discharges is unique: False
Children In Care is unique: False
Number of Children Served is unique: False
Indicated CPS Reports is unique: False


In [6]:
#check index values
df.index.values

array([   0,    1,    2, ..., 1412, 1413, 1414])

In [7]:
#looking for duplicate rows (there are none)
duplicatesdf = df[df.duplicated()]
print(duplicatesdf)

Empty DataFrame
Columns: [County, Year, Adoptive Home, Agency Operated Boarding Home,  Approved Relative Home,  Foster Boarding Home,  Group Home,  Group Residence, Institution,  Supervised Independent Living, Other, Total Days In Care, Admissions, Discharges, Children In Care, Number of Children Served, Indicated CPS Reports]
Index: []


## Cleaning Data for Analysis

In [8]:
#drop rows with nans
df = df.dropna()

In [9]:
#checking if any columns contain na values
df.isna().any()

County                            False
Year                              False
Adoptive Home                     False
Agency Operated Boarding Home     False
 Approved Relative Home           False
 Foster Boarding Home             False
 Group Home                       False
 Group Residence                  False
Institution                       False
 Supervised Independent Living    False
Other                             False
Total Days In Care                False
Admissions                        False
Discharges                        False
Children In Care                  False
Number of Children Served         False
Indicated CPS Reports             False
dtype: bool

In [10]:
#change floats to ints
df["Discharges"] = df["Discharges"].astype(int)
df["Indicated CPS Reports"] = df["Indicated CPS Reports"].astype(int)

In [11]:
df.head(10)

Unnamed: 0,County,Year,Adoptive Home,Agency Operated Boarding Home,Approved Relative Home,Foster Boarding Home,Group Home,Group Residence,Institution,Supervised Independent Living,Other,Total Days In Care,Admissions,Discharges,Children In Care,Number of Children Served,Indicated CPS Reports
0,ALBANY,2011,0,911,6269,23187,5817,1133,24034,2149,1106,64606,125,178,167,335,750
1,ALBANY,2010,0,978,9197,32014,7384,1683,24343,1384,1377,78360,154,164,206,376,801
2,ALBANY,2009,0,914,11365,41222,6601,1761,27456,1799,1604,92722,167,219,219,428,793
3,ALBANY,2008,0,787,14728,43758,6918,3280,27883,1637,3101,102092,169,209,259,464,1037
4,ALBANY,2007,0,1722,10027,51877,5175,3702,31763,632,2199,107097,183,185,295,477,1177
5,ALBANY,2006,0,2125,6029,54178,5569,2353,38291,0,2699,111244,243,267,285,542,1235
6,ALBANY,2005,221,2777,3564,58995,9446,1914,40386,0,3938,121241,263,326,289,621,1339
7,ALBANY,2004,366,4350,2864,71173,12717,3023,41139,0,4429,140061,288,327,356,692,1254
8,ALBANY,2003,365,4773,1972,83089,11422,3136,50523,0,3377,158657,288,345,396,739,1286
9,ALBANY,2002,0,4726,2505,105425,10757,2661,55245,0,2906,184225,286,386,450,832,1233


In [12]:
#showing array of unique counties in NY
df.County.unique()

array(['ALBANY', 'ALLEGANY', 'BROOME', 'CATTARAUGUS', 'CAYUGA',
       'CHAUTAUQUA', 'CHEMUNG', 'CHENANGO', 'CLINTON', 'COLUMBIA',
       'CORTLAND', 'DELAWARE', 'DUTCHESS', 'ERIE', 'ESSEX', 'FRANKLIN',
       'FULTON', 'GENESEE', 'GREENE', 'HAMILTON', 'HERKIMER', 'JEFFERSON',
       'LEWIS', 'LIVINGSTON', 'MADISON', 'MONROE', 'MONTGOMERY', 'NASSAU',
       'NEW YORK CITY', 'NIAGARA', 'ONEIDA', 'ONONDAGA', 'ONTARIO',
       'ORANGE', 'ORLEANS', 'OSWEGO', 'OTSEGO', 'PUTNAM', 'RENSSELAER',
       'ROCKLAND', 'SARATOGA', 'SCHENECTADY', 'SCHOHARIE', 'SCHUYLER',
       'SENECA', 'ST. LAWRENCE', 'ST. REGIS MOHAWK', 'STEUBEN', 'SUFFOLK',
       'SULLIVAN', 'TIOGA', 'TOMPKINS', 'ULSTER', 'WARREN', 'WASHINGTON',
       'WAYNE', 'WESTCHESTER', 'WYOMING', 'YATES'], dtype=object)

In [13]:
#showing range of years displayed in this dataset
df.Year.unique()

array([2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001,
       2000, 1999, 1998, 1997, 1996, 1995, 1994, 2014, 2013, 2012, 2015,
       2017, 2016])

## Descriptive Statistics of Indicated CPS Reports for NYS

What is the average number of indicated CPS reports in NYS?

In [41]:
cps_reports = np.array(df["Indicated CPS Reports"])
def get_mean(dataset):
    return sum(dataset)/len(dataset)

get_mean(cps_reports)

780

What is the median number of indicated CPS reports in NYS?

In [40]:
def compute_median(dataset):
    count = len(dataset)
    
    if count < 1:
        return None
    if count % 2 == 1:
        return dataset[(count-1)/2]
    else:
        return (dataset[(count-1)/2] + dataset[(count+1)/2] ) / 2
    
compute_median(cps_reports)

263