### **Dataset Cleaning - Emily's Sets**

In [1]:
# Imports
import pandas as pd
import numpy as np
import os

In [2]:
os.getcwd()

'C:\\Users\\emily\\Git_Stuff\\General_Assembly\\04_Projects\\project-4\\dsb318-group4\\01_notebooks_eks'

In [3]:
os.chdir('./../02_data_eks')
os.listdir()

['00_ignore',
 '01_original_datasets',
 '02_cleaned_datasets',
 '03_output',
 '04_data-dictionaries',
 'ca_dropout_and_predictors_v6_eks.csv']

#### Abortion Costs

In [4]:
# Read it in
abortion_costs = pd.read_csv('./01_original_datasets/abortions_funded_costs.csv')

# Rename columns
col_names = [col.lower().replace(' ', '_') for col in abortion_costs.columns]
abortion_costs.columns = col_names

# Get the dimensions
abortion_costs.shape

(815, 5)

In [5]:
# 2015 only
abortion_costs = abortion_costs[abortion_costs['calendar_year']==2015]

# Real counties only
abortion_costs = abortion_costs[abortion_costs['county']!='Unknown']
abortion_costs = abortion_costs[abortion_costs['county']!='Total'] 

# Get the shape
abortion_costs.shape

(114, 5)

In [6]:
# Take a look at the column counts - this will help me identify unnecessary features
for i in list(abortion_costs.columns):
  print('='*20)
  print(i)
  print(abortion_costs[i].nunique())

calendar_year
1
delivery_system
2
county
58
total_expenditures
55
date_of_data
1


In [7]:
# Look for missing data
abortion_costs.isna().sum()

calendar_year          0
delivery_system        0
county                 0
total_expenditures    58
date_of_data           0
dtype: int64

Just the one column has NAs, and it is exactly the number of counties.

In [8]:
# Take a look at the column counts 
for i in list(abortion_costs.dropna().columns):
  print('='*20)
  print(i)
  print(abortion_costs.dropna()[i].nunique())

calendar_year
1
delivery_system
1
county
56
total_expenditures
55
date_of_data
1


It looks like an entire category of `delivery_system` is missing.

In [9]:
# Check that hypothesis
abortion_costs['total_expenditures'].isna().groupby(abortion_costs['delivery_system']).sum()

delivery_system
Fee-for-Service     0
Managed Care       58
Name: total_expenditures, dtype: int64

In [10]:
# Check that hypothesis
abortion_costs['total_expenditures'].isna().groupby(abortion_costs['delivery_system']).value_counts(dropna = False)

delivery_system  total_expenditures
Fee-for-Service  False                 56
Managed Care     True                  58
Name: count, dtype: int64

Two counties seem to be missing from the Fee-for-Service level that are present in the Managed Care level, but they're NA there.  They're not suppressed for small numbers or anything; they just aren't present in the dataset at all.

In [11]:
# Remove the unnecessary category (rows)
abortion_costs.dropna(inplace = True)

# Drop unnecessary columns
drop_cols = ['calendar_year', 'delivery_system', 'date_of_data']
abortion_costs.drop(columns = drop_cols, inplace = True)

# Get the shape
abortion_costs.shape

(56, 2)

In [12]:
# Save to csv 
abortion_costs.to_csv('./03_output/abortion_costs_eks.csv', index = False)

The datasets that resulted from my initial cleaning, that went on to be part of the final analysis, can be found in `./02_cleaned_datasets`.  I have edited this code to send them to `./03_output` instead, to prevent these files from being overwritten over and over.  This is the same throughout this notebook.  Curious readers - or at least those using Git Bash on a PC - can run the following command through their terminal to confirm that the files are the same.

```diff 02_data/03_output/<dataset name>_eks.csv 02_data/02_cleaned_datasets/<dataset name>_simplified_eks.csv```

# EMILY DELETE THIS YOU-SPECIFIC LINE

```diff 02_data_eks/03_output/<dataset name>_eks.csv 02_data_eks/02_cleaned_datasets/<dataset name>_simplified_eks.csv```

#### Abortion Counts

#### Daycare Slots

#### E-cigarettes

#### Poverty Rate

#### Suicide Rate

#### Graduation Cohort & Dropout Rate

#### Unemployment Rate

#### Dimensionality Reduction