# COGS 108 - Data Checkpoint

# Names

- Aasem Fituri
- Casey Hild
- Carlos van der Ley
- Jeremy Quinto

<a id='research_question'></a>
# Research Question

Can we find a correlation between cancer rates and socioeconomic status?  More specifically, how does income, education, and employment status affect cancer rates across the United States?

# Dataset(s)

<!-- *Fill in your dataset information here* -->

<!-- (Copy this information for each dataset) -->
- Dataset Name: cancer_reg
- Link to the dataset: https://www.kaggle.com/datasets/thedevastator/uncovering-trends-in-health-outcomes-and-socioec?select=cancer_reg.csv
- Number of observations: 3047

<!-- 1-2 sentences describing each dataset.  -->
This dataset, made accessible by data.world (link), contains a large of health-related and socioeconomic data collected from numerous sources, including the American Community Survey, clinicaltrials.gov, and cancer.gov, covering a range of US counties. The dataset contains 34 columns containing data from 2010 to 2016 in CSV format.

<!-- If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets. -->

# Setup

In [42]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Reading in the dataset
data_cancer = pd.read_csv('data/cancer_reg.csv')
data_cancer.shape

(3047, 34)

# Data Cleaning

In [13]:
# Adding County, State columns from info in the 'geography' column
data_cancer[['County', 'State']] = data_cancer['geography'].str.split(',', n=1, expand=True).astype('string')
data_cancer['State'] = data_cancer['State'].str.replace('^.', '', regex=True)
data_cancer.head()

Unnamed: 0,index,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,...,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate,County,State
0,0,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,"(61494.5, 125635]",...,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831,Kitsap County,Washington
1,1,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",...,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096,Kittitas County,Washington
2,2,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",...,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488,Klickitat County,Washington
3,3,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,"(42724.4, 45201]",...,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841,Lewis County,Washington
4,4,57.0,26,144.4,350.1,49955,10321,12.5,0.0,"(48021.6, 51046.4]",...,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657,Lincoln County,Washington


In [54]:
# Assigning a Region to each County based on its state
West = ['California', 'Nevada', 'Hawaii']
NorthWest = ['Alaska', 'Washington', 'Oregon', 'Idaho', 'Montana', 'Wyoming']
SouthWest = ['Utah', 'Colorado', 'Arizona', 'New Mexico', 'Texas', 'Oklahoma']
MidWest = ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri', 'Wisconsin', 'Illinois', 'Indiana', 'Michigan', 'Indiana', 'Kentucky', 'Ohio']
SouthEast = ['Arkansas', 'Louisiana', 'Mississippi', 'Tennessee', 'North Carolina', 'South Carolina', 'Georgia', 'Alabama', 'Florida']
MidAtlantic = ['District of Columbia', 'Virginia', 'West Virginia', 'Delaware', 'Maryland', 'Pennsylvania', 'New Jersey', 'Connecticut', 'New York']
NorthEast = ['Connecticut', 'Rhode Island', 'Massachusetts', 'Vermont', 'New Hampshire', 'Maine']
# Length of all lists combines to 50
print(len(West) + len(NorthWest) + len(SouthWest) + len(MidWest) + len(SouthEast) + len(MidAtlantic) + len(NorthEast))

# Create the 'Region' column
data_cancer['Region'] = data_cancer['State'].apply(lambda x: 'West' if x in West else
                                                         'North West' if x in NorthWest else
                                                         'South West' if x in SouthWest else
                                                         'Mid West' if x in MidWest else
                                                         'South East' if x in SouthEast else
                                                         'Mid Atlantic' if x in MidAtlantic else
                                                         'North East' if x in NorthEast else '')

pd.set_option('display.max_rows', 10)
data_cancer.sort_values('State')

53


Unnamed: 0,index,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,...,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate,County,State,Region
1063,1063,406.0,158,193.8,482.4,54298,81468,14.4,36.824275,"(51046.4, 54545.6]",...,14.4,75.516016,21.218875,0.516326,0.678529,54.326202,5.800147,Elmore County,Alabama,South East
2684,2684,291.0,120,175.4,429.9,45567,57673,17.5,0.000000,"(45201, 48021.6]",...,19.0,95.097903,1.531797,0.143823,0.961705,59.439854,6.894123,Blount County,Alabama,South East
2674,2674,198.0,78,193.3,487.4,41574,33115,16.6,0.000000,"(40362.7, 42724.4]",...,18.9,77.347704,11.457155,0.110165,0.059549,52.382359,4.830100,Lawrence County,Alabama,South East
2663,2663,529.0,238,194.4,449.5,41324,92596,18.7,21.599205,"(40362.7, 42724.4]",...,17.1,86.926469,9.962582,0.724630,0.219977,50.137263,5.109790,Lauderdale County,Alabama,South East
2653,2653,100.0,43,202.2,491.6,34553,13886,20.6,0.000000,"(34218.1, 37413.8]",...,23.1,87.384136,11.073374,0.021227,0.629732,55.539264,7.489134,Lamar County,Alabama,South East
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,1186,54.0,20,141.5,416.9,49713,8812,13.0,0.000000,"(48021.6, 51046.4]",...,16.4,94.251429,0.034286,0.434286,2.331429,48.430016,9.708738,Platte County,Wyoming,North West
1187,1187,173.0,65,168.7,458.1,54716,30009,9.9,1499.550135,"(54545.6, 61494.5]",...,12.7,93.671397,0.810411,0.763333,1.688076,53.283511,4.312373,Sheridan County,Wyoming,North West
1188,1188,33.0,11,126.9,352.0,77222,9899,6.8,0.000000,"(61494.5, 125635]",...,8.6,95.690422,0.000000,0.523871,0.118612,64.532156,5.148130,Sublette County,Wyoming,North West
1181,1181,435.0,159,156.1,438.5,57192,97121,10.8,175.039384,"(54545.6, 61494.5]",...,12.2,87.859291,2.909956,1.165240,3.984030,51.758996,5.053669,Laramie County,Wyoming,North West


In [53]:
data_cancer.groupby('Region').count()

Unnamed: 0_level_0,index,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,...,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate,County,State
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Mid Atlantic,366,366,366,366,366,366,366,366,366,366,...,366,366,366,366,366,366,366,366,366,366
Mid West,1149,1149,1149,1149,1149,1149,1149,1149,1149,1149,...,1149,1149,1149,1149,1149,1149,1149,1149,1149,1149
North East,59,59,59,59,59,59,59,59,59,59,...,59,59,59,59,59,59,59,59,59,59
North West,206,206,206,206,206,206,206,206,206,206,...,206,206,206,206,206,206,206,206,206,206
South East,745,745,745,745,745,745,745,745,745,745,...,745,745,745,745,745,745,745,745,745,745
South West,444,444,444,444,444,444,444,444,444,444,...,444,444,444,444,444,444,444,444,444,444
West,78,78,78,78,78,78,78,78,78,78,...,78,78,78,78,78,78,78,78,78,78


Describe your data cleaning steps here.

In [2]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION