# COGS 108 - Data Checkpoint

# Names

- Aasem Fituri
- Casey Hild
- Carlos van der Ley
- Jeremy Quinto

<a id='research_question'></a>
# Research Question

Can we find a correlation between cancer rates and socioeconomic status?  More specifically, how does income, education, and employment status affect cancer rates across the United States?

# Dataset(s)

<!-- *Fill in your dataset information here* -->

<!-- (Copy this information for each dataset) -->
- Dataset Name: cancer_reg
- Link to the dataset: https://www.kaggle.com/datasets/thedevastator/uncovering-trends-in-health-outcomes-and-socioec?select=cancer_reg.csv
- Number of observations: 3047

<!-- 1-2 sentences describing each dataset.  -->
This dataset is made accessible by [data.world](https://data.world/exercises), and contains a large of health-related and socioeconomic data collected from numerous sources, including the American Community Survey, clinicaltrials.gov, and cancer.gov, covering a range of US counties. Data such as avg annual count of cancer cases, average deaths per year, the target death rate among the 34 variables, spam from 2010 to 2016 and are in CSV format.

<!-- If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets. -->

# Setup

In [124]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Reading in the dataset
data_cancer = pd.read_csv('data/cancer_reg.csv')
data_cancer.head()

Unnamed: 0,index,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,...,pctprivatecoveragealone,pctempprivcoverage,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate
0,0,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,"(61494.5, 125635]",...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,1,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096
2,2,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",...,43.5,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488
3,3,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,"(42724.4, 45201]",...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,4,57.0,26,144.4,350.1,49955,10321,12.5,0.0,"(48021.6, 51046.4]",...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657


In [125]:
data_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3047 entries, 0 to 3046
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    3047 non-null   int64  
 1   avganncount              3047 non-null   float64
 2   avgdeathsperyear         3047 non-null   int64  
 3   target_deathrate         3047 non-null   float64
 4   incidencerate            3047 non-null   float64
 5   medincome                3047 non-null   int64  
 6   popest2015               3047 non-null   int64  
 7   povertypercent           3047 non-null   float64
 8   studypercap              3047 non-null   float64
 9   binnedinc                3047 non-null   object 
 10  medianage                3047 non-null   float64
 11  medianagemale            3047 non-null   float64
 12  medianagefemale          3047 non-null   float64
 13  geography                3047 non-null   object 
 14  percentmarried          

# Data Cleaning

In [126]:
# convert variable's name to lowercase
data_cancer.columns = data_cancer.columns.str.lower()

In [127]:
# drop index column: we have already the index by default
data_cancer.drop('index', axis=1, inplace=True)

In [130]:
# Adding County, State columns from info in the 'geography' column
data_cancer[['county', 'state']] = data_cancer['geography'].str.split(',', n=1, expand=True).astype('string')
data_cancer['state'] = data_cancer['state'].str.replace('^.', '', regex=True)

# moving 'county' and 'state' to the front of the list
cols = list(data_cancer)
cols.insert(0, cols.pop(cols.index('county')))
cols.insert(1, cols.pop(cols.index('state')))

# using loc to reorder
data_cancer = data_cancer.loc[:, cols]
data_cancer.head()

Unnamed: 0,county,state,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,...,pctprivatecoveragealone,pctempprivcoverage,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate
0,Kitsap County,Washington,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,Kittitas County,Washington,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096
2,Klickitat County,Washington,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,...,43.5,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488
3,Lewis County,Washington,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,Lincoln County,Washington,57.0,26,144.4,350.1,49955,10321,12.5,0.0,...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657


In [131]:
# Assigning a Region to each County based on its state
West = ['California', 'Nevada', 'Hawaii']
NorthWest = ['Alaska', 'Washington', 'Oregon', 'Idaho', 'Montana', 'Wyoming']
SouthWest = ['Utah', 'Colorado', 'Arizona', 'New Mexico', 'Texas', 'Oklahoma']
MidWest = ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri', 'Wisconsin', 'Illinois', 'Indiana', 'Michigan', 'Indiana', 'Kentucky', 'Ohio']
SouthEast = ['Arkansas', 'Louisiana', 'Mississippi', 'Tennessee', 'North Carolina', 'South Carolina', 'Georgia', 'Alabama', 'Florida']
MidAtlantic = ['District of Columbia', 'Virginia', 'West Virginia', 'Delaware', 'Maryland', 'Pennsylvania', 'New Jersey', 'Connecticut', 'New York']
NorthEast = ['Connecticut', 'Rhode Island', 'Massachusetts', 'Vermont', 'New Hampshire', 'Maine']
# Length of all lists combines to 50
print(len(West) + len(NorthWest) + len(SouthWest) + len(MidWest) + len(SouthEast) + len(MidAtlantic) + len(NorthEast))

# Create the 'Region' column
data_cancer['region'] = data_cancer['state'].apply(lambda x: 'West' if x in West else
                                                         'North West' if x in NorthWest else
                                                         'South West' if x in SouthWest else
                                                         'Mid West' if x in MidWest else
                                                         'South East' if x in SouthEast else
                                                         'Mid Atlantic' if x in MidAtlantic else
                                                         'North East' if x in NorthEast else '')

pd.set_option('display.max_rows', 10)
data_cancer.sort_values('state')

53


Unnamed: 0,county,state,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,...,pctempprivcoverage,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate,region
1063,Elmore County,Alabama,406.0,158,193.8,482.4,54298,81468,14.4,36.824275,...,48.0,31.3,14.4,75.516016,21.218875,0.516326,0.678529,54.326202,5.800147,South East
2684,Blount County,Alabama,291.0,120,175.4,429.9,45567,57673,17.5,0.000000,...,44.7,35.9,19.0,95.097903,1.531797,0.143823,0.961705,59.439854,6.894123,South East
2674,Lawrence County,Alabama,198.0,78,193.3,487.4,41574,33115,16.6,0.000000,...,43.8,36.4,18.9,77.347704,11.457155,0.110165,0.059549,52.382359,4.830100,South East
2663,Lauderdale County,Alabama,529.0,238,194.4,449.5,41324,92596,18.7,21.599205,...,46.0,34.1,17.1,86.926469,9.962582,0.724630,0.219977,50.137263,5.109790,South East
2653,Lamar County,Alabama,100.0,43,202.2,491.6,34553,13886,20.6,0.000000,...,34.6,45.3,23.1,87.384136,11.073374,0.021227,0.629732,55.539264,7.489134,South East
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,Platte County,Wyoming,54.0,20,141.5,416.9,49713,8812,13.0,0.000000,...,37.2,37.0,16.4,94.251429,0.034286,0.434286,2.331429,48.430016,9.708738,North West
1187,Sheridan County,Wyoming,173.0,65,168.7,458.1,54716,30009,9.9,1499.550135,...,47.8,29.3,12.7,93.671397,0.810411,0.763333,1.688076,53.283511,4.312373,North West
1188,Sublette County,Wyoming,33.0,11,126.9,352.0,77222,9899,6.8,0.000000,...,54.3,18.2,8.6,95.690422,0.000000,0.523871,0.118612,64.532156,5.148130,North West
1181,Laramie County,Wyoming,435.0,159,156.1,438.5,57192,97121,10.8,175.039384,...,47.9,27.8,12.2,87.859291,2.909956,1.165240,3.984030,51.758996,5.053669,North West


In [132]:
data_cancer.region.value_counts()

region
Mid West        1149
South East       745
South West       444
Mid Atlantic     366
North West       206
West              78
North East        59
Name: count, dtype: int64

Describe your data cleaning steps here.

In [2]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION