# Final Project
## Amanda Epstein

1. Pick a dataset
2. Explore the dataset
3. Pose an exploratory research question

https://www.cdc.gov/cancer/uscs/dataviz/download_data.htm

## Data Set Information

"This tool provides incidence and death counts, rates, and trend data; survival and prevalence estimates; and state-, county-, and congressional district data in a user-driven format. Additional modules provide data for cancers associated with selected risk factors, and incidence data for American Indian and Alaska Native populations living in Indian Health Service purchased/referred care delivery areas. Users can display the output in tables, graphic files, and shareable formats designed for e-mail and social media. The data presented include cancer cases diagnosed and cancer deaths that occurred from 1999 to 2016, for the most recent 5 years combined (2012 to 2016), and for 2016 alone, which is the most recent year that incidence data are available."

"It includes incidence data on more than 1 million cases of invasive cancer (including more than 15,000 cases among children younger than 20 years) diagnosed in each of the individual years. The population coverage may vary by the suppression of state incidence data if 16 or fewer cases were reported, or if the state requested that the data be suppressed, or if a state did not meet publication criteria. For the most recent release, data from 100% of the U.S. population are displayed for cancer cases diagnosed in 2016 only and the most recent 5 years combined (2012 to 2016).

The tool also includes mortality data from malignant cancers as recorded in the National Vital Statistics System from all 50 states and the District of Columbia. Mortality data are available for 100% of the U.S. population.

Cancer incidence and mortality trend data are presented from 1999 through 2016. The 18-year incidence trend includes 98% of the U.S. population, and the mortality trend includes 100% of the U.S. population. The tool also presents survival and prevalence estimates, which are based on NPCR data covering 93% of the U.S. population."

### Data sources
"Information on newly diagnosed cancer cases is based on data collected by registries in CDC’s National Program of Cancer Registries (NPCR) and NCI’s Surveillance, Epidemiology, and End Results (SEER) Program.external icon Together, the two federal programs collect cancer incidence data for the entire U.S. population. These data can be used to monitor cancer trends over time, determine cancer patterns in various populations, guide planning and evaluation of cancer control programs, help set priorities for allocating health resources, and provide information for a national database of cancer incidence. Information on cancer deaths is collected by CDC’s National Center for Health Statistics (NCHS) National Vital Statistics System (NVSS)."

## Import libraries

In [1]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!pwd

/c/Users/Eirika/Documents/PythonForDataScience/Week9


In [3]:
!ls

data
final_project_notes.ipynb


In [6]:
brainbysite = pd.read_csv('./data/USCS_1999_2015_ASCII/BRAINBYSITE.txt', sep='|')

In [7]:
brainbysite.head()

Unnamed: 0,AGE,AGE_ADJUSTED_CI_LOWER,AGE_ADJUSTED_CI_UPPER,AGE_ADJUSTED_RATE,BEHAVIOR,COUNT,POPULATION,SEX,SITE,YEAR,CRUDE_CI_LOWER,CRUDE_CI_UPPER,CRUDE_RATE
0,0-19,~,~,~,Benign/Borderline,~,39865121,Female,All other,2004,~,~,~
1,0-19,~,~,~,Malignant,~,39865121,Female,All other,2004,~,~,~
2,0-19,~,~,~,Benign/Borderline,~,39998876,Female,All other,2005,~,~,~
3,0-19,~,~,~,Malignant,~,39998876,Female,All other,2005,~,~,~
4,0-19,~,~,~,Benign/Borderline,~,40165421,Female,All other,2006,~,~,~


In [8]:
byarea_county = pd.read_csv('./data/USCS_1999_2015_ASCII/BYAREA_COUNTY.txt', sep='|')

In [9]:
byarea_county.head()

Unnamed: 0,STATE,AREA,AGE_ADJUSTED_CI_LOWER,AGE_ADJUSTED_CI_UPPER,AGE_ADJUSTED_RATE,COUNT,EVENT_TYPE,POPULATION,RACE,SEX,SITE,YEAR,CRUDE_CI_LOWER,CRUDE_CI_UPPER,CRUDE_RATE
0,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Incidence,5248,All Races,Female,All Cancer Sites Combined,2011-2015,~,~,~
1,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Mortality,5248,All Races,Female,All Cancer Sites Combined,2011-2015,~,~,~
2,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Incidence,5248,All Races,Female,Brain and Other Nervous System,2011-2015,~,~,~
3,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Mortality,5248,All Races,Female,Brain and Other Nervous System,2011-2015,~,~,~
4,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Incidence,5248,All Races,Female,Cervix,2011-2015,~,~,~


In [10]:
byarea_county.columns

Index(['STATE', 'AREA', 'AGE_ADJUSTED_CI_LOWER', 'AGE_ADJUSTED_CI_UPPER',
       'AGE_ADJUSTED_RATE', 'COUNT', 'EVENT_TYPE', 'POPULATION', 'RACE', 'SEX',
       'SITE', 'YEAR', 'CRUDE_CI_LOWER', 'CRUDE_CI_UPPER', 'CRUDE_RATE'],
      dtype='object')

In [12]:
byarea_county.index

RangeIndex(start=0, stop=2730126, step=1)

In [13]:
byarea_county.iloc[ [6, 11, 500, 9608]]

Unnamed: 0,STATE,AREA,AGE_ADJUSTED_CI_LOWER,AGE_ADJUSTED_CI_UPPER,AGE_ADJUSTED_RATE,COUNT,EVENT_TYPE,POPULATION,RACE,SEX,SITE,YEAR,CRUDE_CI_LOWER,CRUDE_CI_UPPER,CRUDE_RATE
6,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Incidence,5248,All Races,Female,Colon and Rectum,2011-2015,~,~,~
11,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Mortality,5248,All Races,Female,Esophagus,2011-2015,~,~,~
500,AK,AK: Aleutians East Borough (02013) - 1994+,~,~,~,~,Mortality,1410,Black,Male,Esophagus,2011-2015,~,~,~
9608,AK,AK: Juneau City and Borough (02110) - 1990+,~,~,~,~,Incidence,9859,Hispanic,Male and Female,Pancreas,2011-2015,~,~,~
