# Exploration of Socioeconomic Influences on Cancer Mortality

# Hypothesis Testing notebook

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
import statsmodels
from numpy.random import seed

The focus of this sub-report is to run hypothesis tests on a series of null hypotheses about whether the cancer mortality rates seen in different types of counties is due to chance. The null hypotheses will be evaluated using the t-test. The null hypotheses tested will be:

- That the differing cancer mortality rates seen in majority White counties and majority Black counties is due to chance.

- That the differing cancer mortality rates seen in counties where the majority of the populace has private health insurance and counties where the majority of the populace has public health insurance is due to chance.

- That the differing cancer mortality rates seen in counties with a high percentage of the populace who is unemployed and counties with a high percentage of the populace who is employed is due to chance. A "high percentage" is defined as being above the median of the unemployment feature: 'PctUnemployed16_Over'.

- That the differing cancer mortality rates seen in counties where the median income of the populace is below the national median income and counties where the median income of the populace is above the national median income is due to chance.

- That the differing cancer mortality rates seen in counties where a high percentage of the adult populace's highest level of education is a high school diploma and counties where a high percentage of the adult populace's highest level of education is a college degree is due to chance. A "high percentage" is defined as being above the median of the individual feature - the percentage of county residents ages 25 and over whose highest education attained is a high school diploma or the percentage of county residents ages 25 and over whose highest education attained is a Bachelor's Degree

The DataFrame used in this hypothesis testing stage is the full version that includes the original data.world file on cancer mortality and the other data sources that were merged into this file, along with the logarithmic and exponential transformations of select features that increased the overall OLS linear regression model's accuracy.

In [2]:
cancer = pd.read_csv('cancer_ml6.csv', index_col=['Geography'])

In [3]:
cancer.shape

(3047, 329)

## First Null Hypothesis: Majority White and Black Counties

The first null hypothesis states that the differing cancer mortality rates seen in majority White counties and majority Black counties is due to random chance. Separate DataFrames are created for each of these groups.

In [4]:
white = cancer[cancer.PctWhite > 50]
len(white)

2873

In [5]:
black = cancer[cancer.PctBlack > 50]
len(black)

96

For the majority White group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_white'.

In [6]:
n_white = len(white)
n_white

2873

Second, the mean value of cancer mortality for majority White counties is computed.

In [7]:
white_cancer_mortality = np.mean(white['TARGET_deathRate'])
white_cancer_mortality

177.59474416985736

The standard deviation of the cancer mortality feature for majority White counties is assigned to 'white_std', for use in the t-test function that will be run below.

In [8]:
white_std = np.std(white['TARGET_deathRate'])
white_std

27.155587670795196

For the majority Black group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_black'.

In [9]:
n_black = len(black)
n_black

96

Second, the mean value of cancer mortality for majority Black counties is computed. This is 12% higher than for majority white counties.

In [10]:
black_cancer_mortality = np.mean(black['TARGET_deathRate'])
black_cancer_mortality

202.10416666666677

The standard deviation of cancer mortality for majority Black counties is assigned to 'black_std', for use in the t-test function that will be run below.

In [11]:
black_std = np.std(black['TARGET_deathRate'])
black_std

28.190209280981847

The t-test is run using the mean and standard deviation of the cancer mortality feature, along with the sample size, of the majority White and majority Black counties. The t-score is approximately 8.69 and the p-value is 6e-18.

In [12]:
scipy.stats.ttest_ind_from_stats(black_cancer_mortality, black_std, n_black, 
                                 white_cancer_mortality, white_std, n_white)

Ttest_indResult(statistic=8.688263418553845, pvalue=5.97504172600573e-18)

The first null hypothesis is rejected. Although causality and the identification of confounding variables is outside the scope of this analysis, one cannot say that the difference in cancer mortality rates seen in majority Black and majority White counties is due to random chance.

## Second Hypothesis: Majority Private Health Insurance and Majority Public Health Insurance

The second null hypothesis states that the differing cancer mortality rates seen in counties where the majority of the populace has private health insurance and counties where the majority of the populace has public  health insurance is due to random chance. Separate DataFrames are created for each of these groups.

In [13]:
private = cancer[cancer.PctPrivateCoverage > 50]
len(private)

2752

In [14]:
public = cancer[cancer.PctPublicCoverage > 50]
len(public)

124

For the majority private health insurance group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_private'.

In [15]:
n_private = len(private)
n_private

2752

Second, the mean value of cancer mortality for the majority private health insurance group is computed.

In [16]:
private_cancer_mortality = np.mean(private['TARGET_deathRate'])
private_cancer_mortality

177.03771802325568

The standard deviation of cancer mortality for majority private health insurance counties is assigned to 'private_std', for use in the t-test function that will be run below.

In [17]:
private_std = np.std(private['TARGET_deathRate'])
private_std

25.838121011576288

For the majority public health insurance group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_public'.

In [18]:
n_public = len(public)
n_public

124

Second, the mean value of cancer mortality for the majority public health insurance group is computed.

In [19]:
public_cancer_mortality = np.mean(public['TARGET_deathRate'])
public_cancer_mortality

199.166129032258

The standard deviation of cancer mortality for majority public health insurance counties is assigned to 'public_std', for use in the t-test function that will be run below.

In [20]:
public_std = np.std(public['TARGET_deathRate'])
public_std

38.075372802480445

The t-test is run using the mean and standard deviation of the cancer mortality feature, along with the sample size, of the majority private health insurance group of counties and majority public health insurance group of counties. The t-score is approximately 9.1 and the p-value is 1.6e-19.

In [21]:
scipy.stats.ttest_ind_from_stats(public_cancer_mortality, public_std, n_public, 
                                 private_cancer_mortality, private_std, n_private)

Ttest_indResult(statistic=9.103462113639981, pvalue=1.5948013698145586e-19)

The second null hypothesis is rejected. Although causality and the identification of any confounding variables is outside the scope of this analysis, one cannot say that the difference in cancer mortality rates seen in majority public health insurance counties and majority private health insurance counties is due to random chance.

## Third Hypothesis: High Rates of Employment and Unemployment

The third null hypothesis states that the differing cancer mortality rates seen in counties with a high percentage of the populace who is unemployed and counties with a high percentage of the populace who is employed is due to chance. A "high percentage" is defined as being above the median of the unemployment feature: 'PctUnemployed16_Over'. Separate DataFrames are created for each of these groups.

In [22]:
unemployed_median = cancer['PctUnemployed16_Over'].median()
unemployed_median

7.6

In [23]:
employed = cancer[cancer.PctUnemployed16_Over > unemployed_median]
len(employed)

1488

In [24]:
unemployed = cancer[cancer.PctUnemployed16_Over < unemployed_median]
len(unemployed)

1517

For the low unemployment group of counties, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_employed'.

In [25]:
n_employed = len(employed)
n_employed

1488

Second, the mean value of cancer mortality for the low unemployment group is computed.

In [26]:
employed_cancer_mortality = np.mean(employed['TARGET_deathRate'])
employed_cancer_mortality

188.09240591397838

The standard deviation of cancer mortality for the low unemployment group is assigned to 'employed_std', for use in the t-test function that will be run below.

In [27]:
employed_std = np.std(employed['TARGET_deathRate'])
employed_std

27.74457837412228

For the high unemployment group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_unemployed'.

In [28]:
n_unemployed = len(unemployed)
n_unemployed

1517

Second, the mean value of cancer mortality for the high unemployment group is computed.

In [29]:
unemployed_cancer_mortality = np.mean(unemployed['TARGET_deathRate'])
unemployed_cancer_mortality

169.37600527356634

The standard deviation of cancer mortality for the high unemployment group is assigned to 'unemployed_std', for use in the t-test function that will be run below.

In [30]:
unemployed_std = np.std(unemployed['TARGET_deathRate'])
unemployed_std

24.6406806342136

The t-test is run using the mean and standard deviation of the cancer mortality feature, along with the sample size, of the low and high unemployment counties. The t-score is approximately 19.6 and the p-value is 2.7e-80.

In [33]:
scipy.stats.ttest_ind_from_stats(employed_cancer_mortality, employed_std, n_employed, 
                                 unemployed_cancer_mortality, unemployed_std, n_unemployed)

Ttest_indResult(statistic=19.561493672569732, pvalue=2.6718069354190553e-80)

The third null hypothesis is rejected. Although causality and the identification of any confounding variables is outside the scope of this analysis, one cannot say that the difference in cancer mortality rates seen in low unemployment and high unemployment counties is due to random chance.

## Fourth Hypothesis: Low Median Income and High Median Income Counties

The fourth null hypothesis states that the differing cancer mortality rates seen in counties with low median income and counties with high median income is due to random chance. Separate DataFrames are created for each of these groups.

In [34]:
median_medIncome = np.median(cancer['medIncome'])
median_medIncome

45207.0

In [35]:
low_income = cancer[cancer.medIncome < median_medIncome]
len(low_income)

1523

In [36]:
high_income = cancer[cancer.medIncome > median_medIncome]
len(high_income)

1523

For the low income group of counties, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_low_income'.

In [37]:
n_low_income = len(low_income)
n_low_income

1523

Second, the mean value of cancer mortality for the low income group is computed.

In [38]:
low_income_cancer_mortality = np.mean(low_income['TARGET_deathRate'])
low_income_cancer_mortality

188.95778069599461

The standard deviation of cancer mortality for the low income group is assigned to 'low_income_std', for use in the t-test function that will be run below.

In [39]:
low_income_std = np.std(low_income['TARGET_deathRate'])
low_income_std

28.56731180116208

For the high income group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_high_income'.

In [40]:
n_high_income = len(high_income)
n_high_income

1523

Second, the mean value of cancer mortality for the high income group is computed.

In [41]:
high_income_cancer_mortality = np.mean(high_income['TARGET_deathRate'])
high_income_cancer_mortality

168.37327642810246

The standard deviation of cancer mortality for the high income group is assigned to 'high_income_std', for use in the t-test function that will be run below.

In [42]:
high_income_std = np.std(high_income['TARGET_deathRate'])
high_income_std

22.63465176462829

The t-test is run using the mean and standard deviation of the cancer mortality feature, along with the sample size, of the low and high income counties. The t-score is approximately 22 and the p-value is 5.2e-100.

In [44]:
scipy.stats.ttest_ind_from_stats(low_income_cancer_mortality, low_income_std, n_low_income, 
                                 high_income_cancer_mortality, high_income_std, n_high_income)

Ttest_indResult(statistic=22.0405721116392, pvalue=5.24224060328677e-100)

The fourth null hypothesis is rejected. Although causality and the identification of any confounding variables is outside the scope of this analysis, one cannot say that the difference in cancer mortality rates seen in low income and high income counties is due to random chance.

## Fifth Hypothesis: High School Educated and College Educated Counties

The fifth null hypothesis states that the differing cancer mortality rates seen in counties with a high percentage of adults whose highest level of education is a high school diploma and counties with a high percentage of adults whose highest level of education is a college degree is due to random chance. Separate DataFrames are created for each of these groups. A "high percentage" is defined as higher than the median of the feature - 'PctHS25_Over' or 'PctBachDeg25_Over'.

In [45]:
median_hs = cancer['PctHS25_Over'].median()
median_hs

35.3

In [47]:
high_school = cancer[cancer.PctHS25_Over > median_hs]
len(high_school)

1509

In [48]:
median_college = cancer['PctBachDeg25_Over'].median()
median_college

12.3

In [49]:
college = cancer[cancer.PctBachDeg25_Over > median_college]
len(college)

1516

For the high school counties, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size, is assigned to 'n_high_school'.

In [50]:
n_high_school = len(high_school)
n_high_school

1509

Second, the mean value of cancer mortality for the low income group is computed.

In [51]:
high_school_cancer_mortality = np.mean(high_school['TARGET_deathRate'])
high_school_cancer_mortality

187.95354539430116

The standard deviation of cancer mortality for the high school group is assigned to 'high_school_std', for use in the t-test function that will be run below.

In [52]:
high_school_std = np.std(high_school['TARGET_deathRate'])
high_school_std

27.02209504518429

For the college group, the necessary statistics for running hypothesis tests are assigned to objects below. First, the N, or sample size is assigned to 'n_college'.

In [53]:
n_college = len(college)
n_college

1516

Second, the mean value of cancer mortality for the college group is computed.

In [54]:
college_cancer_mortality = np.mean(college['TARGET_deathRate'])
college_cancer_mortality

167.61583113456462

The standard deviation of cancer mortality for the college group is assigned to 'college_std', for use in the t-test function that will be run below.

In [55]:
college_std = np.std(college['TARGET_deathRate'])
college_std

23.55942355045072

The t-test is run using the mean and standard deviation of the cancer mortality feature, along with the sample size, of the counties with a high percentage of adults whose highest level of education is a high school diploma and counties with a high percentage of adults whose highest level of education is a college diploma. The t-score is approximately 22.2 and the p-value is 3.6e-100.

In [56]:
scipy.stats.ttest_ind_from_stats(high_school_cancer_mortality, high_school_std, n_high_school, 
                                 college_cancer_mortality, college_std, n_college)

Ttest_indResult(statistic=22.066074333060843, pvalue=3.604126593783376e-100)

The fifth null hypothesis is rejected. Although causality and the identification of any confounding variables is outside the scope of this analysis, one cannot say that the difference in cancer mortality rates seen in "high school" and "college" counties are due to random chance.

A clear image emerges from these hypothesis tests: that the socioeconomic differences seen in cancer mortality are real and cannot be ascribed to random chance. Exploring the matrix of confounding variables that influence these differences comprise rich ground for future research which could serve as a foundation for meaningful policy changes. The machine learning models conducted in the next section will serve to shed light on the differential influence of the socioeconomic variables that influenced cancer mortality in 2015.