#### DATS 6101 - Project - Group 3

# INFLUENCE OF INCOME ON CANCER INCIDENCE AND DEATH AMONG PATIENTS IN USA

# I.   Introduction

Although the mortality rate of cancer in the United States has declined by 29% between 1991 and 2017, it is still among the leading causes of death in the nation. Prior to the onset of the COVID-19 pandemic, the American Cancer Society said it expected the US to see an estimated 1.8 million new cases and over 606,000 new deaths in 2020. As the pandemic strains the nation, propagating economic uncertainty and exasperating the nation’s income inequality gap, the topic brings to question: 

**How does wealth affect our nation’s cancer patients, and in turn, how do we expect the economic strain of the pandemic to affect cancer patients?**

At the start of developing our question we considered investigating insurance data on cancer patient mortality rate; however, we quickly realized the dependency of insurance on income level and thus the high correlation among the variables. Moreover, state level income data was more readily accessible and, we felt, more appropriate for analysis since available insurance data largely provided information only on how many people were insured within a state. We also hoped to analyze data on the pandemic’s effects on cancer patients, but realized any relevant and useful data is still being collected and likely will not be disseminated to the public for at least another twelve months, depending on when the pandemic largely subsides. Finally, we found recent studies suggesting that lower-income individuals were in fact more affected by cancer than those with a higher income. Thus we decided to look into this and research cancer incidence and mortality rate with respect to income. Our results will show that, as we expected, income level has no significant effect on incidence rate. On the other hand, our analysis will show that income level does have an effect on the mortality rate of cancer patients.
    
For our research, we used demographic data on median household income per state, incidence rate per state, and mortality rate per state from the cancer.gov’s website, State Cancer Profiles. A limitation we recognize within our income dataset is that the data shows only the median income per state. Although that alone doesn’t provide a complete idea of a given state’s wealth, we nonetheless believe it is sufficient for our analysis. Additionally, the income and mortality data available constitutes the average of the four year period between 2014 to 2018 while the data on incidence constitutes the average of the four year period between 2013 to 2017. Albeit not a perfect match, we believe a one year difference between the two groups of datasets was fairly negligible for our purposes. 

# II.   Summary

We initially included Puerto Rico in our data as it was included in the original datasets, but we found that as an outlier it had the potential to skew our analysis. Per the table in Figure 1, the minimum values for each variable we were looking at all belonged to Puerto Rico. Upon removing Puerto Rico we saw the minimum value of the median household income rise by about $23,000, while the number of cases and deaths increased by just 19 and 13, respectively.

![Screen%20Shot%202020-11-30%20at%201.39.25%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%201.39.25%20PM.png)
*Figure 1. Summary including Puerto Rico*

![Screen%20Shot%202020-11-30%20at%201.37.10%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%201.37.10%20PM.png)
*Figure 2. Summary without Puerto Rico*

From our remaining data, we found that most states had average incidence rates between 450 to 490 per 100,000, with one state -- Kentucky -- having an outlying rate of 519 incidences per 100,000 (See Appendix A). The mortality rate had a shape that more closely resembled a normal distribution with most states seeing deaths in the range of 150 to 160 per 100,000. Our data also showed median household income distribution in the US is skewed right; most states were within the \\$48,000 to $63,000 range.
As shown on the map of Figure 3 below, low income states are more or less clustered around the lower-eastern part of the US. High income states are generally in the coasts. From Figure 4, we can see there is a fairly clear difference in the rate of cases between the middle to east section of the country and the western section. Figure 5, shows us a similar pattern. As we noted, Kentucky stands out among all other states in its rate of cancer incidences and mortality.

![Income.png](attachment:Income.png)
*Figure 3. Income distribution across the United States*

![Cases.png](attachment:Cases.png)
*Figure 4. Incidence rate across the United States*

![Deaths.png](attachment:Deaths.png)
*Figure 5. Death rate due to cancer across the United States*

# III. Statistical Analysis

The risk of cancer incidence and mortality related to economic disparities was evaluated by conducting an analysis of variance (ANOVA) and the association between them was analyzed using linear regression. 

## A. ANOVA Testing 
In order to conduct this test, the states were grouped according to median household income and segregated into three income levels. The median household income levels were defined as low income (\\$40000 - \\$55000), middle income (\\$55000 - \\$63500) and high income (\\$63500 & above).

The groups were then tested to detect any significant differences in the means of cancer incidence and mortality rates compared to their income levels.

1. Hypothesis testing of cancer incidence rate means for different income levels.

    Ho: There is no significant difference in the means, i.e. μ1 = μ2 = μ3

    Ha: There is a significant difference in the means, i.e. μ1 ≠ μ2 ≠ μ3

The result of the analysis is presented in the table below.

![Screen%20Shot%202020-11-30%20at%203.19.40%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.19.40%20PM.png)

**Result:** The test statistic is the F value of 1.18. Using a significance level (α) of 0.05, the F critical value for F(2,48) is 3.191 and the p-value for 1.18 is 0.316. Since the test statistic is less than the critical value, we fail to reject the null hypothesis and conclude that there is no significant difference in the means of incidence rates between different income groups. 

2.  Hypothesis testing for cancer mortality rate means for different income levels.

    Ho: There is no significant difference in the means, i.e. μ1 = μ2 = μ3
    
    Ha: There is a significant difference in the means, i.e. μ1 ≠ μ2 ≠ μ3
    
The result of the analysis is presented in the table below.

![Screen%20Shot%202020-11-30%20at%203.23.33%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.23.33%20PM.png)

**Result:** The test statistic is the F value of 16.68. Using a significance level (α) of 0.05, the F critical value for F(2,48) is 3.191 and the p-value for 16.68 is 0.00001. Since the test statistic is greater than the critical value, we reject the null hypothesis and conclude that there is a significant difference in the means of mortality rates between different income groups. 

Using a one-way ANOVA, cancer mortality rates differed significantly with respect to state median household income. Our data showed disparities in cancer mortality rate when assessed against income whereas cancer incidence rate was not influenced by income.


## B. Correlational Analysis

We first examined the strength of association for cancer incidence and mortality rates
compared to median household income variables using a Pearson's correlation 
coefficient. We also evaluated the linear relationship by performing a linear regression. 
The resulting correlation and the relationship between the variables are presented
in the Figure 6 & Figure 7 below.


![Screen%20Shot%202020-11-30%20at%203.48.07%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.48.07%20PM.png)
*Figure 6. Correlation between income, death rate, and incidence rate*

![Screen%20Shot%202020-11-30%20at%203.49.49%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.49.49%20PM.png)
*Figure 7. Scatterplot of income versus rate of cancer cases and rate of cancer related deaths*

We identified a strong negative correlation between cancer mortality and state median household incomes. However, the strength of association between cancer incidence and state median household income is rather weak and, therefore, there is no clear indication regarding the dependency of cancer incidence rates on state income level.

## C. Building Linear Regression Model 

Our results indicated that cancer mortality is significantly correlated with income. We
followed this with a simple linear regression model to predict cancer death rate based 
on household income. For this model, the independent X variable will be household
income and the dependent Y predicted variable will be the corresponding death rate. 

In order to prepare the regression model, 80% of the dataset was split to train the
model and the remaining 20% of the dataset was later used to test the model.
Using the LinearRegression function from Sklearn.linear_model package, we obtained
the following Figure 8 linear regression line plots. 

![Screen%20Shot%202020-11-30%20at%203.52.51%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.52.51%20PM.png)
*Figure 8. Scatterplot of income versus rate of cancer related deaths for trained and tested data*

To evaluate the accuracy of this model, the R squared value was calculated using the
score function from LinearRegression class. The accuracy score achieved for this
model is R² = 0.3444, which indicates that 34% of the total variability in death rate is
explained by predicting death rate with income. Figure 9, below visualizes the
differences in the actual data and predicted outcome.

![Screen%20Shot%202020-11-30%20at%203.53.53%20PM.png](attachment:Screen%20Shot%202020-11-30%20at%203.53.53%20PM.png)
*Figure 9. Comparison of actual versus predicted data*

# IV. Conclusion

## A. Statement of Conclusion

The ANOVA test conducted previously in the paper that compared cancer incident rates with various income levels failed to reject the null hypothesis, concluding there is no significant difference in the means of incidence rates between different income groups. The second ANOVA test that compared cancer mortality rate means with various income levels. This test rejected the null hypothesis, concluding there is a significant difference in the means of mortality rates between different income groups. The results of these tests lead us to conclude that median household income has an effect on cancer mortality rates. 
The direction and strength of this correlation was found to be negative but rather weak. This leads us to believe that the difference in mortality rates between various income levels is negatively correlated but with the association being weak we are unable to conclude this is the case. 

## B. Reliability of Results

We believe that, based on the data used and analysis conducted, our results are reliable. The data used is from a reliable government source and spans a five year range rather than just one or two years. The data is also from very recent years and not from the last century. The results of our analysis do not lean heavily in favor of our posed question leading us to believe they are reliable.

## C. Improvements for the Future

One way to improve on this project would be to include more data. Our data is from 2014-2018, but including more years dating back closer to the year 2000 may show trends that our smaller amount of data does not. We could also look at the increase of median income having some effect on incidence or mortality rates in the US. This would  involve the previously mentioned suggestion of including more years of data. One last way to improve on our results would be to do more analysis of them. We used one-way ANOVA, correlation and linear regression, but in a future analysis it may be possible to use two-way ANOVA. Using other statistical software, such as R, may also be of use in a future analysis as R has benefits that Python does not, and vice versa.

# V. Appendices

## A. Dataset histograms

![hist1.png](attachment:hist1.png)
*Figure 10. Histogram with Puerto Rico*

![hist2.png](attachment:hist2.png)
*Figure 11. Histogram without Puerto Rico*

## B. Code

In [None]:
# importing packages

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model  import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
# importing data

cancer = pd.read_csv("data.csv")
cancer = cancer.drop(index = 0)
cancer.drop(["FIPS","Avg Annual Count: Cases", "Avg Annual Count: Deaths"], axis = 1, inplace = True)
cancer.rename({'Value (Dollars)': 'Median Household Income($)'}, axis = 1, inplace = True)
cancer

In [None]:
# Summary statistics

summary = round((cancer.describe()),2).T
summary

In [None]:
# Visualizing the data

hist = cancer.hist(figsize = (20,20))

In [None]:
# Creating a new dataframe excluding Puerto Rico

cancer = cancer[cancer.State != "Puerto Rico"]
cancer

In [None]:
# Summary statistics

new_summary=round((cancer.describe()),2).T
new_summary

In [None]:
# Visualizing the new data

new_hist = cancer.hist(figsize = (20,20))

In [None]:
# Segregating states based on low, middle and high income groups for the analysis

cancer = cancer.reset_index(drop = True)
cancer = cancer.sort_values(by = "Median Household Income($)")
cancer["Income Group"] = pd.qcut(cancer["Median Household Income($)"], q = 3, 
                                 labels = ['Low', 'Middle', 'High'])
cancer

In [None]:
# Testing for difference in means of incidence rates

incid = cancer[["Cases per 100,000","Income Group"]].copy()
Low = pd.DataFrame(incid.loc[(incid["Income Group"] == "Low")])
Middle = pd.DataFrame(incid.loc[(incid["Income Group"] == "Middle")])
High = pd.DataFrame(incid.loc[(incid["Income Group"] == "High")])

In [None]:
# Finding the mean of each groups and the grand mean

x1 = Low["Cases per 100,000"].mean()

x2 = Middle["Cases per 100,000"].mean()

x3 = High["Cases per 100,000"].mean()

X = ((Low["Cases per 100,000"].sum() + Middle["Cases per 100,000"].sum() + High["Cases per 100,000"].sum())/
     (len(Low["Cases per 100,000"]) + len(Middle["Cases per 100,000"]) + len(High["Cases per 100,000"])))

print("Grand Mean:",round(X,2))

In [None]:
# Calculating the sum of sqaures within each groups

SS1_i = []
for i in Low["Cases per 100,000"]:
    x = (i - x1)**2
    SS1_i.append(x)
    
SS2_i = []
for i in Middle["Cases per 100,000"]:
    x = (i - x2)**2
    SS2_i.append(x)
    
SS3_i = []
for i in High["Cases per 100,000"]:
    x = (i - x3)**2
    SS3_i.append(x)
    
SSW_i = sum(SS1_i) + sum(SS2_i) + sum(SS2_i)
print("Sum of Squares Within:",round(SSW_i,2))

In [None]:
# Calculating the sum of squares between each groups 

SSB1_i = Low["Cases per 100,000"].count() * ((x1 - X)**2)
SSB2_i = Middle["Cases per 100,000"].count() * ((x2 - X)**2)
SSB3_i = High["Cases per 100,000"].count() * ((x3 - X)**2)

SSB_i = SSB1_i + SSB2_i + SSB3_i
print("Sum of Squares Between:", round(SSB_i,2))

In [None]:
# SSTdf (N-1)

df_total_i = len(cancer.index)-1
print("Dftotal:",df_total_i)

# SSBdf (m-1) m = number of groups

df1_i = (3-1)
print("Df1:",df1_i)

# SSW  (SSTdf - SSBdf)

df2_i = (df_total_i - df1_i)
print("Df2:",df2_i)

In [None]:
# Calculating MSB = SSB/df1 & MSE = SSW/df2 

MSB_i = SSB_i/df1_i
MSE_i = SSW_i/df2_i
print("MSB:", round(MSB_i,2))
print("MSE:", round(MSE_i,2))

In [None]:
# Calculating F Statistics

α = 0.05

F_i = MSB_i/MSE_i
print("F value:",round(F_i,2))

F_icv = 3.191
print("F critical value:",F_icv)

print("p-value: 0.316033")

# The p-value is 0.316033. The result is not significant at p < 0.05.

In [None]:
# Testing for difference in means of mortality rates

mortality = cancer[["Deaths per 100,000","Income Group"]].copy()
Low = pd.DataFrame(mortality.loc[(incid["Income Group"] == "Low")])
Middle = pd.DataFrame(mortality.loc[(incid["Income Group"] == "Middle")])
High = pd.DataFrame(mortality.loc[(incid["Income Group"] == "High")])

In [None]:
# Finding the mean of each groups and the grand mean

m1 = Low["Deaths per 100,000"].mean()

m2 = Middle["Deaths per 100,000"].mean()

m3 = High["Deaths per 100,000"].mean()

X2 = ((Low["Deaths per 100,000"].sum() + Middle["Deaths per 100,000"].sum() + High["Deaths per 100,000"].sum())/
     (len(Low["Deaths per 100,000"]) + len(Middle["Deaths per 100,000"]) + len(High["Deaths per 100,000"])))

print("Grand Mean:",round(X2,2))

In [None]:
# Calculating the sum of sqaures within each groups

SS1_m = []
for i in Low["Deaths per 100,000"]:
    x = (i - m1)**2
    SS1_m.append(x)
    
SS2_m = []
for i in Middle["Deaths per 100,000"]:
    x = (i - m2)**2
    SS2_m.append(x)
    
SS3_m = []
for i in High["Deaths per 100,000"]:
    x = (i - m3)**2
    SS3_m.append(x)
    
SSW_m = sum(SS1_m) + sum(SS2_m) + sum(SS2_m)
print("Sum of Squares Within:",round(SSW_m,2))

In [None]:
# Calculating the sum of squares between each groups

SSB1_m = Low["Deaths per 100,000"].count() * ((m1 - X2)**2)
SSB2_m = Middle["Deaths per 100,000"].count() * ((m2 - X2)**2)
SSB3_m = High["Deaths per 100,000"].count() * ((m3 - X2)**2)

SSB_m = SSB1_m + SSB2_m + SSB3_m
print("Sum of Squares Between:", round(SSB_m,2))

In [None]:
# SSTdf (N-1)

df_total_m = len(cancer.index)-1
print("Dftotal:",df_total_m)

# SSBdf (m-1) m = number of groups

df1_m = (3-1)
print("Df1:",df1_m)

# SSW  (SSTdf - SSBdf)

df2_m = (df_total_m - df1_m)
print("Df2:",df2_m)

In [None]:
# Calculating MSB = SSB/df1 & MSE = SSW/df2 

MSB_m = SSB_m/df1_m
MSE_m = SSW_m/df2_m
print("MSB:", round(MSB_m,2))
print("MSE:", round(MSE_m,2))

In [None]:
# Calculating F Statistics

α = 0.05

F_m = MSB_m/MSE_m
print("F value:",round(F_m,2))


F_mcv = 3.191
print("F critical value:",F_mcv)

print("p-value: 0.00001")

# The p-value is < 0.00001. The result is significant at p < 0.05.

In [None]:
# Correlation Matrix

matrix = cancer[["Cases per 100,000","Deaths per 100,000","Median Household Income($)"]].copy()
corrmatrix = matrix.corr()
sns.heatmap(corrmatrix, annot=True)
plt.show()

In [None]:
# Plotting the linear regression line for incidence rate

cancer.plot(x = 'Median Household Income($)', y = 'Cases per 100,000', style = 'o')
plt.title('Income vs Cancer Cases')
sns.regplot(x = cancer['Median Household Income($)'], y = cancer['Cases per 100,000'], ci = None)
plt.show()

In [None]:
# Plotting the linear regression line for death rate

cancer.plot(x = 'Median Household Income($)', y = 'Deaths per 100,000', style = 'o')
plt.title('Income vs Deaths')
sns.regplot(x = cancer['Median Household Income($)'], y = cancer['Deaths per 100,000'], ci = None)
plt.show()

In [None]:
# Defining input variable

Y = cancer[["Deaths per 100,000"]]
X = cancer[["Median Household Income($)"]]

# Splitting data

X_train,X_test,y_train,y_test = train_test_split(X,Y, random_state = 0,test_size = 0.20)

# Creating linear regression model object

model = LinearRegression(fit_intercept = True)

# Fitting the model

model.fit(X_train,y_train)

# Printing the coefficient and intercept of the model

coefficients = model.coef_
intercept = model.intercept_
print("Slope:",coefficients)
print("Intercept:", intercept)

In [None]:
# Predicting trained data output

train_data = model.predict(X_train)

# Predicting testing data output

y_predict = model.predict(X_test)

In [None]:
# Single prediction

prediction = model.predict([[20166]])
print(prediction)

In [None]:
# Plotting regression line for training data

plt.scatter(X_train, y_train)
plt.plot(X_train, train_data,color ="red")

plt.title("Death Rate vs Income(Trained Data)")
plt.xlabel("Income")
plt.ylabel("Death Rate")
plt.show()

In [None]:
# Plotting regression line for testing data

plt.scatter(X_test, y_test)
plt.plot(X_test, y_predict,color = "red")

plt.title("Death Rate vs Income(Tested Data)")
plt.xlabel("Income")
plt.ylabel("Death Rate")
plt.show()

In [None]:
# Evaluating the model

import statsmodels.api as sm
RSS = mean_squared_error(Y,model.predict(X)) * len(Y)
R_squared = model.score(X,Y)

print("Residual Sum of Square:",RSS)
print("R Squared:",R_squared)

In [None]:
# Visualizing the difference in trained and predicted data

Actual = y_test.reset_index()
Predict = pd.DataFrame(y_predict)
df = pd.merge(Actual,round(Predict,2),on = Actual.index)
evaluate = df.drop(['key_0', 'index'], axis = 1).rename({"Deaths per 100,000":"Actual",0:"Predicted"}, axis=1)
evaluate.plot(kind = 'bar', figsize = (10, 8))
plt.title("Differences between actual and predicted outcomes")
plt.ylabel("Cancer Death Rate")
plt.show()