The following analysis uses the results of a cancer patient survey to draw meanginful statistically based conclusions.

Developed by Ahmed Kayal

#### Package imports

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression


<br>

In [27]:
def p_value_generator(input_features, target_feature):
    """
    Calculates the relvance of each variable used within the Linear regression through p-values
    
    :param input_features: Input features to be used for training
    :type input_features: 'pandas:DataFrame'
    :param target_feature: Output column being predicted
    :type target_feature: 'pandas:Series'
    :return: Summary of regression results 
    """
    
    features = sm.add_constant(input_features)
    estimator = sm.OLS(target_feature, features)
    trained_estimator = estimator.fit()
    
    return trained_estimator.summary()

<br>

##### File read and overview

In [3]:
df = pd.read_csv("/Users/ahmed/Desktop/CancerScreenStudy2014.csv")
print(df.columns, '\n', df.shape)

Index(['id', 'studygroup', 'sexm', 'age', 'satisfaction', 'screened'], dtype='object') 
 (300, 6)


In [16]:
df[["age", "satisfaction"]].describe()

Unnamed: 0,age,satisfaction
count,300.0,300.0
mean,61.183333,55.04
std,6.179293,7.095978
min,50.0,33.0
25%,57.0,51.0
50%,61.0,55.0
75%,65.0,60.0
max,73.0,76.0


In [21]:
df[["studygroup", "sexm", "screened"]].apply(pd.Series.value_counts)

Unnamed: 0,studygroup,sexm,screened
1,159,151,169
0,141,149,131


<br>

**Interest**

Given that this is a small dataset, I'm interested in looking at the relationship between patient satisfaction scores and the two available study groups. To do so, I'll rely on a linear regression model where satisfaction is the target variable and the study group is the input feature. 

In [12]:
input_features = df[["studygroup"]]
target = df["satisfaction"]

lm = LinearRegression().fit(X=input_features, y=target)

print(f"The linear regression's slope is {round(lm.coef_[0], 5)} and the y-intercept is {round(lm.intercept_, 5)}")


The linear regression's slope is -1.45002 and the y-intercept is 55.80851


The above slope indicates that the participants in the intervention group have a 1.45 lower mean patient satisfaction score as compared to the control group. 

<br>

In [11]:
# Checking the coefficient of determination
r_squared_val = round(lm.score(X=input_features, y=target), 5) * 100

print(f"{r_squared_val}% variability in patient satisfaction is explained by the different study groups.")

1.044% variability in patient satisfaction is explained by the different study groups.


In [6]:
# Looking for the p-values for the variables of interest 
p_value_generator(input_features, target)

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,satisfaction,R-squared:,0.01
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,3.143
Date:,"Sat, 25 Jan 2020",Prob (F-statistic):,0.0773
Time:,12:29:53,Log-Likelihood:,-1011.5
No. Observations:,300,AIC:,2027.0
Df Residuals:,298,BIC:,2034.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,55.8085,0.595,93.723,0.000,54.637,56.980
studygroup,-1.4500,0.818,-1.773,0.077,-3.060,0.160

0,1,2,3
Omnibus:,0.67,Durbin-Watson:,2.035
Prob(Omnibus):,0.715,Jarque-Bera (JB):,0.656
Skew:,-0.113,Prob(JB):,0.72
Kurtosis:,2.963,Cond. No.,2.69


<br>

**Conclusion**

Despite the extended results shown above, the particularly important output is in the middle table. The linear regression model's p-value for the study group variable can indicate if there is any statistically relevant difference between the study's two groups. In this case, the value .077 is rather large meaning that there is no significant difference in mean patient satisfaction scores for the two different study groups. 

<br>

<br>

**Interest**

Seeing that there isn't a significant difference in scores between the two study groups, I'll look into the relationship that exists between patient satisfaction and the study group and age variables. 

In [7]:
input_features_v2 = df[["studygroup", "age"]]
target_v2 = df["satisfaction"]

lm_v2 = LinearRegression().fit(X=input_features_v2, y=target_v2)

In [8]:
print(f"The linear regression's slope is {round(lm_v2.coef_[0], 5)} and the y-intercept is {lm_v2.intercept_}")

The linear regression's slope is 4.52975 and the y-intercept is 10.521267792715939


The value of the slope indicates that the participants of the intervention group had 4.53 higher mean patient satisfaction score as compared to the control group after adjusting for age. 

<br>

In [24]:
# Checking the coefficient of determination
r_squared_val_v2 = round(lm_v2.score(X=input_features_v2, y=target_v2), 5) * 100

print(f"{r_squared_val_v2}% variability in patient satisfaction is explained by the study group and age predictors.")

19.23% variability in patient satisfaction is explained by the study group and age predictors.


In [28]:
# Looking for the p-values for the variables of interest 
p_value_generator(input_features_v2, target_v2)

0,1,2,3
Dep. Variable:,satisfaction,R-squared:,0.192
Model:,OLS,Adj. R-squared:,0.187
Method:,Least Squares,F-statistic:,35.36
Date:,"Sun, 26 Jan 2020",Prob (F-statistic):,1.68e-14
Time:,15:02:13,Log-Likelihood:,-981.0
No. Observations:,300,AIC:,1968.0
Df Residuals:,297,BIC:,1979.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,10.5213,5.564,1.891,0.060,-0.429,21.471
studygroup,4.5297,1.040,4.354,0.000,2.482,6.577
age,0.6884,0.084,8.178,0.000,0.523,0.854

0,1,2,3
Omnibus:,0.578,Durbin-Watson:,1.993
Prob(Omnibus):,0.749,Jarque-Bera (JB):,0.5
Skew:,-0.1,Prob(JB):,0.779
Kurtosis:,3.011,Cond. No.,936.0


**Conclusion**

Similar to the previous OLS output, the relevant output is in the middle table. The low p-values of the study group and age features shows that there is a signficant difference in mean patient satisfaction scores for the two different study groups after adjusting for age.  

Note that the low p-value or Prob(F-statistic) from the first table suggests that age and study group are signifcant predictors of satisfaction. By comparing the p-values associated with the R-squared value for the univariate linear regression to the p-values of the multivariate linear regression, there was a significant descrease when age was included in the model. As a result of the shift in statistical signficance, it can be concluded that age is a possible confounding variable in this analysis. 