# Module 1 Exercise 2 - Survey Analysis
## Overview
In this exercise, you will analyze more data from the REDCap survey used in the labs and practices.  You will draw upon methods from prior DSA statistical coursework and use some common techniques for survey analysis. (See [Lab 3](../labs/lab-01-03.ipynb) from this module for a refresher on the required statistical method.)


## File Formats
Files are located in the `resources/REDCap` sub folder of this module.  The files are in CSV format, and are the same set as used in the lab and practice exercise.

An additional file will be used in this exercise: 

`survey_invitations.csv` - contains an ID number and the department of every person invitied to fill out the survey
  - `id` - an id number for the subject - joins to the `redcap_survey_identifier` in the REDCap data file
  - `department` - the department for the subject


## Required Output
You will respond to the questions located in the Quiz for this exercise in the Canvas site for this course.
        
## Grading
There are two parts to submission of this exercise. The first is submission of this notebook, and is worth 10 points. Not submitting code will result in a loss of 10 points. Submitting code that is not functional will result in a loss of 5 points.

The second part of the exercise is submission of the answers via the associated Canvas quiz. Each correct answer on the Canvas Quiz is worth 2 points.

Any numeric answer typed into Canvas will be considered correct if it is within $\pm$ 1% from the reference answer.  Answers in which you select a given choice will be graded based on the identified correct choice(s).  For multi-select, partial credit is given if a portion of the correct answers are selected.
    

In [1]:
import pandas as pd
import numpy as np

## Load the REDCap export data, the survey invitations, and the data dictionary into pandas dataframes

Exclude from further analysis any surveys that were not complete.

In [19]:
# your code here

data = pd.read_csv('../resources/REDCap/REDCap_Sample_DATA.csv')
survey = pd.read_csv('../resources/REDCap/survey_invitations.csv')
data_dict = pd.read_csv('../resources/REDCap/REDCap_Sample_DataDictionary.csv')

data_complete = data[data["prerollout_survey_complete"] == 2]

In [47]:
data_complete

Unnamed: 0,pre_participant_id,redcap_survey_identifier,prerollout_survey_timestamp,pre_gender,pre_role,pre_yrs_experience,pre_calculator_use,pre_why_no_use___1,pre_why_no_use___2,pre_why_no_use___3,...,pre_likely_to_use_newer,pre_wait_time_to_use,pre_who_determines,prerollout_survey_complete,pre_barriers,barriers_coded_1,barriers_coded_2,pre_lacking_features,lack_features_coded_1,lack_features_coded_2
0,1,,11/19/14 02:21 PM,1.0,1.0,8.0,0.0,0,0,0,...,,,,2,Not having it at the point of care And not ge...,integration,integration,Prognosis,specific calculator feature,specific calculator feature
1,2,,11/19/14 02:31 PM,1.0,1.0,8.0,1.0,0,0,0,...,1.0,5.0,1.0,2,,,,,,
2,3,,11/19/14 02:24 PM,1.0,1.0,7.0,1.0,0,0,0,...,3.0,4.0,1.0,2,,,,,,
3,4,,11/19/14 02:25 PM,0.0,4.0,9.0,0.0,0,0,0,...,,,,2,I find that I do not need them in my practice,necessity,necessity,none,none,none
4,5,,11/19/14 02:36 PM,1.0,3.0,5.0,1.0,0,0,0,...,3.0,3.0,1.0,2,Phone battery,technical,technical,,none,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,117,656.0,02/17/15 07:31 PM,0.0,2.0,2.0,1.0,0,0,0,...,3.0,2.0,1.0,2,,,,,,
117,118,791.0,02/17/15 08:04 PM,1.0,2.0,3.0,1.0,0,0,0,...,3.0,3.0,20.0,2,,,,,,
118,119,820.0,02/17/15 08:19 PM,0.0,2.0,1.0,1.0,0,0,0,...,3.0,2.0,1.0,2,,,,,,
119,120,747.0,02/18/15 09:36 AM,1.0,3.0,5.0,1.0,0,0,0,...,3.0,1.0,1.0,2,,,,,,


In [11]:
survey

Unnamed: 0,id,department
0,1,Child Health
1,2,Physical Medicine and Rehabilitation
2,3,Orthopaedic Surgery
3,4,Orthopaedic Surgery
4,5,Medicine
...,...,...
814,919,Urology
815,920,Urology
816,921,Urology
817,922,Urology


# Identify potential bias in responses
This survey was sent to all physicians in each department at a hospital.  It is possible that some departments may have had a larger percentage of responses than others, which could introduce departmental bias in the analysis of responses.  

Identify whether the expected responses match the actual responses by department using a two way contingency test.  Some survey responses were not linked to their invites (the `redcap_survey_identifier` is empty).  Do not include those in this analysis.  



### Join the invitations to the responses for those responses that have a redcap_survey_identifier


In [41]:
# your code here

data_valid = data[data['redcap_survey_identifier'].notna()]

data_valid_edit = data_valid.rename(columns={'pre_participant_id': 'id'})

join_df = survey.merge(data_valid_edit, on=['id'], how='left')

join_df['no_response'] = join_df['prerollout_survey_complete'].isnull()*1

join_df

Unnamed: 0,id,department,redcap_survey_identifier,prerollout_survey_timestamp,pre_gender,pre_role,pre_yrs_experience,pre_calculator_use,pre_why_no_use___1,pre_why_no_use___2,...,pre_wait_time_to_use,pre_who_determines,prerollout_survey_complete,pre_barriers,barriers_coded_1,barriers_coded_2,pre_lacking_features,lack_features_coded_1,lack_features_coded_2,no_response
0,1,Child Health,,,,,,,,,...,,,,,,,,,,1
1,2,Physical Medicine and Rehabilitation,,,,,,,,,...,,,,,,,,,,1
2,3,Orthopaedic Surgery,,,,,,,,,...,,,,,,,,,,1
3,4,Orthopaedic Surgery,,,,,,,,,...,,,,,,,,,,1
4,5,Medicine,,,,,,,,,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
814,919,Urology,,,,,,,,,...,,,,,,,,,,1
815,920,Urology,,,,,,,,,...,,,,,,,,,,1
816,921,Urology,,,,,,,,,...,,,,,,,,,,1
817,922,Urology,,,,,,,,,...,,,,,,,,,,1


### Perform the two way contingency test
For each survey invitation, determine if a response exists.  Compare the response/no response counts by department.  Use $\alpha$ = 0.05.

### Quiz Week 1 Exercise 2 Question 1
Do we reject the Null hypothesis that the frequencies of response by department are equal for response/no response, at $\alpha$=0.05?

In [45]:
# your code here

ct = pd.crosstab(join_df['department'], join_df['no_response'], margins=True)
ct

no_response,0,1,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Child Health,5,86,91
Dermatology,0,24,24
Emergency Medicine,6,35,41
Family and Community Medicine,14,93,107
Medicine,13,253,266
Neurology,2,31,33
Orthopaedic Surgery,4,36,40
Orthopedics,0,28,28
Otolaryngology,2,24,26
Physical Medicine and Rehabilitation,2,26,28


In [46]:
from scipy import stats

groupsizes = join_df.groupby(['department', 'no_response']).size()
ctsum = groupsizes.unstack('department')
(chi_sq, p_value, degrees_freedom, expected) = list(stats.chi2_contingency(ctsum.fillna(0)))

print('chi square statistic', chi_sq)
print('p-value', p_value)
print('\nexpected frequencies:')
print(expected)
print('\nactual frequencies:')
print(ctsum)

chi square statistic 18.63919031009638
p-value 0.09761588202723988

expected frequencies:
[[  6.11111111   1.61172161   2.75335775   7.18559219  17.86324786
    2.21611722   2.68620269   1.88034188   1.74603175   1.88034188
    6.11111111   0.53724054   2.41758242]
 [ 84.88888889  22.38827839  38.24664225  99.81440781 248.13675214
   30.78388278  37.31379731  26.11965812  24.25396825  26.11965812
   84.88888889   7.46275946  33.58241758]]

actual frequencies:
department   Child Health  Dermatology  Emergency Medicine  \
no_response                                                  
0                     5.0          NaN                 6.0   
1                    86.0         24.0                35.0   

department   Family and Community Medicine  Medicine  Neurology  \
no_response                                                       
0                                     14.0      13.0        2.0   
1                                     93.0     253.0       31.0   

department   Ortho

### Quiz Week 1 Exercise 2 Question 2
Based on the result of the contingency test, can we conclude that there is likely no bias in the responses by department?

## Inter-rater agreement
For these questions, use the full dataset of survey responses (excluding incomplete surveys), not just the responses with an associated department.

Free text responses were collected in the survey for barriers to use, in the field `pre_barriers`.  Two coders were tasked with coding the free text into a category.  Those codes were stored in `barriers_coded_1` and `barriers_coded_2`.  Determine the inter-rater agreement using Cohen's kappa.

### Quiz Week 1 Exercise 2 Question 3
What is the Cohen's Kappa for inter rater reliability between barriers_coded_1 and barriers_coded_2?

### Quiz Week 1 Exercise 2 Question 4
Does the Cohen's Kappa for inter rater reliability between barriers_coded_1 and barriers_coded_2 indicate near perfect agreement or slight agreement?

In [49]:
# your code here

from sklearn.metrics import cohen_kappa_score

def categorize(df, col_name, mapping=None):
    if mapping:
        df[col_name] = pd.Categorical(df[col_name].map(mapping))
    else:
        df[col_name] = pd.Categorical(df[col_name])

def add_pct(df, pct_col_name, cnt_col_name, decimal_places):
    df_sum = df.sum()

    df[pct_col_name] = df.apply(lambda row: round((row[cnt_col_name]/df_sum)*100, decimal_places), axis=1)
    return df

def group(df, group_col_name, cnt_col_name):
    final_group = pd.DataFrame(df.groupby(group_col_name).size())
    final_group.columns = [cnt_col_name]
    return final_group

def cohen_kappa(df, c1, c2):
    cnt_col_name = 'cnt'
    c1group = group(df, c1, cnt_col_name)
    c2group = group(df, c2, cnt_col_name)
    
    print()
    print(c1, 'total: {}'.format(c1group.sum()))
    display(add_pct(c1group, 'pct', cnt_col_name, 1))
    print()
    print(c2, 'total: {}'.format(c2group.sum()))
    display(add_pct(c2group, 'pct', cnt_col_name, 1))
    print()
    print("Cohen's Kappa for {c1} and {c2} is {ck:.3f}".format(c1=c1, c2=c2, ck=cohen_kappa_score(df[c1].cat.codes, df[c2].cat.codes)))

categorize(data_complete, 'barriers_coded_1')
categorize(data_complete, 'barriers_coded_2')
    
cohen_kappa(data_complete, 'barriers_coded_1', 'barriers_coded_2')


barriers_coded_1 total: cnt    49
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0_level_0,cnt,pct
barriers_coded_1,Unnamed: 1_level_1,Unnamed: 2_level_1
UI,3,6.1
integration,17,34.7
necessity,6,12.2
none,8,16.3
technical,2,4.1
training,3,6.1
workflow,10,20.4



barriers_coded_2 total: cnt    49
dtype: int64


Unnamed: 0_level_0,cnt,pct
barriers_coded_2,Unnamed: 1_level_1,Unnamed: 2_level_1
UI,2,4.1
integration,17,34.7
necessity,8,16.3
none,8,16.3
technical,2,4.1
training,3,6.1
workflow,9,18.4



Cohen's Kappa for barriers_coded_1 and barriers_coded_2 is 0.930
