# Module 1 Practice 2 - Analyzing data from REDCap
In this practice exercise, you will read data from a REDCap survey and perform some analysis.  This will require you to also understand the data dictionary for the REDCap data.  The purpose of this exercise is to reinforce the use of the data dictionary along with the data to create reusable code for analysis of REDCap data, and to minimize manual work.  A good workflow for analysis can be run again and again as new results come into the system without requiring any manual intervention, producing the results you are interested in.

The data dictionary is in csv form and can be downloaded and opened in a spreadsheet for easier viewing.  We will use the same data as was used in Lab 02.
  * [REDCap_Sample_DataDictionary.csv](../resources/REDCap/REDCap_Sample_DataDictionary.csv)

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('../resources/REDCap/REDCap_Sample_DATA.csv')
data_dict = pd.read_csv('../resources/REDCap/REDCap_Sample_DataDictionary.csv')
display(data.head())

Unnamed: 0,pre_participant_id,redcap_survey_identifier,prerollout_survey_timestamp,pre_gender,pre_role,pre_yrs_experience,pre_calculator_use,pre_why_no_use___1,pre_why_no_use___2,pre_why_no_use___3,...,pre_likely_to_use_newer,pre_wait_time_to_use,pre_who_determines,prerollout_survey_complete,pre_barriers,barriers_coded_1,barriers_coded_2,pre_lacking_features,lack_features_coded_1,lack_features_coded_2
0,1,,11/19/14 02:21 PM,1.0,1.0,8.0,0.0,0,0,0,...,,,,2,Not having it at the point of care And not ge...,integration,integration,Prognosis,specific calculator feature,specific calculator feature
1,2,,11/19/14 02:31 PM,1.0,1.0,8.0,1.0,0,0,0,...,1.0,5.0,1.0,2,,,,,,
2,3,,11/19/14 02:24 PM,1.0,1.0,7.0,1.0,0,0,0,...,3.0,4.0,1.0,2,,,,,,
3,4,,11/19/14 02:25 PM,0.0,4.0,9.0,0.0,0,0,0,...,,,,2,I find that I do not need them in my practice,necessity,necessity,none,none,none
4,5,,11/19/14 02:36 PM,1.0,3.0,5.0,1.0,0,0,0,...,3.0,3.0,1.0,2,Phone battery,technical,technical,,none,none


## Perform a chi squared test of independence between years of experience and medical calculator use

We would like to know if there is a potential relationship between experience level of physicians and use of medical calculators (`pre_calculator_use`).


### Create categorical data for years of experience
Using information from the data dictionary, create a new column in the data set that categorizes years of experience (`pre_yrs_experience`) into the following buckets:

  * 0: 1-6 years of experience (residents/early career)
  * 1: 7-20 years of experience (mid career)
  * 2: > 20 years of experience (highly experienced)
  
The current years of experience measurement is already categorical, so you will need to translate those categories to these new categories.  

In [2]:
def create_3_levels(row):
    if row['pre_yrs_experience'] <= 6:
        return 0
    elif row['pre_yrs_experience'] <= 8:
        return 1
    else:
        return 2

def choice_to_dict(data_dict, variable_name):
    field_type = data_dict[data_dict['Variable / Field Name'] == variable_name]['Field Type']

    if field_type.values[0] == 'yesno':
        choices = '1, yes | 0, no'
    elif field_type.values[0] == 'truefalse':
        choices = '1, true | 0, false'
    else:
        choices = data_dict[data_dict['Variable / Field Name'] == variable_name]['Choices, Calculations, OR Slider Labels'].values[0]
    mapping = {}
    for choice in choices.split('|'):
        value_pair = choice.strip().split(',')
        mapping[int(value_pair[0])] = value_pair[1].strip()
    return mapping
            
def categorize(df, col_name, mapping=None):
    if mapping:
        df[col_name] = pd.Categorical(df[col_name].map(mapping))
    else:
        df[col_name] = pd.Categorical(df[col_name])
        
data['exp_3_levels'] = data.apply(create_3_levels, axis=1)

categorize(data, 'exp_3_levels', {0:'1-6', 1:'7-20',2:'>20'})
categorize(data, 'pre_calculator_use', choice_to_dict(data_dict, 'pre_calculator_use'))

### Display a cross tab of the new 3 levels of experience column to calculator use
The cross tab will provide a visual aid to go along with our statistical test.  In the cross tab, show counts or percents (or both if you're feeling adventurous) for each combination of variables.  To ease the interpretation, display the cross tab headers using text from the metadata rather than the raw numeric values. Refer to the [Pandas crosstab documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) if necessary.

In [9]:
def crosstab_with_pct(row, col):
    cnt = pd.crosstab(row, col, margins=True, margins_name='Total')
    pct = pd.crosstab(row, col, margins=True, margins_name='Total', normalize='index').round(3)*100
    joint = cnt.join(pct, lsuffix='_cnt', rsuffix='_pct')
    return joint

display(crosstab_with_pct(data['exp_3_levels'], data['pre_calculator_use']))

pre_calculator_use,no_cnt,yes_cnt,Total,no_pct,yes_pct
exp_3_levels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1-6,3,37,40,7.5,92.5
7-20,9,31,40,22.5,77.5
>20,18,20,38,47.4,52.6
Total,30,88,118,25.4,74.6


### Perform the statistical test
Perform the chi-squared test for homogeneity and interpret the result.  Use $\alpha = 0.05$ for the significance level.  You could use scipy's [implementation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) for the chi-squared test. Refer to [Lab 3](../labs/lab-01-03.ipynb) if necessary.

In [4]:
from scipy import stats

groupsizes = data.groupby(['exp_3_levels', 'pre_calculator_use']).size()
ctsum = groupsizes.unstack('exp_3_levels')
(chi_sq, p_value, degrees_freedom, expected) = list(stats.chi2_contingency(ctsum.fillna(0)))

print('chi square statistic', chi_sq)
print('p-value', p_value)
print('\nexpected frequencies:')
print(expected)
print('\nactual frequencies:')
print(ctsum)


chi square statistic 16.609629186602874
p-value 0.0002473231953882331

expected frequencies:
[[10.16949153 10.16949153  9.66101695]
 [29.83050847 29.83050847 28.33898305]]

actual frequencies:
exp_3_levels        1-6  7-20  >20
pre_calculator_use                
no                    3     9   18
yes                  37    31   20


In [8]:
from scipy import stats

stats.chi2_contingency(ct)

(122.82936160671345,
 1.6840147243265957e-20,
 12,
 array([[ 12.3853211 ,  36.33027523,  48.71559633,  21.22018349,
          61.34862385],
        [ 12.3853211 ,  36.33027523,  48.71559633,  21.22018349,
          61.34862385],
        [ 12.11009174,  35.52293578,  47.63302752,  20.74862385,
          59.9853211 ],
        [ 23.11926606,  67.81651376,  90.93577982,  39.61100917,
         114.51743119]]))

### Interpretation
Given the p-value, is the null hypothesis rejected that the two variables are independent?