In [20]:
pwd

'/Users/amina/Downloads/data'

In [55]:
pwd

'/Users/amina'

Task 2
In this assignment, you will perform medical data analysis by creating statistical tests on a given data set. You will check which variables are potentially a cause of a patient's death.
* ﻿﻿1. For categorical variables, you should perform the chi-squared test of independence between each categorical variable and death variable. Treat variables as categorical if dtype iS int64.
* ﻿﻿2. For numerical variables, perform two Shapiro-Wilk tests: one for each sample that was created by splitting the data by death variable.
* ﻿﻿2.1. If p-values from Shapiro-Wilk tests indicate that both samples have a normal distribution (p-values greater than 0.05), perform the unpaired t-test with the parameter equal_var = False.
* ﻿﻿2.2. Otherwise perform the Mann-Whitney U test.
Requirements
Implement a function perform_tests which accepts one argument:
• data: a pandas DataFrame consisting of the following columns: death (an indicator of whether a patient died less than a year after the operation) and 17 other variables (either categorical or numerical) describing health condition after operation and taken medicaments.
The function returns a dictionary with the following four keys:
* ﻿﻿mann_whitney, test, chi_square: each of these consists of a list of tuples with (variable name, p-value from the corresponding test). For chi _square, these should be categorical variables; for mann_whitney, numerical variables that don't have a normal distribution; and for test, numerical variables with a normal distribution. 
* ﻿﻿shapiro_wilk: a list of tuples with (variable name, (p-value for sample with deaths=0, p-value for sample with deaths=1)). These should be all numerical variables. Round all p-values in the output to four decimal places.

Example
With data limited to the following columns:

example_data = data[["death", "Na+", "DBP", "PLT", "ivabradine". "MRA" ]] 

the function perform_tests(example_data) will return:
{'mann_whitney': [('Nat', 0.2143)],
'ttest': [('DBP'0.0), ('PLT', 0.4739)],
'chi_square': [('ivabradine', 0.0144), ('MRA', 0.2884)],
'shapiro_wilk': [('Nat', (0.0, 0.0071)), ('PLT', (0.2361, 0.6935)), ('DBP', (0.5272, 0.37)]
}
Hints:
Use the scipy.stats package to perform all tests.
In addition to the Python 3.8 standard library you can use SciPy 1.5.2.


def perform_tests(data):

    return {
        'mann_whitney': None,
        'ttest': None,
        'chi_square': None,
        'shapiro_wilk': None
    }

In [47]:
import pandas as pd
df =pd.read_csv('/Users/amina/Downloads/data/medical_data.csv')
df.head()

Unnamed: 0,death,amiodarone,loop_diuretics,ivabradine,ARB,digoxin,MRA,heart_failure,AOS,SBP,DBP,PLT,LDL,HDL,LVEF,Na+,K+,MPV
0,0,0,0,1,1,1,0,2,0,62.0,126.0,196.0,50.0,42.0,28.0,136.0,4.3,9.9
1,0,1,1,0,1,0,1,0,1,72.0,108.0,245.0,59.0,85.0,25.0,147.0,4.58,13.3
2,0,0,0,0,0,1,1,2,2,73.0,109.0,219.0,79.0,61.0,14.0,133.0,4.05,1.5
3,0,0,0,1,1,1,0,2,1,55.0,114.0,294.0,97.0,55.0,8.0,150.0,5.34,7.8
4,0,0,1,0,1,1,1,2,2,70.0,95.0,293.0,96.0,30.0,28.0,151.0,5.25,6.8


In [50]:
from scipy.stats import chi2_contingency, shapiro, ttest_ind, mannwhitneyu
import pandas as pd

def perform_tests(data):
    result = {
        'mann_whitney': [],
        'ttest': [],
        'chi_square': [],
        'shapiro_wilk': []
    }  
    
    for col in data.columns:
        if col == 'death':
            pass
            
        if data[col].dtype =='int64':
            #perform chi-squared test
            tmp = pd.crosstab(data['death'], data[col])
            chi2, p, dof, expected = chi2_contingency(tmp)
            result['chi_square'].append((col, round(p, 4)))
        else: 
            #perform Shapiro-Wilk test
            death_false = data[data['death'] ==0][col]
            death_true = data[data['death']==1][col]
            p_0 = round(shapiro(death_false)[1],4)
            p_1 = round(shapiro(death_true)[1], 4)
            result['shapiro_wilk'].append((col, p_0, p_1))

            if p_0>0.05 and p_1>0.05:
                # perform t-tests
                t, p = ttest_ind(death_false, death_true, equal_var=False)
                result['ttest'].append((col, round(p,4)))
            else:
                # perform Mann-Whitney U tests
                u, p = mannwhitneyu(death_false, death_true)
                result['mann_whitney'].append((col, round(p,4))) 
    return result

In [53]:
perform_tests(df[["death", "Na+", "DBP", "PLT", "ivabradine", "MRA" ]])

{'mann_whitney': [('Na+', 0.4286)],
 'ttest': [('DBP', 0.0), ('PLT', 0.4739)],
 'chi_square': [('death', 0.0), ('ivabradine', 0.0144), ('MRA', 0.2884)],
 'shapiro_wilk': [('Na+', 0.0, 0.0071),
  ('DBP', 0.5272, 0.3715),
  ('PLT', 0.2361, 0.6935)]}

In [54]:
perform_tests(df)

{'mann_whitney': [('HDL', 0.0),
  ('LVEF', 0.0),
  ('Na+', 0.4286),
  ('K+', 0.8569),
  ('MPV', 0.8662)],
 'ttest': [('SBP', 0.0), ('DBP', 0.0), ('PLT', 0.4739), ('LDL', 0.8392)],
 'chi_square': [('death', 0.0),
  ('amiodarone', 0.0),
  ('loop_diuretics', 0.0),
  ('ivabradine', 0.0144),
  ('ARB', 0.2338),
  ('digoxin', 0.5185),
  ('MRA', 0.2884),
  ('heart_failure', 0.0),
  ('AOS', 0.974)],
 'shapiro_wilk': [('SBP', 0.2581, 0.5881),
  ('DBP', 0.5272, 0.3715),
  ('PLT', 0.2361, 0.6935),
  ('LDL', 0.9367, 0.95),
  ('HDL', 0.4388, 0.0006),
  ('LVEF', 0.0, 0.0),
  ('Na+', 0.0, 0.0071),
  ('K+', 0.0, 0.1771),
  ('MPV', 0.0, 0.0003)]}