# Module 1 Exercise 1 - Analyzing Adverse Events in a Clinical Trial

## Overview
In this exercise, you will analyze reported adverse events from the clinical trial described in the SDTM formatted datasets.  You will find adverse events and report on them by frequency, and make statistical inferences about those frequencies while considering study subject's race.


## File Formats
Files are located in the `resources/SDTM_sample` sub folder of this module.  The files are in SDTM format, and are the same set as used in the lab and practice exercise.
  * [SDTM](resources/SDTM_v1.8.pdf)
  * [SDTM Implementation Guide](resources/SDTMIG_v3.3_FINAL.pdf)


## Required Output
You will respond to the questions located in the Quiz for this exercise in the Canvas site for this course.
        
## Grading
There are two parts to submission of this exercise. The first is submission of this notebook, and is worth 10 points. Not submitting code will result in a loss of 10 points. Submitting code that is not functional will result in a loss of 5 points.

The second part of the exercise is submission of the answers via the associated Canvas quiz. Each correct answer on the Canvas Quiz is worth 2 points.

Any numeric answer typed into Canvas will be considered correct if it is within $\pm$ 1% from the reference answer.  Answers in which you select a given choice will be graded based on the identified correct choice(s).  For multi-select, partial credit is given if a portion of the correct answers are selected.
    

In [1]:
import sys
!{sys.executable} -m pip install --upgrade "pandas>=1.1"
!{sys.executable} -m pip install xmltodict

import pandas as pd
import numpy as np

Collecting pandas>=1.1
[?25l  Downloading https://files.pythonhosted.org/packages/99/f0/f99700ef327e51d291efdf4a6de29e685c4d198cbf8531541fc84d169e0e/pandas-1.3.5.tar.gz (4.7MB)
[K     |████████████████████████████████| 4.7MB 2.8MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: pandas
  Building wheel for pandas (PEP 517) ... [?25ldone
[?25h  Created wheel for pandas: filename=pandas-1.3.5-cp37-cp37m-linux_x86_64.whl size=30216646 sha256=f5d44f84fde511ff15774c3de87f534e6b1f7c4a60acc41828f2c11aecb10a01
  Stored in directory: /home/dcphw2/.cache/pip/wheels/5c/f4/45/389dc711f0c5ff9adeb5245397ab18bf75182e8cff9fbfa916
Successfully built pandas
Installing collected packages: pandas
  Found existing installation: pandas 0.25.2
    Uninstalling pandas-0.25.2:
      Successfully uninstalled pandas-0.25.2
Successfully installe



## Find and load the dataset that contains adverse events

Consult one or both of the SDTM documents:
  * [SDTM](resources/SDTM_v1.8.pdf)
  * [SDTM Implementation Guide](resources/SDTMIG_v3.3_FINAL.pdf)

In [46]:
# your code here

with open('../resources/SDTM_sample/ae.xpt', 'rb') as f:
    ae = pd.read_sas(f, format='xport', encoding='utf-8')
    
display(ae.head())

Unnamed: 0,STUDYID,DOMAIN,USUBJID,AESEQ,AESPID,AETERM,AELLT,AELLTCD,AEDECOD,AEPTCD,...,AESHOSP,AESLIFE,AESOD,EPOCH,AEDTC,AESTDTC,AEENDTC,AEDY,AESTDY,AEENDY
0,CDISCPILOT01,AE,01-701-1015,1.0,E07,APPLICATION SITE ERYTHEMA,APPLICATION SITE REDNESS,,APPLICATION SITE ERYTHEMA,,...,N,N,N,TREATMENT,2014-01-16,2014-01-03,,15.0,2.0,
1,CDISCPILOT01,AE,01-701-1015,2.0,E08,APPLICATION SITE PRURITUS,APPLICATION SITE ITCHING,,APPLICATION SITE PRURITUS,,...,N,N,N,TREATMENT,2014-01-16,2014-01-03,,15.0,2.0,
2,CDISCPILOT01,AE,01-701-1015,3.0,E06,DIARRHOEA,DIARRHEA,,DIARRHOEA,,...,N,N,N,TREATMENT,2014-01-16,2014-01-09,2014-01-11,15.0,8.0,10.0
3,CDISCPILOT01,AE,01-701-1023,3.0,E10,ATRIOVENTRICULAR BLOCK SECOND DEGREE,AV BLOCK SECOND DEGREE,,ATRIOVENTRICULAR BLOCK SECOND DEGREE,,...,N,N,N,TREATMENT,2012-08-27,2012-08-26,,23.0,22.0,
4,CDISCPILOT01,AE,01-701-1023,2.0,E09,ERYTHEMA,LOCALIZED ERYTHEMA,,ERYTHEMA,,...,N,N,N,TREATMENT,2012-08-27,2012-08-07,,23.0,3.0,


## Find and load the dataset that contains subject demographics

In [47]:
# your code here

with open('../resources/SDTM_sample/dm.xpt', 'rb') as f:
    dm = pd.read_sas(f, format='xport', encoding='utf-8')
    
display(dm.head())

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SUBJID,RFSTDTC,RFENDTC,RFXSTDTC,RFXENDTC,RFICDTC,RFPENDTC,...,SEX,RACE,ETHNIC,ARMCD,ARM,ACTARMCD,ACTARM,COUNTRY,DMDTC,DMDY
0,CDISCPILOT01,DM,01-701-1015,1015,2014-01-02,2014-07-02,2014-01-02,2014-07-02,,2014-07-02T11:45,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
1,CDISCPILOT01,DM,01-701-1023,1023,2012-08-05,2012-09-02,2012-08-05,2012-09-01,,2013-02-18,...,M,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2012-07-22,-14.0
2,CDISCPILOT01,DM,01-701-1028,1028,2013-07-19,2014-01-14,2013-07-19,2014-01-14,,2014-01-14T11:10,...,M,WHITE,NOT HISPANIC OR LATINO,Xan_Hi,Xanomeline High Dose,Xan_Hi,Xanomeline High Dose,USA,2013-07-11,-8.0
3,CDISCPILOT01,DM,01-701-1033,1033,2014-03-18,2014-04-14,2014-03-18,2014-03-31,,2014-09-15,...,M,WHITE,NOT HISPANIC OR LATINO,Xan_Lo,Xanomeline Low Dose,Xan_Lo,Xanomeline Low Dose,USA,2014-03-10,-8.0
4,CDISCPILOT01,DM,01-701-1034,1034,2014-07-01,2014-12-30,2014-07-01,2014-12-30,,2014-12-30T09:50,...,F,WHITE,NOT HISPANIC OR LATINO,Xan_Hi,Xanomeline High Dose,Xan_Hi,Xanomeline High Dose,USA,2014-06-24,-7.0


## Join adverse events and subjects

In [50]:
# your code here

join_df = ae.merge(dm, on=['STUDYID', 'USUBJID',], how='left')

display(join_df.head())

Unnamed: 0,STUDYID,DOMAIN_x,USUBJID,AESEQ,AESPID,AETERM,AELLT,AELLTCD,AEDECOD,AEPTCD,...,SEX,RACE,ETHNIC,ARMCD,ARM,ACTARMCD,ACTARM,COUNTRY,DMDTC,DMDY
0,CDISCPILOT01,AE,01-701-1015,1.0,E07,APPLICATION SITE ERYTHEMA,APPLICATION SITE REDNESS,,APPLICATION SITE ERYTHEMA,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
1,CDISCPILOT01,AE,01-701-1015,2.0,E08,APPLICATION SITE PRURITUS,APPLICATION SITE ITCHING,,APPLICATION SITE PRURITUS,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
2,CDISCPILOT01,AE,01-701-1015,3.0,E06,DIARRHOEA,DIARRHEA,,DIARRHOEA,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
3,CDISCPILOT01,AE,01-701-1023,3.0,E10,ATRIOVENTRICULAR BLOCK SECOND DEGREE,AV BLOCK SECOND DEGREE,,ATRIOVENTRICULAR BLOCK SECOND DEGREE,,...,M,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2012-07-22,-14.0
4,CDISCPILOT01,AE,01-701-1023,2.0,E09,ERYTHEMA,LOCALIZED ERYTHEMA,,ERYTHEMA,,...,M,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2012-07-22,-14.0


## Find the frequency of adverse events by the adverse event term
### Quiz Week 1 Exercise 1 Question 1
What is the most common adverse event?

In [51]:
# your code here

count = join_df['AETERM'].value_counts()
print(count)

APPLICATION SITE PRURITUS    70
PRURITUS                     66
ERYTHEMA                     45
APPLICATION SITE ERYTHEMA    41
RASH                         35
                             ..
HEART RATE INCREASED          1
PAROSMIA                      1
ALCOHOL USE                   1
STUPOR                        1
AMNESIA                       1
Name: AETERM, Length: 242, dtype: int64


### Quiz Week 1 Exercise 1 Question 2
What is the most common adverse event for race = `WHITE`?

In [52]:
# your code here

white = join_df[join_df["RACE"] == "WHITE"]

count = white['AETERM'].value_counts()
print(count)

APPLICATION SITE PRURITUS                         63
PRURITUS                                          59
APPLICATION SITE ERYTHEMA                         41
ERYTHEMA                                          41
RASH                                              35
                                                  ..
PARKINSON'S DISEASE                                1
COMPLETED SUICIDE                                  1
HYPERBILIRUBINAEMIA                                1
WOUND HAEMORRHAGE                                  1
PARTIAL SEIZURES WITH SECONDARY GENERALISATION     1
Name: AETERM, Length: 234, dtype: int64


### Quiz Week 1 Exercise 1 Question 3
How many distinct adverse events were documented 3 or more times for race = `BLACK OR AFRICAN AMERICAN`?

In [59]:
# your code here

black = join_df[join_df["RACE"] == "BLACK OR AFRICAN AMERICAN"]

count = black['AETERM'].value_counts().loc[lambda x: x>2]
print(count)

APPLICATION SITE PRURITUS    7
NAUSEA                       6
PRURITUS                     6
HEADACHE                     4
BACK PAIN                    3
DIZZINESS                    3
ERYTHEMA                     3
Name: AETERM, dtype: int64


## Display a cross tab of adverse event by race
Display percentages rather than total counts.  Show total percentages for each arm.  Use this cross tab to answer the following question.

### Quiz Week 1 Exercise 1 Question 4
Which Trial Arm had the highest percentage of adverse events? 

In [110]:
# your code here

ct = pd.crosstab(join_df['AETERM'], join_df['ARM'], normalize='index', margins=True)
ct

ARM,Placebo,Xanomeline High Dose,Xanomeline Low Dose
AETERM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABDOMINAL DISCOMFORT,0.000000,1.000000,0.000000
ABDOMINAL PAIN,0.166667,0.333333,0.500000
ACROCHORDON EXCISION,0.000000,1.000000,0.000000
ACTINIC KERATOSIS,0.000000,1.000000,0.000000
AGITATION,0.400000,0.200000,0.400000
...,...,...,...
WHITE BLOOD CELL COUNT INCREASED,0.000000,0.000000,1.000000
WOLFF-PARKINSON-WHITE SYNDROME,0.000000,0.000000,1.000000
WOUND,0.000000,0.000000,1.000000
WOUND HAEMORRHAGE,0.000000,1.000000,0.000000


## Perform a 2 way contingency table analysis
Determine if the occurrence of Adverse Events by race are independent of the study arm.  Use only the top 15 most frequent adverse events across all arms in the study.  Because other races are not well represented in the data, only look at race of `WHITE` or `BLACK OR AFRICAN AMERICAN` when finding the top adverse events and when comparing occurrences by race and trial arm.

### Quiz Week 1 Exercise 1 Question 5
Do we reject the Null hypothesis that the frequencies for occurrence of the top 15 adverse events by study arm are equal for the two races, at  $\alpha$=0.05?

### Quiz Week 1 Exercise 1 Question 6
What is the p-value for the test, rounded to three decimal places?

In [98]:
# your code here

top_15 = join_df['AETERM'].value_counts()[:15].index.tolist()
new_df = join_df[join_df['AETERM'].isin(top_15)]
black_white = ['WHITE', 'BLACK OR AFRICAN AMERICAN']
new_df = new_df[new_df['RACE'].isin(black_white)]



Unnamed: 0,STUDYID,DOMAIN_x,USUBJID,AESEQ,AESPID,AETERM,AELLT,AELLTCD,AEDECOD,AEPTCD,...,SEX,RACE,ETHNIC,ARMCD,ARM,ACTARMCD,ACTARM,COUNTRY,DMDTC,DMDY
0,CDISCPILOT01,AE,01-701-1015,1.0,E07,APPLICATION SITE ERYTHEMA,APPLICATION SITE REDNESS,,APPLICATION SITE ERYTHEMA,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
1,CDISCPILOT01,AE,01-701-1015,2.0,E08,APPLICATION SITE PRURITUS,APPLICATION SITE ITCHING,,APPLICATION SITE PRURITUS,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
2,CDISCPILOT01,AE,01-701-1015,3.0,E06,DIARRHOEA,DIARRHEA,,DIARRHOEA,,...,F,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2013-12-26,-7.0
4,CDISCPILOT01,AE,01-701-1023,2.0,E09,ERYTHEMA,LOCALIZED ERYTHEMA,,ERYTHEMA,,...,M,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2012-07-22,-14.0
5,CDISCPILOT01,AE,01-701-1023,4.0,E08,ERYTHEMA,ERYTHEMA,,ERYTHEMA,,...,M,WHITE,HISPANIC OR LATINO,Pbo,Placebo,Pbo,Placebo,USA,2012-07-22,-14.0


In [117]:
race_arm = pd.crosstab(new_df['AETERM'], columns=[new_df['ARM'], new_df['RACE']])
race_arm

ARM,Placebo,Placebo,Xanomeline High Dose,Xanomeline High Dose,Xanomeline Low Dose,Xanomeline Low Dose
RACE,BLACK OR AFRICAN AMERICAN,WHITE,BLACK OR AFRICAN AMERICAN,WHITE,BLACK OR AFRICAN AMERICAN,WHITE
AETERM,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
APPLICATION SITE DERMATITIS,0,8,2,9,0,12
APPLICATION SITE ERYTHEMA,0,3,0,19,0,19
APPLICATION SITE IRRITATION,0,5,0,15,0,14
APPLICATION SITE PRURITUS,3,5,2,29,2,29
COUGH,0,3,1,3,0,7
DIARRHOEA,1,8,1,3,0,5
DIZZINESS,0,2,2,15,1,9
ERYTHEMA,0,11,2,14,1,16
HEADACHE,3,5,1,6,0,3
HYPERHIDROSIS,0,2,1,7,0,4


In [118]:
from scipy import stats

stats.chi2_contingency(race_arm)

(109.7588445951383,
 0.001692244035433622,
 70,
 array([[ 0.6       ,  4.93333333,  1.53333333, 11.73333333,  0.33333333,
         11.86666667],
        [ 0.79354839,  6.52473118,  2.02795699, 15.51827957,  0.44086022,
         15.69462366],
        [ 0.65806452,  5.41075269,  1.68172043, 12.8688172 ,  0.3655914 ,
         13.01505376],
        [ 1.35483871, 11.13978495,  3.46236559, 26.49462366,  0.75268817,
         26.79569892],
        [ 0.27096774,  2.22795699,  0.69247312,  5.29892473,  0.15053763,
          5.35913978],
        [ 0.3483871 ,  2.86451613,  0.89032258,  6.81290323,  0.19354839,
          6.89032258],
        [ 0.56129032,  4.61505376,  1.4344086 , 10.97634409,  0.31182796,
         11.10107527],
        [ 0.8516129 ,  7.00215054,  2.17634409, 16.65376344,  0.47311828,
         16.84301075],
        [ 0.3483871 ,  2.86451613,  0.89032258,  6.81290323,  0.19354839,
          6.89032258],
        [ 0.27096774,  2.22795699,  0.69247312,  5.29892473,  0.15053763,
     