In [1]:
import pandas as pd
import numpy as np

In [2]:
impact_ad = pd.read_excel("Data/IMPACT_Study/ADSTART0_2023-05-25_06-46-11.xlsx")
impact_serum = pd.read_excel("Data/IMPACT_Study/Serum antibody_2023-05-25_06-32-15.xlsx")
impact_ige = pd.read_excel("Data/Impact_Study/IgE_IgG4_component_2023-05-25_06-30-40.xlsx")
impact_spt = pd.read_excel("Data/Impact_study/Skin Prick Test_2023-05-25_06-31-04.xlsx")

# About Data Sets

#### ADSTART0_2023-05-25_06-46-11 
- Overview of participant data 

#### IgE_IgG4_component_2023-05-25_06-30-40
- IgE component levels in this dataset  

#### Serum antibody_2023-05-25_06-32-15
OFC results:
- "Passed Visit 24 OFC for ITT"
- "Passed Visit 24 OFC No Imputation"	
- "Passed Visit 26 OFC for ITT"
- "Passed Visit 26 OFC No Imputation"

#### Skin Prick Test_2023-05-25_06-31-04
- "Wheal (mm)"

#### BAT data_2023-05-25_06-31-20
- Stands for "basophil activation test (BAT)"
- No known use case for this model 

---

# Acronyms 
- Intent-to-treat (ITT)
- Oral Immunotherapy (OIT)
- Initial Dose Escalation (IDE)

# About Study 
- taken from 2022 Lancet_IMPACT.pdf (pdf page 3)  

Children aged 12 months or older and younger than 
48 months were screened for inclusion in the study. 

__Inclusion criteria__ included the following: a clinical 
history of peanut allergy or avoidance without ever 
having eaten peanut, peanut-specific IgE levels of 5 kUA/L 
or higher, a skin prick test (SPT) wheal size greater than 
that of saline control by 3 mm or more, and a positive 
reaction to a cumulative dose of 500 mg or less of peanut 
in a double-blind, placebo-controlled food challenge 
(DBPCFC).   

__exclusion criteria__ included a history of 
severe anaphylaxis with hypotension to peanut, more 
than mild asthma or uncontrolled asthma, uncontrolled 
atopic dermatitis, and eosinophilic gastrointestinal 
disease (the full list of exclusion criteria is presented in 
the appendix p 2).

---
# Study Design

This is a randomized, double-blind, placebo-controlled, multi-center study comparing peanut oral immunotherapy to placebo. Eligible participants with peanut allergy will be randomly assigned to receive either peanut OIT or placebo for 134 weeks followed by peanut avoidance for 26 weeks.  

An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing. After the initial blinded OFC, the study design includes the following:  

__Initial Dose Escalation:__ This will occur on a single day in which multiple doses are given. Peanut or placebo dosing will be given incrementally and increase every 15-30 minutes until a dose of 12 mg peanut flour (6 mg peanut protein) or placebo flour is given. The first four doses will be administered as a peanut flour extract of 0.1 to 0.8 mg peanut protein, which is 10 to 80 microliters peanut flour extract, or placebo flour extract and the last three doses will be given as peanut flour of 3 to 12 mg peanut flour 1.5 to 6 mg peanut protein or placebo flour. Participants must tolerate a dose of at least 3 mg peanut flour (1.5 mg peanut protein) or placebo flour to remain in the study.  

__Build-up:__ After the initial dose escalation day, the participant will return to the research unit the next morning for an observed dose administration of the highest tolerated dose from the initial escalation day. The participant will then continue on the daily OIT dosing at home and return to the research unit every 2 weeks for a dose escalation. The dosing escalations will be consistent with previous similar OIT studies.
  
Participants who do not reach the 4000 mg peanut flour (2000 mg peanut protein) or placebo flour dose during the build-up phase may enter maintenance phase at their highest tolerated dose, which must be at least 500 mg peanut flour (250 mg peanut protein) or placebo flour.  

The build-up phase will comprise 30 weeks.  

__Maintenance:__ The participant will continue on daily OIT with return visits every 13 weeks. At the end of this phase the participant will undergo a blinded OFC to 10 g peanut flour (5 g peanut protein).  
This phase will comprise 104 weeks.  

__Avoidance:__ In this final phase participants stop OIT and will avoid peanut consumption They will be seen 2 weeks and 26 weeks after initiating this phase. At the completion of this phase participants will have a final blinded OFC to 10 g peanut flour (5 g peanut protein). Participants who do not have a clinical reaction to the challenge will receive an Open Food Challenge (OpFC).  

Avoidance will comprise 26 weeks.  

__Post-challenge:__ If participants do not have a clinical reaction during the OpFC at the end of avoidance, they will be allowed to consume peanut and will have one visit which will include peripheral blood sampling for mechanistic assays assessments.  

Post-challenge will comprise 2 weeks.  

---
# Exploring where and how OFCs are captured
Looking for 3 OFC tests in total

### 1st OFC: According to Study Design in protocol: 
-  Initial reaction: "An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing." (IDE)
- This was for 0.5 g peanut

### 2nd & 3rd OFC: According to Schedule of Assessments: Appendix 2:
- 5 g Oral Food Challenge performed during Avoidance phase during visit 24/week 134 and visit 26/week 160
- this was at or below 5 g peanut



# Filtering just the useful columns:

In [3]:
# impact_ad
# impact_serum 
# impact_ige 
# impact_spt

impact_ad_cols_to_keep = [
     'Participant ID',
     'Date of Screening Visit',
     'Randomized', # Use this to filter out participants that did not pass the study screening
     'Study Status', # Values: 'Discontinued Therapy', 'Completed Study', 'Enrolled but not Randomized', 'Early Termination', 'Screen Failure', 'Screened but not Enrolled'
     'Sex (character)',
     'Race',
     'Completed Study Assessments Numeric', #1 for yes/ 0 for no (means attended all visits up to 26, excluding 27)
     'Age at Screening (years) Not Rounded',
     'Study Termination Reason'
]

impact_serum_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Test Name', # Peanut IgE, Peanut IgE/Total IgE ratio, Peanut IgG4*, Peanut IgG4/IgE ratio, Total IgE
    'Unit',
    'Value', # results from 'Test Name'
]

impact_ige_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Antibody', #IgE, IgG4
    'Component', #rAra h 1, rAra h 2, rAra h 3, rAra h 6
    'Value',
    'Unit' 
]

impact_spt_cols_to_keep = [
    'Participant ID',
    'Date of Allergy Skin Test (Character)', # this is similar to 'Collection Date' in other datasets 
    'Visit',
    'Wheal (mm)',
    'OUT24NOI' # this will be used as OFC results for week 24
    #'OUT26NOI' # this will be used as OFC results for week 26
]

impact_ad = impact_ad[impact_ad_cols_to_keep]
impact_serum = impact_serum[impact_serum_cols_to_keep]
impact_ige = impact_ige[impact_ige_cols_to_keep]
impact_spt = impact_spt[impact_spt_cols_to_keep]


In [4]:
print(impact_ad.shape)
impact_ad.head()

(209, 9)


Unnamed: 0,Participant ID,Date of Screening Visit,Randomized,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Yes,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Yes,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Yes,Completed Study,Female,White/Caucasian,1,3.0,
3,IMPACT_112496,2013-05-05,No,Enrolled but not Randomized,Female,White/Caucasian,0,3.2,Withdrawn Consent
4,IMPACT_113135,2013-11-25,Yes,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent


In [5]:
print(impact_serum.shape)
impact_serum.head()

(3069, 8)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,,
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,27.934783,,
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3,,
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,0.004864,,
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,,


In [6]:
print(impact_ige.shape)
impact_ige.head()

(4848, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
0,IMPACT_101655,2012-10-11,-2,IgE,rAra h 1,8.76,KU/L
1,IMPACT_101655,2012-10-11,-2,IgE,rAra h 2,27.5,KU/L
2,IMPACT_101655,2012-10-11,-2,IgE,rAra h 3,0.5,KU/L
3,IMPACT_101655,2012-10-11,-2,IgE,rAra h 6,17.3,kUA/L
4,IMPACT_101655,2012-10-11,-2,IgG4,rAra h 1,0.15,MG/L


In [7]:
print(impact_spt.shape)
impact_spt.head()

(621, 5)


Unnamed: 0,Participant ID,Date of Allergy Skin Test (Character),Visit,Wheal (mm),OUT24NOI
0,IMPACT_101655,2012-10-11,-2,17.5,
1,IMPACT_101655,2013-05-28,16,4.5,
2,IMPACT_102436,2013-03-14,-2,16.0,1.0
3,IMPACT_102436,2013-11-26,16,9.5,1.0
4,IMPACT_102436,2014-11-28,20,,1.0


---
# Filtering participants that were study eligible (n=144)
- Total participants enrolled = 209  
- Total participants eligible = 144
- Total excluded = 65 (62 that did not have OFC fail results + 2 that could not handle the IDE)

#### filtering solution (this will help filter initial screening OFC results too)
- filter by'ADSTART0 -> immpact_ad['Randomized']=="Yes" 
- filter by ADSTART0 -> "Study Termination Reason"-> "Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation" (These are the two that terminated after being randomized but could not reach their IDE: 146-2 -> n=144)

In [8]:
# filter out ineligible participants 
impact_ad_eligible = impact_ad.loc[(impact_ad['Randomized'] == 'Yes') & ~(impact_ad['Study Termination Reason'] == 'Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation')]


In [9]:
# drop column
impact_ad_eligible = impact_ad_eligible.drop(columns=['Randomized'])

In [10]:
print(impact_ad_eligible.shape)
impact_ad_eligible.head()

(144, 8)


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,


# Obtain list of participants who were eligible 


In [11]:
eligible_participant_IDs = set(impact_ad_eligible['Participant ID'])

In [12]:
len(eligible_participant_IDs)

144

# Filtering each remaining dataset to include just the eligible participants 

In [13]:
impact_ige_eligible = impact_ige[impact_ige['Participant ID'].isin(eligible_participant_IDs)]
impact_serum_eligible = impact_serum[impact_serum['Participant ID'].isin(eligible_participant_IDs)]
impact_spt_eligible = impact_spt[impact_spt['Participant ID'].isin(eligible_participant_IDs)]

# all datasets so far with 144 eligible participants and relevant columns:
# impact_ad_eligible
# impact_ige_eligible 
# impact_serum_eligible
# impact_spt_eligible

In [14]:
print("impact_ad_eligible")
print(impact_ad_eligible.shape)

print("impact_ige_eligible")
print(impact_ige_eligible.shape)

print("impact_serum_eligible")
print(impact_serum_eligible.shape)

print("impact_spt_eligible")
print(impact_spt_eligible.shape)

print("impact_ad_eligible")
print(len(impact_ad_eligible['Participant ID'].unique().tolist()))

print("impact_ige_eligible")
print(len(impact_ige_eligible['Participant ID'].unique().tolist()))

print("impact_serum_eligible")
print(len(impact_serum_eligible['Participant ID'].unique().tolist()))

print("impact_spt_eligible")
print(len(impact_spt_eligible['Participant ID'].unique().tolist()))

impact_ad_eligible
(144, 8)
impact_ige_eligible
(4832, 7)
impact_serum_eligible
(3059, 8)
impact_spt_eligible
(619, 5)
impact_ad_eligible
144
impact_ige_eligible
144
impact_serum_eligible
144
impact_spt_eligible
144


In [15]:
print(impact_ad_eligible.shape)
impact_ad_eligible.head()

(144, 8)


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,


# Standardizing Column Names Across Datasets



In [16]:
# impact_ad_eligible
# impact_ige_eligible 
# impact_serum_eligible
# impact_spt_eligible

# renaming'Date of Screening Visit' to 'Collection Date' to match the other datasets 
# impact_ad_eligible = impact_ad_eligible.rename(columns={'Date of Screening Visit': 'Collection Date'})
impact_spt_eligible = impact_spt_eligible.rename(columns={'Date of Allergy Skin Test (Character)': 'Collection Date'})


In [17]:
print(impact_spt_eligible.shape)
impact_spt_eligible.head()

(619, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT24NOI
0,IMPACT_101655,2012-10-11,-2,17.5,
1,IMPACT_101655,2013-05-28,16,4.5,
2,IMPACT_102436,2013-03-14,-2,16.0,1.0
3,IMPACT_102436,2013-11-26,16,9.5,1.0
4,IMPACT_102436,2014-11-28,20,,1.0


# Filtering all participants except those who made it to week 24

- per IMPACT chart, 116 participants made it to Visit 24

filtering results below correctly shows 116 participants common to all datasets 


In [18]:
# impact_ige_eligible 

In [19]:
impact_ige_eligible_24 = impact_ige_eligible[impact_ige_eligible['Visit'] == 24]

In [20]:
print(impact_ige_eligible_24.shape)
print(len(impact_ige_eligible_24['Participant ID'].unique().tolist()))
impact_ige_eligible_24.head()

(936, 7)
117


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
40,IMPACT_102436,2015-11-20,24,IgE,rAra h 1,1.14,kUA/L
41,IMPACT_102436,2015-11-20,24,IgE,rAra h 2,8.67,kUA/L
42,IMPACT_102436,2015-11-20,24,IgE,rAra h 3,0.05,kUA/L
43,IMPACT_102436,2015-11-20,24,IgE,rAra h 6,6.64,kUA/L
44,IMPACT_102436,2015-11-20,24,IgG4,rAra h 1,12.9,MG/L


In [21]:
# impact_serum_eligible

In [22]:
impact_serum_eligible_24 = impact_serum_eligible[impact_serum_eligible['Visit'] == 24]

In [23]:
print(impact_serum_eligible_24.shape)
print(len(impact_serum_eligible_24['Participant ID'].unique().tolist()))
impact_serum_eligible_24.head()

(590, 8)
118


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
25,IMPACT_102436,2015-11-20,24,Peanut IgE,kU/L,8.41,1.0,0.0
26,IMPACT_102436,2015-11-20,24,Peanut IgE/Total IgE ratio,Ratio,8.670103,1.0,0.0
27,IMPACT_102436,2015-11-20,24,Peanut IgG4*,mcg/mL,90.2,1.0,0.0
28,IMPACT_102436,2015-11-20,24,Peanut IgG4/IgE ratio,Ratio,4.468886,1.0,0.0
29,IMPACT_102436,2015-11-20,24,Total IgE,IU/mL,97.0,1.0,0.0


In [24]:
# impact_spt_eligible

In [25]:
impact_spt_eligible_24 = impact_spt_eligible[impact_spt_eligible['Visit'] == 24]

In [26]:
print(impact_spt_eligible_24.shape)
print(len(impact_spt_eligible_24['Participant ID'].unique().tolist()))
impact_spt_eligible_24.head()

(118, 5)
118


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT24NOI
5,IMPACT_102436,2015-11-20,24,2.5,1.0
10,IMPACT_105670,2015-11-25,24,0.0,1.0
15,IMPACT_113135,2016-09-12,24,24.5,0.0
20,IMPACT_115876,2016-11-09,24,27.5,0.0
25,IMPACT_136775,2017-11-20,24,7.5,0.0


In [27]:
# finding common participants who reached visit 24 

# impact_ige_eligible_24
# impact_serum_eligible_24
# impact_spt_eligible_24

ige_list_24 = impact_ige_eligible_24['Participant ID'].tolist()
serum_list_24 = impact_serum_eligible_24['Participant ID'].tolist()
spt_list_24 = impact_spt_eligible_24['Participant ID'].tolist()

common_participants_24 = set(spt_list_24).intersection(ige_list_24, serum_list_24)
common_participants_24 = list(common_participants_24)



In [28]:
len(common_participants_24)
#output 141

117

In [29]:

# Updating each data frame so that they only contain the common participants 
impact_ad_eligible_24_common = impact_ad_eligible[impact_ad_eligible['Participant ID'].isin(common_participants_24)]
impact_ige_eligible_24_common = impact_ige_eligible_24[impact_ige_eligible_24['Participant ID'].isin(common_participants_24)]
impact_serum_eligible_24_common = impact_serum_eligible_24[impact_serum_eligible_24['Participant ID'].isin(common_participants_24)]
impact_spt_eligible_24_common = impact_spt_eligible_24[impact_spt_eligible_24['Participant ID'].isin(common_participants_24)]



In [30]:
print(impact_ad_eligible_24_common.shape)
print(len(impact_ad_eligible_24_common['Participant ID'].unique().tolist()))
impact_ad_eligible_24_common.head()

(117, 8)
117


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,
9,IMPACT_136775,2015-03-28,Early Termination,Female,Mixed Race,0,3.5,Withdrawn Consent


In [31]:
print(impact_ige_eligible_24_common.shape)
print(len(impact_ige_eligible_24_common['Participant ID'].unique().tolist()))
impact_ige_eligible_24_common.head()

(936, 7)
117


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
40,IMPACT_102436,2015-11-20,24,IgE,rAra h 1,1.14,kUA/L
41,IMPACT_102436,2015-11-20,24,IgE,rAra h 2,8.67,kUA/L
42,IMPACT_102436,2015-11-20,24,IgE,rAra h 3,0.05,kUA/L
43,IMPACT_102436,2015-11-20,24,IgE,rAra h 6,6.64,kUA/L
44,IMPACT_102436,2015-11-20,24,IgG4,rAra h 1,12.9,MG/L


In [32]:
print(impact_serum_eligible_24_common.shape)
print(len(impact_serum_eligible_24_common['Participant ID'].unique().tolist()))
impact_serum_eligible_24_common.head()

(585, 8)
117


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
25,IMPACT_102436,2015-11-20,24,Peanut IgE,kU/L,8.41,1.0,0.0
26,IMPACT_102436,2015-11-20,24,Peanut IgE/Total IgE ratio,Ratio,8.670103,1.0,0.0
27,IMPACT_102436,2015-11-20,24,Peanut IgG4*,mcg/mL,90.2,1.0,0.0
28,IMPACT_102436,2015-11-20,24,Peanut IgG4/IgE ratio,Ratio,4.468886,1.0,0.0
29,IMPACT_102436,2015-11-20,24,Total IgE,IU/mL,97.0,1.0,0.0


In [33]:
print(impact_spt_eligible_24_common.shape)
print(len(impact_spt_eligible_24_common['Participant ID'].unique().tolist()))
impact_spt_eligible_24_common.head()

(117, 5)
117


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT24NOI
5,IMPACT_102436,2015-11-20,24,2.5,1.0
10,IMPACT_105670,2015-11-25,24,0.0,1.0
15,IMPACT_113135,2016-09-12,24,24.5,0.0
20,IMPACT_115876,2016-11-09,24,27.5,0.0
25,IMPACT_136775,2017-11-20,24,7.5,0.0


# Calculating Age at visit 24



In [34]:
# transforming the "Age at Screening (years) Not Rounded" column in "impact_ad_eligible_24_common" to be in months


#convering the 'Age at Screening (years) Not Rounded' to months 
impact_ad_eligible_24_common['Age at Screening (years) Not Rounded'] = impact_ad_eligible_24_common['Age at Screening (years) Not Rounded'] * 12

print(impact_ad_eligible_24_common.shape)
impact_ad_eligible_24_common.head()


(117, 8)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  impact_ad_eligible_24_common['Age at Screening (years) Not Rounded'] = impact_ad_eligible_24_common['Age at Screening (years) Not Rounded'] * 12


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,46.8,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,36.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,31.2,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,26.4,
9,IMPACT_136775,2015-03-28,Early Termination,Female,Mixed Race,0,42.0,Withdrawn Consent


In [35]:
# Using OFC Results from SPT to Calculate date difference 

In [36]:
impact_spt_eligible_24_common.head()

Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT24NOI
5,IMPACT_102436,2015-11-20,24,2.5,1.0
10,IMPACT_105670,2015-11-25,24,0.0,1.0
15,IMPACT_113135,2016-09-12,24,24.5,0.0
20,IMPACT_115876,2016-11-09,24,27.5,0.0
25,IMPACT_136775,2017-11-20,24,7.5,0.0


In [37]:
#Merging ad and spt  
merged_ad_spt = impact_ad_eligible_24_common.merge(impact_spt_eligible_24_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID'])


In [38]:
# calculate the time difference in months and add it to 'Age at Screening (years) Not Rounded'
merged_ad_spt['Age at OFC 24'] = merged_ad_spt.apply(
    lambda row: row['Age at Screening (years) Not Rounded'] + pd.Timedelta(row['Collection Date'] - row['Date of Screening Visit']).days / 30, axis=1
)


In [39]:
# round the 'Age at OFC 24' column to the nearest month
merged_ad_spt['Age at OFC 24'] = merged_ad_spt['Age at OFC 24'].round(1)



In [40]:
# dropping initial screening column (and other unecessary cols)

merged_ad_spt = merged_ad_spt.drop(columns = ['Date of Screening Visit', 
                                              'Completed Study Assessments Numeric', 
                                              'Study Termination Reason', 
                                              'Study Status',
                                              'Age at Screening (years) Not Rounded'
                                             ])

In [41]:
print(merged_ad_spt.shape)
merged_ad_spt.head()

(117, 8)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT24NOI,Age at OFC 24
0,IMPACT_102436,Male,White/Caucasian,2015-11-20,24,2.5,1.0,79.5
1,IMPACT_105670,Female,White/Caucasian,2015-11-25,24,0.0,1.0,68.2
2,IMPACT_113135,Female,White/Caucasian,2016-09-12,24,24.5,0.0,65.3
3,IMPACT_115876,Male,Mixed Race,2016-11-09,24,27.5,0.0,58.6
4,IMPACT_136775,Female,Mixed Race,2017-11-20,24,7.5,0.0,74.3


# 3 datasets left to merge


In [42]:
# 3 datasets left to merge: 
# merged_ad_spt -> contains AGE at 24
# impact_ige_eligible_24_common -> contains components 
# impact_serum_eligible_24_common -> contains Peanut & Total IgE 



In [43]:
print(merged_ad_spt.shape)
merged_ad_spt.head()

(117, 8)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT24NOI,Age at OFC 24
0,IMPACT_102436,Male,White/Caucasian,2015-11-20,24,2.5,1.0,79.5
1,IMPACT_105670,Female,White/Caucasian,2015-11-25,24,0.0,1.0,68.2
2,IMPACT_113135,Female,White/Caucasian,2016-09-12,24,24.5,0.0,65.3
3,IMPACT_115876,Male,Mixed Race,2016-11-09,24,27.5,0.0,58.6
4,IMPACT_136775,Female,Mixed Race,2017-11-20,24,7.5,0.0,74.3


In [44]:
print(impact_ige_eligible_24_common.shape)
impact_ige_eligible_24_common.head()

(936, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
40,IMPACT_102436,2015-11-20,24,IgE,rAra h 1,1.14,kUA/L
41,IMPACT_102436,2015-11-20,24,IgE,rAra h 2,8.67,kUA/L
42,IMPACT_102436,2015-11-20,24,IgE,rAra h 3,0.05,kUA/L
43,IMPACT_102436,2015-11-20,24,IgE,rAra h 6,6.64,kUA/L
44,IMPACT_102436,2015-11-20,24,IgG4,rAra h 1,12.9,MG/L


In [45]:
print(impact_serum_eligible_24_common.shape)
impact_serum_eligible_24_common.head()

(585, 8)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
25,IMPACT_102436,2015-11-20,24,Peanut IgE,kU/L,8.41,1.0,0.0
26,IMPACT_102436,2015-11-20,24,Peanut IgE/Total IgE ratio,Ratio,8.670103,1.0,0.0
27,IMPACT_102436,2015-11-20,24,Peanut IgG4*,mcg/mL,90.2,1.0,0.0
28,IMPACT_102436,2015-11-20,24,Peanut IgG4/IgE ratio,Ratio,4.468886,1.0,0.0
29,IMPACT_102436,2015-11-20,24,Total IgE,IU/mL,97.0,1.0,0.0


# Breaking out components into columns 
in "impact_ige_eligible_24_common"

In [46]:
# dropping IgG4 from Antibody column, only want IgE
impact_ige_eligible_24_common = impact_ige_eligible_24_common[impact_ige_eligible_24_common['Antibody']=='IgE']


In [47]:
# Creating new columns for each component

# Create new columns with initial NaN values
impact_ige_eligible_24_common['Ara h1 (kU/L)'] = np.nan
impact_ige_eligible_24_common['Ara h2 (kU/L)'] = np.nan
impact_ige_eligible_24_common['Ara h3 (kU/L)'] = np.nan
impact_ige_eligible_24_common['Ara h6 (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 1', 'Ara h1 (kU/L)'] = impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 1', 'Value']
impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 2', 'Ara h2 (kU/L)'] = impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 2', 'Value']
impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 3', 'Ara h3 (kU/L)'] = impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 3', 'Value']
impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 6', 'Ara h6 (kU/L)'] = impact_ige_eligible_24_common.loc[impact_ige_eligible_24_common['Component'] == 'rAra h 6', 'Value']

# Drop the specified columns
impact_ige_eligible_24_common = impact_ige_eligible_24_common.drop(columns=['Antibody', 'Component', 'Value', 'Unit'])



In [48]:
print(impact_ige_eligible_24_common.shape)
impact_ige_eligible_24_common.head(10)


(468, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
40,IMPACT_102436,2015-11-20,24,1.14,,,
41,IMPACT_102436,2015-11-20,24,,8.67,,
42,IMPACT_102436,2015-11-20,24,,,0.05,
43,IMPACT_102436,2015-11-20,24,,,,6.64
72,IMPACT_105670,2015-11-25,24,1.95,,,
73,IMPACT_105670,2015-11-25,24,,10.4,,
74,IMPACT_105670,2015-11-25,24,,,0.39,
75,IMPACT_105670,2015-11-25,24,,,,4.97
112,IMPACT_113135,2016-09-12,24,16.3,,,
113,IMPACT_113135,2016-09-12,24,,181.0,,


In [50]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_ige_eligible_24_common = impact_ige_eligible_24_common.groupby(['Participant ID', 'Collection Date', 'Visit'], as_index=False).agg({'Ara h1 (kU/L)': 'mean', 'Ara h2 (kU/L)': 'mean', 'Ara h3 (kU/L)': 'mean', 'Ara h6 (kU/L)': 'mean'})


# Reset the index
impact_ige_eligible_24_common.reset_index(drop=True, inplace=True)



In [53]:
print(impact_ige_eligible_24_common.shape)
impact_ige_eligible_24_common.head(20)


(117, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_102436,2015-11-20,24,1.14,8.67,0.05,6.64
1,IMPACT_105670,2015-11-25,24,1.95,10.4,0.39,4.97
2,IMPACT_113135,2016-09-12,24,16.3,181.0,1.4,190.0
3,IMPACT_115876,2016-11-09,24,8.75,9.54,0.05,11.0
4,IMPACT_136775,2017-11-20,24,0.88,7.8,0.05,4.93
5,IMPACT_139237,2017-08-23,24,19.2,102.0,2.28,62.2
6,IMPACT_149018,2017-04-17,24,13.9,15.3,1.06,12.3
7,IMPACT_196748,2017-01-29,24,0.05,0.52,0.05,0.66
8,IMPACT_205488,2015-08-08,24,0.2,1.26,0.05,3.0
9,IMPACT_207729,2017-06-13,24,13.1,6.72,0.05,11.3


In [55]:

print(len(impact_ige_eligible_24_common['Participant ID'].unique().tolist()))



117


# Merging "impact_ige_eligible_24_common" with "merged_ad_spt"
- both are 117 rows 

## Result -> One participant needs to be manually merged:
(23)  -   IMPACT_320824  
(117) -    IMPACT_320824

In [56]:
merged_ad_spt.head()

Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT24NOI,Age at OFC 24
0,IMPACT_102436,Male,White/Caucasian,2015-11-20,24,2.5,1.0,79.5
1,IMPACT_105670,Female,White/Caucasian,2015-11-25,24,0.0,1.0,68.2
2,IMPACT_113135,Female,White/Caucasian,2016-09-12,24,24.5,0.0,65.3
3,IMPACT_115876,Male,Mixed Race,2016-11-09,24,27.5,0.0,58.6
4,IMPACT_136775,Female,Mixed Race,2017-11-20,24,7.5,0.0,74.3


In [57]:
#Merging ad and spt  
merged_ad_spt_ige = merged_ad_spt.merge(impact_ige_eligible_24_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID', 'Visit', 'Collection Date'])


In [61]:
print(merged_ad_spt_ige.shape)
merged_ad_spt_ige.head()

(118, 12)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT24NOI,Age at OFC 24,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2015-11-20,24,2.5,1.0,79.5,1.14,8.67,0.05,6.64
1,IMPACT_105670,Female,White/Caucasian,2015-11-25,24,0.0,1.0,68.2,1.95,10.4,0.39,4.97
2,IMPACT_113135,Female,White/Caucasian,2016-09-12,24,24.5,0.0,65.3,16.3,181.0,1.4,190.0
3,IMPACT_115876,Male,Mixed Race,2016-11-09,24,27.5,0.0,58.6,8.75,9.54,0.05,11.0
4,IMPACT_136775,Female,Mixed Race,2017-11-20,24,7.5,0.0,74.3,0.88,7.8,0.05,4.93


In [60]:
#seeing which row is duplicated

duplicated_ids = merged_ad_spt_ige['Participant ID'].duplicated(keep=False)
duplicated_values = merged_ad_spt_ige[duplicated_ids]

print(duplicated_values['Participant ID'])


23     IMPACT_320824
117    IMPACT_320824
Name: Participant ID, dtype: object


# Breaking out Peanut IgE/Total IgE
- in "impact_serum_eligible_24_common"


In [62]:
# removing rows from 'Test Name' for everything except 'Peanut IgE' and 'Total IgE'

#defining list to keep
keep_IgEs = ['Peanut IgE', 'Total IgE']

impact_serum_eligible_24_common = impact_serum_eligible_24_common[impact_serum_eligible_24_common['Test Name'].isin(keep_IgEs)]



In [63]:

impact_serum_eligible_24_common['Test Name'].unique()


array(['Peanut IgE', 'Total IgE'], dtype=object)

In [64]:


print(impact_serum_eligible_24_common.shape)
impact_serum_eligible_24_common.head()

(234, 8)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
25,IMPACT_102436,2015-11-20,24,Peanut IgE,kU/L,8.41,1.0,0.0
29,IMPACT_102436,2015-11-20,24,Total IgE,IU/mL,97.0,1.0,0.0
50,IMPACT_105670,2015-11-25,24,Peanut IgE,kU/L,10.7,1.0,0.0
54,IMPACT_105670,2015-11-25,24,Total IgE,IU/mL,70.0,1.0,0.0
75,IMPACT_113135,2016-09-12,24,Peanut IgE,kU/L,243.0,0.0,


# Merging "merged_ad_spt_ige" with "impact_serum_eligible_24_common"


---
# Serum Dataset & IgE_IgG4_component Common Features

Duplicate OFC data between both data sets with different labels.  
These columns in Serum:

- 'Passed Visit 24 OFC for ITT',
- 'Passed Visit 24 OFC No Imputation',
- 'Passed Visit 26 OFC for ITT',
- 'Passed Visit 26 OFC No Imputation',
 
are the same data as these columns in IgE_IgG4_component

- OUT24ITT	
- OUT24NOI	
- OUT26ITT	
- OUT26NOI

---

#### Will take 'Peanut IgE' and 'Total IgE' from column 'Test Name' in Serum data  
#### Will take 'Component' and values columns from IgE_IgG4_component 


# Serum Cleaning
- taking Peanut IgE and Total IgE from this data set

In [None]:
# checking total unique participants is the same
len(impact_serum['Participant ID'].unique().tolist())
# output 146

In [None]:
impact_serum.columns.tolist()

In [None]:
#filtering by just the participant IDs in the eligible list
impact_serum_eligible = impact_serum[impact_serum['Participant ID'].isin(participant_ids)]

# impact_serum_cols_to_keep = [
#     'Participant ID',
#     'Collection Date',
#     'Visit',
#     'Test Name', # Peanut IgE, Peanut IgE/Total IgE ratio, Peanut IgG4*, Peanut IgG4/IgE ratio, Total IgE
#     'Unit',
#     'Value', # results from 'Test Name'
#     'Baseline Value', # IgE values taken during initial screening
#     'Passed Visit 24 OFC No Imputation',
#     'Passed Visit 26 OFC No Imputation',
# #     'Tolerance outcome', # has values of nan,  2.,  4.,  1.,  3.
# #     'Tolerance outcome (character)' # has values of nan, 'Desen_no_Tol', 'No_Desen_no_Tol', 'Desen_Tol', 'No_Desen_Tol
# ]

# question for Dr. Gryak on what these mean: 
# 'Tolerance outcome', has values of nan,  2.,  4.,  1.,  3.
# 'Tolerance outcome (character)', has values of nan, 'Desen_no_Tol', 'No_Desen_no_Tol', 'Desen_Tol', 'No_Desen_Tol

# impact_serum_eligible_filtered = impact_serum_eligible[impact_serum_cols_to_keep]


In [None]:
#checking to see the above filtering kept the correct number of 144 participants 
len(impact_serum_eligible_filtered['Participant ID'].unique().tolist()) #output 144 - correct

In [None]:
print(impact_serum_eligible_filtered.shape)
impact_serum_eligible_filtered.head()

### Making a new DF with a new column capturing initial OFC fail 
calling it "Passed Visit -2 OFC" to match other OFC columns
- Doing this because according to protocol, 144 participants enterered the study because they passed all the eligibility criteria AND __had a clinical reaction to the intake OFC, thus all 144 participants had a failed OFC status for their -2 visit__

In [None]:
impact_serum_eligible_filtered['Passed Visit -2 OFC'] = 0
impact_serum_eligible_filtered['Passed Visit -2 OFC'] = impact_serum_eligible_filtered['Passed Visit -2 OFC'].astype('int64')
impact_serum_eligible_filtered.head()

### filtering out from 'Test Name' everything but 'Peanut IgE' and 'Total IgE'

In [None]:
# removing rows from 'Test Name' for everything except 'Peanut IgE' and 'Total IgE'

#defining list to keep
keep_IgEs = ['Peanut IgE', 'Total IgE']

impact_serum_eligible_filtered = impact_serum_eligible_filtered[impact_serum_eligible_filtered['Test Name'].isin(keep_IgEs)]


In [None]:
impact_serum_eligible_filtered['Test Name'].unique()

In [None]:
print(impact_serum_eligible_filtered.shape)
impact_serum_eligible_filtered.head()

--- 
# IgE_IgG4_component Cleaning
- taking the components from this data set

In [None]:
impact_ige.columns.tolist()

In [None]:
#filtering by just the participant IDs in the eligible list
impact_ige_eligible = impact_ige[impact_ige['Participant ID'].isin(participant_ids)]

# impact_ige_cols_to_keep = [
#     'Participant ID',
#     'Visit',
#     'Antibody', #IgE, IgG4
#     'Component', #rAra h 1, rAra h 2, rAra h 3, rAra h 6
#     'Value',
#     'Unit',
#     'Baseline Value',
#     'Collection Date',   
# ]

# impact_ige_eligible_filtered = impact_ige_eligible[impact_ige_cols_to_keep]


In [None]:
#dropping IgG4 from Antibody column, only want IgE
impact_ige_eligible_filtered = impact_ige_eligible_filtered[impact_ige_eligible_filtered['Antibody']=='IgE']

In [None]:
print(impact_ige_eligible_filtered.shape)
impact_ige_eligible_filtered.head(10)

# Skin Prick Test cleaning 
(Skin Prick Test_2023-05-25_06-31-04.xlsx)

- taking wheal from this data set

In [None]:
# checking total unique participants is the same
len(impact_spt['Participant ID'].unique().tolist())
# output 146

In [None]:
impact_spt.columns.tolist()

In [None]:
#filtering by just the participant IDs in the eligible list
impact_spt_eligible = impact_spt[impact_spt['Participant ID'].isin(participant_ids)]

# impact_spt_cols_to_keep = [
#     'Participant ID',
#     'Date of Allergy Skin Test (Character)', # this is similar to 'Collection Date' in other datasets 
#     'Visit',
#     'Wheal (mm)',
#     'Wheal (mm) baseline'
# ]

# impact_spt_eligible_filtered = impact_spt_eligible[impact_spt_cols_to_keep]


In [None]:
# Changing 'Date of Allergy Skin Test (Character)' to 'Collection Date' to match other data sets

impact_spt_eligible_filtered = impact_spt_eligible_filtered.rename(columns={'Date of Allergy Skin Test (Character)': 'Collection Date'})


In [None]:
#checking to see the above filtering kept the correct number of 144 participants 
len(impact_spt_eligible_filtered['Participant ID'].unique().tolist()) #output 144 - correct

In [None]:
print(impact_spt_eligible_filtered.shape)
impact_spt_eligible_filtered.head()

---
# Building Baseline Datasets
Separating baseline data from rest of data.  
Baseline data varies from rest of the test visits in a few ways:
- interpretting OFC pass results from protocol and filtering
- components units are different than that of the follow up visits (16,21,24,26)

In [None]:
# impact_ad_eligible_filtered (144, 12)
# impact_serum_eligible_filtered
# impact_ige_eligible_filtered
# impact_spt_eligible_filtered

---
# Cleaning ad baseline

summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for Age in months 

### Notes on 'impact_ad_eligible_baseline' dataframe: 

144 unique participant IDs  
144 rows


In [None]:
#updating the 'Visit' column to say -2 since this is intake information
impact_ad_eligible_baseline = impact_ad_eligible_filtered
impact_ad_eligible_baseline['Visit'] = -2
impact_ad_eligible_baseline['Visit'] = impact_ad_eligible_baseline['Visit'].astype('int64')
print(impact_ad_eligible_baseline.shape) #144 total entries in the DF = 144 unique participant IDs
impact_ad_eligible_baseline.head()

In [None]:
impact_ad_eligible_baseline = impact_ad_eligible_baseline.drop(columns=['Age at screening (years)', 'Study Status', 'Completed Study Protocol', 'Completed Study Assessments Numeric'])
impact_ad_eligible_baseline.head()

In [None]:
#convering the 'Age at Screening (years) Not Rounded' to months 
impact_ad_eligible_baseline['Age'] = impact_ad_eligible_baseline['Age at Screening (years) Not Rounded'] * 12
impact_ad_eligible_baseline = impact_ad_eligible_baseline.drop(columns=['Age at Screening (years) Not Rounded'])


In [None]:
print(impact_ad_eligible_baseline.shape)
impact_ad_eligible_baseline.head()

In [None]:
# renaming 'Date of Screening Visit' to 'Collection Date' for merger later
impact_ad_eligible_baseline = impact_ad_eligible_baseline.rename(columns={'Date of Screening Visit': 'Collection Date'})


In [None]:
print(impact_ad_eligible_baseline.shape)
impact_ad_eligible_baseline.head()

---
# Cleaning serum baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for 'Peanut IgE kU/L', 'Total IgE IU/mL'
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_serum_eligible_baseline_merged' dataframe: 

144 unique participant IDs  
159 rows becauase couldn't merge all of them due to 19 participants with issues:
- 15 participants had their IgE values read on different dates
- 4 only had ONE of their IgE values read during intake (either Peanut or Total, not both)

In [None]:
impact_serum_eligible_baseline = impact_serum_eligible_filtered

# dropping all rows for visits that are not -2 the initial visit
impact_serum_eligible_baseline = impact_serum_eligible_baseline[impact_serum_eligible_baseline['Visit'] == -2]

# dropping all non baseline columns
impact_serum_eligible_baseline = impact_serum_eligible_baseline.drop(columns=['Passed Visit 24 OFC No Imputation', 
                                                                        'Passed Visit 26 OFC No Imputation',
                                                                        'Value'
                                                                       ])

In [None]:
len(impact_serum_eligible_baseline['Participant ID'].unique().tolist())

In [None]:
impact_serum_eligible_baseline.head()

In [None]:
# Create new columns with initial NaN values
impact_serum_eligible_baseline['Peanut IgE (kU/L)'] = np.nan
impact_serum_eligible_baseline['Total IgE (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Peanut IgE', 'Peanut IgE (kU/L)'] = impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Peanut IgE', 'Baseline Value']
impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Total IgE', 'Total IgE (kU/L)'] = impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Total IgE', 'Baseline Value']

# Drop the specified columns
impact_serum_eligible_baseline = impact_serum_eligible_baseline.drop(columns=['Test Name', 'Unit', 'Baseline Value'])


In [None]:
len(impact_serum_eligible_baseline['Participant ID'].unique().tolist())

In [None]:
print(impact_serum_eligible_baseline.shape) # (284, 6)
impact_serum_eligible_baseline.head()
# note if all of the 144 participants had 2 rows for Peanut and Total, we'd have 288, but we have 284

In [None]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_serum_eligible_baseline_merged = impact_serum_eligible_baseline.groupby(['Participant ID', 'Collection Date', 'Visit', 'Passed Visit -2 OFC'], as_index=False).agg({'Peanut IgE (kU/L)': 'mean', 'Total IgE (kU/L)': 'mean'})

# Reset the index
impact_serum_eligible_baseline_merged.reset_index(drop=True, inplace=True)


In [None]:
impact_serum_eligible_baseline_merged.head(10)

In [None]:
print(len(impact_serum_eligible_baseline_merged['Participant ID'].unique().tolist()))
print(impact_serum_eligible_baseline_merged.shape)

In [None]:
# Exploring where the extra rows came from

# making a df of just the rows with nan values
serum_nan_rows = impact_serum_eligible_baseline_merged[impact_serum_eligible_baseline_merged.isna().any(axis=1)]


In [None]:
print(serum_nan_rows.shape)
serum_nan_rows.head(34)

In [None]:
len(serum_nan_rows['Participant ID'].unique().tolist())


---
# Cleaning IgE baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for components
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_ige_eligible_baseline_merged' dataframe: 

141 unique participant IDs because
- 3 did not have baseline data for initial visit (IDs: 'IMPACT_149018', 'IMPACT_746400', 'IMPACT_920870')   

177 rows becauase couldn't merge all of them due to:
- 36 participants had their component values read on different dates

In [None]:
impact_ige_eligible_baseline = impact_ige_eligible_filtered

In [None]:
print(len(impact_ige['Participant ID'].unique().tolist()))

In [None]:
print(len(impact_ige_eligible_filtered['Participant ID'].unique().tolist()))


In [None]:
print(len(impact_ige_eligible_baseline['Participant ID'].unique().tolist()))

In [None]:
# just the participant IDs for those who have -2 in their history
impact_ige_eligible_baseline_2 = impact_ige_eligible_baseline[impact_ige_eligible_baseline['Visit'] == -2]

In [None]:
len(impact_ige_eligible_baseline_2['Participant ID'].unique().tolist())

In [None]:
participant_ids = impact_ige_eligible_baseline[~impact_ige_eligible_baseline['Participant ID'].isin(impact_ige_eligible_baseline_2['Participant ID'])]['Participant ID'].tolist()


In [None]:
#dropping duplicates
participant_ids = list(set(participant_ids))
participant_ids

# output: ['IMPACT_149018', 'IMPACT_746400', 'IMPACT_920870'] THESE ARE THE 3 

# manually checking these IDs in the IgE spreadsheet; 

# IMPACT_149018 
# in IgE dataset; missing -2 and 16  visit, missing component baseline values 
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Statys = Early Termination

# IMPACT_746400
# in IgE dataset; missing -2 visit, missing component baseline values 
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Status = Completed Study

# IMPACT_920870
# in IgE dataset; missing -2 visit, missing component baseline values
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Status = Completed Study

In [None]:
# removing these from the baseline data 
impact_ige_eligible_baseline = impact_ige_eligible_filtered[~impact_ige_eligible_filtered['Participant ID'].isin(participant_ids)]


In [None]:
print(len(impact_ige_eligible_baseline['Participant ID'].unique().tolist()))
# output 141, correctly filtered out the participants with missing baseline data 

In [None]:
# dropping all rows for visits that are not -2 the initial visit
impact_ige_eligible_baseline = impact_ige_eligible_baseline[impact_ige_eligible_baseline['Visit'] == -2]

# dropping all non baseline columns
impact_ige_eligible_baseline = impact_ige_eligible_baseline.drop(columns=['Value'])

In [None]:
print(impact_ige_eligible_baseline.shape)
impact_ige_eligible_baseline.head()

In [None]:
# Creating new columns for each component

# Create new columns with initial NaN values
impact_ige_eligible_baseline['Ara h1 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h2 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h3 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h6 (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 1', 'Ara h1 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 1', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 2', 'Ara h2 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 2', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 3', 'Ara h3 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 3', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 6', 'Ara h6 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 6', 'Baseline Value']

# Drop the specified columns
impact_ige_eligible_baseline = impact_ige_eligible_baseline.drop(columns=['Antibody', 'Component', 'Baseline Value', 'Unit'])


In [None]:
print(impact_ige_eligible_baseline.shape)
impact_ige_eligible_baseline.head()

In [None]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_ige_eligible_baseline_merged = impact_ige_eligible_baseline.groupby(['Participant ID', 'Collection Date', 'Visit'], as_index=False).agg({'Ara h1 (kU/L)': 'mean', 'Ara h2 (kU/L)': 'mean', 'Ara h3 (kU/L)': 'mean', 'Ara h6 (kU/L)': 'mean'})

# Reset the index
impact_ige_eligible_baseline_merged.reset_index(drop=True, inplace=True)


In [None]:
print(impact_ige_eligible_baseline_merged.shape)
impact_ige_eligible_baseline_merged.head(50)

In [None]:
print(len(impact_ige_eligible_baseline_merged['Participant ID'].unique().tolist()))

In [None]:
# checking which rows have NaN Values 
ig_nan_rows = impact_ige_eligible_baseline_merged[impact_ige_eligible_baseline_merged.isna().any(axis=1)]


In [None]:
print(ig_nan_rows.shape)
ig_nan_rows.head(72)

In [None]:
# getting the participant IDs with the NaN values
nan_participant_ids = impact_ige_eligible_baseline_merged.loc[impact_ige_eligible_baseline_merged.isna().any(axis=1), 'Participant ID'].tolist()
nan_participant_ids

In [None]:
#dropping duplicates
nan_participant_ids = list(set(nan_participant_ids))
nan_participant_ids

len(nan_participant_ids)

---
# Cleaning Skin Prick Test (Wheal) baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_spt_eligible_baseline_merged' dataframe: 

- 144 unique participant IDs because everyone had baseline wheal data done on the same date 

In [None]:
impact_spt_eligible_baseline = impact_spt_eligible_filtered

In [None]:
print(len(impact_spt['Participant ID'].unique().tolist()))

In [None]:
print(len(impact_spt_eligible_filtered['Participant ID'].unique().tolist()))


In [None]:
print(len(impact_spt_eligible_baseline['Participant ID'].unique().tolist()))

In [None]:
#checking if there are any participants that don't have baseline wheal data (-2 visit)
impact_spt_eligible_baseline_2 = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Visit'] == -2]
len(impact_spt_eligible_baseline_2['Participant ID'].unique().tolist())
#output 144, all eligible participants have wheal baseline data

In [None]:
#dropping nonbaseline data
impact_spt_eligible_baseline = impact_spt_eligible_baseline.drop(columns=['Wheal (mm)'])


In [None]:
print(impact_spt_eligible_baseline.shape)
impact_spt_eligible_baseline.head()

In [None]:
# dropping all rows for visits that are not -2 the initial visit
impact_spt_eligible_baseline = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Visit'] == -2]


In [None]:
print(impact_spt_eligible_baseline.shape)
impact_spt_eligible_baseline.head()

### viewing all baseline data so far 


In [None]:
#impact_ige_eligible_baseline_merged
#impact_serum_eligible_baseline_merged
#impact_ad_eligible_baseline
#impact_spt_eligible_baseline

In [None]:
# Finding the participant IDs that are common among all 3 baseline data frames
ige_list = impact_ige_eligible_baseline_merged['Participant ID'].tolist()
serum_list = impact_serum_eligible_baseline_merged['Participant ID'].tolist()
ad_list = impact_ad_eligible_baseline['Participant ID'].tolist()
spt_list = impact_spt_eligible_baseline['Participant ID'].tolist()

common_participants = set(ige_list).intersection(serum_list, ad_list, spt_list)
common_participants = list(common_participants)


In [None]:
len(common_participants)
#output 141

In [None]:
# Updating each data frame so that they only contain the common participants 
impact_ige_eligible_baseline_merged_common = impact_ige_eligible_baseline_merged[impact_ige_eligible_baseline_merged['Participant ID'].isin(common_participants)]
impact_serum_eligible_baseline_merged_common = impact_serum_eligible_baseline_merged[impact_serum_eligible_baseline_merged['Participant ID'].isin(common_participants)]
impact_ad_eligible_baseline_common = impact_ad_eligible_baseline[impact_ad_eligible_baseline['Participant ID'].isin(common_participants)]
impact_spt_eligible_baseline_common = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Participant ID'].isin(common_participants)]

In [None]:
print(impact_ige_eligible_baseline_merged_common.shape) #multiple dates for obtaining baseline data for some IDs
print(impact_serum_eligible_baseline_merged_common.shape) #multiple dates for obtaining baseline data for some IDs
print(impact_ad_eligible_baseline_common.shape)
print(impact_spt_eligible_baseline_common.shape)

In [None]:
#Merging Serum and Ad baselines
# outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
merged_serum_ad = impact_serum_eligible_baseline_merged_common.merge(impact_ad_eligible_baseline_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID', 'Collection Date', 'Visit'])


In [None]:
print(merged_serum_ad.shape)
merged_serum_ad.head()

In [None]:
#Merging the merged_serum_ad data with impact_ige_eligible_baseline_merged_common baselines
merged_serum_ad_ige_baseline = impact_ige_eligible_baseline_merged_common.merge(merged_serum_ad, 
                                                               how='outer', 
                                                               on=['Participant ID', 'Collection Date', 'Visit'])

In [None]:
print(merged_serum_ad_ige_baseline.shape)
merged_serum_ad_ige_baseline.head()

In [None]:
# exporting this to .xlsx to finish merging the participants who had baseline 
# data on different dates manually in excel

#commenting out to prevent exporting again
#merged_serum_ad_ige_baseline.to_excel('merged_serum_ad_ige_baseline_raw_2.xlsx', index=False)

In [None]:
# importing back in the cleaned file 

merged_serum_ad_ige_baseline_141 = pd.read_excel("Data/Impact_Study/merged_serum_ad_ige_baseline_141_2.xlsx")

In [None]:
print(merged_serum_ad_ige_baseline_141.shape)
merged_serum_ad_ige_baseline_141.head()

In [None]:
# merging impact_spt_eligible_baseline_common data with merged_serum_ad_ige_baseline_clean
# both have 141 rows, should be 1:1


#Merging the merged_serum_ad data with impact_ige_eligible_baseline_merged_common baselines
impact_all_baseline = merged_serum_ad_ige_baseline_141.merge(impact_spt_eligible_baseline_common, 
                                                               how='outer', 
                                                               on=['Participant ID', 'Collection Date', 'Visit'])

In [None]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

In [None]:
# did not merge 1:1, There are 10 more rows than expected. 
# Assuming again it's because wheal samples were taking on slightly different dates.
# exporting to manually parse in excel again 

#commenting out to prevent exporting again
#impact_all_baseline.to_excel('Data/IMPACT_Study/impact_all_baseline_raw_2.xlsx', index=False)


# Final transformation of dataset to match LEAP columns 

In [None]:
# importing back in the cleaned file 

impact_all_baseline_141 = pd.read_excel("Data/Impact_Study/impact_all_baseline_141_2.xlsx")


In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

In [None]:
# drop all unecessary columns 
impact_all_baseline_141 = impact_all_baseline_141.drop(columns=['Completed Study Protocol'])


In [None]:
# rename 'Passed Visit -2 OFC' to 'OFC Pass'
impact_all_baseline_141 = impact_all_baseline_141.rename(columns={'Passed Visit -2 OFC': 'OFC Pass'})

# rename Wheal (mm) baseline
impact_all_baseline_141 = impact_all_baseline_141.rename(columns={'Wheal (mm) baseline': 'Wheal (mm)'})


In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

#### Creating Binary column from Child's Sex -> Male

note origial mapping in the raw eat dataset was:  
'Male', 'Female'  

in this new encoding:  
1=male, 0=female

In [None]:
# function to map the values in 'Sex (character)' to integers
def encode_sex(sex):
    if sex == "Male":
        return 1
    elif sex == "Female":
        return 0
    else:
        return None  # Return None for any other values

# create the new 'Male' column
impact_all_baseline_141['Male'] = impact_all_baseline_141['Sex (character)'].apply(encode_sex)

# drop original column 
impact_all_baseline_141 = impact_all_baseline_141.drop(columns=['Sex (character)'])


In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

In [None]:
impact_all_baseline_141['Race'].unique()

# One Hot Encoding for Race 
note original values for "Race" from the raw IMPACT dataset is as follows
- 'White/Caucasian', 
- 'Mixed Race', 
- 'Asian',
- 'Black or African American'

In [None]:
# Create dummy variables for "race"
race_dummies = pd.get_dummies(impact_all_baseline_141['Race'])

# Define the mapping between values and column names
race_mapping = {
    'White/Caucasian': 'White',
    'Black or African American': 'Black',
    'Asian': 'Asian',
    'Mixed Race': 'Mixed',
     np.nan: 'Unknown'
}

# iterate over the mapping and update the dataframe columns
for value, column_name in race_mapping.items():
    if value in race_dummies.columns:
        impact_all_baseline_141[column_name] = race_dummies[value].fillna(0).astype(int)
    else:
        impact_all_baseline_141[column_name] = 0

# drop original "race" column
impact_all_baseline_141 = impact_all_baseline_141.drop('Race', axis=1)


In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

In [None]:
# add the missing columns (Flare, h8, h9, Other) and fill with NaNs so the dataset matches LEAP
impact_all_baseline_141['Flare (mm)'] = np.nan
impact_all_baseline_141['Ara h8 (kU/L)'] = np.nan
impact_all_baseline_141['Ara h9 (kU/L)'] = np.nan
impact_all_baseline_141['Other'] = np.nan

In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

In [None]:
# reorganizing columns 

print(impact_all_baseline_141.columns)

In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

In [None]:
# Define the desired column order
desired_order = ['Participant ID',
                 'Collection Date',
                 'Visit',
                "Age",
                 "Male",
                 "White",
                 "Black",
                 "Asian",
                 "Other",
                 "Mixed",
                 "Unknown",
                 "Wheal (mm)",
                 "Flare (mm)",
                 "Total IgE (kU/L)",
                 "Peanut IgE (kU/L)",
                 "Ara h1 (kU/L)",
                 "Ara h2 (kU/L)",
                 "Ara h3 (kU/L)",
                 "Ara h6 (kUA/L)",
                 "Ara h8 (kU/L)",
                 "Ara h9 (kU/L)",
                 "OFC Pass"]

# Reorder the columns using reindex
impact_all_baseline_141 = impact_all_baseline_141.reindex(columns=desired_order)


In [None]:
print(impact_all_baseline_141.shape)
impact_all_baseline_141.head()

# Final export of baseline clean data for IMPACT study 

In [None]:
#commenting out to prevent exporting again
impact_all_baseline_141.to_excel('Data/IMPACT_Study/impact_all_baseline_clean_2.xlsx', index=False)
