In [1]:
import pandas as pd
import numpy as np

In [2]:
impact_ad = pd.read_excel("Data/IMPACT_Study/ADSTART0_2023-05-25_06-46-11.xlsx")
impact_serum = pd.read_excel("Data/IMPACT_Study/Serum antibody_2023-05-25_06-32-15.xlsx")
impact_ige = pd.read_excel("Data/Impact_Study/IgE_IgG4_component_2023-05-25_06-30-40.xlsx")
impact_spt = pd.read_excel("Data/Impact_study/Skin Prick Test_2023-05-25_06-31-04.xlsx")

# About Data Sets

#### ADSTART0_2023-05-25_06-46-11 
- Overview of participant data 

#### IgE_IgG4_component_2023-05-25_06-30-40
- IgE component levels in this dataset  

#### Serum antibody_2023-05-25_06-32-15
OFC results:
- "Passed Visit 24 OFC for ITT"
- "Passed Visit 24 OFC No Imputation"	
- "Passed Visit 26 OFC for ITT"
- "Passed Visit 26 OFC No Imputation"

#### Skin Prick Test_2023-05-25_06-31-04
- "Wheal (mm)"

#### BAT data_2023-05-25_06-31-20
- Stands for "basophil activation test (BAT)"
- No known use case for this model 

---

# Acronyms 
- Intent-to-treat (ITT)
- Oral Immunotherapy (OIT)
- Initial Dose Escalation (IDE)

# About Study 
- taken from 2022 Lancet_IMPACT.pdf (pdf page 3)  

Children aged 12 months or older and younger than 
48 months were screened for inclusion in the study. 

__Inclusion criteria__ included the following: a clinical 
history of peanut allergy or avoidance without ever 
having eaten peanut, peanut-specific IgE levels of 5 kUA/L 
or higher, a skin prick test (SPT) wheal size greater than 
that of saline control by 3 mm or more, and a positive 
reaction to a cumulative dose of 500 mg or less of peanut 
in a double-blind, placebo-controlled food challenge 
(DBPCFC).   

__exclusion criteria__ included a history of 
severe anaphylaxis with hypotension to peanut, more 
than mild asthma or uncontrolled asthma, uncontrolled 
atopic dermatitis, and eosinophilic gastrointestinal 
disease (the full list of exclusion criteria is presented in 
the appendix p 2).

---
# Study Design

This is a randomized, double-blind, placebo-controlled, multi-center study comparing peanut oral immunotherapy to placebo. Eligible participants with peanut allergy will be randomly assigned to receive either peanut OIT or placebo for 134 weeks followed by peanut avoidance for 26 weeks.  

An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing. After the initial blinded OFC, the study design includes the following:  

__Initial Dose Escalation:__ This will occur on a single day in which multiple doses are given. Peanut or placebo dosing will be given incrementally and increase every 15-30 minutes until a dose of 12 mg peanut flour (6 mg peanut protein) or placebo flour is given. The first four doses will be administered as a peanut flour extract of 0.1 to 0.8 mg peanut protein, which is 10 to 80 microliters peanut flour extract, or placebo flour extract and the last three doses will be given as peanut flour of 3 to 12 mg peanut flour 1.5 to 6 mg peanut protein or placebo flour. Participants must tolerate a dose of at least 3 mg peanut flour (1.5 mg peanut protein) or placebo flour to remain in the study.  

__Build-up:__ After the initial dose escalation day, the participant will return to the research unit the next morning for an observed dose administration of the highest tolerated dose from the initial escalation day. The participant will then continue on the daily OIT dosing at home and return to the research unit every 2 weeks for a dose escalation. The dosing escalations will be consistent with previous similar OIT studies.
  
Participants who do not reach the 4000 mg peanut flour (2000 mg peanut protein) or placebo flour dose during the build-up phase may enter maintenance phase at their highest tolerated dose, which must be at least 500 mg peanut flour (250 mg peanut protein) or placebo flour.  

The build-up phase will comprise 30 weeks.  

__Maintenance:__ The participant will continue on daily OIT with return visits every 13 weeks. At the end of this phase the participant will undergo a blinded OFC to 10 g peanut flour (5 g peanut protein).  
This phase will comprise 104 weeks.  

__Avoidance:__ In this final phase participants stop OIT and will avoid peanut consumption They will be seen 2 weeks and 26 weeks after initiating this phase. At the completion of this phase participants will have a final blinded OFC to 10 g peanut flour (5 g peanut protein). Participants who do not have a clinical reaction to the challenge will receive an Open Food Challenge (OpFC).  

Avoidance will comprise 26 weeks.  

__Post-challenge:__ If participants do not have a clinical reaction during the OpFC at the end of avoidance, they will be allowed to consume peanut and will have one visit which will include peripheral blood sampling for mechanistic assays assessments.  

Post-challenge will comprise 2 weeks.  

---
# Exploring where and how OFCs are captured
Looking for 3 OFC tests in total

### 1st OFC: According to Study Design in protocol: 
-  Initial reaction: "An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing." (IDE)
- This was for 0.5 g peanut

### 2nd & 3rd OFC: According to Schedule of Assessments: Appendix 2:
- 5 g Oral Food Challenge performed during Avoidance phase during visit 24/week 134 and visit 26/week 160
- this was at or below 5 g peanut



---
# Filtering participants that were study eligible (n=144)
- Total participants enrolled = 209  
- Total participants eligible = 144
- Total excluded = 65 (62 that did not have OFC fail results + 2 that could not handle the IDE)

#### filtering solution (this will help filter initial screening OFC results too)
- filter by'ADSTART0 -> immpact_ad['Randomized']=="Yes" 
- filter by ADSTART0 -> "Study Termination Reason"-> "Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation" (These are the two that terminated after being randomized but could not reach their IDE: 146-2 -> n=144)

Additional filter to get just those initial screening OFC rows:
- filter by visit -2 or -1 (this is the initial screening visit number)

Explanation: There doesn't seem to be a OFC column for the initial screening.   
However can filter criteria based on protocol to determine who failed their first OFC upon study intake and manually assign the OFC fail column.   
These would be the 144 participants that were eligible for the study and have a visit value of -2 or -1

In [3]:
# Useing 'Randomized' to filter out the 146 that passed the criteria and were randomly assigned for the study 
(impact_ad['Randomized']=="Yes").sum() #output = 146

# filtering out two more participants that failed during the initial dose escalation (IDE)
# These are the two that terminated after being randomized 146-2 -> n=144

(impact_ad['Study Termination Reason']=="Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation").sum()
#output = 2

2

### Creating a new DF for just the eligible 144 randomly assigned participants 
- and removing unnecessary columns 

In [4]:
impact_ad_eligible = impact_ad.loc[(impact_ad['Randomized'] == 'Yes') & ~(impact_ad['Study Termination Reason'] == 'Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation')]

impact_ad_cols_to_keep = [
     'Participant ID',
     'Visit', # -2 is the initial participant screening 
     'Date of Screening Visit',
     'Randomized', # Use this to filter out participants that did not pass the study screening
     'Study Status', # Values: 'Discontinued Therapy', 'Completed Study', 'Enrolled but not Randomized', 'Early Termination', 'Screen Failure', 'Screened but not Enrolled'
     'Completed Study Protocol', # Values: 'No', 'Yes'
     'Sex (character)',
     'Race',
     'Completed Study Assessments Numeric', #1 for yes/ 0 for no (means attended all visits up to 26, excluding 27)
     'Age at screening (years)',
     'Age at Screening (years) Not Rounded'
]

# this DF only contains the 144 eligible participants 
impact_ad_eligible_filtered = impact_ad_eligible[impact_ad_cols_to_keep]

impact_ad_eligible_filtered = impact_ad_eligible_filtered.drop(columns=['Randomized'])

In [5]:
# renaming'Date of Screening Visit' to 'Collection Date' to match the other datasets 
impact_ad_eligible_filtered = impact_ad_eligible_filtered.rename(columns={'Date of Screening Visit': 'Collection Date'})


In [6]:
print(impact_ad_eligible_filtered.shape)
# correct number of participants now (144)
impact_ad_eligible_filtered.head()

(144, 10)


Unnamed: 0,Participant ID,Visit,Collection Date,Study Status,Completed Study Protocol,Sex (character),Race,Completed Study Assessments Numeric,Age at screening (years),Age at Screening (years) Not Rounded
0,IMPACT_101655,,2012-10-11,Discontinued Therapy,No,Male,White/Caucasian,0,4,3.8
1,IMPACT_102436,,2013-03-14,Completed Study,Yes,Male,White/Caucasian,1,4,3.9
2,IMPACT_105670,,2013-04-04,Completed Study,Yes,Female,White/Caucasian,1,3,3.0
4,IMPACT_113135,,2013-11-25,Early Termination,No,Female,White/Caucasian,0,3,2.6
5,IMPACT_115876,,2014-03-19,Completed Study,Yes,Male,Mixed Race,1,2,2.2


#### Getting a list of the 144 participant IDs 

In [7]:
participant_ids = impact_ad_eligible_filtered['Participant ID'].unique()
print(len(participant_ids)) #output 144 unique participant IDs

144


---
# Serum Dataset & IgE_IgG4_component Common Features

Duplicate OFC data between both data sets with different labels.  
These columns in Serum:

- 'Passed Visit 24 OFC for ITT',
- 'Passed Visit 24 OFC No Imputation',
- 'Passed Visit 26 OFC for ITT',
- 'Passed Visit 26 OFC No Imputation',
 
are the same data as these columns in IgE_IgG4_component

- OUT24ITT	
- OUT24NOI	
- OUT26ITT	
- OUT26NOI

---

#### Will take 'Peanut IgE' and 'Total IgE' from column 'Test Name' in Serum data  
#### Will take 'Component' and values columns from IgE_IgG4_component 


# Serum Cleaning
- taking Peanut IgE and Total IgE from this data set

In [8]:
# checking total unique participants is the same
len(impact_serum['Participant ID'].unique().tolist())
# output 146

146

In [9]:
impact_serum.columns.tolist()

['Participant ID',
 'Visit',
 'Barcode',
 'Collection Date',
 'Test Name',
 'Unit',
 'Test Result',
 'Visit Number',
 'Value',
 'Baseline Value',
 'log10 Value',
 'log10 Value_Baseline',
 'Fold change from baseline',
 'log10 fold change from baseline',
 'Planned Treatment',
 'Planned Treatment (N)',
 'Randomized',
 'Intent to Treat Sample',
 'Per-protocol Primary Endpoint',
 'Per-protocol Secondary Endpoint',
 'Age at screening (years)',
 'Age at screening (years) Not Rounded',
 'Sex (character)',
 'Passed Visit 24 OFC for ITT',
 'Passed Visit 24 OFC No Imputation',
 'Passed Visit 26 OFC for ITT',
 'Passed Visit 26 OFC No Imputation',
 'Tolerance outcome',
 'Tolerance outcome (character)']

In [10]:
#filtering by just the participant IDs in the eligible list
impact_serum_eligible = impact_serum[impact_serum['Participant ID'].isin(participant_ids)]

impact_serum_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Test Name', # Peanut IgE, Peanut IgE/Total IgE ratio, Peanut IgG4*, Peanut IgG4/IgE ratio, Total IgE
    'Unit',
    'Value', # results from 'Test Name'
    'Baseline Value', # IgE values taken during initial screening
    'Passed Visit 24 OFC No Imputation',
    'Passed Visit 26 OFC No Imputation',
]

impact_serum_eligible_filtered = impact_serum_eligible[impact_serum_cols_to_keep]


In [11]:
#checking to see the above filtering kept the correct number of 144 participants 
len(impact_serum_eligible_filtered['Participant ID'].unique().tolist()) #output 144 - correct

144

In [12]:
print(impact_serum_eligible_filtered.shape)
impact_serum_eligible_filtered.head()

(3059, 9)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Baseline Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,25.7,,
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,27.934783,27.934783,,
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3,0.3,,
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,0.004864,0.004864,,
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,92.0,,


### Making a new DF with a new column capturing initial OFC fail 
calling it "Passed Visit -2 OFC" to match other OFC columns
- Doing this because according to protocol, 144 participants enterered the study because they passed all the eligibility criteria AND __had a clinical reaction to the intake OFC, thus all 144 participants had a failed OFC status for their -2 visit__

In [13]:
impact_serum_eligible_filtered['Passed Visit -2 OFC'] = 0
impact_serum_eligible_filtered['Passed Visit -2 OFC'] = impact_serum_eligible_filtered['Passed Visit -2 OFC'].astype('int64')
impact_serum_eligible_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  impact_serum_eligible_filtered['Passed Visit -2 OFC'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  impact_serum_eligible_filtered['Passed Visit -2 OFC'] = impact_serum_eligible_filtered['Passed Visit -2 OFC'].astype('int64')


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Baseline Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation,Passed Visit -2 OFC
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,25.7,,,0
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,27.934783,27.934783,,,0
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3,0.3,,,0
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,0.004864,0.004864,,,0
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,92.0,,,0


### filtering out from 'Test Name' everything but 'Peanut IgE' and 'Total IgE'

In [14]:
# removing rows from 'Test Name' for everything except 'Peanut IgE' and 'Total IgE'

#defining list to keep
keep_IgEs = ['Peanut IgE', 'Total IgE']

impact_serum_eligible_filtered = impact_serum_eligible_filtered[impact_serum_eligible_filtered['Test Name'].isin(keep_IgEs)]


In [15]:
impact_serum_eligible_filtered['Test Name'].unique()

array(['Peanut IgE', 'Total IgE'], dtype=object)

In [16]:
print(impact_serum_eligible_filtered.shape)
impact_serum_eligible_filtered.head()

(1221, 10)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value,Baseline Value,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation,Passed Visit -2 OFC
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,25.7,,,0
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,92.0,,,0
5,IMPACT_101655,2013-05-28,16,Peanut IgE,kU/L,80.8,25.7,,,0
9,IMPACT_101655,2013-05-28,16,Total IgE,IU/mL,442.0,92.0,,,0
10,IMPACT_102436,2013-03-14,-2,Peanut IgE,kU/L,36.2,36.2,1.0,0.0,0


--- 
# IgE_IgG4_component Cleaning
- taking the components from this data set

In [17]:
impact_ige.columns.tolist()

['Participant ID',
 'Visit',
 'Visit Number',
 'Barcode',
 'Antibody',
 'Component',
 'Test Name',
 'Result',
 'Value',
 'Unit',
 'log10 Value',
 'Baseline Value',
 'log10 Value_Baseline',
 'Specimen type',
 'Collection Date',
 'Fold change from baseline',
 'log10 fold change from baseline',
 'Planned Treatment',
 'Planned Treatment (N)',
 'Randomized',
 'Intent to Treat Sample',
 'Per-protocol Primary Endpoint',
 'Per-protocol Secondary Endpoint',
 'Age at screening (years)',
 'Age at screening (years) Not Rounded',
 'Sex (character)',
 'OUT24ITT',
 'OUT24NOI',
 'OUT26ITT',
 'OUT26NOI',
 'Tolerance outcome',
 'Tolerance outcome (character)']

In [18]:
#filtering by just the participant IDs in the eligible list
impact_ige_eligible = impact_ige[impact_ige['Participant ID'].isin(participant_ids)]

impact_ige_cols_to_keep = [
    'Participant ID',
    'Visit',
    'Antibody', #IgE, IgG4
    'Component', #rAra h 1, rAra h 2, rAra h 3, rAra h 6
    'Value',
    'Unit',
    'Baseline Value',
    'Collection Date',   
]

impact_ige_eligible_filtered = impact_ige_eligible[impact_ige_cols_to_keep]


In [19]:
#dropping IgG4 from Antibody column, only want IgE
impact_ige_eligible_filtered = impact_ige_eligible_filtered[impact_ige_eligible_filtered['Antibody']=='IgE']

In [20]:
print(impact_ige_eligible_filtered.shape)
impact_ige_eligible_filtered.head(10)

(2416, 8)


Unnamed: 0,Participant ID,Visit,Antibody,Component,Value,Unit,Baseline Value,Collection Date
0,IMPACT_101655,-2,IgE,rAra h 1,8.76,KU/L,8.76,2012-10-11
1,IMPACT_101655,-2,IgE,rAra h 2,27.5,KU/L,27.5,2012-10-11
2,IMPACT_101655,-2,IgE,rAra h 3,0.5,KU/L,0.5,2012-10-11
3,IMPACT_101655,-2,IgE,rAra h 6,17.3,kUA/L,17.3,2012-10-11
8,IMPACT_101655,16,IgE,rAra h 1,58.0,kUA/L,8.76,2013-05-28
9,IMPACT_101655,16,IgE,rAra h 2,81.8,kUA/L,27.5,2013-05-28
10,IMPACT_101655,16,IgE,rAra h 3,1.02,kUA/L,0.5,2013-05-28
11,IMPACT_101655,16,IgE,rAra h 6,72.9,kUA/L,17.3,2013-05-28
16,IMPACT_102436,-2,IgE,rAra h 1,2.34,KU/L,2.34,2013-03-14
17,IMPACT_102436,-2,IgE,rAra h 2,48.1,KU/L,48.1,2013-03-14


# Skin Prick Test cleaning 
(Skin Prick Test_2023-05-25_06-31-04.xlsx)

- taking wheal from this data set

In [21]:
# checking total unique participants is the same
len(impact_spt['Participant ID'].unique().tolist())
# output 146

146

In [22]:
impact_spt.columns.tolist()

['Participant ID',
 'Visit',
 'DataStream code',
 'PHASE',
 'SEQNO',
 'VISITNUM',
 'Date of Allergy Skin Test (Character)',
 'Positive Control Wheal',
 'Positive Control Wheal (Character)',
 'Negative Control Wheal',
 'Negative Control Wheal (Character)',
 'Allergen',
 'Wheal (mm)',
 'Wheal (mm) (Character)',
 'Calculated Wheal',
 'Calculated Wheal (Character)',
 'Planned Treatment',
 'Planned Treatment (N)',
 'Randomized',
 'Intent to Treat Sample',
 'Age at screening (years)',
 'Age at screening (years) Not Rounded',
 'Sex (character)',
 'OUT24ITT',
 'OUT24NOI',
 'OUT26ITT',
 'OUT26NOI',
 'Week',
 'Tolerance outcome',
 'Tolerance outcome (character)',
 'Wheal (mm) baseline',
 'Calculated Wheal baseline',
 'Wheal fold change from baseline',
 'Calculated wheal fold change from baseline',
 'Per-protocol Primary Endpoint',
 'Per-protocol Secondary Endpoint']

In [23]:
#filtering by just the participant IDs in the eligible list
impact_spt_eligible = impact_spt[impact_spt['Participant ID'].isin(participant_ids)]

impact_spt_cols_to_keep = [
    'Participant ID',
    'Date of Allergy Skin Test (Character)', # this is similar to 'Collection Date' in other datasets 
    'Visit',
    'Wheal (mm)',
    'Wheal (mm) baseline'
]

impact_spt_eligible_filtered = impact_spt_eligible[impact_spt_cols_to_keep]


In [24]:
# Changing 'Date of Allergy Skin Test (Character)' to 'Collection Date' to match other data sets

impact_spt_eligible_filtered = impact_spt_eligible_filtered.rename(columns={'Date of Allergy Skin Test (Character)': 'Collection Date'})


In [25]:
#checking to see the above filtering kept the correct number of 144 participants 
len(impact_spt_eligible_filtered['Participant ID'].unique().tolist()) #output 144 - correct

144

In [26]:
print(impact_spt_eligible_filtered.shape)
impact_spt_eligible_filtered.head()

(619, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),Wheal (mm) baseline
0,IMPACT_101655,2012-10-11,-2,17.5,17.5
1,IMPACT_101655,2013-05-28,16,4.5,17.5
2,IMPACT_102436,2013-03-14,-2,16.0,16.0
3,IMPACT_102436,2013-11-26,16,9.5,16.0
4,IMPACT_102436,2014-11-28,20,,16.0


---
# Building Baseline Datasets
Separating baseline data from rest of data.  
Baseline data varies from rest of the test visits in a few ways:
- interpretting OFC pass results from protocol and filtering
- components units are different than that of the follow up visits (16,21,24,26)

In [27]:
# impact_ad_eligible_filtered (144, 12)
# impact_serum_eligible_filtered
# impact_ige_eligible_filtered
# impact_spt_eligible_filtered

---
# Cleaning ad baseline

summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for Age in months 

### Notes on 'impact_ad_eligible_baseline' dataframe: 

144 unique participant IDs  
144 rows


In [28]:
#updating the 'Visit' column to say -2 since this is intake information
impact_ad_eligible_baseline = impact_ad_eligible_filtered
impact_ad_eligible_baseline['Visit'] = -2
impact_ad_eligible_baseline['Visit'] = impact_ad_eligible_baseline['Visit'].astype('int64')
print(impact_ad_eligible_baseline.shape) #144 total entries in the DF = 144 unique participant IDs
impact_ad_eligible_baseline.head()

(144, 10)


Unnamed: 0,Participant ID,Visit,Collection Date,Study Status,Completed Study Protocol,Sex (character),Race,Completed Study Assessments Numeric,Age at screening (years),Age at Screening (years) Not Rounded
0,IMPACT_101655,-2,2012-10-11,Discontinued Therapy,No,Male,White/Caucasian,0,4,3.8
1,IMPACT_102436,-2,2013-03-14,Completed Study,Yes,Male,White/Caucasian,1,4,3.9
2,IMPACT_105670,-2,2013-04-04,Completed Study,Yes,Female,White/Caucasian,1,3,3.0
4,IMPACT_113135,-2,2013-11-25,Early Termination,No,Female,White/Caucasian,0,3,2.6
5,IMPACT_115876,-2,2014-03-19,Completed Study,Yes,Male,Mixed Race,1,2,2.2


In [29]:
impact_ad_eligible_baseline = impact_ad_eligible_baseline.drop(columns=['Age at screening (years)', 'Study Status', 'Completed Study Protocol', 'Completed Study Assessments Numeric'])
impact_ad_eligible_baseline.head()

Unnamed: 0,Participant ID,Visit,Collection Date,Sex (character),Race,Age at Screening (years) Not Rounded
0,IMPACT_101655,-2,2012-10-11,Male,White/Caucasian,3.8
1,IMPACT_102436,-2,2013-03-14,Male,White/Caucasian,3.9
2,IMPACT_105670,-2,2013-04-04,Female,White/Caucasian,3.0
4,IMPACT_113135,-2,2013-11-25,Female,White/Caucasian,2.6
5,IMPACT_115876,-2,2014-03-19,Male,Mixed Race,2.2


In [30]:
#convering the 'Age at Screening (years) Not Rounded' to months 
impact_ad_eligible_baseline['Age'] = impact_ad_eligible_baseline['Age at Screening (years) Not Rounded'] * 12
impact_ad_eligible_baseline = impact_ad_eligible_baseline.drop(columns=['Age at Screening (years) Not Rounded'])



In [31]:
print(impact_ad_eligible_baseline.shape)
impact_ad_eligible_baseline.head()

(144, 6)


Unnamed: 0,Participant ID,Visit,Collection Date,Sex (character),Race,Age
0,IMPACT_101655,-2,2012-10-11,Male,White/Caucasian,45.6
1,IMPACT_102436,-2,2013-03-14,Male,White/Caucasian,46.8
2,IMPACT_105670,-2,2013-04-04,Female,White/Caucasian,36.0
4,IMPACT_113135,-2,2013-11-25,Female,White/Caucasian,31.2
5,IMPACT_115876,-2,2014-03-19,Male,Mixed Race,26.4


In [32]:
# renaming 'Date of Screening Visit' to 'Collection Date' for merger later
impact_ad_eligible_baseline = impact_ad_eligible_baseline.rename(columns={'Date of Screening Visit': 'Collection Date'})


In [33]:
print(impact_ad_eligible_baseline.shape)
impact_ad_eligible_baseline.head()

(144, 6)


Unnamed: 0,Participant ID,Visit,Collection Date,Sex (character),Race,Age
0,IMPACT_101655,-2,2012-10-11,Male,White/Caucasian,45.6
1,IMPACT_102436,-2,2013-03-14,Male,White/Caucasian,46.8
2,IMPACT_105670,-2,2013-04-04,Female,White/Caucasian,36.0
4,IMPACT_113135,-2,2013-11-25,Female,White/Caucasian,31.2
5,IMPACT_115876,-2,2014-03-19,Male,Mixed Race,26.4


---
# Cleaning serum baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for 'Peanut IgE kU/L', 'Total IgE IU/mL'
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_serum_eligible_baseline_merged' dataframe: 

144 unique participant IDs  
159 rows becauase couldn't merge all of them due to 19 participants with issues:
- 15 participants had their IgE values read on different dates
- 4 only had ONE of their IgE values read during intake (either Peanut or Total, not both)

In [34]:
impact_serum_eligible_baseline = impact_serum_eligible_filtered

# dropping all rows for visits that are not -2 the initial visit
impact_serum_eligible_baseline = impact_serum_eligible_baseline[impact_serum_eligible_baseline['Visit'] == -2]

# dropping all non baseline columns
impact_serum_eligible_baseline = impact_serum_eligible_baseline.drop(columns=['Passed Visit 24 OFC No Imputation', 
                                                                        'Passed Visit 26 OFC No Imputation',
                                                                        'Value'
                                                                       ])

In [35]:
len(impact_serum_eligible_baseline['Participant ID'].unique().tolist())

144

In [36]:
impact_serum_eligible_baseline.head()

Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Baseline Value,Passed Visit -2 OFC
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,0
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,0
10,IMPACT_102436,2013-03-14,-2,Peanut IgE,kU/L,36.2,0
14,IMPACT_102436,2013-03-23,-2,Total IgE,IU/mL,187.0,0
35,IMPACT_105670,2013-04-04,-2,Peanut IgE,kU/L,51.3,0


In [37]:
# Create new columns with initial NaN values
impact_serum_eligible_baseline['Peanut IgE (kU/L)'] = np.nan
impact_serum_eligible_baseline['Total IgE (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Peanut IgE', 'Peanut IgE (kU/L)'] = impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Peanut IgE', 'Baseline Value']
impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Total IgE', 'Total IgE (kU/L)'] = impact_serum_eligible_baseline.loc[impact_serum_eligible_baseline['Test Name'] == 'Total IgE', 'Baseline Value']

# Drop the specified columns
impact_serum_eligible_baseline = impact_serum_eligible_baseline.drop(columns=['Test Name', 'Unit', 'Baseline Value'])


In [38]:
len(impact_serum_eligible_baseline['Participant ID'].unique().tolist())

144

In [39]:
print(impact_serum_eligible_baseline.shape) # (284, 6)
impact_serum_eligible_baseline.head()
# note if all of the 144 participants had 2 rows for Peanut and Total, we'd have 288, but we have 284

(284, 6)


Unnamed: 0,Participant ID,Collection Date,Visit,Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_101655,2012-10-11,-2,0,25.7,
4,IMPACT_101655,2012-10-11,-2,0,,92.0
10,IMPACT_102436,2013-03-14,-2,0,36.2,
14,IMPACT_102436,2013-03-23,-2,0,,187.0
35,IMPACT_105670,2013-04-04,-2,0,51.3,


In [40]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_serum_eligible_baseline_merged = impact_serum_eligible_baseline.groupby(['Participant ID', 'Collection Date', 'Visit', 'Passed Visit -2 OFC'], as_index=False).agg({'Peanut IgE (kU/L)': 'mean', 'Total IgE (kU/L)': 'mean'})

# Reset the index
impact_serum_eligible_baseline_merged.reset_index(drop=True, inplace=True)


In [41]:
impact_serum_eligible_baseline_merged.head(10)

Unnamed: 0,Participant ID,Collection Date,Visit,Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_101655,2012-10-11,-2,0,25.7,92.0
1,IMPACT_102436,2013-03-14,-2,0,36.2,
2,IMPACT_102436,2013-03-23,-2,0,,187.0
3,IMPACT_105670,2013-04-04,-2,0,51.3,118.0
4,IMPACT_113135,2013-11-25,-2,0,241.0,408.0
5,IMPACT_115876,2014-03-19,-2,0,19.8,344.0
6,IMPACT_136775,2015-03-28,-2,0,292.0,333.0
7,IMPACT_139237,2014-12-24,-2,0,343.0,529.0
8,IMPACT_149018,2014-07-16,-2,0,195.0,
9,IMPACT_155320,2014-08-22,-2,0,89.9,497.0


In [42]:
print(len(impact_serum_eligible_baseline_merged['Participant ID'].unique().tolist()))
print(impact_serum_eligible_baseline_merged.shape)

144
(159, 6)


In [43]:
# Exploring where the extra rows came from

# making a df of just the rows with nan values
serum_nan_rows = impact_serum_eligible_baseline_merged[impact_serum_eligible_baseline_merged.isna().any(axis=1)]


In [44]:
print(serum_nan_rows.shape)
serum_nan_rows.head(34)

(34, 6)


Unnamed: 0,Participant ID,Collection Date,Visit,Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L)
1,IMPACT_102436,2013-03-14,-2,0,36.2,
2,IMPACT_102436,2013-03-23,-2,0,,187.0
8,IMPACT_149018,2014-07-16,-2,0,195.0,
16,IMPACT_228754,2014-09-05,-2,0,,303.0
17,IMPACT_228754,2014-09-06,-2,0,214.0,
18,IMPACT_251255,2013-06-17,-2,0,,2365.0
19,IMPACT_251255,2013-06-18,-2,0,499.0,
20,IMPACT_255737,2012-12-25,-2,0,,622.0
21,IMPACT_255737,2012-12-26,-2,0,394.0,
22,IMPACT_256280,2014-01-28,-2,0,,702.0


In [45]:
len(serum_nan_rows['Participant ID'].unique().tolist())


19

---
# Cleaning IgE baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Create new columns for components
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_ige_eligible_baseline_merged' dataframe: 

141 unique participant IDs because
- 3 did not have baseline data for initial visit (IDs: 'IMPACT_149018', 'IMPACT_746400', 'IMPACT_920870')   

177 rows becauase couldn't merge all of them due to:
- 36 participants had their component values read on different dates

In [46]:
impact_ige_eligible_baseline = impact_ige_eligible_filtered

In [47]:
print(len(impact_ige['Participant ID'].unique().tolist()))

146


In [48]:
print(len(impact_ige_eligible_filtered['Participant ID'].unique().tolist()))


144


In [49]:
print(len(impact_ige_eligible_baseline['Participant ID'].unique().tolist()))

144


In [50]:
# just the participant IDs for those who have -2 in their history
impact_ige_eligible_baseline_2 = impact_ige_eligible_baseline[impact_ige_eligible_baseline['Visit'] == -2]

In [51]:
len(impact_ige_eligible_baseline_2['Participant ID'].unique().tolist())

141

In [52]:
participant_ids = impact_ige_eligible_baseline[~impact_ige_eligible_baseline['Participant ID'].isin(impact_ige_eligible_baseline_2['Participant ID'])]['Participant ID'].tolist()


In [53]:
#dropping duplicates
participant_ids = list(set(participant_ids))
participant_ids

# output: ['IMPACT_149018', 'IMPACT_746400', 'IMPACT_920870'] THESE ARE THE 3 

# manually checking these IDs in the IgE spreadsheet; 

# IMPACT_149018 
# in IgE dataset; missing -2 and 16  visit, missing component baseline values 
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Statys = Early Termination

# IMPACT_746400
# in IgE dataset; missing -2 visit, missing component baseline values 
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Status = Completed Study

# IMPACT_920870
# in IgE dataset; missing -2 visit, missing component baseline values
# in serum dataset; Confirming baseline obtained for Peanut IgE but missing Total IgE
# in ad dataset; Study Status = Completed Study

['IMPACT_746400', 'IMPACT_920870', 'IMPACT_149018']

In [54]:
# removing these from the baseline data 
impact_ige_eligible_baseline = impact_ige_eligible_filtered[~impact_ige_eligible_filtered['Participant ID'].isin(participant_ids)]


In [55]:
print(len(impact_ige_eligible_baseline['Participant ID'].unique().tolist()))
# output 141, correctly filtered out the participants with missing baseline data 

141


In [56]:
# dropping all rows for visits that are not -2 the initial visit
impact_ige_eligible_baseline = impact_ige_eligible_baseline[impact_ige_eligible_baseline['Visit'] == -2]

# dropping all non baseline columns
impact_ige_eligible_baseline = impact_ige_eligible_baseline.drop(columns=['Value'])

In [57]:
print(impact_ige_eligible_baseline.shape)
impact_ige_eligible_baseline.head()

(564, 7)


Unnamed: 0,Participant ID,Visit,Antibody,Component,Unit,Baseline Value,Collection Date
0,IMPACT_101655,-2,IgE,rAra h 1,KU/L,8.76,2012-10-11
1,IMPACT_101655,-2,IgE,rAra h 2,KU/L,27.5,2012-10-11
2,IMPACT_101655,-2,IgE,rAra h 3,KU/L,0.5,2012-10-11
3,IMPACT_101655,-2,IgE,rAra h 6,kUA/L,17.3,2012-10-11
16,IMPACT_102436,-2,IgE,rAra h 1,KU/L,2.34,2013-03-14


In [58]:
# Creating new columns for each component

# Create new columns with initial NaN values
impact_ige_eligible_baseline['Ara h1 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h2 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h3 (kU/L)'] = np.nan
impact_ige_eligible_baseline['Ara h6 (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 1', 'Ara h1 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 1', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 2', 'Ara h2 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 2', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 3', 'Ara h3 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 3', 'Baseline Value']
impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 6', 'Ara h6 (kU/L)'] = impact_ige_eligible_baseline.loc[impact_ige_eligible_baseline['Component'] == 'rAra h 6', 'Baseline Value']

# Drop the specified columns
impact_ige_eligible_baseline = impact_ige_eligible_baseline.drop(columns=['Antibody', 'Component', 'Baseline Value', 'Unit'])


In [59]:
print(impact_ige_eligible_baseline.shape)
impact_ige_eligible_baseline.head()

(564, 7)


Unnamed: 0,Participant ID,Visit,Collection Date,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_101655,-2,2012-10-11,8.76,,,
1,IMPACT_101655,-2,2012-10-11,,27.5,,
2,IMPACT_101655,-2,2012-10-11,,,0.5,
3,IMPACT_101655,-2,2012-10-11,,,,17.3
16,IMPACT_102436,-2,2013-03-14,2.34,,,


In [60]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_ige_eligible_baseline_merged = impact_ige_eligible_baseline.groupby(['Participant ID', 'Collection Date', 'Visit'], as_index=False).agg({'Ara h1 (kU/L)': 'mean', 'Ara h2 (kU/L)': 'mean', 'Ara h3 (kU/L)': 'mean', 'Ara h6 (kU/L)': 'mean'})

# Reset the index
impact_ige_eligible_baseline_merged.reset_index(drop=True, inplace=True)


In [61]:
print(impact_ige_eligible_baseline_merged.shape)
impact_ige_eligible_baseline_merged.head(50)

(177, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2
5,IMPACT_136775,2015-03-28,-2,13.8,144.0,0.27,92.6
6,IMPACT_139237,2014-12-24,-2,8.7,177.0,1.24,159.0
7,IMPACT_155320,2014-08-22,-2,4.51,88.3,1.87,21.7
8,IMPACT_155790,2014-10-21,-2,67.0,86.9,17.9,38.2
9,IMPACT_164041,2013-11-21,-2,0.33,15.0,0.63,26.4


In [62]:
print(len(impact_ige_eligible_baseline_merged['Participant ID'].unique().tolist()))

141


In [63]:
# checking which rows have NaN Values 
ig_nan_rows = impact_ige_eligible_baseline_merged[impact_ige_eligible_baseline_merged.isna().any(axis=1)]


In [64]:
print(ig_nan_rows.shape)
ig_nan_rows.head(72)

(72, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
14,IMPACT_228754,2014-09-05,-2,,,,68.7
15,IMPACT_228754,2014-09-06,-2,41.10,75.0,0.24,
16,IMPACT_251255,2013-06-17,-2,,,,162.0
17,IMPACT_251255,2013-06-18,-2,17.10,303.0,6.26,
18,IMPACT_255737,2012-12-25,-2,,,,60.5
...,...,...,...,...,...,...,...
167,IMPACT_960338,2013-11-29,-2,0.69,38.9,0.23,
169,IMPACT_967510,2015-03-23,-2,,,,18.3
170,IMPACT_967510,2015-05-30,-2,6.26,16.1,0.05,
171,IMPACT_969235,2012-12-21,-2,,,,22.7


In [65]:
# getting the participant IDs with the NaN values
nan_participant_ids = impact_ige_eligible_baseline_merged.loc[impact_ige_eligible_baseline_merged.isna().any(axis=1), 'Participant ID'].tolist()
nan_participant_ids

['IMPACT_228754',
 'IMPACT_228754',
 'IMPACT_251255',
 'IMPACT_251255',
 'IMPACT_255737',
 'IMPACT_255737',
 'IMPACT_256280',
 'IMPACT_256280',
 'IMPACT_289709',
 'IMPACT_289709',
 'IMPACT_291362',
 'IMPACT_291362',
 'IMPACT_298007',
 'IMPACT_298007',
 'IMPACT_322235',
 'IMPACT_322235',
 'IMPACT_376278',
 'IMPACT_376278',
 'IMPACT_410437',
 'IMPACT_410437',
 'IMPACT_423617',
 'IMPACT_423617',
 'IMPACT_462870',
 'IMPACT_462870',
 'IMPACT_488312',
 'IMPACT_488312',
 'IMPACT_519208',
 'IMPACT_519208',
 'IMPACT_533350',
 'IMPACT_533350',
 'IMPACT_622758',
 'IMPACT_622758',
 'IMPACT_641869',
 'IMPACT_641869',
 'IMPACT_651071',
 'IMPACT_651071',
 'IMPACT_653909',
 'IMPACT_653909',
 'IMPACT_679764',
 'IMPACT_679764',
 'IMPACT_725444',
 'IMPACT_725444',
 'IMPACT_735026',
 'IMPACT_735026',
 'IMPACT_778693',
 'IMPACT_778693',
 'IMPACT_807785',
 'IMPACT_807785',
 'IMPACT_832465',
 'IMPACT_832465',
 'IMPACT_834513',
 'IMPACT_834513',
 'IMPACT_841912',
 'IMPACT_841912',
 'IMPACT_848744',
 'IMPACT_8

In [66]:
#dropping duplicates
nan_participant_ids = list(set(nan_participant_ids))
nan_participant_ids

len(nan_participant_ids)

36

---
# Cleaning Skin Prick Test (Wheal) baseline 
summary of steps:
- dropping all rows for visits that are not -2 the initial visit
- dropping all non baseline columns
- Merging duplicate rows and overwriting NaN values


### Notes on 'impact_spt_eligible_baseline_merged' dataframe: 

- 144 unique participant IDs because everyone had baseline wheal data done on the same date 

In [67]:
impact_spt_eligible_baseline = impact_spt_eligible_filtered

In [68]:
print(len(impact_spt['Participant ID'].unique().tolist()))

146


In [69]:
print(len(impact_spt_eligible_filtered['Participant ID'].unique().tolist()))


144


In [70]:
print(len(impact_spt_eligible_baseline['Participant ID'].unique().tolist()))

144


In [71]:
#checking if there are any participants that don't have baseline wheal data (-2 visit)
impact_spt_eligible_baseline_2 = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Visit'] == -2]
len(impact_spt_eligible_baseline_2['Participant ID'].unique().tolist())
#output 144, all eligible participants have wheal baseline data

144

In [72]:
#dropping nonbaseline data
impact_spt_eligible_baseline = impact_spt_eligible_baseline.drop(columns=['Wheal (mm)'])


In [73]:
print(impact_spt_eligible_baseline.shape)
impact_spt_eligible_baseline.head()

(619, 4)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm) baseline
0,IMPACT_101655,2012-10-11,-2,17.5
1,IMPACT_101655,2013-05-28,16,17.5
2,IMPACT_102436,2013-03-14,-2,16.0
3,IMPACT_102436,2013-11-26,16,16.0
4,IMPACT_102436,2014-11-28,20,16.0


In [74]:
# dropping all rows for visits that are not -2 the initial visit
impact_spt_eligible_baseline = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Visit'] == -2]


In [75]:
print(impact_spt_eligible_baseline.shape)
impact_spt_eligible_baseline.head()

(144, 4)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm) baseline
0,IMPACT_101655,2012-10-11,-2,17.5
2,IMPACT_102436,2013-03-14,-2,16.0
7,IMPACT_105670,2013-04-04,-2,14.5
12,IMPACT_113135,2013-11-25,-2,15.5
17,IMPACT_115876,2014-03-19,-2,17.5


### viewing all baseline data so far 


In [76]:
#impact_ige_eligible_baseline_merged
#impact_serum_eligible_baseline_merged
#impact_ad_eligible_baseline
#impact_spt_eligible_baseline

In [77]:
# Finding the participant IDs that are common among all 3 baseline data frames
ige_list = impact_ige_eligible_baseline_merged['Participant ID'].tolist()
serum_list = impact_serum_eligible_baseline_merged['Participant ID'].tolist()
ad_list = impact_ad_eligible_baseline['Participant ID'].tolist()
spt_list = impact_spt_eligible_baseline['Participant ID'].tolist()

common_participants = set(ige_list).intersection(serum_list, ad_list, spt_list)
common_participants = list(common_participants)


In [78]:
len(common_participants)
#output 144

141

In [79]:
# Updating each data frame so that they only contain the common participants 
impact_ige_eligible_baseline_merged_common = impact_ige_eligible_baseline_merged[impact_ige_eligible_baseline_merged['Participant ID'].isin(common_participants)]
impact_serum_eligible_baseline_merged_common = impact_serum_eligible_baseline_merged[impact_serum_eligible_baseline_merged['Participant ID'].isin(common_participants)]
impact_ad_eligible_baseline_common = impact_ad_eligible_baseline[impact_ad_eligible_baseline['Participant ID'].isin(common_participants)]
impact_spt_eligible_baseline_common = impact_spt_eligible_baseline[impact_spt_eligible_baseline['Participant ID'].isin(common_participants)]

In [80]:
print(impact_ige_eligible_baseline_merged_common.shape) #multiple dates for obtaining baseline data for some IDs
print(impact_serum_eligible_baseline_merged_common.shape) #multiple dates for obtaining baseline data for some IDs
print(impact_ad_eligible_baseline_common.shape)
print(impact_spt_eligible_baseline_common.shape)

(177, 7)
(156, 6)
(141, 6)
(141, 4)


In [81]:
#Merging Serum and Ad baselines
# outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
merged_serum_ad = impact_serum_eligible_baseline_merged_common.merge(impact_ad_eligible_baseline_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID', 'Collection Date', 'Visit'])


In [82]:
print(merged_serum_ad.shape)
merged_serum_ad.head()

(162, 9)


Unnamed: 0,Participant ID,Collection Date,Visit,Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L),Sex (character),Race,Age
0,IMPACT_101655,2012-10-11,-2,0.0,25.7,92.0,Male,White/Caucasian,45.6
1,IMPACT_102436,2013-03-14,-2,0.0,36.2,,Male,White/Caucasian,46.8
2,IMPACT_102436,2013-03-23,-2,0.0,,187.0,,,
3,IMPACT_105670,2013-04-04,-2,0.0,51.3,118.0,Female,White/Caucasian,36.0
4,IMPACT_113135,2013-11-25,-2,0.0,241.0,408.0,Female,White/Caucasian,31.2


In [83]:
#Merging the merged_serum_ad data with impact_ige_eligible_baseline_merged_common baselines
merged_serum_ad_ige_baseline = impact_ige_eligible_baseline_merged_common.merge(merged_serum_ad, 
                                                               how='outer', 
                                                               on=['Participant ID', 'Collection Date', 'Visit'])

In [84]:
print(merged_serum_ad_ige_baseline.shape)
merged_serum_ad_ige_baseline.head()

(187, 13)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L),Sex (character),Race,Age
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,Male,White/Caucasian,45.6
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,Male,White/Caucasian,46.8
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,Female,White/Caucasian,36.0
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,Female,White/Caucasian,31.2
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,Male,Mixed Race,26.4


In [85]:
# merging impact_spt_eligible_baseline_common data with merged_serum_ad_ige_baseline_clean



#Merging the merged_serum_ad data with impact_ige_eligible_baseline_merged_common baselines
impact_all_baseline = merged_serum_ad_ige_baseline.merge(impact_spt_eligible_baseline_common, 
                                                               how='outer', 
                                                               on=['Participant ID', 'Collection Date', 'Visit'])

In [86]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 14)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Passed Visit -2 OFC,Peanut IgE (kU/L),Total IgE (kU/L),Sex (character),Race,Age,Wheal (mm) baseline
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,Male,White/Caucasian,45.6,17.5
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,Male,White/Caucasian,46.8,16.0
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,Female,White/Caucasian,36.0,14.5
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,Female,White/Caucasian,31.2,15.5
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,Male,Mixed Race,26.4,17.5


# Final transformation of dataset to match LEAP columns 

In [87]:
# rename 'Passed Visit -2 OFC' to 'OFC Pass'
impact_all_baseline = impact_all_baseline.rename(columns={'Passed Visit -2 OFC': 'OFC Pass'})

# rename Wheal (mm) baseline
impact_all_baseline = impact_all_baseline.rename(columns={'Wheal (mm) baseline': 'Wheal (mm)'})


In [88]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 14)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),OFC Pass,Peanut IgE (kU/L),Total IgE (kU/L),Sex (character),Race,Age,Wheal (mm)
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,Male,White/Caucasian,45.6,17.5
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,Male,White/Caucasian,46.8,16.0
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,Female,White/Caucasian,36.0,14.5
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,Female,White/Caucasian,31.2,15.5
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,Male,Mixed Race,26.4,17.5


#### Creating Binary column from Child's Sex -> Male

note origial mapping in the raw eat dataset was:  
'Male', 'Female'  

in this new encoding:  
1=male, 0=female

In [89]:
# function to map the values in 'Sex (character)' to integers
def encode_sex(sex):
    if sex == "Male":
        return 1
    elif sex == "Female":
        return 0
    else:
        return None  # Return None for any other values

# create the new 'Male' column
impact_all_baseline['Male'] = impact_all_baseline['Sex (character)'].apply(encode_sex)

# drop original column 
impact_all_baseline = impact_all_baseline.drop(columns=['Sex (character)'])


In [90]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 14)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),OFC Pass,Peanut IgE (kU/L),Total IgE (kU/L),Race,Age,Wheal (mm),Male
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,White/Caucasian,45.6,17.5,1.0
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,White/Caucasian,46.8,16.0,1.0
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,White/Caucasian,36.0,14.5,0.0
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,White/Caucasian,31.2,15.5,0.0
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,Mixed Race,26.4,17.5,1.0


In [91]:
impact_all_baseline['Race'].unique()

array(['White/Caucasian', 'Mixed Race', 'Asian',
       'Black or African American', nan], dtype=object)

# One Hot Encoding for Race 
note original values for "Race" from the raw IMPACT dataset is as follows
- 'White/Caucasian', 
- 'Mixed Race', 
- 'Asian',
- 'Black or African American'

In [92]:
# Create dummy variables for "race"
race_dummies = pd.get_dummies(impact_all_baseline['Race'])

# Define the mapping between values and column names
race_mapping = {
    'White/Caucasian': 'White',
    'Black or African American': 'Black',
    'Asian': 'Asian',
    'Mixed Race': 'Mixed',
     np.nan: 'Unknown'
}

# iterate over the mapping and update the dataframe columns
for value, column_name in race_mapping.items():
    if value in race_dummies.columns:
        impact_all_baseline[column_name] = race_dummies[value].fillna(0).astype(int)
    else:
        impact_all_baseline[column_name] = 0

# drop original "race" column
impact_all_baseline = impact_all_baseline.drop('Race', axis=1)


In [93]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 18)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),OFC Pass,Peanut IgE (kU/L),Total IgE (kU/L),Age,Wheal (mm),Male,White,Black,Asian,Mixed,Unknown
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,45.6,17.5,1.0,1,0,0,0,0
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,46.8,16.0,1.0,1,0,0,0,0
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,36.0,14.5,0.0,1,0,0,0,0
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,31.2,15.5,0.0,1,0,0,0,0
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,26.4,17.5,1.0,0,0,0,1,0


In [94]:
# add the missing columns (Flare, h8, h9, Other) and fill with NaNs so the dataset matches LEAP
impact_all_baseline['Flare (mm)'] = np.nan
impact_all_baseline['Ara h8 (kU/L)'] = np.nan
impact_all_baseline['Ara h9 (kU/L)'] = np.nan
impact_all_baseline['Other'] = np.nan

In [95]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),OFC Pass,Peanut IgE (kU/L),Total IgE (kU/L),...,Male,White,Black,Asian,Mixed,Unknown,Flare (mm),Ara h8 (kU/L),Ara h9 (kU/L),Other
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,...,1.0,1,0,0,0,0,,,,
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,...,1.0,1,0,0,0,0,,,,
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,...,0.0,1,0,0,0,0,,,,
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,...,0.0,1,0,0,0,0,,,,
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,...,1.0,0,0,0,1,0,,,,


In [96]:
# reorganizing columns 

print(impact_all_baseline.columns)

Index(['Participant ID', 'Collection Date', 'Visit', 'Ara h1 (kU/L)',
       'Ara h2 (kU/L)', 'Ara h3 (kU/L)', 'Ara h6 (kU/L)', 'OFC Pass',
       'Peanut IgE (kU/L)', 'Total IgE (kU/L)', 'Age', 'Wheal (mm)', 'Male',
       'White', 'Black', 'Asian', 'Mixed', 'Unknown', 'Flare (mm)',
       'Ara h8 (kU/L)', 'Ara h9 (kU/L)', 'Other'],
      dtype='object')


In [97]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),OFC Pass,Peanut IgE (kU/L),Total IgE (kU/L),...,Male,White,Black,Asian,Mixed,Unknown,Flare (mm),Ara h8 (kU/L),Ara h9 (kU/L),Other
0,IMPACT_101655,2012-10-11,-2,8.76,27.5,0.5,17.3,0.0,25.7,92.0,...,1.0,1,0,0,0,0,,,,
1,IMPACT_102436,2013-03-14,-2,2.34,48.1,0.24,24.9,0.0,36.2,,...,1.0,1,0,0,0,0,,,,
2,IMPACT_105670,2013-04-04,-2,2.73,61.6,2.4,24.1,0.0,51.3,118.0,...,0.0,1,0,0,0,0,,,,
3,IMPACT_113135,2013-11-25,-2,6.45,116.0,0.79,172.0,0.0,241.0,408.0,...,0.0,1,0,0,0,0,,,,
4,IMPACT_115876,2014-03-19,-2,12.1,13.2,0.24,10.2,0.0,19.8,344.0,...,1.0,0,0,0,1,0,,,,


In [98]:
# Define the desired column order
desired_order = ['Participant ID',
                 'Collection Date',
                 'Visit',
                "Age",
                 "Male",
                 "White",
                 "Black",
                 "Asian",
                 "Other",
                 "Mixed",
                 "Unknown",
                 "Wheal (mm)",
                 "Flare (mm)",
                 "Total IgE (kU/L)",
                 "Peanut IgE (kU/L)",
                 "Ara h1 (kU/L)",
                 "Ara h2 (kU/L)",
                 "Ara h3 (kU/L)",
                 "Ara h6 (kUA/L)",
                 "Ara h8 (kU/L)",
                 "Ara h9 (kU/L)",
                 "OFC Pass"]

# Reorder the columns using reindex
impact_all_baseline = impact_all_baseline.reindex(columns=desired_order)


In [99]:
print(impact_all_baseline.shape)
impact_all_baseline.head()

(190, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Age,Male,White,Black,Asian,Other,Mixed,...,Flare (mm),Total IgE (kU/L),Peanut IgE (kU/L),Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kUA/L),Ara h8 (kU/L),Ara h9 (kU/L),OFC Pass
0,IMPACT_101655,2012-10-11,-2,45.6,1.0,1,0,0,,0,...,,92.0,25.7,8.76,27.5,0.5,,,,0.0
1,IMPACT_102436,2013-03-14,-2,46.8,1.0,1,0,0,,0,...,,,36.2,2.34,48.1,0.24,,,,0.0
2,IMPACT_105670,2013-04-04,-2,36.0,0.0,1,0,0,,0,...,,118.0,51.3,2.73,61.6,2.4,,,,0.0
3,IMPACT_113135,2013-11-25,-2,31.2,0.0,1,0,0,,0,...,,408.0,241.0,6.45,116.0,0.79,,,,0.0
4,IMPACT_115876,2014-03-19,-2,26.4,1.0,0,0,0,,1,...,,344.0,19.8,12.1,13.2,0.24,,,,0.0


# Final export of baseline clean data for IMPACT study 

In [100]:
#commenting out to prevent exporting again
#impact_all_baseline.to_excel('Data/IMPACT_Study/impact_all_baseline.xlsx', index=False)


# Final import of OFC Baseline clean data for IMPACT study 

In [101]:
impact_all_24_clean = pd.read_excel("Data/IMPACT_Study/impact_all_baseline_clean_final.xlsx")


In [102]:
impact_all_24_clean.head()

Unnamed: 0,Participant ID,Collection Date,Visit,Age,Male,White,Black,Asian,Other,Mixed,...,Flare (mm),Total IgE (kU/L),Peanut IgE (kU/L),Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kUA/L),Ara h8 (kU/L),Ara h9 (kU/L),OFC Pass
0,IMPACT_101655,2012-10-11,-2,45.6,1,1,0,0,,0,...,,92.0,25.7,8.76,27.5,0.5,,,,0
1,IMPACT_102436,2013-03-23,-2,46.8,1,1,0,0,,0,...,,187.0,36.2,2.34,48.1,0.24,,,,0
2,IMPACT_105670,2013-04-04,-2,36.0,0,1,0,0,,0,...,,118.0,51.3,2.73,61.6,2.4,,,,0
3,IMPACT_113135,2013-11-25,-2,31.2,0,1,0,0,,0,...,,408.0,241.0,6.45,116.0,0.79,,,,0
4,IMPACT_115876,2014-03-19,-2,26.4,1,0,0,0,,1,...,,344.0,19.8,12.1,13.2,0.24,,,,0


In [103]:
impact_all_24_clean.shape

(141, 22)