In [1]:
import pandas as pd
import numpy as np

In [2]:
impact_ad = pd.read_excel("Data/IMPACT_Study/ADSTART0_2023-05-25_06-46-11.xlsx")
impact_serum = pd.read_excel("Data/IMPACT_Study/Serum antibody_2023-05-25_06-32-15.xlsx")
impact_ige = pd.read_excel("Data/Impact_Study/IgE_IgG4_component_2023-05-25_06-30-40.xlsx")
impact_spt = pd.read_excel("Data/Impact_study/Skin Prick Test_2023-05-25_06-31-04.xlsx")

# About Data Sets

#### ADSTART0_2023-05-25_06-46-11 
- Overview of participant data 

#### IgE_IgG4_component_2023-05-25_06-30-40
- IgE component levels in this dataset  

#### Serum antibody_2023-05-25_06-32-15
OFC results:
- "Passed Visit 24 OFC for ITT"
- "Passed Visit 24 OFC No Imputation"	
- "Passed Visit 26 OFC for ITT"
- "Passed Visit 26 OFC No Imputation"

#### Skin Prick Test_2023-05-25_06-31-04
- "Wheal (mm)"

#### BAT data_2023-05-25_06-31-20
- Stands for "basophil activation test (BAT)"
- No known use case for this model 

---

# Acronyms 
- Intent-to-treat (ITT)
- Oral Immunotherapy (OIT)
- Initial Dose Escalation (IDE)

# About Study 
- taken from 2022 Lancet_IMPACT.pdf (pdf page 3)  

Children aged 12 months or older and younger than 
48 months were screened for inclusion in the study. 

__Inclusion criteria__ included the following: a clinical 
history of peanut allergy or avoidance without ever 
having eaten peanut, peanut-specific IgE levels of 5 kUA/L 
or higher, a skin prick test (SPT) wheal size greater than 
that of saline control by 3 mm or more, and a positive 
reaction to a cumulative dose of 500 mg or less of peanut 
in a double-blind, placebo-controlled food challenge 
(DBPCFC).   

__exclusion criteria__ included a history of 
severe anaphylaxis with hypotension to peanut, more 
than mild asthma or uncontrolled asthma, uncontrolled 
atopic dermatitis, and eosinophilic gastrointestinal 
disease (the full list of exclusion criteria is presented in 
the appendix p 2).

---
# Study Design

This is a randomized, double-blind, placebo-controlled, multi-center study comparing peanut oral immunotherapy to placebo. Eligible participants with peanut allergy will be randomly assigned to receive either peanut OIT or placebo for 134 weeks followed by peanut avoidance for 26 weeks.  

An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing. After the initial blinded OFC, the study design includes the following:  

__Initial Dose Escalation:__ This will occur on a single day in which multiple doses are given. Peanut or placebo dosing will be given incrementally and increase every 15-30 minutes until a dose of 12 mg peanut flour (6 mg peanut protein) or placebo flour is given. The first four doses will be administered as a peanut flour extract of 0.1 to 0.8 mg peanut protein, which is 10 to 80 microliters peanut flour extract, or placebo flour extract and the last three doses will be given as peanut flour of 3 to 12 mg peanut flour 1.5 to 6 mg peanut protein or placebo flour. Participants must tolerate a dose of at least 3 mg peanut flour (1.5 mg peanut protein) or placebo flour to remain in the study.  

__Build-up:__ After the initial dose escalation day, the participant will return to the research unit the next morning for an observed dose administration of the highest tolerated dose from the initial escalation day. The participant will then continue on the daily OIT dosing at home and return to the research unit every 2 weeks for a dose escalation. The dosing escalations will be consistent with previous similar OIT studies.
  
Participants who do not reach the 4000 mg peanut flour (2000 mg peanut protein) or placebo flour dose during the build-up phase may enter maintenance phase at their highest tolerated dose, which must be at least 500 mg peanut flour (250 mg peanut protein) or placebo flour.  

The build-up phase will comprise 30 weeks.  

__Maintenance:__ The participant will continue on daily OIT with return visits every 13 weeks. At the end of this phase the participant will undergo a blinded OFC to 10 g peanut flour (5 g peanut protein).  
This phase will comprise 104 weeks.  

__Avoidance:__ In this final phase participants stop OIT and will avoid peanut consumption They will be seen 2 weeks and 26 weeks after initiating this phase. At the completion of this phase participants will have a final blinded OFC to 10 g peanut flour (5 g peanut protein). Participants who do not have a clinical reaction to the challenge will receive an Open Food Challenge (OpFC).  

Avoidance will comprise 26 weeks.  

__Post-challenge:__ If participants do not have a clinical reaction during the OpFC at the end of avoidance, they will be allowed to consume peanut and will have one visit which will include peripheral blood sampling for mechanistic assays assessments.  

Post-challenge will comprise 2 weeks.  

---
# Exploring where and how OFCs are captured
Looking for 3 OFC tests in total

### 1st OFC: According to Study Design in protocol: 
-  Initial reaction: "An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing." (IDE)
- This was for 0.5 g peanut

### 2nd & 3rd OFC: According to Schedule of Assessments: Appendix 2:
- 5 g Oral Food Challenge performed during Avoidance phase during visit 24/week 134 and visit 26/week 160
- this was at or below 5 g peanut



# Filtering just the useful columns:

In [3]:
# impact_ad
# impact_serum 
# impact_ige 
# impact_spt

impact_ad_cols_to_keep = [
     'Participant ID',
     'Date of Screening Visit',
     'Randomized', # Use this to filter out participants that did not pass the study screening
     'Study Status', # Values: 'Discontinued Therapy', 'Completed Study', 'Enrolled but not Randomized', 'Early Termination', 'Screen Failure', 'Screened but not Enrolled'
     'Sex (character)',
     'Race',
     'Completed Study Assessments Numeric', #1 for yes/ 0 for no (means attended all visits up to 26, excluding 27)
     'Age at Screening (years) Not Rounded',
     'Study Termination Reason'
]

impact_serum_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Test Name', # Peanut IgE, Peanut IgE/Total IgE ratio, Peanut IgG4*, Peanut IgG4/IgE ratio, Total IgE
    'Unit',
    'Value', # results from 'Test Name'
]

impact_ige_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Antibody', #IgE, IgG4
    'Component', #rAra h 1, rAra h 2, rAra h 3, rAra h 6
    'Value',
    'Unit' 
]

impact_spt_cols_to_keep = [
    'Participant ID',
    'Date of Allergy Skin Test (Character)', # this is similar to 'Collection Date' in other datasets 
    'Visit',
    'Wheal (mm)',
    #'OUT24NOI' # this will be used as OFC results for week 24
    'OUT26NOI' # this will be used as OFC results for week 26
]

impact_ad = impact_ad[impact_ad_cols_to_keep]
impact_serum = impact_serum[impact_serum_cols_to_keep]
impact_ige = impact_ige[impact_ige_cols_to_keep]
impact_spt = impact_spt[impact_spt_cols_to_keep]


In [4]:
print(impact_ad.shape)
impact_ad.head()

(209, 9)


Unnamed: 0,Participant ID,Date of Screening Visit,Randomized,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Yes,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Yes,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Yes,Completed Study,Female,White/Caucasian,1,3.0,
3,IMPACT_112496,2013-05-05,No,Enrolled but not Randomized,Female,White/Caucasian,0,3.2,Withdrawn Consent
4,IMPACT_113135,2013-11-25,Yes,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent


In [5]:
print(impact_serum.shape)
impact_serum.head()

(3069, 6)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,27.934783
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,0.004864
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0


In [6]:
print(impact_ige.shape)
impact_ige.head()

(4848, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
0,IMPACT_101655,2012-10-11,-2,IgE,rAra h 1,8.76,KU/L
1,IMPACT_101655,2012-10-11,-2,IgE,rAra h 2,27.5,KU/L
2,IMPACT_101655,2012-10-11,-2,IgE,rAra h 3,0.5,KU/L
3,IMPACT_101655,2012-10-11,-2,IgE,rAra h 6,17.3,kUA/L
4,IMPACT_101655,2012-10-11,-2,IgG4,rAra h 1,0.15,MG/L


In [7]:
print(impact_spt.shape)
impact_spt.head()

(621, 5)


Unnamed: 0,Participant ID,Date of Allergy Skin Test (Character),Visit,Wheal (mm),OUT26NOI
0,IMPACT_101655,2012-10-11,-2,17.5,
1,IMPACT_101655,2013-05-28,16,4.5,
2,IMPACT_102436,2013-03-14,-2,16.0,0.0
3,IMPACT_102436,2013-11-26,16,9.5,0.0
4,IMPACT_102436,2014-11-28,20,,0.0


---
# Filtering participants that were study eligible (n=144)
- Total participants enrolled = 209  
- Total participants eligible = 144
- Total excluded = 65 (62 that did not have OFC fail results + 2 that could not handle the IDE)

#### filtering solution (this will help filter initial screening OFC results too)
- filter by'ADSTART0 -> immpact_ad['Randomized']=="Yes" 
- filter by ADSTART0 -> "Study Termination Reason"-> "Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation" (These are the two that terminated after being randomized but could not reach their IDE: 146-2 -> n=144)

In [8]:
# filter out ineligible participants 
impact_ad_eligible = impact_ad.loc[(impact_ad['Randomized'] == 'Yes') & ~(impact_ad['Study Termination Reason'] == 'Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation')]


In [9]:
# drop column
impact_ad_eligible = impact_ad_eligible.drop(columns=['Randomized'])

In [10]:
print(impact_ad_eligible.shape)
impact_ad_eligible.head()

(144, 8)


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,


# Obtain list of participants who were eligible 


In [11]:
eligible_participant_IDs = set(impact_ad_eligible['Participant ID'])

In [12]:
len(eligible_participant_IDs)

144

# Filtering each remaining dataset to include just the eligible participants 

In [13]:
impact_ige_eligible = impact_ige[impact_ige['Participant ID'].isin(eligible_participant_IDs)]
impact_serum_eligible = impact_serum[impact_serum['Participant ID'].isin(eligible_participant_IDs)]
impact_spt_eligible = impact_spt[impact_spt['Participant ID'].isin(eligible_participant_IDs)]

# all datasets so far with 144 eligible participants and relevant columns:
# impact_ad_eligible
# impact_ige_eligible 
# impact_serum_eligible
# impact_spt_eligible

In [14]:
print("impact_ad_eligible")
print(impact_ad_eligible.shape)

print("impact_ige_eligible")
print(impact_ige_eligible.shape)

print("impact_serum_eligible")
print(impact_serum_eligible.shape)

print("impact_spt_eligible")
print(impact_spt_eligible.shape)

print("impact_ad_eligible")
print(len(impact_ad_eligible['Participant ID'].unique().tolist()))

print("impact_ige_eligible")
print(len(impact_ige_eligible['Participant ID'].unique().tolist()))

print("impact_serum_eligible")
print(len(impact_serum_eligible['Participant ID'].unique().tolist()))

print("impact_spt_eligible")
print(len(impact_spt_eligible['Participant ID'].unique().tolist()))

impact_ad_eligible
(144, 8)
impact_ige_eligible
(4832, 7)
impact_serum_eligible
(3059, 6)
impact_spt_eligible
(619, 5)
impact_ad_eligible
144
impact_ige_eligible
144
impact_serum_eligible
144
impact_spt_eligible
144


In [15]:
print(impact_ad_eligible.shape)
impact_ad_eligible.head()

(144, 8)


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
0,IMPACT_101655,2012-10-11,Discontinued Therapy,Male,White/Caucasian,0,3.8,Adverse Event
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,


# Standardizing Column Names Across Datasets



In [16]:
# impact_ad_eligible
# impact_ige_eligible 
# impact_serum_eligible
# impact_spt_eligible

# renaming'Date of Screening Visit' to 'Collection Date' to match the other datasets 
impact_spt_eligible = impact_spt_eligible.rename(columns={'Date of Allergy Skin Test (Character)': 'Collection Date'})


In [17]:
print(impact_spt_eligible.shape)
impact_spt_eligible.head()

(619, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
0,IMPACT_101655,2012-10-11,-2,17.5,
1,IMPACT_101655,2013-05-28,16,4.5,
2,IMPACT_102436,2013-03-14,-2,16.0,0.0
3,IMPACT_102436,2013-11-26,16,9.5,0.0
4,IMPACT_102436,2014-11-28,20,,0.0


# Filtering all rows except those == to week 26

filtering results below shows 94 participants common to all datasets. 


In [18]:
# impact_ige_eligible 

In [19]:
impact_ige_eligible_26 = impact_ige_eligible[impact_ige_eligible['Visit'] == 26]

In [20]:
print(impact_ige_eligible_26.shape)
print(len(impact_ige_eligible_26['Participant ID'].unique().tolist()))
impact_ige_eligible_26.head()

(760, 7)
95


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
80,IMPACT_105670,2016-05-27,26,IgE,rAra h 1,3.94,kUA/L
81,IMPACT_105670,2016-05-27,26,IgE,rAra h 2,8.24,kUA/L
82,IMPACT_105670,2016-05-27,26,IgE,rAra h 3,0.32,kUA/L
83,IMPACT_105670,2016-05-27,26,IgE,rAra h 6,3.29,kUA/L
84,IMPACT_105670,2016-05-27,26,IgG4,rAra h 1,0.52,MG/L


In [21]:
# impact_serum_eligible

In [22]:
impact_serum_eligible_26 = impact_serum_eligible[impact_serum_eligible['Visit'] == 26]

In [23]:
print(impact_serum_eligible_26.shape)
print(len(impact_serum_eligible_26['Participant ID'].unique().tolist()))
impact_serum_eligible_26.head()

(484, 6)
97


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value
30,IMPACT_102436,2016-05-17,26,Peanut IgE,kU/L,15.3
31,IMPACT_102436,2016-05-17,26,Peanut IgE/Total IgE ratio,Ratio,19.125
32,IMPACT_102436,2016-05-17,26,Peanut IgG4*,mcg/mL,12.2
33,IMPACT_102436,2016-05-17,26,Peanut IgG4/IgE ratio,Ratio,0.332244
34,IMPACT_102436,2016-05-17,26,Total IgE,IU/mL,80.0


In [24]:
# impact_spt_eligible

In [25]:
impact_spt_eligible_26 = impact_spt_eligible[impact_spt_eligible['Visit'] == 26]

In [26]:
print(impact_spt_eligible_26.shape)
print(len(impact_spt_eligible_26['Participant ID'].unique().tolist()))
impact_spt_eligible_26.head()

(97, 5)
97


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
6,IMPACT_102436,2016-05-17,26,4.0,0.0
11,IMPACT_105670,2016-05-27,26,9.5,0.0
16,IMPACT_113135,2017-02-28,26,12.0,
21,IMPACT_115876,2017-05-10,26,20.5,0.0
30,IMPACT_139237,2018-02-22,26,10.0,0.0


In [27]:
# THERE WAS A BUG HERE! INTERSECTION CODE WAS LEAVING OUT 4 PARTICIPANTS
# # finding common participants who reached visit 26

# # impact_ige_eligible_26
# # impact_serum_eligible_26
# # impact_spt_eligible_26

# ige_list_26 = impact_ige_eligible_26['Participant ID'].tolist()
# serum_list_26 = impact_serum_eligible_26['Participant ID'].tolist()
# spt_list_26 = impact_spt_eligible_26['Participant ID'].tolist()

# common_participants_26 = set(spt_list_26).intersection(ige_list_26, serum_list_26)
# common_participants_26 = list(common_participants_26)



In [28]:
# finding common participants who reached visit 26

# impact_ige_eligible_26
# impact_serum_eligible_26
# impact_spt_eligible_26

ige_list_26 = impact_ige_eligible_26['Participant ID'].tolist()
serum_list_26 = impact_serum_eligible_26['Participant ID'].tolist()
spt_list_26 = impact_spt_eligible_26['Participant ID'].tolist()

common_participants_26 = spt_list_26+ige_list_26+serum_list_26
common_participants_26 = list(common_participants_26)


In [29]:
len(set(common_participants_26))
# old output 94
# new output 98

98

In [30]:

# Updating each data frame so that they only contain the common visit 26 participants 
impact_ad_eligible_26_common = impact_ad_eligible[impact_ad_eligible['Participant ID'].isin(common_participants_26)]
impact_ige_eligible_26_common = impact_ige_eligible_26[impact_ige_eligible_26['Participant ID'].isin(common_participants_26)]
impact_serum_eligible_26_common = impact_serum_eligible_26[impact_serum_eligible_26['Participant ID'].isin(common_participants_26)]
impact_spt_eligible_26_common = impact_spt_eligible_26[impact_spt_eligible_26['Participant ID'].isin(common_participants_26)]



In [31]:
print(impact_ad_eligible_26_common.shape)
print(len(impact_ad_eligible_26_common['Participant ID'].unique().tolist()))
impact_ad_eligible_26_common.head()

(98, 8)
98


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,3.9,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,3.0,
4,IMPACT_113135,2013-11-25,Early Termination,Female,White/Caucasian,0,2.6,Withdrawn Consent
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,2.2,
10,IMPACT_139237,2014-12-24,Completed Study,Male,Asian,1,3.1,


In [32]:
print(impact_ige_eligible_26_common.shape)
print(len(impact_ige_eligible_26_common['Participant ID'].unique().tolist()))
impact_ige_eligible_26_common.head()

(760, 7)
95


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
80,IMPACT_105670,2016-05-27,26,IgE,rAra h 1,3.94,kUA/L
81,IMPACT_105670,2016-05-27,26,IgE,rAra h 2,8.24,kUA/L
82,IMPACT_105670,2016-05-27,26,IgE,rAra h 3,0.32,kUA/L
83,IMPACT_105670,2016-05-27,26,IgE,rAra h 6,3.29,kUA/L
84,IMPACT_105670,2016-05-27,26,IgG4,rAra h 1,0.52,MG/L


In [33]:
print(impact_serum_eligible_26_common.shape)
print(len(impact_serum_eligible_26_common['Participant ID'].unique().tolist()))
impact_serum_eligible_26_common.head()

(484, 6)
97


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value
30,IMPACT_102436,2016-05-17,26,Peanut IgE,kU/L,15.3
31,IMPACT_102436,2016-05-17,26,Peanut IgE/Total IgE ratio,Ratio,19.125
32,IMPACT_102436,2016-05-17,26,Peanut IgG4*,mcg/mL,12.2
33,IMPACT_102436,2016-05-17,26,Peanut IgG4/IgE ratio,Ratio,0.332244
34,IMPACT_102436,2016-05-17,26,Total IgE,IU/mL,80.0


In [34]:
print(impact_spt_eligible_26_common.shape)
print(len(impact_spt_eligible_26_common['Participant ID'].unique().tolist()))
impact_spt_eligible_26_common.head()

(97, 5)
97


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
6,IMPACT_102436,2016-05-17,26,4.0,0.0
11,IMPACT_105670,2016-05-27,26,9.5,0.0
16,IMPACT_113135,2017-02-28,26,12.0,
21,IMPACT_115876,2017-05-10,26,20.5,0.0
30,IMPACT_139237,2018-02-22,26,10.0,0.0


# Keeping only participants that completed visit 26 OFCs AND are in the common participants list
- Result: 
    - 93 participants completed visit 26 OFCs AND are in the common participants list

In [35]:
# containing only the rows that are not NaN in the 'OUT26NOI' column
spt_not_nan = impact_spt_eligible_26[pd.notna(impact_spt_eligible_26['OUT26NOI'])]


In [36]:
print(spt_not_nan.shape)
spt_not_nan.head()

(93, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
6,IMPACT_102436,2016-05-17,26,4.0,0.0
11,IMPACT_105670,2016-05-27,26,9.5,0.0
21,IMPACT_115876,2017-05-10,26,20.5,0.0
30,IMPACT_139237,2018-02-22,26,10.0,0.0
45,IMPACT_196748,2017-08-13,26,10.5,0.0


In [37]:
len(set(spt_not_nan['Participant ID']))

93

In [38]:
len(set(common_participants_26))

98

In [39]:
spt_not_nan_common = spt_not_nan[spt_not_nan['Participant ID'].isin(common_participants_26)]

In [40]:
print(spt_not_nan_common.shape)
spt_not_nan_common.head()

(93, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
6,IMPACT_102436,2016-05-17,26,4.0,0.0
11,IMPACT_105670,2016-05-27,26,9.5,0.0
21,IMPACT_115876,2017-05-10,26,20.5,0.0
30,IMPACT_139237,2018-02-22,26,10.0,0.0
45,IMPACT_196748,2017-08-13,26,10.5,0.0


# getting the list of participants that completed visit 26 OFCs AND are in the common participants list and updating the dataframes 

In [41]:
len(spt_not_nan_common['Participant ID'].unique())

93

In [42]:
common_26_ofc_complete = list(set(spt_not_nan_common['Participant ID']))

In [43]:
len(common_26_ofc_complete)

93

In [44]:

# Updating each data frame so that they only contain the common participants 
impact_ad_eligible_26_common = impact_ad_eligible[impact_ad_eligible['Participant ID'].isin(common_26_ofc_complete)]
impact_ige_eligible_26_common = impact_ige_eligible_26[impact_ige_eligible_26['Participant ID'].isin(common_26_ofc_complete)]
impact_serum_eligible_26_common = impact_serum_eligible_26[impact_serum_eligible_26['Participant ID'].isin(common_26_ofc_complete)]
impact_spt_eligible_26_common = impact_spt_eligible_26[impact_spt_eligible_26['Participant ID'].isin(common_26_ofc_complete)]



# Calculating Age at visit 26



In [45]:
# transforming the "Age at Screening (years) Not Rounded" column in "impact_ad_eligible_26_common" to be in months


#convering the 'Age at Screening (years) Not Rounded' to months 
impact_ad_eligible_26_common['Age at Screening (years) Not Rounded'] = impact_ad_eligible_26_common['Age at Screening (years) Not Rounded'] * 12

print(impact_ad_eligible_26_common.shape)
impact_ad_eligible_26_common.head()


(93, 8)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  impact_ad_eligible_26_common['Age at Screening (years) Not Rounded'] = impact_ad_eligible_26_common['Age at Screening (years) Not Rounded'] * 12


Unnamed: 0,Participant ID,Date of Screening Visit,Study Status,Sex (character),Race,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,Study Termination Reason
1,IMPACT_102436,2013-03-14,Completed Study,Male,White/Caucasian,1,46.8,
2,IMPACT_105670,2013-04-04,Completed Study,Female,White/Caucasian,1,36.0,
5,IMPACT_115876,2014-03-19,Completed Study,Male,Mixed Race,1,26.4,
10,IMPACT_139237,2014-12-24,Completed Study,Male,Asian,1,37.2,
19,IMPACT_196748,2014-06-15,Completed Study,Male,White/Caucasian,1,44.4,


In [46]:
# Using OFC Results from SPT to Calculate date difference 

In [47]:
impact_spt_eligible_26_common.head()

Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OUT26NOI
6,IMPACT_102436,2016-05-17,26,4.0,0.0
11,IMPACT_105670,2016-05-27,26,9.5,0.0
21,IMPACT_115876,2017-05-10,26,20.5,0.0
30,IMPACT_139237,2018-02-22,26,10.0,0.0
45,IMPACT_196748,2017-08-13,26,10.5,0.0


In [48]:
#Merging ad and spt  
merged_ad_spt = impact_ad_eligible_26_common.merge(impact_spt_eligible_26_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID'])


In [49]:
# calculate the time difference in months and add it to 'Age at Screening (years) Not Rounded'
merged_ad_spt['Age at OFC 26'] = merged_ad_spt.apply(
    lambda row: row['Age at Screening (years) Not Rounded'] + pd.Timedelta(row['Collection Date'] - row['Date of Screening Visit']).days / 30, axis=1
)


In [50]:
# round the 'Age at OFC 26' column to the nearest month
merged_ad_spt['Age at OFC 26'] = merged_ad_spt['Age at OFC 26'].round(1)



In [51]:
# dropping initial screening column (and other unecessary cols)

merged_ad_spt = merged_ad_spt.drop(columns = ['Date of Screening Visit', 
                                              'Completed Study Assessments Numeric', 
                                              'Study Termination Reason', 
                                              'Study Status',
                                              'Age at Screening (years) Not Rounded'
                                             ])

In [52]:
print(merged_ad_spt.shape)
merged_ad_spt.head()

(93, 8)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9


# 3 datasets left to merge


In [53]:
# 3 datasets left to merge: 
# merged_ad_spt -> contains AGE at 26
# impact_ige_eligible_26_common -> contains components 
# impact_serum_eligible_26_common -> contains Peanut & Total IgE 



In [54]:
print(merged_ad_spt.shape)
merged_ad_spt.head()

(93, 8)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9


In [55]:
print(impact_ige_eligible_26_common.shape)
impact_ige_eligible_26_common.head()

(720, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Antibody,Component,Value,Unit
80,IMPACT_105670,2016-05-27,26,IgE,rAra h 1,3.94,kUA/L
81,IMPACT_105670,2016-05-27,26,IgE,rAra h 2,8.24,kUA/L
82,IMPACT_105670,2016-05-27,26,IgE,rAra h 3,0.32,kUA/L
83,IMPACT_105670,2016-05-27,26,IgE,rAra h 6,3.29,kUA/L
84,IMPACT_105670,2016-05-27,26,IgG4,rAra h 1,0.52,MG/L


In [56]:
print(impact_serum_eligible_26_common.shape)
impact_serum_eligible_26_common.head()

(459, 6)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value
30,IMPACT_102436,2016-05-17,26,Peanut IgE,kU/L,15.3
31,IMPACT_102436,2016-05-17,26,Peanut IgE/Total IgE ratio,Ratio,19.125
32,IMPACT_102436,2016-05-17,26,Peanut IgG4*,mcg/mL,12.2
33,IMPACT_102436,2016-05-17,26,Peanut IgG4/IgE ratio,Ratio,0.332244
34,IMPACT_102436,2016-05-17,26,Total IgE,IU/mL,80.0


# Breaking out components into columns 
in "impact_ige_eligible_26_common"

In [57]:
# dropping IgG4 from Antibody column, only want IgE
impact_ige_eligible_26_common = impact_ige_eligible_26_common[impact_ige_eligible_26_common['Antibody']=='IgE']


In [58]:
# Creating new columns for each component

# Create new columns with initial NaN values
impact_ige_eligible_26_common['Ara h1 (kU/L)'] = np.nan
impact_ige_eligible_26_common['Ara h2 (kU/L)'] = np.nan
impact_ige_eligible_26_common['Ara h3 (kU/L)'] = np.nan
impact_ige_eligible_26_common['Ara h6 (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 1', 'Ara h1 (kU/L)'] = impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 1', 'Value']
impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 2', 'Ara h2 (kU/L)'] = impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 2', 'Value']
impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 3', 'Ara h3 (kU/L)'] = impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 3', 'Value']
impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 6', 'Ara h6 (kU/L)'] = impact_ige_eligible_26_common.loc[impact_ige_eligible_26_common['Component'] == 'rAra h 6', 'Value']

# Drop the specified columns
impact_ige_eligible_26_common = impact_ige_eligible_26_common.drop(columns=['Antibody', 'Component', 'Value', 'Unit'])



In [59]:
print(impact_ige_eligible_26_common.shape)
impact_ige_eligible_26_common.head(10)


(360, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
80,IMPACT_105670,2016-05-27,26,3.94,,,
81,IMPACT_105670,2016-05-27,26,,8.24,,
82,IMPACT_105670,2016-05-27,26,,,0.32,
83,IMPACT_105670,2016-05-27,26,,,,3.29
160,IMPACT_115876,2017-05-10,26,7.45,,,
161,IMPACT_115876,2017-05-10,26,,8.99,,
162,IMPACT_115876,2017-05-10,26,,,0.05,
163,IMPACT_115876,2017-05-10,26,,,,9.81
232,IMPACT_139237,2018-02-22,26,20.3,,,
233,IMPACT_139237,2018-02-22,26,,112.0,,


In [60]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_ige_eligible_26_common = impact_ige_eligible_26_common.groupby(['Participant ID', 'Collection Date', 'Visit'], as_index=False).agg({'Ara h1 (kU/L)': 'mean', 'Ara h2 (kU/L)': 'mean', 'Ara h3 (kU/L)': 'mean', 'Ara h6 (kU/L)': 'mean'})


# Reset the index
impact_ige_eligible_26_common.reset_index(drop=True, inplace=True)



In [61]:
print(impact_ige_eligible_26_common.shape)
impact_ige_eligible_26_common.head(20)


(90, 7)


Unnamed: 0,Participant ID,Collection Date,Visit,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_105670,2016-05-27,26,3.94,8.24,0.32,3.29
1,IMPACT_115876,2017-05-10,26,7.45,8.99,0.05,9.81
2,IMPACT_139237,2018-02-22,26,20.3,112.0,2.16,80.0
3,IMPACT_196748,2017-08-13,26,0.05,0.37,0.05,0.48
4,IMPACT_205488,2016-02-06,26,0.13,1.67,0.05,3.17
5,IMPACT_207729,2017-12-19,26,16.7,12.2,0.16,17.8
6,IMPACT_226208,2016-04-11,26,1.04,22.5,0.23,14.9
7,IMPACT_228754,2017-11-17,26,13.9,15.1,0.22,10.8
8,IMPACT_255737,2016-02-10,26,255.0,215.0,25.5,141.0
9,IMPACT_256280,2017-03-12,26,3.83,17.2,1.69,20.2


In [62]:

print(len(impact_ige_eligible_26_common['Participant ID'].unique().tolist()))



90


# Merging "impact_ige_eligible_26_common" with "merged_ad_spt"


In [63]:
merged_ad_spt.head()

Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9


In [64]:
#Merging ad and spt  
merged_ad_spt_ige = merged_ad_spt.merge(impact_ige_eligible_26_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID', 'Visit', 'Collection Date'])


In [65]:
print(merged_ad_spt_ige.shape)
merged_ad_spt_ige.head()

(94, 12)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48


In [66]:
#seeing which row is duplicated

duplicated_ids = merged_ad_spt_ige['Participant ID'].duplicated(keep=False)
duplicated_values = merged_ad_spt_ige[duplicated_ids]

print(duplicated_values['Participant ID'])


32    IMPACT_519208
93    IMPACT_519208
Name: Participant ID, dtype: object


# Breaking out Peanut IgE/Total IgE
- in "impact_serum_eligible_26_common"


In [67]:
# removing rows from 'Test Name' for everything except 'Peanut IgE' and 'Total IgE'

#defining list to keep
keep_IgEs = ['Peanut IgE', 'Total IgE']

impact_serum_eligible_26_common = impact_serum_eligible_26_common[impact_serum_eligible_26_common['Test Name'].isin(keep_IgEs)]



In [68]:

impact_serum_eligible_26_common['Test Name'].unique()


array(['Peanut IgE', 'Total IgE'], dtype=object)

In [69]:


print(impact_serum_eligible_26_common.shape)
impact_serum_eligible_26_common.head()

(183, 6)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Value
30,IMPACT_102436,2016-05-17,26,Peanut IgE,kU/L,15.3
34,IMPACT_102436,2016-05-17,26,Total IgE,IU/mL,80.0
55,IMPACT_105670,2016-05-27,26,Peanut IgE,kU/L,8.92
59,IMPACT_105670,2016-05-27,26,Total IgE,IU/mL,68.0
105,IMPACT_115876,2017-05-10,26,Peanut IgE,kU/L,16.4


In [70]:
# Create new columns with initial NaN values
impact_serum_eligible_26_common['Peanut IgE (kU/L)'] = np.nan
impact_serum_eligible_26_common['Total IgE (kU/L)'] = np.nan

# Populate the new columns based on conditions
impact_serum_eligible_26_common.loc[impact_serum_eligible_26_common['Test Name'] == 'Peanut IgE', 'Peanut IgE (kU/L)'] = impact_serum_eligible_26_common.loc[impact_serum_eligible_26_common['Test Name'] == 'Peanut IgE', 'Value']
impact_serum_eligible_26_common.loc[impact_serum_eligible_26_common['Test Name'] == 'Total IgE', 'Total IgE (kU/L)'] = impact_serum_eligible_26_common.loc[impact_serum_eligible_26_common['Test Name'] == 'Total IgE', 'Value']

# Drop the specified columns
impact_serum_eligible_26_common = impact_serum_eligible_26_common.drop(columns=['Test Name', 'Unit', 'Value'])


In [71]:

len(impact_serum_eligible_26_common['Participant ID'].unique().tolist())


92

In [72]:
print(impact_serum_eligible_26_common.shape) # (234, 5)
impact_serum_eligible_26_common.head()
# note if all of the 144 participants had 2 rows for Peanut and Total, we'd have 232, but we have 234


(183, 5)


Unnamed: 0,Participant ID,Collection Date,Visit,Peanut IgE (kU/L),Total IgE (kU/L)
30,IMPACT_102436,2016-05-17,26,15.3,
34,IMPACT_102436,2016-05-17,26,,80.0
55,IMPACT_105670,2016-05-27,26,8.92,
59,IMPACT_105670,2016-05-27,26,,68.0
105,IMPACT_115876,2017-05-10,26,16.4,


In [73]:
# Merging duplicate rows and overwriting NaN values

# Group by columns and aggregate using the mean (for numeric columns) or first (for non-numeric columns)
impact_serum_eligible_26_common = impact_serum_eligible_26_common.groupby(['Participant ID', 'Collection Date', 'Visit'], as_index=False).agg({'Peanut IgE (kU/L)': 'mean', 'Total IgE (kU/L)': 'mean'})

# Reset the index
impact_serum_eligible_26_common.reset_index(drop=True, inplace=True)


In [74]:
print(impact_serum_eligible_26_common.shape) # correctly removed duplicates -> 117 rows  remain 
print(len(impact_serum_eligible_26_common['Participant ID'].unique().tolist()))
impact_serum_eligible_26_common.head(10)


(92, 5)
92


Unnamed: 0,Participant ID,Collection Date,Visit,Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_102436,2016-05-17,26,15.3,80.0
1,IMPACT_105670,2016-05-27,26,8.92,68.0
2,IMPACT_115876,2017-05-10,26,16.4,1215.0
3,IMPACT_139237,2018-02-22,26,148.0,490.0
4,IMPACT_196748,2017-08-13,26,0.45,52.0
5,IMPACT_205488,2016-02-06,26,3.85,868.0
6,IMPACT_207729,2017-12-19,26,32.0,1338.0
7,IMPACT_226208,2016-04-11,26,20.2,302.0
8,IMPACT_228754,2017-11-17,26,29.7,214.0
9,IMPACT_255737,2016-02-10,26,450.0,1056.0


# Merging "merged_ad_spt_ige" with "impact_serum_eligible_26_common"

- both have 117 rows

In [75]:
merged_ad_spt_ige.head()

Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48


In [76]:
#Merging ad and spt  
merged_ad_spt_ige_serum = merged_ad_spt_ige.merge(impact_serum_eligible_26_common, 
                                                               how='outer', #preserves all rows and adds NaN for empty rows
                                                               on=['Participant ID', 'Visit', 'Collection Date'])


In [77]:
print(merged_ad_spt_ige_serum.shape)
print(len(merged_ad_spt_ige_serum['Participant ID'].unique().tolist()))
merged_ad_spt_ige_serum.head()

(98, 14)
93


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,,15.3,80.0
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,8.92,68.0
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,16.4,1215.0
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,148.0,490.0
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,0.45,52.0


# Final transformation of 'merged_ad_spt_ige_serum' to match LEAP columns 

In [78]:
impact_all_26 = merged_ad_spt_ige_serum

In [79]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 14)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OUT26NOI,Age at OFC 26,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,,15.3,80.0
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,8.92,68.0
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,16.4,1215.0
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,148.0,490.0
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,0.45,52.0


# Renaming columns 

In [80]:
# rename 'OUT26NOI' to 'OFC Pass'
impact_all_26 = impact_all_26.rename(columns={'OUT26NOI': 'OFC Pass'})

# rename Age
impact_all_26 = impact_all_26.rename(columns={'Age at OFC 26': 'Age'})


In [81]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 14)


Unnamed: 0,Participant ID,Sex (character),Race,Collection Date,Visit,Wheal (mm),OFC Pass,Age,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Peanut IgE (kU/L),Total IgE (kU/L)
0,IMPACT_102436,Male,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,,15.3,80.0
1,IMPACT_105670,Female,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,8.92,68.0
2,IMPACT_115876,Male,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,16.4,1215.0
3,IMPACT_139237,Male,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,148.0,490.0
4,IMPACT_196748,Male,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,0.45,52.0


#### Creating Binary column from Child's Sex -> Male

note origial mapping in the raw eat dataset was:  
'Male', 'Female'  

in this new encoding:  
1=male, 0=female

In [82]:
# function to map the values in 'Sex (character)' to integers
def encode_sex(sex):
    if sex == "Male":
        return 1
    elif sex == "Female":
        return 0
    else:
        return None  # Return None for any other values

# create the new 'Male' column
impact_all_26['Male'] = impact_all_26['Sex (character)'].apply(encode_sex)

# drop original column 
impact_all_26 = impact_all_26.drop(columns=['Sex (character)'])


In [83]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 14)


Unnamed: 0,Participant ID,Race,Collection Date,Visit,Wheal (mm),OFC Pass,Age,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Peanut IgE (kU/L),Total IgE (kU/L),Male
0,IMPACT_102436,White/Caucasian,2016-05-17,26,4.0,0.0,85.5,,,,,15.3,80.0,1.0
1,IMPACT_105670,White/Caucasian,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,8.92,68.0,0.0
2,IMPACT_115876,Mixed Race,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,16.4,1215.0,1.0
3,IMPACT_139237,Asian,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,148.0,490.0,1.0
4,IMPACT_196748,White/Caucasian,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,0.45,52.0,1.0


In [84]:
impact_all_26['Race'].unique()

array(['White/Caucasian', 'Mixed Race', 'Asian',
       'Black or African American', nan], dtype=object)

# One Hot Encoding for Race 
note original values for "Race" from the raw IMPACT dataset is as follows
- 'White/Caucasian', 
- 'Mixed Race', 
- 'Asian',
- 'Black or African American'

In [85]:
# Create dummy variables for "race"
race_dummies = pd.get_dummies(impact_all_26['Race'])

# Define the mapping between values and column names
race_mapping = {
    'White/Caucasian': 'White',
    'Black or African American': 'Black',
    'Asian': 'Asian',
    'Mixed Race': 'Mixed',
     np.nan: 'Unknown'
}

# iterate over the mapping and update the dataframe columns
for value, column_name in race_mapping.items():
    if value in race_dummies.columns:
        impact_all_26[column_name] = race_dummies[value].fillna(0).astype(int)
    else:
        impact_all_26[column_name] = 0

# drop original "race" column
impact_all_26 = impact_all_26.drop('Race', axis=1)


In [86]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 18)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OFC Pass,Age,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),Peanut IgE (kU/L),Total IgE (kU/L),Male,White,Black,Asian,Mixed,Unknown
0,IMPACT_102436,2016-05-17,26,4.0,0.0,85.5,,,,,15.3,80.0,1.0,1,0,0,0,0
1,IMPACT_105670,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,8.92,68.0,0.0,1,0,0,0,0
2,IMPACT_115876,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,16.4,1215.0,1.0,0,0,0,1,0
3,IMPACT_139237,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,148.0,490.0,1.0,0,0,1,0,0
4,IMPACT_196748,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,0.45,52.0,1.0,1,0,0,0,0


In [87]:
# add the missing columns (Flare, h8, h9, Other) and fill with NaNs so the dataset matches LEAP
impact_all_26['Flare (mm)'] = np.nan
impact_all_26['Ara h8 (kU/L)'] = np.nan
impact_all_26['Ara h9 (kU/L)'] = np.nan
impact_all_26['Other'] = np.nan

In [88]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OFC Pass,Age,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),...,Male,White,Black,Asian,Mixed,Unknown,Flare (mm),Ara h8 (kU/L),Ara h9 (kU/L),Other
0,IMPACT_102436,2016-05-17,26,4.0,0.0,85.5,,,,,...,1.0,1,0,0,0,0,,,,
1,IMPACT_105670,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,...,0.0,1,0,0,0,0,,,,
2,IMPACT_115876,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,...,1.0,0,0,0,1,0,,,,
3,IMPACT_139237,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,...,1.0,0,0,1,0,0,,,,
4,IMPACT_196748,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,...,1.0,1,0,0,0,0,,,,


In [89]:
# reorganizing columns 

print(impact_all_26.columns)

Index(['Participant ID', 'Collection Date', 'Visit', 'Wheal (mm)', 'OFC Pass',
       'Age', 'Ara h1 (kU/L)', 'Ara h2 (kU/L)', 'Ara h3 (kU/L)',
       'Ara h6 (kU/L)', 'Peanut IgE (kU/L)', 'Total IgE (kU/L)', 'Male',
       'White', 'Black', 'Asian', 'Mixed', 'Unknown', 'Flare (mm)',
       'Ara h8 (kU/L)', 'Ara h9 (kU/L)', 'Other'],
      dtype='object')


In [90]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Wheal (mm),OFC Pass,Age,Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kU/L),...,Male,White,Black,Asian,Mixed,Unknown,Flare (mm),Ara h8 (kU/L),Ara h9 (kU/L),Other
0,IMPACT_102436,2016-05-17,26,4.0,0.0,85.5,,,,,...,1.0,1,0,0,0,0,,,,
1,IMPACT_105670,2016-05-27,26,9.5,0.0,74.3,3.94,8.24,0.32,3.29,...,0.0,1,0,0,0,0,,,,
2,IMPACT_115876,2017-05-10,26,20.5,0.0,64.7,7.45,8.99,0.05,9.81,...,1.0,0,0,0,1,0,,,,
3,IMPACT_139237,2018-02-22,26,10.0,0.0,75.7,20.3,112.0,2.16,80.0,...,1.0,0,0,1,0,0,,,,
4,IMPACT_196748,2017-08-13,26,10.5,0.0,82.9,0.05,0.37,0.05,0.48,...,1.0,1,0,0,0,0,,,,


In [91]:
# Define the desired column order
desired_order = ['Participant ID',
                 'Collection Date',
                 'Visit',
                 "Age",
                 "Male",
                 "White",
                 "Black",
                 "Asian",
                 "Other",
                 "Mixed",
                 "Unknown",
                 "Wheal (mm)",
                 "Flare (mm)",
                 "Total IgE (kU/L)",
                 "Peanut IgE (kU/L)",
                 "Ara h1 (kU/L)",
                 "Ara h2 (kU/L)",
                 "Ara h3 (kU/L)",
                 "Ara h6 (kUA/L)",
                 "Ara h8 (kU/L)",
                 "Ara h9 (kU/L)",
                 "OFC Pass"]

# Reorder the columns using reindex
impact_all_26 = impact_all_26.reindex(columns=desired_order)


In [92]:
print(impact_all_26.shape)
impact_all_26.head()

(98, 22)


Unnamed: 0,Participant ID,Collection Date,Visit,Age,Male,White,Black,Asian,Other,Mixed,...,Flare (mm),Total IgE (kU/L),Peanut IgE (kU/L),Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kUA/L),Ara h8 (kU/L),Ara h9 (kU/L),OFC Pass
0,IMPACT_102436,2016-05-17,26,85.5,1.0,1,0,0,,0,...,,80.0,15.3,,,,,,,0.0
1,IMPACT_105670,2016-05-27,26,74.3,0.0,1,0,0,,0,...,,68.0,8.92,3.94,8.24,0.32,,,,0.0
2,IMPACT_115876,2017-05-10,26,64.7,1.0,0,0,0,,1,...,,1215.0,16.4,7.45,8.99,0.05,,,,0.0
3,IMPACT_139237,2018-02-22,26,75.7,1.0,0,0,1,,0,...,,490.0,148.0,20.3,112.0,2.16,,,,0.0
4,IMPACT_196748,2017-08-13,26,82.9,1.0,1,0,0,,0,...,,52.0,0.45,0.05,0.37,0.05,,,,0.0


# Final export for OFC 26
- manual merger of data collected on different dates (keeping later date)
- removed the participants who did not have OFC results 
        

In [93]:
# commenting out to prevent export again 

# exporting 
# impact_all_26.to_excel('Data/IMPACT_Study/impact_all_26_raw.xlsx', index=False)


# Final import of OFC 26 clean data for IMPACT study 

In [95]:
impact_all_26_clean = pd.read_excel("Data/IMPACT_Study/impact_all_26_clean_final.xlsx")


In [96]:
impact_all_26_clean.head()

Unnamed: 0,Participant ID,Collection Date,Visit,Age,Male,White,Black,Asian,Other,Mixed,...,Flare (mm),Total IgE (kU/L),Peanut IgE (kU/L),Ara h1 (kU/L),Ara h2 (kU/L),Ara h3 (kU/L),Ara h6 (kUA/L),Ara h8 (kU/L),Ara h9 (kU/L),OFC Pass
0,IMPACT_102436,2016-05-17,26,85.5,1,1,0,0,,0,...,,80.0,15.3,,,,,,,0
1,IMPACT_105670,2016-05-27,26,74.3,0,1,0,0,,0,...,,68.0,8.92,3.94,8.24,0.32,,,,0
2,IMPACT_115876,2017-05-10,26,64.7,1,0,0,0,,1,...,,1215.0,16.4,7.45,8.99,0.05,,,,0
3,IMPACT_139237,2018-02-22,26,75.7,1,0,0,1,,0,...,,490.0,148.0,20.3,112.0,2.16,,,,0
4,IMPACT_196748,2017-08-13,26,82.9,1,1,0,0,,0,...,,52.0,0.45,0.05,0.37,0.05,,,,0


In [97]:
impact_all_26_clean.shape

(93, 22)

In [99]:
len(set(impact_all_26_clean['Participant ID'].unique()))

93