In [1]:
import pandas as pd
import numpy as np

In [2]:
impact_ad = pd.read_excel("Data/IMPACT_Study/ADSTART0_2023-05-25_06-46-11.xlsx")
impact_serum = pd.read_excel("Data/IMPACT_Study/Serum antibody_2023-05-25_06-32-15.xlsx")


# About Data Sets

#### ADSTART0_2023-05-25_06-46-11 
- Overview of participant data 

#### IgE_IgG4_component_2023-05-25_06-30-40
- IgE levels in this dataset  

questioning what these results are for:  

- OUT24ITT	
- OUT24NOI	
- OUT26ITT	
- OUT26NOI

#### Serum antibody_2023-05-25_06-32-15
Potentials for OFC results:
- "Passed Visit 24 OFC for ITT"
- "Passed Visit 24 OFC No Imputation"	
- "Passed Visit 26 OFC for ITT"
- "Passed Visit 26 OFC No Imputation"

#### Skin Prick Test_2023-05-25_06-31-04
- "Wheal (mm)"

#### BAT data_2023-05-25_06-31-20
- Stands for "basophil activation test (BAT)"
- No known use case for this model 

---

# Acronyms 
- Intent-to-treat (ITT)
- Oral Immunotherapy (OIT)
- Initial Dose Escalation (IDE)

# About Study 
- taken from 2022 Lancet_IMPACT.pdf (pdf page 3)  

Children aged 12 months or older and younger than 
48 months were screened for inclusion in the study. 

__Inclusion criteria__ included the following: a clinical 
history of peanut allergy or avoidance without ever 
having eaten peanut, peanut-specific IgE levels of 5 kUA/L 
or higher, a skin prick test (SPT) wheal size greater than 
that of saline control by 3 mm or more, and a positive 
reaction to a cumulative dose of 500 mg or less of peanut 
in a double-blind, placebo-controlled food challenge 
(DBPCFC).   

__exclusion criteria__ included a history of 
severe anaphylaxis with hypotension to peanut, more 
than mild asthma or uncontrolled asthma, uncontrolled 
atopic dermatitis, and eosinophilic gastrointestinal 
disease (the full list of exclusion criteria is presented in 
the appendix p 2).

# Study Design

This is a randomized, double-blind, placebo-controlled, multi-center study comparing peanut oral immunotherapy to placebo. Eligible participants with peanut allergy will be randomly assigned to receive either peanut OIT or placebo for 134 weeks followed by peanut avoidance for 26 weeks.  

An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing. After the initial blinded OFC, the study design includes the following:  

__Initial Dose Escalation:__ This will occur on a single day in which multiple doses are given. Peanut or placebo dosing will be given incrementally and increase every 15-30 minutes until a dose of 12 mg peanut flour (6 mg peanut protein) or placebo flour is given. The first four doses will be administered as a peanut flour extract of 0.1 to 0.8 mg peanut protein, which is 10 to 80 microliters peanut flour extract, or placebo flour extract and the last three doses will be given as peanut flour of 3 to 12 mg peanut flour 1.5 to 6 mg peanut protein or placebo flour. Participants must tolerate a dose of at least 3 mg peanut flour (1.5 mg peanut protein) or placebo flour to remain in the study.  

__Build-up:__ After the initial dose escalation day, the participant will return to the research unit the next morning for an observed dose administration of the highest tolerated dose from the initial escalation day. The participant will then continue on the daily OIT dosing at home and return to the research unit every 2 weeks for a dose escalation. The dosing escalations will be consistent with previous similar OIT studies.
  
Participants who do not reach the 4000 mg peanut flour (2000 mg peanut protein) or placebo flour dose during the build-up phase may enter maintenance phase at their highest tolerated dose, which must be at least 500 mg peanut flour (250 mg peanut protein) or placebo flour.  

The build-up phase will comprise 30 weeks.  

__Maintenance:__ The participant will continue on daily OIT with return visits every 13 weeks. At the end of this phase the participant will undergo a blinded OFC to 10 g peanut flour (5 g peanut protein).  
This phase will comprise 104 weeks.  

__Avoidance:__ In this final phase participants stop OIT and will avoid peanut consumption They will be seen 2 weeks and 26 weeks after initiating this phase. At the completion of this phase participants will have a final blinded OFC to 10 g peanut flour (5 g peanut protein). Participants who do not have a clinical reaction to the challenge will receive an Open Food Challenge (OpFC).  

Avoidance will comprise 26 weeks.  

__Post-challenge:__ If participants do not have a clinical reaction during the OpFC at the end of avoidance, they will be allowed to consume peanut and will have one visit which will include peripheral blood sampling for mechanistic assays assessments.  

Post-challenge will comprise 2 weeks.  

# Exploring where and how OFCs are captured
Looking for 3 OFC tests in total

### 1st OFC: According to Study Design in protocol: 
-  Initial reaction: "An initial blinded oral food challenge (OFC) to 1 g of peanut flour (500 mg peanut protein) will be conducted. Participants must have a clinical reaction during this blinded OFC to initiate study dosing." (IDE)
- This was for 0.5 g peanut

### 2nd & 3rd OFC: According to Schedule of Assessments: Appendix 2:
- 5 g Oral Food Challenge performed during Avoidance phase during visit 24/week 134 and visit 26/week 160
- this was at or below 5 g peanut



In [3]:
impact_ad.head()

Unnamed: 0,Participant ID,Visit,Study Identifier,Date of Screening Visit,Consent for Genetic research?,Consent for Non-genetic research?,Date of Informed Consent,Protocol Version,Met ALL Inclusion Criteria?,Met ANY Exclusion Criteria?,...,Completed Initial Dose Escalation,Completed Build-Up Treatment,Completed Maint Trt and Attended V24,Attended V25,Attended V26,Attended V27,Completed Study Assessments Numeric,Age at Screening (years) Not Rounded,New Discontinuation,New Termination
0,IMPACT_101655,,ITN050AD,2012-10-11,Yes,Yes,2012-10-11,v2.0,Yes,No,...,Yes,Yes,No,No,No,No,0,3.8,,
1,IMPACT_102436,,ITN050AD,2013-03-14,Yes,No,2013-03-14,v2.0,Yes,No,...,Yes,Yes,Yes,Yes,Yes,No,1,3.9,,
2,IMPACT_105670,,ITN050AD,2013-04-04,Yes,Yes,2013-04-04,v2.0,Yes,No,...,Yes,Yes,Yes,Yes,Yes,No,1,3.0,,
3,IMPACT_112496,,ITN050AD,2013-05-05,Yes,Yes,2013-05-05,v2.0,No,No,...,No,No,No,No,No,No,0,3.2,,
4,IMPACT_113135,,ITN050AD,2013-11-25,Yes,Yes,2013-11-25,v2.0,Yes,No,...,Yes,Yes,Yes,Yes,Yes,No,0,2.6,,


In [4]:
impact_ad.columns.tolist()

['Participant ID',
 'Visit',
 'Study Identifier',
 'Date of Screening Visit',
 'Consent for Genetic research?',
 'Consent for Non-genetic research?',
 'Date of Informed Consent',
 'Protocol Version',
 'Met ALL Inclusion Criteria?',
 'Met ANY Exclusion Criteria?',
 'Screen Failure',
 'Date of Randomization from RhoRAND',
 'Initial Dose Escalation Date',
 'Randomized',
 'Date of Enrollment',
 'Last Phase from Scheduled Visits',
 'Post-Randomization Assessment',
 'Last Dose',
 'Last Dose Date',
 'Intent to Treat Sample',
 'Safety Sample',
 'Completed Assigned Study Assessments',
 'Treatment Discontinuation Date',
 'Last Study Visit Date from TERM Form',
 'Date of First Study Therapy',
 'Date of Last Study Therapy',
 'Completed Study Protocol',
 'Study Termination Date',
 'Study Termination Reason',
 'Study Status',
 'Age at screening (years)',
 'Sex (character)',
 'Race',
 'Race Listing',
 'Ethnicity',
 'Data Snapshot Date',
 'Planned Treatment',
 'Planned Treatment.1',
 'Last Visit Numbe

# Filtering participants that were study eligible (n=144)
- Total participants enrolled = 209  
- Total participants eligible = 144
- Total excluded = 65 (62 that did not have OFC fail results + 2 that could not handle the IDE)

#### filtering solution (this will help filter initial screening OFC results too)
- filter by'ADSTART0 -> immpact_ad['Randomized']=="Yes" 
- filter by ADSTART0 -> "Study Termination Reason"-> "Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation" (These are the two that terminated after being randomized but could not reach their IDE: 146-2 -> n=144)

Additional filter to get just those initial screening OFC rows:
- filter by visit -2 or -1 (this is the initial screening visit number)

Explanation: There doesn't seem to be a OFC column for the initial screening.   
However can filter criteria based on protocol to determine who failed their first OFC upon study intake and manually assign the OFC fail column.   
These would be the 144 participants that were eligible for the study and have a visit value of -2 or -1

In [5]:
# Useing 'Randomized' to filter out the 146 that passed the criteria and were randomly assigned for the study 
(impact_ad['Randomized']=="Yes").sum() #output = 146

146

In [6]:
# filtering out two more participants that failed during the initial dose escalation (IDE)
# These are the two that terminated after being randomized 146-2 -> n=144

(impact_ad['Study Termination Reason']=="Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation").sum()
#output = 2

2

# Creating a new DF for just the eligible 144 randomly assigned participants 
- filtering out unnecessary columns 

In [7]:
impact_ad_eligible = impact_ad.loc[(impact_ad['Randomized'] == 'Yes') & ~(impact_ad['Study Termination Reason'] == 'Inability to reach 3 mg peanut/placebo flour (1.5 mg peanut/placebo protein) during the initial dose escalation')]

impact_ad_cols_to_keep = [
     'Participant ID',
     'Visit', # -2 to -1 is the participant screening 
     'Date of Screening Visit',
     'Randomized', # Use this to filter out participants that did not pass the study screening
     'Study Status', # Values: 'Discontinued Therapy', 'Completed Study', 'Enrolled but not Randomized', 'Early Termination', 'Screen Failure', 'Screened but not Enrolled'
     'Completed Study Protocol', # Values: 'No', 'Yes'
     'Sex (character)',
     'Race',
     'Race Listing',
     'Ethnicity',
     'Completed Study Assessments Numeric', #1 for yes/ 0 for no (means attended all visits up to 26, excluding 27)
     'Age at screening (years)',
     'Age at Screening (years) Not Rounded'
]

impact_ad_eligible_filtered = impact_ad_eligible[impact_ad_cols_to_keep]



In [8]:
impact_ad_eligible_filtered.shape
# correct number of participants now

(144, 13)

In [9]:
impact_ad_eligible_filtered.shape

(144, 13)

In [10]:
impact_ad_eligible_filtered.head()

Unnamed: 0,Participant ID,Visit,Date of Screening Visit,Randomized,Study Status,Completed Study Protocol,Sex (character),Race,Race Listing,Ethnicity,Completed Study Assessments Numeric,Age at screening (years),Age at Screening (years) Not Rounded
0,IMPACT_101655,,2012-10-11,Yes,Discontinued Therapy,No,Male,White/Caucasian,White/Caucasian,Not Hispanic or Latino,0,4,3.8
1,IMPACT_102436,,2013-03-14,Yes,Completed Study,Yes,Male,White/Caucasian,White/Caucasian,Not Hispanic or Latino,1,4,3.9
2,IMPACT_105670,,2013-04-04,Yes,Completed Study,Yes,Female,White/Caucasian,White/Caucasian,Not Hispanic or Latino,1,3,3.0
4,IMPACT_113135,,2013-11-25,Yes,Early Termination,No,Female,White/Caucasian,White/Caucasian,Not Hispanic or Latino,0,3,2.6
5,IMPACT_115876,,2014-03-19,Yes,Completed Study,Yes,Male,Mixed Race,Mixed Race,Not Hispanic or Latino,1,2,2.2


# Getting a list of the 144 participant IDs 

In [11]:
participant_ids = impact_ad_eligible_filtered['Participant ID'].unique()
print(len(participant_ids)) #output 144 unique participant IDs

144


---
# Serum Dataset filtering & IgE_IgG4_component Filtering

Duplicate OFC data between both data sets with different labels.  
These columns in Serum:

- 'Passed Visit 24 OFC for ITT',
- 'Passed Visit 24 OFC No Imputation',
- 'Passed Visit 26 OFC for ITT',
- 'Passed Visit 26 OFC No Imputation',
 
are the same data as these columns in IgE_IgG4_component

- OUT24ITT	
- OUT24NOI	
- OUT26ITT	
- OUT26NOI

---

Will take save only 'Peanut IgE' and 'Total IgE' from column 'Test Name' in Serum data  
Will take 'Component' and values columns from IgE_IgG4_component 


In [22]:
# checking total unique participants is the same
len(impact_serum['Participant ID'].unique().tolist())

146

In [23]:
impact_serum.columns.tolist()

['Participant ID',
 'Visit',
 'Barcode',
 'Collection Date',
 'Test Name',
 'Unit',
 'Test Result',
 'Visit Number',
 'Value',
 'Baseline Value',
 'log10 Value',
 'log10 Value_Baseline',
 'Fold change from baseline',
 'log10 fold change from baseline',
 'Planned Treatment',
 'Planned Treatment (N)',
 'Randomized',
 'Intent to Treat Sample',
 'Per-protocol Primary Endpoint',
 'Per-protocol Secondary Endpoint',
 'Age at screening (years)',
 'Age at screening (years) Not Rounded',
 'Sex (character)',
 'Passed Visit 24 OFC for ITT',
 'Passed Visit 24 OFC No Imputation',
 'Passed Visit 26 OFC for ITT',
 'Passed Visit 26 OFC No Imputation',
 'Tolerance outcome',
 'Tolerance outcome (character)']

In [None]:
#keeping the relevant IgE rows




In [14]:
#filtering by just the participant IDs in the eligible list
impact_serum_eligible = impact_serum[impact_serum['Participant ID'].isin(participant_ids)]

impact_serum_cols_to_keep = [
    'Participant ID',
    'Collection Date',
    'Visit',
    'Test Name', # Peanut IgE, Peanut IgE/Total IgE ratio, Peanut IgG4*, Peanut IgG4/IgE ratio, Total IgE
    'Unit',
    'Test Result',
    'Value', # same as test result unless capturing ratio value (for this test result blank)
    'Baseline Value', # IgE values taken during initial screening
    'Planned Treatment', # 'Peanut OIT', 'Placebo'
    #'Passed Visit 24 OFC for ITT',
    'Passed Visit 24 OFC No Imputation',
    #'Passed Visit 26 OFC for ITT',
    'Passed Visit 26 OFC No Imputation',
    'Tolerance outcome', # has values of nan,  2.,  4.,  1.,  3.
    'Tolerance outcome (character)' # has values of nan, 'Desen_no_Tol', 'No_Desen_no_Tol', 'Desen_Tol', 'No_Desen_Tol
]

# question for Dr. Gryak on what these mean but including in the DF for now: 
# 'Tolerance outcome', has values of nan,  2.,  4.,  1.,  3.
# 'Tolerance outcome (character)', has values of nan, 'Desen_no_Tol', 'No_Desen_no_Tol', 'Desen_Tol', 'No_Desen_Tol

impact_serum_eligible_filtered = impact_serum_eligible[impact_serum_cols_to_keep]


In [15]:
#checking to see the above filtering kept the correct number of 144 participants 
len(impact_serum_eligible_filtered['Participant ID'].unique().tolist()) #output 144 - correct

144

In [16]:
print(impact_serum_eligible_filtered.shape)
impact_serum_eligible_filtered.head()

(3059, 13)


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Test Result,Value,Baseline Value,Planned Treatment,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation,Tolerance outcome,Tolerance outcome (character)
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,25.7,25.7,Peanut OIT,,,,
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,,27.934783,27.934783,Peanut OIT,,,,
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3,0.3,0.3,Peanut OIT,,,,
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,,0.004864,0.004864,Peanut OIT,,,,
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,92.0,92.0,Peanut OIT,,,,


### OFC Results column notes

#### Will use 'Passed Visit 24 OFC No Imputation' and 'Passed Visit 26 OFC No Imputation' columns for OFC results.  

This is because the other OFC column uses a model/imputation to fill in the values. I validated that this is true by counting the unique participant IDs for those who had NaN values in the No Imputation column and read (somewhere) that a model was made to impute these values. 

In reality: 
Total participants from OIT and Placebo side that dropped between n=144 initial dose escalation and visit 24 is (4+6)+(9+9) = __28 dropped before week 24 visit__

#### Rule: If column contains Nan, then they did not take the OFC during that visit

In [17]:
# counting the number of unique participant IDs who have NaN values for 'Passed Visit 24 OFC No Imputation'. 
# Should be 28

filtered_df = impact_serum_eligible_filtered[(impact_serum_eligible_filtered['Passed Visit 24 OFC No Imputation'].isnull() | 
                                             (impact_serum_eligible_filtered['Passed Visit 24 OFC No Imputation'] == "") |
                                             impact_serum_eligible_filtered['Passed Visit 24 OFC No Imputation'].isna())]



  res_values = method(rvalues)


In [18]:
len(filtered_df['Participant ID'].unique().tolist())
# output is 28 as expected

28

# Making a new DF with a new column capturing initial OFC fail 
calling it "Passed Visit -2 OFC"

In [20]:
impact_serum_eligible_filtered['Passed Visit -2 OFC'] = 0
impact_serum_eligible_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  impact_serum_eligible_filtered['Passed Visit -2 OFC'] = 0


Unnamed: 0,Participant ID,Collection Date,Visit,Test Name,Unit,Test Result,Value,Baseline Value,Planned Treatment,Passed Visit 24 OFC No Imputation,Passed Visit 26 OFC No Imputation,Tolerance outcome,Tolerance outcome (character),Passed Visit -2 OFC
0,IMPACT_101655,2012-10-11,-2,Peanut IgE,kU/L,25.7,25.7,25.7,Peanut OIT,,,,,0
1,IMPACT_101655,2012-10-11,-2,Peanut IgE/Total IgE ratio,Ratio,,27.934783,27.934783,Peanut OIT,,,,,0
2,IMPACT_101655,2012-10-11,-2,Peanut IgG4*,mcg/mL,0.3,0.3,0.3,Peanut OIT,,,,,0
3,IMPACT_101655,2012-10-11,-2,Peanut IgG4/IgE ratio,Ratio,,0.004864,0.004864,Peanut OIT,,,,,0
4,IMPACT_101655,2012-10-11,-2,Total IgE,IU/mL,92.0,92.0,92.0,Peanut OIT,,,,,0



## To Do - keep filtering to determine who finished the study
determine 70 completed (peanut) + 23 completed (placebo) = 103 completed
- 'Last Visit Number from Scheduled Visits'] == 26 only gave 84. Pick up exploring from here