# **Allergy Predictor: Data pre-processing**

In this notebook we will analyse and pre-process the data provided by Kaggle: https://www.kaggle.com/datasets/thedevastator/childhood-allergies-prevalence-diagnosis-and-tre

We will select relevant features and process the data as required for the ML model training.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np

**01 - Data import and analysis**

In [3]:
dataset = pd.read_csv('/content/drive/MyDrive/Projects/allergy_prediction/data/food-allergy-analysis-Zenodo.csv')
dataset.head()

Unnamed: 0,SUBJECT_ID,BIRTH_YEAR,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,PAYER_FACTOR,ATOPIC_MARCH_COHORT,AGE_START_YEARS,AGE_END_YEARS,SHELLFISH_ALG_START,...,CASHEW_ALG_END,ATOPIC_DERM_START,ATOPIC_DERM_END,ALLERGIC_RHINITIS_START,ALLERGIC_RHINITIS_END,ASTHMA_START,ASTHMA_END,FIRST_ASTHMARX,LAST_ASTHMARX,NUM_ASTHMARX
0,1,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,P1 - Medicaid,False,0.093087,3.164956,,...,,,,,,,,,,
1,2,1994,S1 - Female,R0 - White,E0 - Non-Hispanic,P0 - Non-Medicaid,False,12.232717,18.880219,,...,,,,,,,,12.262834,18.880219,2.0
2,3,2006,S0 - Male,R0 - White,E1 - Hispanic,P0 - Non-Medicaid,True,0.010951,6.726899,,...,,4.884326,,3.917864,6.157426,5.127995,,1.404517,6.157426,4.0
3,4,2004,S0 - Male,R4 - Unknown,E1 - Hispanic,P0 - Non-Medicaid,False,2.398357,9.111567,,...,,,,,,,,,,
4,5,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,P0 - Non-Medicaid,False,0.013689,6.193018,,...,,,,,,,,,,


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333200 entries, 0 to 333199
Data columns (total 50 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   SUBJECT_ID               333200 non-null  int64  
 1   BIRTH_YEAR               333200 non-null  int64  
 2   GENDER_FACTOR            333200 non-null  object 
 3   RACE_FACTOR              333200 non-null  object 
 4   ETHNICITY_FACTOR         333200 non-null  object 
 5   PAYER_FACTOR             333200 non-null  object 
 6   ATOPIC_MARCH_COHORT      333200 non-null  bool   
 7   AGE_START_YEARS          333200 non-null  float64
 8   AGE_END_YEARS            333200 non-null  float64
 9   SHELLFISH_ALG_START      5246 non-null    float64
 10  SHELLFISH_ALG_END        1051 non-null    float64
 11  FISH_ALG_START           1796 non-null    float64
 12  FISH_ALG_END             527 non-null     float64
 13  MILK_ALG_START           7289 non-null    float64
 14  MILK

Count the number of subjects

In [5]:
# Count the number of subjects
print(f'Total number of patients: {dataset["SUBJECT_ID"].nunique()}')

Total number of patients: 333200


Count the number of patients with at least one recorded allergy.

Allergy = value not null in start or end of any allergen column

In [6]:
# Count the number of subjects all the allergen columns are null
allergen_columns = dataset.columns[9:50]
print(f'Number of patients with at least one recorded allergy: {dataset[allergen_columns].notnull().any(axis=1).sum()}')

# Represent this as a percentage of the total
print(f'Percentage of patients with at least one recorded allergy: {dataset[allergen_columns].notnull().any(axis=1).sum() / len(dataset) * 100:.2f}%')

Number of patients with at least one recorded allergy: 167867
Percentage of patients with at least one recorded allergy: 50.38%


Identify which allergy is the most prevalent

In [7]:
# Define the allergen start columns
allergen_start_columns = [
    'SHELLFISH_ALG_START', 'FISH_ALG_START', 'MILK_ALG_START', 'SOY_ALG_START', 'EGG_ALG_START',
    'WHEAT_ALG_START', 'PEANUT_ALG_START', 'SESAME_ALG_START', 'TREENUT_ALG_START', 'WALNUT_ALG_START',
    'PECAN_ALG_START', 'PISTACH_ALG_START', 'ALMOND_ALG_START', 'BRAZIL_ALG_START', 'HAZELNUT_ALG_START',
    'CASHEW_ALG_START'
]

# Count the non-null values for each allergen start column
allergen_counts = dataset[allergen_start_columns].notnull().sum()

# Find the allergen with the highest count
most_prevalent_allergen = allergen_counts.idxmax()
most_prevalent_count = allergen_counts.max()

# Print the result
print(f"The most prevalent allergen is {most_prevalent_allergen} with {most_prevalent_count} occurrences.")

The most prevalent allergen is PEANUT_ALG_START with 8653 occurrences.


**We will focus this study on predicting peanut allergies based on demographic information.**

# **2 - Pre-process data**

To pre-process data we need to handle the categorical data appropriately.

Gender, Race, Ethnicity should be one hot encoded

Age needs to be calculated and placed into categories: 0-1, 1-5, 5-10, 10-15, 15-20 and then one hot encoded.



Extract relevant columns first

In [8]:
required_columns = [
    'SUBJECT_ID', 'BIRTH_YEAR', 'GENDER_FACTOR', 'RACE_FACTOR', 'ETHNICITY_FACTOR',
    'PEANUT_ALG_START', 'PEANUT_ALG_END'
]

In [9]:
peanut_dataset = dataset[required_columns]
peanut_dataset.head()

Unnamed: 0,SUBJECT_ID,BIRTH_YEAR,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,PEANUT_ALG_START,PEANUT_ALG_END
0,1,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,,
1,2,1994,S1 - Female,R0 - White,E0 - Non-Hispanic,,
2,3,2006,S0 - Male,R0 - White,E1 - Hispanic,,
3,4,2004,S0 - Male,R4 - Unknown,E1 - Hispanic,,
4,5,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,,


In [10]:
peanut_dataset['SUBJECT_ID'].nunique()

333200

Create labels for each patient whether they had a peanut allergy or not

In [11]:
# Create new column peanut_allergy which has binary classification for the patient
peanut_dataset['peanut_allergy'] = peanut_dataset['PEANUT_ALG_START'].notnull().astype(int)
peanut_dataset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  peanut_dataset['peanut_allergy'] = peanut_dataset['PEANUT_ALG_START'].notnull().astype(int)


Unnamed: 0,SUBJECT_ID,BIRTH_YEAR,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,PEANUT_ALG_START,PEANUT_ALG_END,peanut_allergy
0,1,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,,,0
1,2,1994,S1 - Female,R0 - White,E0 - Non-Hispanic,,,0
2,3,2006,S0 - Male,R0 - White,E1 - Hispanic,,,0
3,4,2004,S0 - Male,R4 - Unknown,E1 - Hispanic,,,0
4,5,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,,,0


In [12]:
# Count the number of 1s in the label column
print(f'Number of patients with peanut allergies: {peanut_dataset["peanut_allergy"].sum()}')

Number of patients with peanut allergies: 8653


In [13]:
# Remove peanut-alg columns
peanut_dataset = peanut_dataset.drop(['PEANUT_ALG_START', 'PEANUT_ALG_END'], axis=1)

In [14]:
peanut_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333200 entries, 0 to 333199
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   SUBJECT_ID        333200 non-null  int64 
 1   BIRTH_YEAR        333200 non-null  int64 
 2   GENDER_FACTOR     333200 non-null  object
 3   RACE_FACTOR       333200 non-null  object
 4   ETHNICITY_FACTOR  333200 non-null  object
 5   peanut_allergy    333200 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 15.3+ MB


In [15]:
# Merge age_start and age_end columns from original data
peanut_dataset = pd.merge(peanut_dataset, dataset[['SUBJECT_ID', 'AGE_START_YEARS']], on='SUBJECT_ID')
peanut_dataset.head()


Unnamed: 0,SUBJECT_ID,BIRTH_YEAR,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,peanut_allergy,AGE_START_YEARS
0,1,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0.093087
1,2,1994,S1 - Female,R0 - White,E0 - Non-Hispanic,0,12.232717
2,3,2006,S0 - Male,R0 - White,E1 - Hispanic,0,0.010951
3,4,2004,S0 - Male,R4 - Unknown,E1 - Hispanic,0,2.398357
4,5,2006,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0.013689


In [16]:
# Drop birth_year
peanut_dataset = peanut_dataset.drop('BIRTH_YEAR', axis=1)

In [17]:
# Calculate age as a whole number
peanut_dataset['age'] = peanut_dataset['AGE_START_YEARS'].astype(int)
peanut_dataset.head()

Unnamed: 0,SUBJECT_ID,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,peanut_allergy,AGE_START_YEARS,age
0,1,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0.093087,0
1,2,S1 - Female,R0 - White,E0 - Non-Hispanic,0,12.232717,12
2,3,S0 - Male,R0 - White,E1 - Hispanic,0,0.010951,0
3,4,S0 - Male,R4 - Unknown,E1 - Hispanic,0,2.398357,2
4,5,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0.013689,0


In [18]:
peanut_dataset = peanut_dataset.drop('AGE_START_YEARS', axis=1)

Check for any missing values

In [19]:
# Check for any null data
peanut_dataset.isnull().sum()

Unnamed: 0,0
SUBJECT_ID,0
GENDER_FACTOR,0
RACE_FACTOR,0
ETHNICITY_FACTOR,0
peanut_allergy,0
age,0


In [20]:
peanut_dataset.head()

Unnamed: 0,SUBJECT_ID,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,peanut_allergy,age
0,1,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0
1,2,S1 - Female,R0 - White,E0 - Non-Hispanic,0,12
2,3,S0 - Male,R0 - White,E1 - Hispanic,0,0
3,4,S0 - Male,R4 - Unknown,E1 - Hispanic,0,2
4,5,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0


In [21]:
# Move peanut_allergy to the end
peanut_dataset = peanut_dataset[['SUBJECT_ID', 'GENDER_FACTOR', 'RACE_FACTOR', 'ETHNICITY_FACTOR', 'age', 'peanut_allergy']]
peanut_dataset.head()

Unnamed: 0,SUBJECT_ID,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,age,peanut_allergy
0,1,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0
1,2,S1 - Female,R0 - White,E0 - Non-Hispanic,12,0
2,3,S0 - Male,R0 - White,E1 - Hispanic,0,0
3,4,S0 - Male,R4 - Unknown,E1 - Hispanic,2,0
4,5,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0


Split age into buckets

In [22]:
# Split age into category buckets: 0-1, 1-5, 5-10, 10-15, 15>
def age_to_category(age):
    if 0 <= age <= 1:
        return '0-1'
    elif 1 < age <= 5:
        return '1-5'
    elif 5 < age <= 10:
        return '5-10'
    elif 10 < age <= 15:
        return '10-15'
    elif 15 < age <= 20:
        return '15-20'
    else:
        return '>20' # Consider adding a category for ages above 20


peanut_dataset['age_category'] = peanut_dataset['age'].apply(age_to_category)

In [23]:
peanut_dataset.head()

Unnamed: 0,SUBJECT_ID,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,age,peanut_allergy,age_category
0,1,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0,0-1
1,2,S1 - Female,R0 - White,E0 - Non-Hispanic,12,0,10-15
2,3,S0 - Male,R0 - White,E1 - Hispanic,0,0,0-1
3,4,S0 - Male,R4 - Unknown,E1 - Hispanic,2,0,1-5
4,5,S1 - Female,R1 - Black,E0 - Non-Hispanic,0,0,0-1


In [24]:
# Move peanut_allergy to the end
peanut_dataset = peanut_dataset[['SUBJECT_ID', 'GENDER_FACTOR', 'RACE_FACTOR', 'ETHNICITY_FACTOR', 'age_category', 'peanut_allergy']]

Spit train and test data

In [25]:
# Split the data into train and test set ensuring similar proportion of peanut_allergy = 1 in each
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(peanut_dataset, test_size=0.2, stratify=peanut_dataset['peanut_allergy'], random_state=42)

train_data.head()

Unnamed: 0,SUBJECT_ID,GENDER_FACTOR,RACE_FACTOR,ETHNICITY_FACTOR,age_category,peanut_allergy
149830,149831,S1 - Female,R1 - Black,E0 - Non-Hispanic,0-1,0
73785,73786,S1 - Female,R0 - White,E0 - Non-Hispanic,0-1,0
313746,313747,S1 - Female,R1 - Black,E0 - Non-Hispanic,0-1,0
254431,254432,S0 - Male,R0 - White,E0 - Non-Hispanic,1-5,0
180804,180805,S1 - Female,R0 - White,E0 - Non-Hispanic,1-5,0


In [26]:
# Proportion of allergy = 1 in train and test data
print(f'Proportion of peanut allergies in train data: {train_data["peanut_allergy"].sum() / len(train_data)}')
print(f'Proportion of peanut allergies in test data: {test_data["peanut_allergy"].sum() / len(test_data)}')

Proportion of peanut allergies in train data: 0.025967887154861945
Proportion of peanut allergies in test data: 0.025975390156062424


In [27]:
# Split X_train, y_train, X_test and y_test
X_train = train_data.drop('peanut_allergy', axis=1)
y_train = train_data['peanut_allergy']

X_test = test_data.drop('peanut_allergy', axis=1)
y_test = test_data['peanut_allergy']

**Convert into One Hot Encoded features**

In [28]:
# Remove any rows with age_category >20
X_train = X_train[X_train['age_category'] != '>20']
X_test = X_test[X_test['age_category'] != '>20']

In [29]:
# One hot encode categorical features
from sklearn.preprocessing import OneHotEncoder

In [30]:
categorical_values = ['GENDER_FACTOR', 'RACE_FACTOR', 'ETHNICITY_FACTOR', 'age_category']

In [31]:
encoder = OneHotEncoder()

In [32]:
# Fit and transform the training data
X_train_encoded = encoder.fit_transform(X_train[categorical_values])

# Transform the test data
X_test_encoded = encoder.transform(X_test[categorical_values])

In [33]:
# Convert back to dataframes
X_train_encoded = pd.DataFrame.sparse.from_spmatrix(X_train_encoded)
X_test_encoded = pd.DataFrame.sparse.from_spmatrix(X_test_encoded)

In [34]:
X_train_encoded.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,1.0,0.0,1.0,0,0,0,1.0,0,1.0,0.0,0,0,0
1,0.0,1.0,1.0,0.0,0,0,0,1.0,0,1.0,0.0,0,0,0
2,0.0,1.0,0.0,1.0,0,0,0,1.0,0,1.0,0.0,0,0,0
3,1.0,0.0,1.0,0.0,0,0,0,1.0,0,0.0,1.0,0,0,0
4,0.0,1.0,1.0,0.0,0,0,0,1.0,0,0.0,1.0,0,0,0


In [35]:
# Get the column headings back
X_train_encoded.columns = encoder.get_feature_names_out(categorical_values)
X_test_encoded.columns = encoder.get_feature_names_out(categorical_values)

In [36]:
X_train_encoded.head()


Unnamed: 0,GENDER_FACTOR_S0 - Male,GENDER_FACTOR_S1 - Female,RACE_FACTOR_R0 - White,RACE_FACTOR_R1 - Black,RACE_FACTOR_R2 - Asian or Pacific Islander,RACE_FACTOR_R3 - Other,RACE_FACTOR_R4 - Unknown,ETHNICITY_FACTOR_E0 - Non-Hispanic,ETHNICITY_FACTOR_E1 - Hispanic,age_category_0-1,age_category_1-5,age_category_10-15,age_category_15-20,age_category_5-10
0,0.0,1.0,0.0,1.0,0,0,0,1.0,0,1.0,0.0,0,0,0
1,0.0,1.0,1.0,0.0,0,0,0,1.0,0,1.0,0.0,0,0,0
2,0.0,1.0,0.0,1.0,0,0,0,1.0,0,1.0,0.0,0,0,0
3,1.0,0.0,1.0,0.0,0,0,0,1.0,0,0.0,1.0,0,0,0
4,0.0,1.0,1.0,0.0,0,0,0,1.0,0,0.0,1.0,0,0,0


Since the peanut allergy suffering class is small we will have to apply a sampling technique.

In this case we can apply SMOTE to oversample.

In [37]:
from imblearn.over_sampling import SMOTE

In [42]:
# Apply smote to data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_encoded, y_train)

In [45]:
# Save train and test data

X_train_resampled.to_csv('/content/drive/MyDrive/Projects/allergy_prediction/data/X_train_resampled.csv', index=False)
y_train_resampled.to_csv('/content/drive/MyDrive/Projects/allergy_prediction/data/y_train_resampled.csv', index=False)
X_test_encoded.to_csv('/content/drive/MyDrive/Projects/allergy_prediction/data/X_test_encoded.csv', index=False)
y_test.to_csv('/content/drive/MyDrive/Projects/allergy_prediction/data/y_test.csv', index=False)