### A Fine-Tuned LLM DistilBERT to Classify User VARES Adverse Event Symptoms Text Descriptions 
#### To predict more than just the first symptom (SYMPTOM1), a different approach is needed to handle multiple label prediction. This is typically done using a multi-label classification setup, where each symptom is treated as a separate label, and the model learns to predict the presence or absence of each symptom independently (using a MultiLabelBinarizer).

In [1]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Load and preprocess the VAERS data and symptoms
vaers_data_path = 'data/2023VAERSDATA.csv'
vaers_symptoms_path = 'data/2023VAERSSYMPTOMS.csv'
vaers_data = pd.read_csv(vaers_data_path, encoding='ISO-8859-1')
vaers_symptoms = pd.read_csv(vaers_symptoms_path, encoding='ISO-8859-1')

# Merge datasets on VAERS_ID
merged_data = vaers_data.merge(vaers_symptoms, on='VAERS_ID')
merged_data['SYMPTOM_TEXT'] = merged_data['SYMPTOM_TEXT'].astype(str)

# Concatenate symptoms into a single string for each row
merged_data['ALL_SYMPTOMS'] = merged_data[['SYMPTOM1', 'SYMPTOM2', 'SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5']].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)

# Group by VAERS_ID and aggregate data
grouped = merged_data.groupby('VAERS_ID').agg({
    # Include all necessary columns here
    'SYMPTOM_TEXT': 'first',
    'ALL_SYMPTOMS': ' '.join
}).reset_index()

# Split the 'ALL_SYMPTOMS' into a list of symptoms
grouped['ALL_SYMPTOMS'] = grouped['ALL_SYMPTOMS'].str.split(', ')

# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Transform the symptoms into a multi-label format
grouped['Encoded_Symptoms'] = list(mlb.fit_transform(grouped['ALL_SYMPTOMS']))

# Now 'Encoded_Symptoms' column contains a binary matrix suitable for multi-label classification
