# vaxxAId

**VaxxAId** aims to cluster and find similarities among patients who experienced COVID-19 vaccine adverse reactions or side effects based on the reports from the Vaccine Adverse Event Reporting System (VAERS). This would aid in the identification and classification of patients at risk of such effects. It will use the **K-modes clustering algorithm** to cluster the reports based on features such as patient outcomes, demographics, type of vaccine received, etc.

## Loading the datasets

In [90]:
# Load the datasets

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import ipynb.fs.defs.utils as utils


patients = pd.read_csv('../data/20212022DATA.csv', encoding='latin1')

patients


Unnamed: 0,VAERS_ID,STATE,AGE_YRS,SEX,DIED,L_THREAT,ER_VISIT,HOSPITAL,DISABLE,BIRTH_DEFECT,RECOVD,V_ADMINBY,VAX_MANU,VAX_SITE,VAX_ROUTE
0,916600,TX,33.0,F,,,,,,,Y,PVT,MODERNA,LA,IM
1,916601,CA,73.0,F,,,,,,,Y,SEN,MODERNA,RA,IM
2,916602,WA,23.0,F,,,,,,,U,SEN,PFIZER\BIONTECH,LA,IM
3,916603,WA,58.0,F,,,,,,,Y,WRK,MODERNA,,
4,916604,TX,47.0,F,,,,,,,N,PUB,MODERNA,LA,IM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149699,2265774,NJ,27.0,F,,,,,,,,UNK,PFIZER\BIONTECH,RA,IM
149700,2265775,FL,73.0,M,,,,Y,,,N,PHM,MODERNA,RA,IM
149701,2265776,NJ,23.0,F,,,,,,,Y,PHM,PFIZER\BIONTECH,LA,IM
149702,2265777,HI,50.0,M,,,,,,,Y,PHM,PFIZER\BIONTECH,LA,IM


Described in the following table are the attributes for the clustering process. All of these are categorical data.

| Attributes | Data Type | Description |
| ---------- | --------- | ----------- |
| `STATE`        | char(2)   | Home state of the vaccinee                                  |
| `AGE_GRP`      | range     | Age group                                                   |
| `SEX`          | char(1)   | Sex                                                         |
| `DIED`         | char(1)   | Adverse effect caused death of the patient                  |
| `L_THREAT`     | char(1)   | Life-threatening event associated with the vaccination      |
| `ER_VISIT`     | char(1)   | Patient required ER visit after experiencing adverse effect |
| `HOSPITAL`     | char(1)   | Patient was hospitalized after experiencing adverse effect  |
| `DISABLE`      | char(1)   | Patient was disabled after experiencing adverse effect      |
| `BIRTH_DEFECT` | char(1)   | Patient has birth defect                                    |
| `RECOVD`       | char(1)   | Patient recovered from adverse effect                       |
| `V_ADMINBY`    | char(3)   | Vaccine administered at                                     |
| `VAX_MANU`     | char(40)  | Vaccine manufacterer                                        |
| `VAX_SITE`     | char(6)   | Vaccination anatomic site                                   |
| `VAX_ROUTE`    | char(6)   | Vaccine route of administration                             |

## Replacing/removing null values

In [91]:
def replace_values(df):
    df['STATE'] = df['STATE'].str.upper()

    # Add a new column, AGE_GRP, and drop AGE_YRS
    age_max = df['AGE_YRS'].max()
    multiplier = 10 ** -1
    agegrp_max = int(math.ceil(age_max * multiplier) / multiplier)
    age_grp = [i for i in range(0,agegrp_max+1,10)]
    df['AGE_GRP'] = pd.cut(x=df['AGE_YRS'], bins=age_grp).astype(str)
    df = df.drop(['AGE_YRS'],axis=1)

    # Fill na with the following corresponding unknown values
    values = {'DIED': 'N', 'L_THREAT': 'N', 'ER_VISIT': 'N', 'HOSPITAL':'N', 'DISABLE':'N', 'BIRTH_DEFECT':'N'}
    df = df.fillna(value=values)
    
    return df

patients = replace_values(patients)
patients.head()

Unnamed: 0,VAERS_ID,STATE,SEX,DIED,L_THREAT,ER_VISIT,HOSPITAL,DISABLE,BIRTH_DEFECT,RECOVD,V_ADMINBY,VAX_MANU,VAX_SITE,VAX_ROUTE,AGE_GRP
0,916600,TX,F,N,N,N,N,N,N,Y,PVT,MODERNA,LA,IM,"(30, 40]"
1,916601,CA,F,N,N,N,N,N,N,Y,SEN,MODERNA,RA,IM,"(70, 80]"
2,916602,WA,F,N,N,N,N,N,N,U,SEN,PFIZER\BIONTECH,LA,IM,"(20, 30]"
3,916603,WA,F,N,N,N,N,N,N,Y,WRK,MODERNA,,,"(50, 60]"
4,916604,TX,F,N,N,N,N,N,N,N,PUB,MODERNA,LA,IM,"(40, 50]"


In [92]:
patients.isna().sum()

# patients.STATE.value_counts()

VAERS_ID            0
STATE           27392
SEX                 0
DIED                0
L_THREAT            0
ER_VISIT            0
HOSPITAL            0
DISABLE             0
BIRTH_DEFECT        0
RECOVD          24705
V_ADMINBY           0
VAX_MANU            0
VAX_SITE        51220
VAX_ROUTE       43533
AGE_GRP             0
dtype: int64

In [93]:
patients.dropna(inplace=True)
patients.isna().sum()

VAERS_ID        0
STATE           0
SEX             0
DIED            0
L_THREAT        0
ER_VISIT        0
HOSPITAL        0
DISABLE         0
BIRTH_DEFECT    0
RECOVD          0
V_ADMINBY       0
VAX_MANU        0
VAX_SITE        0
VAX_ROUTE       0
AGE_GRP         0
dtype: int64

In [94]:
# Remove rows where AGE_GRP is unknown

patients = patients[patients['AGE_GRP']!='nan']
patients.AGE_GRP.value_counts()

(50, 60]      11538
(40, 50]      11010
(30, 40]      10802
(60, 70]      10646
(20, 30]       6960
(70, 80]       6672
(10, 20]       4700
(80, 90]       2751
(0, 10]        2209
(90, 100]       662
(100, 110]       31
Name: AGE_GRP, dtype: int64

In [95]:
# Remove rows where SEX is unknown

patients = patients[patients['SEX']!='U']
patients.SEX.value_counts()

F    46115
M    21600
Name: SEX, dtype: int64

In [96]:
# Remove rows where RECOVD is unknown

patients = patients[patients['RECOVD']!='U']
patients.RECOVD.value_counts()

N    29765
Y    24985
Name: RECOVD, dtype: int64

In [97]:
# Remove rows where V_ADMINBY is unknown

patients = patients[patients['V_ADMINBY']!='UNK']
patients.V_ADMINBY.value_counts()

PVT    18861
PHM    14624
PUB     5565
OTH     4838
WRK     2719
SEN     1382
MIL      722
SCH      663
Name: V_ADMINBY, dtype: int64

In [98]:
# Remove rows where VAX_MANU is unknown

patients = patients[patients['VAX_MANU']!='UNKNOWN MANUFACTURER']
patients.VAX_MANU.value_counts()

PFIZER\BIONTECH    24206
MODERNA            22899
JANSSEN             2246
Name: VAX_MANU, dtype: int64

In [99]:
# We select the first 30000 rows only.

sample = patients.head(30000)
sample.to_csv('../data/20212022DATASAMPLE.csv', index=False)


## Visualizing the data

## K-modes Clustering Model Implementation

In [None]:
# sklearn kmodes