**Purpose**

The goal of this code is to impute values for the PCIAT variables and, based on these, compute sii scores.

This code would not be part of what we submit to the Kaggle competition, since we wouldn't have access to the outcome variables there

**Note**: It appears that the original PCIAT-PCIAT_Total values were computed by replacing missing values with scores of 5. So the code here might not be needed--or might even be problematic 

In [15]:
import pandas as pd
from sklearn.impute import KNNImputer

In [16]:
#This is the starting data.
train_cleaned=pd.read_csv('train_cleaned.csv')

**Removing Participants with No Data**

Many participants don't have data for any of the 20 PCIAT variables. We'll remove these from the data file.

In [17]:
#First we'll create a list of columns that hold the PCIAT values
pciats = [col for col in train_cleaned.columns if 'PCIAT' in col]
pciats.remove('PCIAT-Season')
pciats.remove('PCIAT-PCIAT_Total')

#Remove rows where all values in pciats are NaN
train_imp_KNN = train_cleaned.copy()
train_imp_KNN['pciatsnotna_sum'] = train_imp_KNN[pciats].notna().sum(axis=1)
train_imp_KNN = train_imp_KNN[train_imp_KNN['pciatsnotna_sum'] != 0]
train_imp_KNN.reset_index(drop=True, inplace=True)

#Remove the pciatsnotna_sum variable
train_imp_KNN.drop(columns=['pciatsnotna_sum'], inplace=True)

**Imputing Missing Values**

Next we'll use KNN to impute the missing values.

In [18]:
#Identify the rows with at least one NaN value
train_imp_KNN['nan_rows'] = train_imp_KNN[pciats].isnull().any(axis=1)

# Create a copy of train_imp_KNN
train_imp_KNN2 = train_imp_KNN.copy()
# define imputer
Number_Neighbors=5
imputer = KNNImputer(n_neighbors=Number_Neighbors, weights='uniform', metric='nan_euclidean')

#The imputer.fit_transform function outputs a numpy array. So first I do the fitting, then convert the output back to a pandas dataframe.

imputations=imputer.fit_transform(train_imp_KNN[pciats])
df2 = pd.DataFrame(imputations, columns=pciats)

#Next take the result and insert into the original dataframe. 

train_imp_KNN[pciats]=train_imp_KNN[pciats].fillna(df2[pciats])

#Remove the nan_rows variable
train_imp_KNN.drop(columns=['nan_rows'], inplace=True)

**Computing PCIAT_Total**

We can now recompute PCIAT_Total based on the imputed values

In [19]:
#Recalculate the PCIAT total score.
train_imp_KNN['PCIAT-PCIAT_Total'] = train_imp_KNN[pciats].sum(axis=1)

**Computing sii Values**

The sii values are based on cutpoints; we can (re)compute these from the new PCIAT_Total values

In [20]:
#Now we can calculate a new sii score with the imputed values. 
bins = [0, 30, 49,79,100]
labels = [0,1,2,3]
train_imp_KNN['sii'] = pd.cut(train_imp_KNN['PCIAT-PCIAT_Total'], bins=bins, labels=labels, right=False)

**Output to CSV**

Finally, we'll output to a CSV for future experimentation

In [21]:
train_imp_KNN.to_csv('train_cleaned_outcome_imputed.csv', index=False)