### Implementing a Combined Linear Interpolation and Weighted KNN Imputation

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import time


How to evaluate the linear interpolation vs. the Weighted KNN? <br><br>
1. Create a masked dataset
2. Create an interpolated dataset
3. Iterate through the original dataset and if a cell has a value in the original dataset but this value is 'nan' in
the masked dataset, then use this indices to check the value in the INTERPOLATED dataset.
4. Add the interpolated value to another dataset that contains just the interpolated masked values.
5. Find the sum of the lab values in the original dataset (column wise) and in the masked dataset.
6. Find the difference between the value in the original dataset and the masked dataset. This calculates the value of the masked data.
7. Find the sum of the interpolated masked values (column wise)
8. Compare the sum of the ground truth masked values to the interpolated masked values with a t-test. 

If it is necessary to make a dataframe where each row represents one time point, this could be achieved by:
- Making a dataframe that drops all but one of the time points (i.e. df1, df2, df3, df4, and df5 where each df has the values for the 1,2,3,4,5 visits respectively)
- Combine/stack these dataframes 

### Expand the Dataset: Each Time Point Has a Row

First, need to transform the dataset so that there is one time point per row for each patient, meaning each patient will have five rows. 

In [4]:
# Load datasets
ibd_labs = pd.read_csv('/Users/emmadyer/Desktop/long_ibd_data/data/ibd_reduced.csv')
healthy_labs = pd.read_csv('/Users/emmadyer/Desktop/long_ibd_data/data/healthy_reduced.csv')

all_labs = pd.concat([ibd_labs, healthy_labs], axis=0)

In [5]:
labs = [ibd_labs, healthy_labs, all_labs]
expand_labs = []
new_col_names = ['patient_id', 'ibd_disease_code']
years = ['1','2','3','4','5']
ids = ['patient_id', 'ibd_disease_code']
col_lst = list(ibd_labs.columns.values)

for l in labs:
    annual_data = []
    for y in years:
        column_names = ['patient_id', 'ibd_disease_code']
        new_col_names = ['patient_id', 'ibd_disease_code']
        for c in col_lst:
            if y in c:
                column_names.append(c)
        annual_df = l[column_names]
        # Rename column header to remove the year
        for n in column_names:
            if y in n:
                new_name = n[:-1]
                new_col_names.append(new_name)
        annual_df.columns = new_col_names
        # Add a column keeping track of the year 
        annual_df['year'] = int(y)
        #print(len(list(annual_df.columns.values)))
        annual_data.append(annual_df)

    expand_df = pd.concat(annual_data)
    expand_labs.append(expand_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annual_df['year'] = int(y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annual_df['year'] = int(y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annual_df['year'] = int(y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead



In [11]:
ibd_expanded = expand_labs[0]
healthy_expanded = expand_labs[1]
all_expanded = expand_labs[2]

all_expanded.to_csv('/Users/emmadyer/Desktop/long_ibd_data/data/all_expanded.csv')

# Separate UC and CD
cd_expanded = ibd_expanded.loc[ibd_expanded['ibd_disease_code'] == 1]
uc_expanded = ibd_expanded.loc[ibd_expanded['ibd_disease_code'] == 2]

# Export All IBD data (keeping the disease code)
#ibd_expanded.to_csv('/Users/emmadyer/Desktop/ibd_long_project/data/ibd_expanded.csv', index=False)

# Drop the desease code
cd_expanded = cd_expanded.drop('ibd_disease_code', axis=1)
uc_expanded = uc_expanded.drop('ibd_disease_code', axis=1)
healthy_expanded = healthy_expanded.drop('ibd_disease_code', axis=1)

In [12]:
# Export to CSV without disease code
cd_expanded.to_csv('/Users/emmadyer/Desktop/ibd_long_project/data/cd_expanded.csv', index=False)
uc_expanded.to_csv('/Users/emmadyer/Desktop/ibd_long_project/data/uc_expanded.csv', index=False)
healthy_expanded.to_csv('/Users/emmadyer/Desktop/ibd_long_project/data/healthy_expanded.csv', index=False)