# Data Validation and Sanity Checks

In this notebook, we'll compare the outputs of our GeneExpressionData class with the original expression matrices and labels

In [1]:
import pandas as pd 
import os
import sys
import pandas as pd
import numpy as np
from torch.utils.data import *
from tqdm import tqdm
import linecache 
from pytorch_tabnet.tab_model import TabNetClassifier
sys.path.append('../src/')
sys.path.append('..')

from src.models.lib.neural import GeneClassifier

In [2]:
from src.models.lib.data import *
from src.helper import *

## Intersection between reference columns with current dataset columns 

Here, we'll write test code for the methods used in mapping an arbitrary dataset sample to a list of given reference columns, since this method being correct is extremely important 

In [3]:
def clean_sample(sample, refgenes, currgenes):
    intersection = np.intersect1d(currgenes, refgenes, return_indices=True)
    indices = intersection[1] # List of indices in currgenes that equal refgenes 
    
    axis = (1 if sample.ndim == 2 else 0)
    sample = np.sort(sample, axis=axis)
    sample = np.take(sample, indices, axis=axis)

    return torch.from_numpy(sample)

### Unit Test Example 01

In [4]:
def test1():
    ref = ['a', 'b', 'c']
    curr = ['b', 'a', 'c', 'd'] 
    sample = np.array([1,2,3,4]) # Want --> [2,1,3]

    result = clean_sample(sample, ref, curr)
    desired = torch.from_numpy(np.array([2,1,3]))
    
    assert torch.equal(result, desired)
    
def test2():
    ref = ['a', 'b', 'c']
    curr = ['c', 'd', 'b', 'a']

    sample = np.array(
        [[1,2,3,4],
         [5,6,7,8]]
    ) 
    # --> want [[4, 3, 1],
    #           [8, 7, 5]]

    res = clean_sample(sample, ref, curr)
    desired = torch.from_numpy(np.array([
        [4,3,1],
        [8,7,5]
    ]))
    
    assert torch.equal(res, desired)
    
test1()
test2()

From initial tests, `clean_sample` seems to be working correctly.

## Validation of GeneExpressionData class with the original expression matrices and label files 

In this section, we'll confirm that the GeneExpressionData method returns the correct (sample, label) pairs relative to the original expression matrices and raw label files.

In [3]:
datafiles, labelfiles = list(INTERIM_DATA_AND_LABEL_FILES_LIST.keys()), list(INTERIM_DATA_AND_LABEL_FILES_LIST.values())

datafiles = [os.path.join('..', 'data', 'interim', f) for f in datafiles]
labelfiles = [os.path.join('..', 'data', 'processed/labels', f) for f in labelfiles]


Next, we define a function that takes the first `N` samples from the GeneExpressionData object and from the raw expression matrix and compares the samples to make sure they are equal.

In [9]:
N = 5

def test_first_n(n, datafile, labelfile):
    data = GeneExpressionData(datafile, labelfile, 'Type', skip=3)
    cols = data.columns
    
    # Generate dict with half precision values to read this into my 16gb memory
    dtype_cols = dict(zip(cols, [np.float32]*len(cols)))
    
    data_df = pd.read_csv(datafile, nrows=2*n, header=1, dtype=dtype_cols) # Might need some extras since numerical index drops some values
    label_df = pd.read_csv(labelfile, nrows=n)

    similar = []
    for i in range(n):
        idx = label_df.loc[i, 'cell']
        
        datasample = data[i][0]
        dfsample = torch.from_numpy(data_df.loc[idx, :].values).float()
        
        isclose = all(torch.isclose(datasample, dfsample))
        
        similar.append(isclose)
    
    print(f"First {n=} columns of expression matrix is equal to GeneExpressionData: {all(p for p in similar)}")

for datafile, labelfile in zip(datafiles, labelfiles):
    print(f'{datafile=}')
    test_first_n(N, datafile, labelfile)
    

In [10]:
test_first_n(5, datafiles[1], labelfiles[1])

1 False
2 False
3 False
4 False
5 False
First n=5 columns of expression matrix is equal to GeneExpressionData: False


In [22]:
data = GeneExpressionData(datafiles[1], labelfiles[1], 'Type', skip=3)

data_df = pd.read_csv(datafiles[1], header=1, nrows=10)
label_df = pd.read_csv(labelfiles[1], nrows=10)

test_idx = label_df.loc[0, 'cell']
test_idx

1

In [29]:
s1 = data_df.loc[test_idx, :].values
s2 = data[0][0]

all(np.isclose(s1, s2))

True

In [30]:
label_df

Unnamed: 0,cell,Type
0,1,7
1,2,7
2,3,7
3,4,7
4,5,7
5,6,7
6,7,7
7,8,7
8,9,7
9,11,15


## TabNet Classifier validation

Since the TabNet package is designed to be used with the `sklearn` API, we'll write a custom `pl.LightningModule` with the TabNet classifier as the base class, and make sure that the correct `forward` method is returned. Essentially, validating that our wrapper doesn't change any internals.

In [4]:
from pytorch_tabnet.tab_model import TabNetClassifier
from models.lib.neural import TabNetGeneClassifier

train, val, test = generate_dataloaders(datafiles=datafiles, labelfiles=labelfiles, class_label='Type', skip=3, batch_size=4, num_workers=0)
sample = next(iter(train))

model = TabNetGeneClassifier(input_dim=19765, output_dim=17)

Model initialized. input_dim = 19765, output_dim = 17. Metrics are dict_keys(['accuracy', 'precision', 'recall']) and weighted_metrics = False


In [5]:
model(sample)

AttributeError: 'tuple' object has no attribute 'dim'