# Test for data leaks in split()

I noticed a couple of weeks ago that the 

```
sum(RTrain == RValidation) != zero
```

more recently our auto encoder models have lower validation error than the traning sets. Could we have a data leak problem? I.E. we have examples that are in more than one set?

In [1]:
import logging
from   setupLogging import setupLogging
configFilePath = setupLogging( default_path='src/test/logging.test.ini.json')
logger = logging.getLogger("notebook")
logger.info("using logging configuration file:{}".format(configFilePath))

from   DEMETER2.lowRankMatrixFactorizationEasyOfUse \
            import LowRankMatrixFactorizationEasyOfUse as LrmfEoU

import numpy as np

[INFO <ipython-input-1-a21b3a7d137a>:5 - <module>()] using logging configuration file:src/test/logging.test.ini.json


In [2]:
dataDir = "data/"
dataFileName = "D2_Achilles_gene_dep_scores.tsv"
numFeatures = 19
geneFilterPercent = 0.25 
holdOutPercent = 0.40 
easyOfUse = LrmfEoU(dataDir, dataFileName, numFeatures, geneFilterPercent, holdOutPercent)

In [3]:
resultsDict = easyOfUse.loadAll()

In [4]:
# clean tidy version of demeter data
Y, R, cellLinesNames, geneNames, = resultsDict["DEMETER2"]
geneDependencies = Y
print("geneDependencies.shape:{}".format(geneDependencies.shape))

# trained model
# scipy.optimize.OptimizeResult
X, Theta, optimizeResult = resultsDict["LowRankMatrixFactorizationModel"]
genes = X
print("genes.shape:{}".format(genes.shape))
cellLines = Theta
print("cellLines.shape:{}".format(cellLines.shape))

# knockout logical filters. Use to select Y Train, Validations, and Test values
RTrain, RValidation, RTest = resultsDict["filters"]

print("number of observed values in traning set: {}".format(np.sum(RTrain)))
print ("size of traning set: {}".format(np.size(RTrain)))

geneDependencies.shape:(11193, 501)
genes.shape:(11193, 19)
cellLines.shape:(501, 19)
number of observed values in traning set: 3364160
size of traning set: 5607693


# Test to see we have data leaks in the knockout matrices

In [5]:
totalR = np.sum(R)
print("totalR:{}".format(totalR))

totalRTrain = np.sum(RTrain)
print("totalRTrain:{}".format(totalRTrain))

totalRValidation = np.sum(RValidation)
print("totalRValidate:{}".format(totalRValidation))

totalRTest = np.sum(RTest)
print("totalRTest:{}".format(totalRTest))

print()
print("totalRTrain + totalRValidation + totalRTest =  {}".\
     format(totalRTrain + totalRValidation + totalRTest))

# weak evidence that we do not have a data leak, if there was a leak the lhs and rsh 
# would not be equal
assert (totalR == totalRTrain + totalRValidation + totalRTest)

totalR:5606895
totalRTrain:3364160
totalRValidate:1121380
totalRTest:1121355

totalRTrain + totalRValidation + totalRTest =  5606895


In [6]:
# if no data leak sum should be zero
print( np.sum( np.multiply(RTrain, RValidation) ) )
print( np.sum( np.multiply(RTrain, RTest) ) )
print( np.sum( np.multiply(RValidation, RTest) ) )

0
0
0


In [7]:
# select the idx where R == 1
# create 2 sets
# A - B should be empty

In [8]:
# create some test data
aR = np.eye(3, dtype=bool)
print(aR)
bR = np.flip(aR, 0)
print()
print(bR)

[[ True False False]
 [False  True False]
 [False False  True]]

[[False False  True]
 [False  True False]
 [ True False False]]


In [9]:
# figure out how to get the idx for values = 1
#aR[ aR == 1]
aRIdx = np.argwhere(aR == 1)
print("aRIdx:\n{}".format(aRIdx))

bRIdx = np.argwhere(bR == 1)
print("\nbRIdx:\n{}".format(bRIdx))


aRIdx:
[[0 0]
 [1 1]
 [2 2]]

bRIdx:
[[0 2]
 [1 1]
 [2 0]]


In [10]:
aRIdx == bRIdx

array([[ True, False],
       [ True,  True],
       [ True, False]])

In [11]:
# print( (aRIdx == bRIdx).all(axis=0) )
print()

# find the location in aRIdx == bRIdx where a and b are true
print( (aRIdx == bRIdx).all(axis=1) )


[False  True False]


# is missing values reason validation error is lower than training
git commit 5fdb1c9

we are fill missing values in Y with tokenStr = '66,666,666.66666'

we also removed a lot of genes. There are only 798 missing values.

I thought well maybe the reason the validation error is lower is than test because it does not have any of these 798 extreme values. However when we run the knockout to select the data sets
the extreme values get replaced with zero. So it does not account our problem

In [21]:
token = 66666666.66666
missingY = Y == token # np.where(Y == token)
print(missingY.shape)
print(sum( missingY.flatten() ))

(11193, 501)
798


In [40]:
# fancy indexing example
X = np.arange(12).reshape((3, 4))
X

row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]

array([ 2,  5, 11])

In [28]:
missingYIdx = np.argwhere(Y == token)
print(missingYIdx.shape)
print(missingYIdx[0:3])

(798, 2)
[[11 35]
 [11 64]
 [11 96]]
