## Summary
This notebook evaluates an ML model design for its capacity to learn an embedding capable of distinguishing between different "mechanisms of action", or MOA, in the bbbc021 dataset. It does this by considering N trained models, where N corresponds to the number of chemical compounds with known MOA. Each of the N models differs from the others in that one particular compound was left out of its training set. Then, each of these models can be tested against its "left out" compound to evaluate its capacity to accurately classify the MOA of the left-out compound using knowledge learned from other compounds sharing the same MOA. The bbbc021 dataset has 12 MOA and 38 compounds with known MOA (there are several representative compounds per MOA, and up to 8 different concentrations per compound). There are a total of 103 'treatments' in the bbbc021 datasets with known MOA, where a treatment == the application of a particular compound at a particlar concentraion.

During training the network model learns to compute an embedding (vector space) that tries to position compounds with the same MOA close together, while keeping compounds with differing MOA farther apart. Once trained, a model can be used to predict the MOA of an unknown (or untrained) MOA by finding its nearest labeled neighbors in the embedding space.

This notebook assumes each of the N models is trained and available for evalution, and that each image in the dataset has a computed embedding corresponding to each model.

For each of the N "one compound left out" models, the mean embedding for each of M treatments is computed. Then, MOA is assigned to each of the treatments corresponding to the left out compound (i.e., for each concetration separately) based on its nearest-neighbor. This is called NSC, or "Not Same Compound" analysis.

Another analysis is done, NSCB, called "Not Same Compound or Batch", in which in addition to the compound being left out (at all concentrations) for nearest-neighbor consideration, all compounds prepared in the same Batch are also left out, to remove Batch-related characterists from biasing the results. This is only possible for 10 of the 12 MOAs, because 2 only have representatives in a single Batch.

In [1]:
!pip install shortuuid

Collecting shortuuid
  Using cached shortuuid-1.0.1-py3-none-any.whl (7.5 kB)
Installing collected packages: shortuuid
Successfully installed shortuuid-1.0.1


In [2]:
import sys
import os
import math
import base64
import boto3
import sagemaker
import matplotlib.pyplot as plt
import numpy as np
import collections
from collections import defaultdict
from PIL import Image
import sklearn
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib.ticker import NullFormatter
from sklearn import manifold, datasets
from time import time
from time import sleep

In [3]:
EMBEDDING_NAME = 'bbbc021-1'
BASELINE_TRAIN_ID = 'gEWDUe21eyQ19FmoBmbp3g'

In [4]:
s3c = boto3.client('s3')

In [5]:
%pwd

'/root/bioimage-search/datasets/bbbc-021/notebooks'

In [19]:
bioimsArtifactBucket='bioims-data-1'
bbbc021Bucket='bioimagesearchbbbc021stack-bbbc021bucket544c3e64-ugln15rb234b'

In [20]:
# assumes cwd=/root/bioimage-search/datasets/bbbc-021/notebooks
sys.path.insert(0, "../../../cli/bioims/src")
import bioims as bi

In [21]:
sys.path.insert(0, "../scripts")
import bbbc021common as bb

In [22]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

In [23]:
bucket

'sagemaker-us-east-1-147147579088'

Get ImageID->(compound, concentration) maps

In [24]:
image_df, moa_df = bb.Bbbc021PlateInfoByDF.getDataFrames(bbbc021Bucket)
compound_moa_map = bb.Bbbc021PlateInfoByDF.getCompoundMoaMapFromDf(moa_df)

sourceCompoundMap={}
sourceConcentrationMap={}
compoundCountMap={}
moaCountMap={}
for i in range(len(image_df.index)):
    r = image_df.iloc[i]
    imageSourceId = r['Image_FileName_DAPI'][:-4]
    imageCompound=r['Image_Metadata_Compound']
    sourceCompoundMap[imageSourceId]=imageCompound
    sourceConcentrationMap[imageSourceId]=r['Image_Metadata_Concentration']
    if imageCompound not in compoundCountMap:
        compoundCountMap[imageCompound]=1
    else:
        compoundCountMap[imageCompound] = compoundCountMap[imageCompound] + 1
    if imageCompound in compound_moa_map:
        imageMoa=compound_moa_map[imageCompound]
        if imageMoa not in moaCountMap:
            moaCountMap[imageMoa]=1
        else:
            moaCountMap[imageMoa] = moaCountMap[imageMoa] + 1

In [25]:
compoundCountMap

{'5-fluorouracil': 96,
 'acyclovir': 96,
 'AG-1478': 192,
 'ALLN': 96,
 'aloisine A': 96,
 'alsterpaullone': 64,
 'anisomycin': 96,
 'aphidicolin': 96,
 'arabinofuranosylcytosine': 96,
 'atropine': 96,
 'bleomycin': 96,
 'bohemine': 64,
 'brefeldin A': 96,
 'bryostatin': 64,
 'calpain inhibitor 2 (ALLM)': 96,
 'calpeptin': 64,
 'camptothecin': 96,
 'carboplatin': 96,
 'caspase inhibitor 1 (ZVAD)': 96,
 'cathepsin inhibitor I': 96,
 'Cdk1 inhibitor III': 96,
 'Cdk1/2 inhibitor (NU6102)': 96,
 'chlorambucil': 96,
 'chloramphenicol': 64,
 'cisplatin': 96,
 'colchicine': 96,
 'cyclohexamide': 96,
 'cyclophosphamide': 64,
 'cytochalasin B': 96,
 'cytochalasin D': 96,
 'demecolcine': 96,
 'deoxymannojirimycin': 64,
 'deoxynojirimycin': 96,
 "3,3'-diaminobenzidine": 96,
 'docetaxel': 96,
 'doxorubicin': 96,
 'emetine': 96,
 'epothilone B': 96,
 'etoposide': 96,
 'filipin': 64,
 'floxuridine': 96,
 'forskolin': 96,
 'genistein': 96,
 'H-7': 96,
 'herbimycin A': 96,
 'hydroxyurea': 96,
 'ICI-18

In [26]:
moaCountMap

{'Protein degradation': 384,
 'Kinase inhibitors': 192,
 'Protein synthesis': 288,
 'DNA replication': 384,
 'DNA damage': 384,
 'Microtubule destabilizers': 384,
 'Actin disruptors': 288,
 'Microtubule stabilizers': 1608,
 'Cholesterol-lowering': 192,
 'Epithelial': 256,
 'Eg5 inhibitors': 192,
 'Aurora kinase inhibitors': 288,
 'DMSO': 1320}

In [27]:
embeddingClient = bi.client('embedding')

In [28]:
imageClient = bi.client('image-management')

In [29]:
trainingConfigurationClient = bi.client('training-configuration')

In [30]:
embeddingInfo = trainingConfigurationClient.getEmbeddingInfo(EMBEDDING_NAME)

In [31]:
plateList = imageClient.listCompatiblePlates(embeddingInfo['inputWidth'], embeddingInfo['inputHeight'], embeddingInfo['inputDepth'], embeddingInfo['inputChannels'])

In [32]:
trainList = trainingConfigurationClient.getEmbeddingTrainings(EMBEDDING_NAME)

In [33]:
trainList

[{'filterBucket': 'bioims-resource-1',
  'sagemakerJobName': 'bioims-2hP5wLvua7kAf9Rb9gwwJ2-YLr9k4CyNfRtiWkNrKP3NV',
  'messageId': '361f04f2-09f7-4142-9858-b406b45bdbbf',
  'filterKey': 'train-filter/bbbc021-1/PD-169316-filter.txt',
  'trainId': '2hP5wLvua7kAf9Rb9gwwJ2',
  'embeddingName': 'bbbc021-1',
  'executeProcessPlate': 'false'},
 {'filterBucket': 'bioims-resource-1',
  'sagemakerJobName': 'bioims-3fyp1t55HMNfkx2Y7kBP41-XXc3HqJZpE22cfbYvj2FqN',
  'messageId': '589f22c6-358c-4926-9cfb-02d4e0a45db9',
  'filterKey': 'train-filter/bbbc021-1/docetaxel-filter.txt',
  'trainId': '3fyp1t55HMNfkx2Y7kBP41',
  'embeddingName': 'bbbc021-1',
  'executeProcessPlate': 'false'},
 {'filterBucket': 'bioims-resource-1',
  'sagemakerJobName': 'bioims-59v3zXoNWqhkHECzydTbGf-Mp35B6grsie9RqYtagusto',
  'messageId': '15d7387c-6717-42f4-a9a6-771eaec31bdd',
  'filterKey': 'train-filter/bbbc021-1/AZ-C-filter.txt',
  'trainId': '59v3zXoNWqhkHECzydTbGf',
  'embeddingName': 'bbbc021-1',
  'executeProcessPla

In [34]:
compound_moa_map

{'PP-2': 'Epithelial',
 'emetine': 'Protein synthesis',
 'AZ258': 'Aurora kinase inhibitors',
 'cytochalasin B': 'Actin disruptors',
 'ALLN': 'Protein degradation',
 'mitoxantrone': 'DNA replication',
 'AZ-C': 'Eg5 inhibitors',
 'MG-132': 'Protein degradation',
 'AZ841': 'Aurora kinase inhibitors',
 'docetaxel': 'Microtubule stabilizers',
 'mitomycin C': 'DNA damage',
 'PD-169316': 'Kinase inhibitors',
 'proteasome inhibitor I': 'Protein degradation',
 'vincristine': 'Microtubule destabilizers',
 'AZ138': 'Eg5 inhibitors',
 'demecolcine': 'Microtubule destabilizers',
 'mevinolin/lovastatin': 'Cholesterol-lowering',
 'AZ-A': 'Aurora kinase inhibitors',
 'alsterpaullone': 'Kinase inhibitors',
 'etoposide': 'DNA damage',
 'floxuridine': 'DNA replication',
 'AZ-U': 'Epithelial',
 'simvastatin': 'Cholesterol-lowering',
 'anisomycin': 'Protein synthesis',
 'nocodazole': 'Microtubule destabilizers',
 'AZ-J': 'Epithelial',
 'taxol': 'Microtubule stabilizers',
 'camptothecin': 'DNA replication'

In [35]:
def getCompoundLabel(compound):    
    cnws ="".join(compound.split())
    return cnws.replace('/','-')

In [36]:
label_moa_map = {}
labelCountMap = {}
for c, m in compound_moa_map.items():
    label = getCompoundLabel(c)
    label_moa_map[label] = m
    labelCountMap[label]=compoundCountMap[c]

In [37]:
label_moa_map

{'PP-2': 'Epithelial',
 'emetine': 'Protein synthesis',
 'AZ258': 'Aurora kinase inhibitors',
 'cytochalasinB': 'Actin disruptors',
 'ALLN': 'Protein degradation',
 'mitoxantrone': 'DNA replication',
 'AZ-C': 'Eg5 inhibitors',
 'MG-132': 'Protein degradation',
 'AZ841': 'Aurora kinase inhibitors',
 'docetaxel': 'Microtubule stabilizers',
 'mitomycinC': 'DNA damage',
 'PD-169316': 'Kinase inhibitors',
 'proteasomeinhibitorI': 'Protein degradation',
 'vincristine': 'Microtubule destabilizers',
 'AZ138': 'Eg5 inhibitors',
 'demecolcine': 'Microtubule destabilizers',
 'mevinolin-lovastatin': 'Cholesterol-lowering',
 'AZ-A': 'Aurora kinase inhibitors',
 'alsterpaullone': 'Kinase inhibitors',
 'etoposide': 'DNA damage',
 'floxuridine': 'DNA replication',
 'AZ-U': 'Epithelial',
 'simvastatin': 'Cholesterol-lowering',
 'anisomycin': 'Protein synthesis',
 'nocodazole': 'Microtubule destabilizers',
 'AZ-J': 'Epithelial',
 'taxol': 'Microtubule stabilizers',
 'camptothecin': 'DNA replication',
 '

In [38]:
train_compoundLabel_map = {}

In [39]:
for trainInfo in trainList:
    if 'filterKey' in trainInfo and len(trainInfo['filterKey'])>0:
        filterKey = trainInfo['filterKey']
        print(filterKey)
        a1=filterKey.split('/')
        print(a1)
        a2=a1[2].split("-filter")
        print(a2)
        trainId = trainInfo['trainId']
        print(trainId)
        train_compoundLabel_map[trainId]=a2[0]

train-filter/bbbc021-1/PD-169316-filter.txt
['train-filter', 'bbbc021-1', 'PD-169316-filter.txt']
['PD-169316', '.txt']
2hP5wLvua7kAf9Rb9gwwJ2
train-filter/bbbc021-1/docetaxel-filter.txt
['train-filter', 'bbbc021-1', 'docetaxel-filter.txt']
['docetaxel', '.txt']
3fyp1t55HMNfkx2Y7kBP41
train-filter/bbbc021-1/AZ-C-filter.txt
['train-filter', 'bbbc021-1', 'AZ-C-filter.txt']
['AZ-C', '.txt']
59v3zXoNWqhkHECzydTbGf
train-filter/bbbc021-1/emetine-filter.txt
['train-filter', 'bbbc021-1', 'emetine-filter.txt']
['emetine', '.txt']
5RM6PFqrEU38za91jEGaw2
train-filter/bbbc021-1/epothiloneB-filter.txt
['train-filter', 'bbbc021-1', 'epothiloneB-filter.txt']
['epothiloneB', '.txt']
5ahR5kfxyiXGq4JYmNU1iA
train-filter/bbbc021-1/methotrexate-filter.txt
['train-filter', 'bbbc021-1', 'methotrexate-filter.txt']
['methotrexate', '.txt']
5xnjqwaoU5DDbmAukDbaTR
train-filter/bbbc021-1/AZ258-filter.txt
['train-filter', 'bbbc021-1', 'AZ258-filter.txt']
['AZ258', '.txt']
6M6yWEzTK6TuRbfhSTupHk
train-filter/bbbc

In [40]:
train_compoundLabel_map

{'2hP5wLvua7kAf9Rb9gwwJ2': 'PD-169316',
 '3fyp1t55HMNfkx2Y7kBP41': 'docetaxel',
 '59v3zXoNWqhkHECzydTbGf': 'AZ-C',
 '5RM6PFqrEU38za91jEGaw2': 'emetine',
 '5ahR5kfxyiXGq4JYmNU1iA': 'epothiloneB',
 '5xnjqwaoU5DDbmAukDbaTR': 'methotrexate',
 '6M6yWEzTK6TuRbfhSTupHk': 'AZ258',
 '6MqYixg9TJzi6H9GPMvbzX': 'mevinolin-lovastatin',
 '6vDKLDzef5tg3x9BhwY1CX': 'taxol',
 '7JpraYic58Y41UoNfBEitF': 'AZ-A',
 '7finP77wGa5ESQGHv2CufV': 'bryostatin',
 '8HKNipjmvtveqTCxnNV9fq': 'anisomycin',
 '8UsYsgFEFzc1gLmziy8i5i': 'demecolcine',
 '9sWPv2HAwc7DymRc2FoVrW': 'MG-132',
 'bNa6PZmJ474jpJ5pWJskaL': 'floxuridine',
 'deTme9bcmwLXvNs7ccgSRL': 'latrunculinB',
 'e4TucXBmqu2QhG7yv9Dyk7': 'nocodazole',
 'eNdkW7UYq15WAxt5hbhvet': 'cytochalasinD',
 'eednrUjUhWNGSEEtqvXrtd': 'ALLN',
 'enVvuCwdZcVpGpnE5SuJ2S': 'mitoxantrone',
 'fayY8cotAAa28HS2zfKf72': 'vincristine',
 'fo7WWAuRCuYHM3T4BEZhzG': 'lactacystin',
 'fztGEeznZXpsVQNEHugkxr': 'camptothecin',
 'gfYHZzavYCpe2s26bxLFrT': 'cyclohexamide',
 'hqTvRAmUVR5amUiAABqv85

Check that the counts match, we leave out the control DMSO:

In [41]:
len(train_compoundLabel_map)==len(compound_moa_map)-1

True

In [42]:
tagClient = bi.client("tag")

In [43]:
tagList = tagClient.getAllTags()

In [44]:
compoundLabel_tag_map = {}
for tag in tagList:
    id = tag['id']
    value = tag['tagValue']
    type = tag['tagType']
    if (value.startswith('compound:')):
        a1 = value.split(":")
        compoundLabel_tag_map[a1[1]]=id

In [45]:
compoundLabel_tag_map

{'AZ-U': 18,
 'taxol': 51,
 'alsterpaullone': 26,
 'cyclohexamide': 33,
 'PP-2': 25,
 'camptothecin': 29,
 'floxuridine': 41,
 'PD-169316': 24,
 'demecolcine': 36,
 'anisomycin': 27,
 'mitoxantrone': 47,
 'cytochalasinB': 34,
 'simvastatin': 50,
 'AZ138': 19,
 'AZ258': 20,
 'bryostatin': 28,
 'latrunculinB': 43,
 'proteasomeinhibitorI': 49,
 'methotrexate': 44,
 'AZ-C': 16,
 'nocodazole': 48,
 'vincristine': 52,
 'docetaxel': 37,
 'colchicine': 32,
 'AZ841': 21,
 'MG-132': 23,
 'etoposide': 40,
 'lactacystin': 42,
 'AZ-A': 15,
 'DMSO': 22,
 'cytochalasinD': 35,
 'chlorambucil': 30,
 'epothiloneB': 39,
 'ALLN': 14,
 'emetine': 38,
 'mevinolin-lovastatin': 45,
 'mitomycinC': 46,
 'cisplatin': 31,
 'AZ-J': 17}

In [46]:
searchClient = bi.client("search")

We use the search service to construct a histogram of the distribution of matches to MOAs, where we pool the results for the images of a "left out" treatment. Here we survey across a range of pick values (which in practice shows remarkable insensitivity).

In [55]:
def getMoaHistogram(trainId, leftOutCompoundLabel=''):
    testSequence = []
# Uncomment to observe the invariance of this parameter
#     for j in range(1,31):
#         testSequence.append(j)
    testSequence.append(10)
    print("***")
    print(trainId)
    if leftOutCompoundLabel == '':
        leftOutCompoundLabel=train_compoundLabel_map[trainId]
    print(leftOutCompoundLabel)
    leftOutMoa = label_moa_map[leftOutCompoundLabel]
    print(leftOutMoa)
    print("===")
    imageInfoMap={}
    dmsoTag = compoundLabel_tag_map['DMSO']
    searchPlateMap = {}
    searchCount=0
    imageListPlateMap={}
    for plate in plateList:
        plateId = plate['plateId']
        #print("plate {}".format(plateId))
        images = imageClient.getImagesByPlateId(plateId)
        imageListPlateMap[plateId] = images
    print("Start search")
    for plate in plateList:
        plateId = plate['plateId']
        images = imageListPlateMap[plateId]
        searchResponses = []
        for image in images:
            imageSourceId = image['Item']['imageSourceId']
            imageId = image['Item']['imageId']
            compound = sourceCompoundMap[imageSourceId]
            compoundLabel = getCompoundLabel(compound)
            concentration = sourceConcentrationMap[imageSourceId]
            if compoundLabel==leftOutCompoundLabel:
                #print("{} {} {} {}".format(imageId, compound, compoundLabel, concentration))
                exclusionTags = []
                tag = compoundLabel_tag_map[compoundLabel]
                exclusionTags.append(tag)
                exclusionTags.append(dmsoTag)
                search = {
                    "trainId" : trainId,
                    "queryImageId" : imageId,
                    "exclusionTags" : exclusionTags,
                    "requireMoa" : "true",
                    "metric" : "Cosine"
                }
                #print(search)
                searchResponse = searchClient.submitSearch(search)
                searchCount += 1
                searchResponses.append(searchResponse)
        searchPlateMap[plateId] = searchResponses
    searchResultsMap={}
    resultCount=0
    for plate in plateList:
        plateId = plate['plateId']
        searchResponses = searchPlateMap[plateId]
        for searchResponse in searchResponses:
            searchId = searchResponse['searchId']
            statusValue = 'submitted'
            while statusValue != 'completed' and statusValue != 'error':
                sleep(1)
                searchStatus = searchClient.getSearchStatus(searchId)
                statusValue = searchStatus['Item']['status']
            if statusValue == 'completed':
                searchResults = searchClient.getSearchResults(searchId)
                if plateId not in searchResultsMap:
                    searchResultsMap[plateId] = []
                searchResultsMap[plateId].append(searchResults)
                resultCount += 1
    print("searchCount={} resultCount={}".format(searchCount, resultCount))
    for testCount in testSequence:
        moaBinCounts = {}
        hitCount=0
        binCount=0
        for plate in plateList:
                plateId = plate['plateId']
                if plateId in searchResultsMap:
                    searchResultsList = searchResultsMap[plateId]
                    for searchResults in searchResultsList:
                        for i in range(testCount):
                            hitCount += 1
                            searchResult = searchResults[i]
                            hitImageId = searchResult['imageId']
                            if hitImageId not in imageInfoMap:
                                imageInfo = imageClient.getImageInfo(hitImageId, 'origin')
                                imageInfoMap[hitImageId]=imageInfo
                            imageInfo=imageInfoMap[hitImageId]
                            imageSourceId = imageInfo['Item']['imageSourceId']
                            hitCompound = sourceCompoundMap[imageSourceId]
                            if hitCompound in compound_moa_map:
                                moa = compound_moa_map[hitCompound]
                            else:
                                moa = "unknown"
                            if moa in moaBinCounts:
                                c = moaBinCounts[moa]
                                c += 1
                                binCount += 1
                                moaBinCounts[moa] = c
                            else:
                                binCount += 1
                                moaBinCounts[moa] = 1
        print("hitCount={} binCount={}".format(hitCount, binCount))
        labelCount = labelCountMap[leftOutCompoundLabel]
        labelMoaCount = moaCountMap[leftOutMoa]
        adjustedLabelMoaCount = labelMoaCount - labelCount
        bestMoa=''
        bestScore=0.0
        for moa in moaBinCounts:
            c = moaBinCounts[moa]
            m = moaCountMap[moa]
            if moa == leftOutMoa:
                n = c / adjustedLabelMoaCount
            else:
                n = c / m
            if n > bestScore:
                bestMoa=moa
                bestScore=n
            elif n == bestScore and moa==leftOutMoa:
                bestMoa=moa
                bestScore=n
        for moa in moaBinCounts:
            c = moaBinCounts[moa]
            m = moaCountMap[moa]
            if moa == leftOutMoa:
                n = c / adjustedLabelMoaCount
            else:
                n = c / m
            if moa==bestMoa:
                print("{}> {} {} {}".format(testCount, moa, c, n))
            else:
                print("{} {} {} {}".format(testCount, moa, c, n))
        # Comment out below if observing multiple parameter values
        if bestMoa==leftOutMoa:
            return 1
        else:
            return 0

In [56]:
trainIdList = []
for trainInfo in trainList:
    trainId = trainInfo['trainId']
    if trainId!='origin' and trainId!=BASELINE_TRAIN_ID:
        trainIdList.append(trainInfo['trainId'])
trainIdList.sort()

In [57]:
trainIdList

['2hP5wLvua7kAf9Rb9gwwJ2',
 '3fyp1t55HMNfkx2Y7kBP41',
 '59v3zXoNWqhkHECzydTbGf',
 '5RM6PFqrEU38za91jEGaw2',
 '5ahR5kfxyiXGq4JYmNU1iA',
 '5xnjqwaoU5DDbmAukDbaTR',
 '6M6yWEzTK6TuRbfhSTupHk',
 '6MqYixg9TJzi6H9GPMvbzX',
 '6vDKLDzef5tg3x9BhwY1CX',
 '7JpraYic58Y41UoNfBEitF',
 '7finP77wGa5ESQGHv2CufV',
 '8HKNipjmvtveqTCxnNV9fq',
 '8UsYsgFEFzc1gLmziy8i5i',
 '9sWPv2HAwc7DymRc2FoVrW',
 'bNa6PZmJ474jpJ5pWJskaL',
 'deTme9bcmwLXvNs7ccgSRL',
 'e4TucXBmqu2QhG7yv9Dyk7',
 'eNdkW7UYq15WAxt5hbhvet',
 'eednrUjUhWNGSEEtqvXrtd',
 'enVvuCwdZcVpGpnE5SuJ2S',
 'fayY8cotAAa28HS2zfKf72',
 'fo7WWAuRCuYHM3T4BEZhzG',
 'fztGEeznZXpsVQNEHugkxr',
 'gfYHZzavYCpe2s26bxLFrT',
 'hqTvRAmUVR5amUiAABqv85',
 'jDu4hfhxSzBcaDmRQkV5TC',
 'mhuc4GduMQGEHmCGLUBpSy',
 'msCoNbq2VVYRG8ukXP9vuj',
 'mxFYgwmUGT8V62q73dhNV4',
 'phkg7iq8ipxD52NNfZZ8Nj',
 'qDKHJFf7Mb9UBSnRRVicZH',
 'qS3Hw1FVdMZnMZHthUTYZZ',
 'ssFHb7Jg3seeB3ikMLR33s',
 't6Txn8r9xCD2grWBX7dgHc',
 'tsdpUuLdnEtSAHYaNvRjgm',
 'uR4NaBNEzhiBGhewMBDbUU',
 'wM3um6VLA37hSHfR9wnDbz',
 

In [59]:
j=0
correct=0
for trainId in trainIdList:
    print(j+1)
    correct += getMoaHistogram(trainId)
    j += 1
pc = correct/j
print("==")
print("Percentage of compounds with correct predicted MOA={}".format(pc))

1
***
2hP5wLvua7kAf9Rb9gwwJ2
PD-169316
Kinase inhibitors
===
Start search
searchCount=64 resultCount=64
hitCount=640 binCount=640
10> Kinase inhibitors 573 4.4765625
10 Protein synthesis 15 0.052083333333333336
10 Epithelial 40 0.15625
10 DNA damage 7 0.018229166666666668
10 Aurora kinase inhibitors 5 0.017361111111111112
2
***
3fyp1t55HMNfkx2Y7kBP41
docetaxel
Microtubule stabilizers
===
Start search
searchCount=96 resultCount=95
hitCount=950 binCount=950
10> Microtubule stabilizers 769 0.5085978835978836
10 Microtubule destabilizers 29 0.07552083333333333
10 Protein degradation 66 0.171875
10 DNA damage 33 0.0859375
10 Kinase inhibitors 41 0.21354166666666666
10 Protein synthesis 2 0.006944444444444444
10 Aurora kinase inhibitors 10 0.034722222222222224
3
***
59v3zXoNWqhkHECzydTbGf
AZ-C
Eg5 inhibitors
===
Start search
searchCount=96 resultCount=96
hitCount=960 binCount=960
10 Protein degradation 120 0.3125
10 Microtubule stabilizers 314 0.19527363184079602
10> Eg5 inhibitors 399 4.156