## Summary
This notebook evaluates an ML model design for its capacity to learn an embedding capable of distinguishing between different "mechanisms of action", or MOA, in the bbbc021 dataset. It does this by considering N trained models, where N corresponds to the number of chemical compounds with known MOA. Each of the N models differs from the others in that one particular compound was left out of its training set. Then, each of these models can be tested against its "left out" compound to evaluate its capacity to accurately classify the MOA of the left-out compound using knowledge learned from other compounds sharing the same MOA. The bbbc021 dataset has 12 MOA and 38 compounds with known MOA (there are several representative compounds per MOA, and up to 8 different concentrations per compound). There are a total of 103 'treatments' in the bbbc021 datasets with known MOA, where a treatment == the application of a particular compound at a particlar concentraion.

During training the network model learns to compute an embedding (vector space) that tries to position compounds with the same MOA close together, while keeping compounds with differing MOA farther apart. Once trained, a model can be used to predict the MOA of an unknown (or untrained) MOA by finding its nearest labeled neighbors in the embedding space.

This notebook assumes each of the N models is trained and available for evalution, and that each image in the dataset has a computed embedding corresponding to each model.

For each of the N "one compound left out" models, the mean embedding for each of M treatments is computed. Then, MOA is assigned to each of the treatments corresponding to the left out compound (i.e., for each concetration separately) based on its nearest-neighbor. This is called NSC, or "Not Same Compound" analysis.

Another analysis is done, NSCB, called "Not Same Compound or Batch", in which in addition to the compound being left out (at all concentrations) for nearest-neighbor consideration, all compounds prepared in the same Batch are also left out, to remove Batch-related characterists from biasing the results. This is only possible for 10 of the 12 MOAs, because 2 only have representatives in a single Batch.

In [None]:
!pip install shortuuid

In [None]:
import sys
import os
import math
import base64
import boto3
import sagemaker
import matplotlib.pyplot as plt
import numpy as np
import collections
from collections import defaultdict
from PIL import Image
import sklearn
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib.ticker import NullFormatter
from sklearn import manifold, datasets
from time import time
from time import sleep

In [None]:
EMBEDDING_NAME = 'bbbc021-3'
BASELINE_TRAIN_ID = '5TkVcLc6EM2pgkQwAujW2d'

In [None]:
s3c = boto3.client('s3')

In [None]:
%pwd

In [None]:
bioimsArtifactBucket='bioimagesearchbasestack-bioimagesearchdatabucketa-16h77xh6oyxmm'
bbbc021Bucket='bioimagesearchbbbc021stack-bbbc021bucket544c3e64-10ecnwo51127'

In [None]:
# assumes cwd=/root/bioimage-search/datasets/bbbc-021/notebooks
sys.path.insert(0, "../../../cli/bioims/src")
import bioims as bi

In [None]:
sys.path.insert(0, "../scripts")
import bbbc021common as bb

In [None]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

In [None]:
bucket

In [None]:
print(role)

## Prerequisites

### Permissions
This notebook requires adding the "BioimageSearch" managed policy to the above SageMaker execution role. Do this by using the IAM console to add the policy to the above role. The policy arn will be something like: arn:aws:iam::580829821648:policy/BioimageSearchResourcePermissionsStack-biomageSearchManagedPolicy9CB9C1D7-SXEV4WNUCZ7V

### Steps
* Use bbbc021 metadata to contruct a map of ( imageSourceId -> { compound, concentration } )
* Begin with the name of the embedding, 'bbbc021'
* Get the metadata for the embedding
* Using the metadata for the embedding, get a list of compatible plates
* Get the list of all trainIds for the embedding 'bbbc021'
* Using the filter key of each trainId, create a map of (trainId->'left out compound')
* For each 'left out compound' and its corresponding model, find the nearest neighbor MOA for each of its treatments:
  * We need to group all embeddings (all images) per treatment, and compute the mean
  * Iterate through each plate:
      * Get all images
      * For each image, get its embedding for the trainId and collect in its treatment group
  * Compute the mean embedding for each treatment
  * Find the nearest neighbor treatment to each 'left out compound/treatment'
  * Assign the MOA of the nearest neighbor to the test treatment
* Contruct confusion matrix with results, which summarizes all models together
* Compute % of treatments with correctly assigned MOAs
* This concludes the 'Not Same Compound' (NSC) compute
* Repeat the whole process but exclude imagery from the same batch (e.g., 'Week#') - this is NSCB
      

Get ImageID->(compound, concentration) maps

In [None]:
image_df, moa_df = bb.Bbbc021PlateInfoByDF.getDataFrames(bbbc021Bucket)
compound_moa_map = bb.Bbbc021PlateInfoByDF.getCompoundMoaMapFromDf(moa_df)

sourceCompoundMap={}
sourceConcentrationMap={}
compoundCountMap={}
moaCountMap={}
for i in range(len(image_df.index)):
    r = image_df.iloc[i]
    imageSourceId = r['Image_FileName_DAPI'][:-4]
    imageCompound=r['Image_Metadata_Compound']
    sourceCompoundMap[imageSourceId]=imageCompound
    sourceConcentrationMap[imageSourceId]=r['Image_Metadata_Concentration']
    if imageCompound not in compoundCountMap:
        compoundCountMap[imageCompound]=1
    else:
        compoundCountMap[imageCompound] = compoundCountMap[imageCompound] + 1
    if imageCompound in compound_moa_map:
        imageMoa=compound_moa_map[imageCompound]
        if imageMoa not in moaCountMap:
            moaCountMap[imageMoa]=1
        else:
            moaCountMap[imageMoa] = moaCountMap[imageMoa] + 1

In [None]:
compoundCountMap

In [None]:
moaCountMap

In [None]:
embeddingClient = bi.client('embedding')

In [None]:
imageClient = bi.client('image-management')

In [None]:
trainingConfigurationClient = bi.client('training-configuration')

In [None]:
embeddingInfo = trainingConfigurationClient.getEmbeddingInfo(EMBEDDING_NAME)

In [None]:
plateList = imageClient.listCompatiblePlates(embeddingInfo['inputWidth'], embeddingInfo['inputHeight'], embeddingInfo['inputDepth'], embeddingInfo['inputChannels'])

In [None]:
trainList = trainingConfigurationClient.getEmbeddingTrainings(EMBEDDING_NAME)

In [None]:
trainList

In [None]:
compound_moa_map

In [None]:
def getCompoundLabel(compound):    
    cnws ="".join(compound.split())
    return cnws.replace('/','-')

In [None]:
label_moa_map = {}
labelCountMap = {}
for c, m in compound_moa_map.items():
    label = getCompoundLabel(c)
    label_moa_map[label] = m
    labelCountMap[label]=compoundCountMap[c]

In [None]:
label_moa_map

In [None]:
train_compoundLabel_map = {}

In [None]:
for trainInfo in trainList:
    if 'filterKey' in trainInfo and len(trainInfo['filterKey'])>0:
        filterKey = trainInfo['filterKey']
        print(filterKey)
        a1=filterKey.split('/')
        print(a1)
        a2=a1[2].split("-filter")
        print(a2)
        trainId = trainInfo['trainId']
        print(trainId)
        train_compoundLabel_map[trainId]=a2[0]

In [None]:
train_compoundLabel_map

Check that the counts match, we leave out the control DMSO:

In [None]:
len(train_compoundLabel_map)==len(compound_moa_map)-1

In [None]:
tagClient = bi.client("tag")

In [None]:
tagList = tagClient.getAllTags()

In [None]:
compoundLabel_tag_map = {}
for tag in tagList:
    id = tag['id']
    value = tag['tagValue']
    type = tag['tagType']
    if (value.startswith('compound:')):
        a1 = value.split(":")
        compoundLabel_tag_map[a1[1]]=id

In [None]:
compoundLabel_tag_map

In [None]:
searchClient = bi.client("search")

We use the search service to construct a histogram of the distribution of matches to MOAs, where we pool the results for the images of a "left out" treatment. Here we survey across a range of pick values (which in practice shows remarkable insensitivity).

In [None]:
def getMoaHistogram(trainId, leftOutCompoundLabel=''):
    testSequence = []
    for j in range(1,31):
        testSequence.append(j)
    print("***")
    print(trainId)
    if leftOutCompoundLabel == '':
        leftOutCompoundLabel=train_compoundLabel_map[trainId]
    print(leftOutCompoundLabel)
    leftOutMoa = label_moa_map[leftOutCompoundLabel]
    print(leftOutMoa)
    print("===")
    imageInfoMap={}
    dmsoTag = compoundLabel_tag_map['DMSO']
    searchPlateMap = {}
    searchCount=0
    imageListPlateMap={}
    for plate in plateList:
        plateId = plate['plateId']
        #print("plate {}".format(plateId))
        images = imageClient.getImagesByPlateId(plateId)
        imageListPlateMap[plateId] = images
    print("Start search")
    for plate in plateList:
        plateId = plate['plateId']
        images = imageListPlateMap[plateId]
        searchResponses = []
        for image in images:
            imageSourceId = image['Item']['imageSourceId']
            imageId = image['Item']['imageId']
            compound = sourceCompoundMap[imageSourceId]
            compoundLabel = getCompoundLabel(compound)
            concentration = sourceConcentrationMap[imageSourceId]
            if compoundLabel==leftOutCompoundLabel:
                #print("{} {} {} {}".format(imageId, compound, compoundLabel, concentration))
                exclusionTags = []
                tag = compoundLabel_tag_map[compoundLabel]
                exclusionTags.append(tag)
                exclusionTags.append(dmsoTag)
                search = {
                    "trainId" : trainId,
                    "queryImageId" : imageId,
                    "exclusionTags" : exclusionTags,
                    "requireMoa" : "true",
                    "metric" : "Cosine"
                }
                #print(search)
                searchResponse = searchClient.submitSearch(search)
                searchCount += 1
                searchResponses.append(searchResponse)
        searchPlateMap[plateId] = searchResponses
    searchResultsMap={}
    resultCount=0
    for plate in plateList:
        plateId = plate['plateId']
        searchResponses = searchPlateMap[plateId]
        for searchResponse in searchResponses:
            searchId = searchResponse['searchId']
            statusValue = 'submitted'
            while statusValue != 'completed' and statusValue != 'error':
                sleep(1)
                searchStatus = searchClient.getSearchStatus(searchId)
                statusValue = searchStatus['Item']['status']
            if statusValue == 'completed':
                searchResults = searchClient.getSearchResults(searchId)
                if plateId not in searchResultsMap:
                    searchResultsMap[plateId] = []
                searchResultsMap[plateId].append(searchResults)
                resultCount += 1
    print("searchCount={} resultCount={}".format(searchCount, resultCount))
    for testCount in testSequence:
        moaBinCounts = {}
        hitCount=0
        binCount=0
        for plate in plateList:
                plateId = plate['plateId']
                if plateId in searchResultsMap:
                    searchResultsList = searchResultsMap[plateId]
                    for searchResults in searchResultsList:
                        for i in range(testCount):
                            hitCount += 1
                            searchResult = searchResults[i]
                            hitImageId = searchResult['imageId']
                            if hitImageId not in imageInfoMap:
                                imageInfo = imageClient.getImageInfo(hitImageId, 'origin')
                                imageInfoMap[hitImageId]=imageInfo
                            imageInfo=imageInfoMap[hitImageId]
                            imageSourceId = imageInfo['Item']['imageSourceId']
                            hitCompound = sourceCompoundMap[imageSourceId]
                            if hitCompound in compound_moa_map:
                                moa = compound_moa_map[hitCompound]
                            else:
                                moa = "unknown"
                            if moa in moaBinCounts:
                                c = moaBinCounts[moa]
                                c += 1
                                binCount += 1
                                moaBinCounts[moa] = c
                            else:
                                binCount += 1
                                moaBinCounts[moa] = 1
        print("hitCount={} binCount={}".format(hitCount, binCount))
        labelCount = labelCountMap[leftOutCompoundLabel]
        labelMoaCount = moaCountMap[leftOutMoa]
        adjustedLabelMoaCount = labelMoaCount - labelCount
        bestMoa=''
        bestScore=0.0
        for moa in moaBinCounts:
            c = moaBinCounts[moa]
            m = moaCountMap[moa]
            if moa == leftOutMoa:
                n = c / adjustedLabelMoaCount
            else:
                n = c / m
            if n > bestScore:
                bestMoa=moa
                bestScore=n
        for moa in moaBinCounts:
            c = moaBinCounts[moa]
            m = moaCountMap[moa]
            if moa == leftOutMoa:
                n = c / adjustedLabelMoaCount
            else:
                n = c / m
            if moa==bestMoa:
                print("{}> {} {} {}".format(testCount, moa, c, n))
            else:
                print("{} {} {} {}".format(testCount, moa, c, n))            

In [None]:
trainIdList = []
for trainInfo in trainList:
    trainId = trainInfo['trainId']
    if trainId!='origin' and trainId!=BASELINE_TRAIN_ID:
        trainIdList.append(trainInfo['trainId'])
trainIdList.sort()

In [None]:
trainIdList

In [None]:
j=1
for trainId in trainIdList:
    print(j)
    getMoaHistogram(trainId)
    j += 1