# Create Count Matrices
```
Andrew E. Davidson
aedavids@ucsc.edu
```

**prerequisites:**
1. you have already run Salmon in one or more workspaces. 
2. you have create "localization scripts"
    - these scripts are used to copy the salmon quant.sf files to the local VM
    - the quant.sf file can be gz compressed
    - a simple script would use gsutil -m cp
    ```
    gsutil -m cp gs://fc-secure-8a69fc00-b6c9-4179-aee5-f1e47a4475dd/34b2bbfb-4f9a-41d4-bfd8-b55a8e1987de/quantify/00a77c33-e6d7-44a0-8638-61a6c3e2d1fd/call-salmon_paired_reads/attempt-2/READ-G5-6641-TP.quant.sf.gz ./TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/
    ```
3. the quant.sf file names are of the form 'sampleName.quant.sf' or 'sampleName.quant.sf.gz'
    - for example: READ-AF-2689-NT.quant.sf 
    
**for each localization script**
- if matrix.tsv does not exist in workspace bucket
     * copys quant.sf files to local VM
     * use cut and paste to create a matrix in tsv format from the NumReads column in the quant.sf files
     * copys th matrix tsv file to the current workspace bucket.

Note: the first column name will be 'name' the remaining will column names will be the sample Names. The first column will have the names column from the quant.sf file the remaining elements in the matrix will be the count values. the columns will be sorted by sample name

**useful environmental vars**
```
WORKSPACE_BUCKET=gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275
WORKSPACE_NAME=uber
WORKSPACE_NAMESPACE=test-aedavids-proj
```



## Request for enchancement
like juypter notebooks by default terra/wdl command should have access to gsutil with out having to create a special
docker image that contains gsutil and all the magic required to authenticate. No architecture, design or framework is perfect, that is say can handle all use cases. It is important to have an "escape hatch to lower levels of framework and system". It should be possible to mount the gsutil and auth tokens in the 'docker run' cmd

Another possible implementation would be to create another mechanism like wdl' array[File] that take a file where each row is gs:// url and localizes it

## Challenges with Terra wdl
### approach a: use sample.tsv file 
    - sample.tsv has an array[file] column 
    - over 10,000 very long gs urls. Terra support says sql will probably choke
    - not manageable
### approach b: use input json.
        * would require us to generate the json. (we have to do this anyway. to many files)
        * disadvantage, would not be repeatable. 
           + you would have to look at the job results to see what the input was
           + we could store the input.json in the workflow bucket
### approach C: create a file, each line is the gs:// to a sample to be processed
    **This is the best approach** it is repeatable, easy to manage, efficient, and allows parralel creation of 33 required matrices. 
    * create a 'sample.tsv' with an id for each matrix to be created.
    * add a column to contain the url to a file with the list of urls to be combined into the final matrix
    * difficult to implemented.
    * unlike terra juypter notebooks it would be hard if not impossible to use gsutil in a wld command
    * it appears when terra/cromwell does localization on the vm but does not makd the gsutil and authentication available to the container
    * <span style="color:red"> This notebook is clumspy POC of this approach</span>
        + basically I rewrote stuff cromwell does well. :-( 
        + this notebook runs the 29 individual workspacs serialy. If this was implemented using
        a sample.tsv and wdl they would run in parrallel


In [1]:
!echo $WORKSPACE_BUCKET

gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275


## Copy scripts from workspace bucket to vm

In [2]:
%%time
! mkdir -p ~/${WORKSPACE_NAME}/bin/localization
! gsutil cp ${WORKSPACE_BUCKET}/bin/cutAndPaste.sh ~/${WORKSPACE_NAME}/bin/
! gsutil -m cp -r ${WORKSPACE_BUCKET}/bin/localization/ ~/${WORKSPACE_NAME}/bin/

Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/cutAndPaste.sh...
/ [1 files][  2.4 KiB/  2.4 KiB]                                                
Operation completed over 1 objects/2.4 KiB.                                      
Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh...
Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/localization/TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_427_copyFromTerraToLocal.sh...
Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/localization/TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_1181_copyFromTerraToLocal.sh...
Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/localization/TCGA_CESC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_309_copyFromTerraToLocal.sh...
Copying gs://fc-e15b796f-1abe-4206-ab91-bd58374cc275/bin/localization/TCGA_CHOL_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_45_copyFromTerraToL

In [3]:
! cat ~/${WORKSPACE_NAME}/bin/cutAndPaste.sh

#!/bin/sh

#
# return a tsv file. first col header name is 'Name' remain headers
# are the sample names
#

if [ "$#" -lt 1 ]; then
    echo "ERROR arguments are not correct"
    echo "Usage: $0 outputFile"
    echo "create a tsv file from all the salmon quant.sf files in the current directory"
    echo "will uncompress if in gz format"
    echo " example $0 myMatrix; will produce a file myMatrix.tsv"
    exit 1
fi

outputFile=$1

# ref: https://gist.github.com/vncsna/64825d5609c146e80de8b1fd623011ca 
#set -euxo pipefail
#set -x


# check if file is compressed or not
printf "uncompressing quant.sf.gz files\n"
for f in `ls *quant.sf*`;
do

    #printf "\n****** $f"
    gzip -t $f 2>/dev/null
    if [ $? -eq 0 ];
    then
        gzip -d $f &
    # else
    #     printf not a compressed file
    fi

    # wait for all background processes to complete
    # to run paste we need to be a big machine. we want to do as much
    # concurrent processing as 

## localization script
copys the quant.sf.gz files from terra workspace buckets to local vm

In [4]:
%%time
def localization(copyScript):
    #dir = "~/${WORKSPACE_NAME}/bin/localization/"
#     copyScript = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"

    # { } is script magic synatx for expanding variables and concatenation
    #! chmod u+x {dir + copyScript}
    ! chmod u+x $copyScript
    #! ls -l {dir + copyScript}
    ! ls -l $copyScript
    # run the script. clean up any left over files from previous runs
    ! ( \
       mkdir ~/data; \
       cd ~/data; \
       rm -rf *; \
       $copyScript ; \
      )
    
def testLocalization():
    #copyScript = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"
    copyScript = "/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"
    localization(copyScript)

# this takes a long time
#testLocalization()

CPU times: user 4 µs, sys: 3 µs, total: 7 µs
Wall time: 9.78 µs


## find terra workspace name the samples originated from

In [5]:
def getSampleWorkspaceName(copyScript):
    baseName = copyScript.split("/")[-1]
    tmp = baseName.split("_copyFromTerraToLocal.sh")[0]
    tokens = tmp.split("_")[:-1]
    workSpace = "_".join(tokens)
    return workSpace
    
def testGetSampleWorkspaceName():
    #copyScript="TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"; 
    copyScript="/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"
    wsName = getSampleWorkspaceName(copyScript)
    print(wsName)
    assert wsName == "TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab", "ERROR name did not parse"

testGetSampleWorkspaceName()

TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab


## Create the matrix

In [6]:
%%time
def createMatrix(sampleWorkspace):
    # notebook can not expand both local variables and environmental variable
    # work around copy enviromental variables to local
    tmp = ! echo $WORKSPACE_NAME
    currentWorkspace = tmp[0]
    
    print("\n****** " + sampleWorkspace)
    ! ( \
       chmod +x ~/$currentWorkspace/bin/cutAndPaste.sh; \
       cd ~/data/$sampleWorkspace; \
       ~/uber/bin/cutAndPaste.sh $sampleWorkspace ; \
      )
    
def testCreateMatrix():
    copyScript = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"; 
    copyScript = "/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"
    sampleWorkSpace = getSampleWorkspaceName(copyScript)
    createMatrix(sampleWorkSpace)
    ! ls -l ~/data/$sampleWorkSpace/*.tsv

    
# this take a long time
#testCreateMatrix()

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 6.44 µs


In [7]:
def getCountMatrixFilePath(copyScript):
    sampleWorkspaceName = getSampleWorkspaceName(copyScript)
    countMatrix = "~/data/" + sampleWorkspaceName + "/" + sampleWorkspaceName + ".tsv"
    return countMatrix

def testGetCountMatrixFilePath():
    #copyScript="TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"; 
    copyScript = "/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"

    file = getCountMatrixFilePath(copyScript)
    print(file)
    assert file == "~/data/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.tsv", "ERROR file not found"
    ! ls -l $file
    
testGetCountMatrixFilePath()

~/data/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.tsv
ls: cannot access '/home/jupyter/data/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.tsv': No such file or directory


In [8]:
def getSaveAsFileName(copyScript):
    tsvFile = getCountMatrixFilePath(copyScript)
    basename = tsvFile.split("/")[-1]
    root = basename.split(".")[0]
    saveAs = root + ".NumReads.tsv"
    return saveAs
    
def testGetSaveAsFileName():
    #copyScript="TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"; 
    copyScript = "/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"

    saveAs = getSaveAsFileName(copyScript)
    print(saveAs)
    assert saveAs == "TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.NumReads.tsv", "ERROR"

testGetSaveAsFileName()

TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.NumReads.tsv


## define pipeline

In [9]:
%%time
def pipeline(copyScript):
    sampleWorkspaceName = getSampleWorkspaceName(copyScript)
    print("\n******** {}".format(sampleWorkspaceName))
    
    print("BEGIN localization")
    localization(copyScript)
    print("END localization")
    
    print("\nBEGIN createMatrix")
    createMatrix(sampleWorkspaceName)
    print("END createMatrix")
    
    tsvFile = getCountMatrixFilePath(copyScript)
    
    
    # juypter / iPython does can not expand local python variables and environment variable in the same cmd
    # works around is to copy environment varable to local python variables
    tmp = ! echo ${WORKSPACE_BUCKET}
    bucketURL = tmp[0]
    saveAs = getSaveAsFileName(copyScript)
    ! gsutil -m cp $tsvFile $bucketURL/data/matrices/NumReads/$saveAs
    ! gsutil ls -l $bucketURL/data/matrices/NumReads/$saveAs
    
    
def testPipeline():
    #copyScript="TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_105_copyFromTerraToLocal.sh"; 
    copyScript = "/home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh"

    pipeline( copyScript )
    
# this take a long time. 
#testPipeline()

CPU times: user 3 µs, sys: 2 µs, total: 5 µs
Wall time: 8.82 µs


In [10]:
%%time
def runAll():
    # juypter / iPython does can not expand local python variables and environment variable in the same cmd
    # works around is to copy environment varable to local python variables
    tmp = ! echo ${WORKSPACE_BUCKET}
    bucketURL = tmp[0]
    
    dir = "~/${WORKSPACE_NAME}/bin/localization/*.sh"
    listOfLocalizationScripts = ! ls $dir
    for copyScript in listOfLocalizationScripts:
        print("\n***** " + copyScript)
        saveAs = getSaveAsFileName(copyScript)
        print(saveAs)
        exitCodeList = ! (gsutil -q stat $bucketURL/data/matrices/NumReads/$saveAs; echo $?)
        exitCode = int(exitCodeList[0])
        if exitCode == 0:
            print("skipping {} matrix already exists".format(copyScript))
            continue
            
        try:
            pipeline(copyScript)
        except Exception as e:
            print("ERROR exception: {}".format(e))

        
runAll()


***** /home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh
TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.NumReads.tsv
skipping /home/jupyter/uber/bin/localization/TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_79_copyFromTerraToLocal.sh matrix already exists

***** /home/jupyter/uber/bin/localization/TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_427_copyFromTerraToLocal.sh
TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.NumReads.tsv
skipping /home/jupyter/uber/bin/localization/TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_427_copyFromTerraToLocal.sh matrix already exists

***** /home/jupyter/uber/bin/localization/TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_1181_copyFromTerraToLocal.sh
TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab.NumReads.tsv
skipping /home/jupyter/uber/bin/localization/TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_1181_copyFromTerraToLocal.sh ma

## Find find workspaces that we where not able to create numReads.tsv for

In [11]:
def getListOfSucesses():
    listOfTSVFiles = ! gsutil ls "${WORKSPACE_BUCKET}/data/matrices/NumReads/*.tsv"
    retList = [""]* len(listOfTSVFiles)
    
    for i in range(len(listOfTSVFiles)):
        url = listOfTSVFiles[i]
        tsv = url.split("/")[-1]
        retList[i] = tsv
    return retList

#getListOfSucesses()

In [12]:
def getExpectedResults():
    listOfCopyScripts = ! ls ~/${WORKSPACE_NAME}/bin/localization/*.sh
    retList = []
    
    for copyScript  in listOfCopyScripts:
        tsv = getSaveAsFileName(copyScript)
        retList.append(tsv)
        
    return retList

#getExpectedResults()

In [13]:
def checkForFailures():
    expectedSResultSet = set(getExpectedResults())
    resultsSet         = set( getListOfSucesses() )
    # element in expected that are not in results
    return expectedSResultSet.difference( resultsSet )
        
checkForFailures()

set()