The status of samples on Ada, BSMN scratch space and GENEWIZ project 30-317737003 denoted here as g003.

## Preparation

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import synapseclient
import synapseutils
import re
import os.path
syn = synapseclient.login()

Welcome, Attila Jones!



In [2]:
wdir = '/home/attila/projects/bsm/results/2020-05-18-processed-samples/'
# CMC_Human_clinical_metadata.csv
cmc_clinical_syn = syn.get('syn2279441', downloadLocation=wdir, ifcollision='overwrite.local')
cmc_clinical = pd.read_csv(cmc_clinical_syn.path, index_col='Individual ID')

Sequenced samples from Ada

In [3]:
ada_path = '/home/attila/projects/bsm/results/2018-09-12-sequenced-individuals/sequenced-samples.csv'
ada = pd.read_csv(ada_path)
# index with sample IDs like MSSM_033_NeuN_pl
ada.index = [y[0].replace('CMC_', '') + '_' + y[1] for y in zip(ada['Individual ID'], ada['Tissue'])]

Processed samples from *Scratch Space »  BSMN_grant_data »  WGS.SCZ.Chess » alignment* Synapse folder [syn20735395](https://www.synapse.org/#!Synapse:syn20735395)

In [4]:
scratch = list(synapseutils.walk(syn, 'syn20735395'))[0][2] # get file list
scratch = [y for y in scratch if re.match('^.*\.cram$', y[0])] # retain only .cram files and remove everything else like .crai files
entities = [syn.get(y[1], downloadFile=False) for y in scratch]

In [5]:
def entity2df(e, cmc_clinical=cmc_clinical):
    '''given Synapse entity e create one-lined data frame'''
    simple_id = re.sub('_(muscle|NeuN_pl|NeuN_mn).*$', '', e.name)
    indiv_id = 'CMC_' + simple_id
    d = dict()
    d['Individual ID'] = indiv_id
    d['Dx'] = cmc_clinical.loc[indiv_id, 'Dx']
    d['Tissue'] = re.sub('^.*(muscle|NeuN_pl|NeuN_mn).*$', '\\1', e.name)
    d['File name'] = e.name
    d['Modification time'] = e.modifiedOn
    users = {'3388274': '@cmolitor', '3340241': '@taejeong', '3338602': '@attilajones'}
    d['Created by'] = users[e.createdBy]
    d['Modified by'] = users[e.modifiedBy]
    d['Size GiB'] = e._file_handle.contentSize / 2 ** 30
    e._file_handle.contentSize
    df = pd.DataFrame(d, index=[e.name.replace('.cram', '')])
    return(df)

scratch = pd.concat([entity2df(e) for e in entities])
scratch.head()

Unnamed: 0,Individual ID,Dx,Tissue,File name,Modification time,Created by,Modified by,Size GiB
MSSM_027_NeuN_pl,CMC_MSSM_027,SCZ,NeuN_pl,MSSM_027_NeuN_pl.cram,2020-05-02T15:56:53.425Z,@attilajones,@attilajones,59.431365
MSSM_056_muscle,CMC_MSSM_056,Control,muscle,MSSM_056_muscle.cram,2020-03-13T20:18:16.206Z,@taejeong,@cmolitor,52.598372
MSSM_106_NeuN_mn,CMC_MSSM_106,Control,NeuN_mn,MSSM_106_NeuN_mn.cram,2020-03-13T20:18:16.812Z,@taejeong,@cmolitor,55.699787
MSSM_106_NeuN_pl,CMC_MSSM_106,Control,NeuN_pl,MSSM_106_NeuN_pl.cram,2020-03-13T20:20:05.209Z,@taejeong,@cmolitor,226.807344
MSSM_106_muscle,CMC_MSSM_106,Control,muscle,MSSM_106_muscle.cram,2020-03-13T20:19:32.261Z,@taejeong,@cmolitor,61.918579


In [6]:
samples_ada = set(ada.index)
samples_scratch = set(scratch.index)

Samples in Chaggai's table for GENEWIZ project 30-317737003.

In [7]:
g003 = pd.read_csv('/home/attila/projects/bsm/tables/samples-from-Chaggai.csv')
g003.index = [y + '_NeuN_pl' for y in g003['CMC_simple_id']]
g003['Individual ID'] = ['CMC_' + y for y in g003['CMC_simple_id']]
g003['Tissue'] = 'NeuN_pl'
# correction: from CONTROL to Control
g003['Dx'] = [re.sub('ONTROL', 'ontrol', y) for y in g003['Dx']]
g003.head()

Unnamed: 0,CMC_simple_id,Dissection,PFC #,Dx,GENEWIZ_serialn,Index,Individual ID,Tissue
PITT_117_NeuN_pl,PITT_117,PITT,904,SCZ,1,A2,CMC_PITT_117,NeuN_pl
MSSM_364_NeuN_pl,MSSM_364,MSSM,1141,SCZ,2,A3,CMC_MSSM_364,NeuN_pl
MSSM_363_NeuN_pl,MSSM_363,MSSM,1144,SCZ,3,A4,CMC_MSSM_363,NeuN_pl
MSSM_308_NeuN_pl,MSSM_308,MSSM,1153,SCZ,5,A5,CMC_MSSM_308,NeuN_pl
MSSM_339_NeuN_pl,MSSM_339,MSSM,1155,SCZ,6,A6,CMC_MSSM_339,NeuN_pl


## Results
### Processed samples from Ada
Samples on Ada that have been processed by bsmn-pipeline.  Note that it was Taejeong who did the work.

In [8]:
scratch.loc[samples_ada.intersection(samples_scratch), :]

Unnamed: 0,Individual ID,Dx,Tissue,File name,Modification time,Created by,Modified by,Size GiB
PITT_118_NeuN_mn,CMC_PITT_118,SCZ,NeuN_mn,PITT_118_NeuN_mn.cram,2020-03-13T20:18:17.841Z,@taejeong,@cmolitor,52.733003
MSSM_183_muscle,CMC_MSSM_183,Control,muscle,MSSM_183_muscle.cram,2020-03-13T20:18:17.380Z,@taejeong,@cmolitor,47.4775
MSSM_183_NeuN_mn,CMC_MSSM_183,Control,NeuN_mn,MSSM_183_NeuN_mn.cram,2020-03-13T20:18:19.052Z,@taejeong,@cmolitor,62.790291
MSSM_295_muscle,CMC_MSSM_295,SCZ,muscle,MSSM_295_muscle.cram,2020-03-13T20:18:19.455Z,@taejeong,@cmolitor,56.074964
MSSM_118_muscle,CMC_MSSM_118,SCZ,muscle,MSSM_118_muscle.cram,2020-03-13T20:18:16.688Z,@taejeong,@cmolitor,48.005579
MSSM_106_NeuN_pl,CMC_MSSM_106,Control,NeuN_pl,MSSM_106_NeuN_pl.cram,2020-03-13T20:20:05.209Z,@taejeong,@cmolitor,226.807344
MSSM_373_NeuN_pl,CMC_MSSM_373,SCZ,NeuN_pl,MSSM_373_NeuN_pl.cram,2020-03-13T20:20:17.553Z,@taejeong,@cmolitor,304.313954
MSSM_175_NeuN_pl,CMC_MSSM_175,Control,NeuN_pl,MSSM_175_NeuN_pl.cram,2020-03-13T20:19:55.225Z,@taejeong,@cmolitor,176.807351
MSSM_118_NeuN_pl,CMC_MSSM_118,SCZ,NeuN_pl,MSSM_118_NeuN_pl.cram,2020-03-13T20:18:19.573Z,@taejeong,@cmolitor,99.49285
PITT_091_NeuN_pl,CMC_PITT_091,SCZ,NeuN_pl,PITT_091_NeuN_pl.cram,2020-03-13T20:18:24.555Z,@taejeong,@cmolitor,169.931067


### Unprocessed samples from Ada
Samples on Ada that have **not** been processed by bsmn-pipeline yet

In [9]:
samples_unprocessed = samples_ada - samples_scratch
samples_unprocessed_path = os.path.normpath(wdir + os.path.sep + 'unprocessed_samples')
with open(samples_unprocessed_path, 'w') as f:
    for s in samples_unprocessed:
        f.write(s + '\n')
print('written to', samples_unprocessed_path)

written to /home/attila/projects/bsm/results/2020-05-18-processed-samples/unprocessed_samples


In [10]:
%%bash
cat /home/attila/projects/bsm/results/2020-05-18-processed-samples/unprocessed_samples

MSSM_331_NeuN_pl
PITT_060_NeuN_pl
PITT_101_NeuN_pl
MSSM_065_NeuN_pl
MSSM_310_NeuN_pl
MSSM_033_NeuN_pl
MSSM_338_NeuN_pl
MSSM_295_NeuN_pl
MSSM_056_NeuN_pl
PITT_036_NeuN_pl
MSSM_304_NeuN_pl
MSSM_193_NeuN_pl


These samples were uploaded to [s3://chesslab-bsmn/Ada/](https://s3.console.aws.amazon.com/s3/buckets/chesslab-bsmn/Ada/?region=us-east-2&tab=overview)

### (Partially) processed samples from GENEWIZ 30-317737003
Newest samples: GENEWIZ project 30-317737003

In [11]:
scratch.loc[samples_scratch - samples_ada, :]

Unnamed: 0,Individual ID,Dx,Tissue,File name,Modification time,Created by,Modified by,Size GiB
MSSM_366_NeuN_pl,CMC_MSSM_366,SCZ,NeuN_pl,MSSM_366_NeuN_pl.cram,2020-05-01T11:10:17.453Z,@attilajones,@attilajones,44.041661
MSSM_224_NeuN_pl,CMC_MSSM_224,SCZ,NeuN_pl,MSSM_224_NeuN_pl.cram,2020-05-01T00:58:44.085Z,@attilajones,@attilajones,38.057367
PITT_050_NeuN_pl,CMC_PITT_050,Control,NeuN_pl,PITT_050_NeuN_pl.cram,2020-05-01T12:38:57.697Z,@attilajones,@attilajones,38.51765
MSSM_327_NeuN_pl,CMC_MSSM_327,SCZ,NeuN_pl,MSSM_327_NeuN_pl.cram,2020-05-01T12:47:36.932Z,@attilajones,@attilajones,50.819513
PITT_072_NeuN_pl,CMC_PITT_072,SCZ,NeuN_pl,PITT_072_NeuN_pl.cram,2020-05-01T07:18:57.829Z,@attilajones,@attilajones,32.506979
MSSM_321_NeuN_pl,CMC_MSSM_321,SCZ,NeuN_pl,MSSM_321_NeuN_pl.cram,2020-05-02T06:43:10.689Z,@attilajones,@attilajones,62.606365
MSSM_346_NeuN_pl,CMC_MSSM_346,SCZ,NeuN_pl,MSSM_346_NeuN_pl.cram,2020-05-01T08:40:51.936Z,@attilajones,@attilajones,40.556149
MSSM_192_NeuN_pl,CMC_MSSM_192,SCZ,NeuN_pl,MSSM_192_NeuN_pl.cram,2020-05-01T05:42:08.793Z,@attilajones,@attilajones,35.282288
MSSM_269_NeuN_pl,CMC_MSSM_269,SCZ,NeuN_pl,MSSM_269_NeuN_pl.cram,2020-05-02T01:03:59.082Z,@attilajones,@attilajones,50.057645
MSSM_213_NeuN_pl,CMC_MSSM_213,SCZ,NeuN_pl,MSSM_213_NeuN_pl.cram,2020-05-02T05:55:22.334Z,@attilajones,@attilajones,60.143387


### Unprocessed samples from GENEWIZ 30-317737003


Take samples in g003 (`samples_g003`) as well as those on Ada or on the BSMN scratch space or both (`samples_ada_scratch`)

In [12]:
samples_g003 = set(g003.index)
samples_ada_scratch = samples_unprocessed.union(samples_scratch)
print(len(samples_g003), len(samples_ada_scratch))

69 69


In [13]:
samples_missing_from_scratch = samples_g003 - samples_ada_scratch
len(samples_missing_from_scratch)

47

## List of all samples

In [14]:
sel_col = ['Individual ID', 'Dx', 'Tissue']
all_samples = pd.concat([ada.loc[samples_unprocessed, sel_col], \
                         scratch.loc[:, sel_col], \
                        g003.loc[samples_missing_from_scratch, sel_col]])
allsi = list(all_samples.index)
allsi.sort()
all_samples = all_samples.loc[allsi, :]
all_samples.to_csv(wdir + os.path.sep + 'all_samples.csv', index_label = 'Sample ID')
all_samples

Unnamed: 0,Individual ID,Dx,Tissue
MSSM_027_NeuN_pl,CMC_MSSM_027,SCZ,NeuN_pl
MSSM_033_NeuN_pl,CMC_MSSM_033,Control,NeuN_pl
MSSM_055_NeuN_pl,CMC_MSSM_055,Control,NeuN_pl
MSSM_056_NeuN_pl,CMC_MSSM_056,Control,NeuN_pl
MSSM_056_muscle,CMC_MSSM_056,Control,muscle
...,...,...,...
PITT_101_NeuN_pl,CMC_PITT_101,Control,NeuN_pl
PITT_113_NeuN_pl,CMC_PITT_113,Control,NeuN_pl
PITT_117_NeuN_pl,CMC_PITT_117,SCZ,NeuN_pl
PITT_118_NeuN_mn,CMC_PITT_118,SCZ,NeuN_mn


In [15]:
print(len(set(all_samples['Individual ID'])),'individuals')

95 individuals


In [16]:
all_samples.groupby(['Dx', 'Tissue']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Individual ID
Dx,Tissue,Unnamed: 2_level_1
Control,NeuN_mn,8
Control,NeuN_pl,28
Control,muscle,7
SCZ,NeuN_mn,2
SCZ,NeuN_pl,67
SCZ,muscle,4


List Control individuals with samples from multiple tissues

In [17]:
samplecounts = all_samples.groupby(['Individual ID']).count()
multisample_indivs = samplecounts.loc[samplecounts['Tissue'] > 1, :].index
multisample_indivs
multisamples = all_samples.loc[all_samples['Individual ID'].isin(multisample_indivs), :]
def count_multisamples(Dx='Control'):
    res = multisamples.loc[multisamples['Dx'] == Dx, :].groupby(['Individual ID', 'Tissue']).count()
    return(res)
count_multisamples('Control')

Unnamed: 0_level_0,Unnamed: 1_level_0,Dx
Individual ID,Tissue,Unnamed: 2_level_1
CMC_MSSM_056,NeuN_pl,1
CMC_MSSM_056,muscle,1
CMC_MSSM_106,NeuN_mn,1
CMC_MSSM_106,NeuN_pl,1
CMC_MSSM_106,muscle,1
CMC_MSSM_109,NeuN_mn,1
CMC_MSSM_109,NeuN_pl,1
CMC_MSSM_109,muscle,1
CMC_MSSM_179,NeuN_mn,1
CMC_MSSM_179,NeuN_pl,1


The same for SCZ individuals

In [18]:
count_multisamples('SCZ')

Unnamed: 0_level_0,Unnamed: 1_level_0,Dx
Individual ID,Tissue,Unnamed: 2_level_1
CMC_MSSM_118,NeuN_pl,1
CMC_MSSM_118,muscle,1
CMC_MSSM_295,NeuN_pl,1
CMC_MSSM_295,muscle,1
CMC_MSSM_304,NeuN_pl,1
CMC_MSSM_304,muscle,1
CMC_MSSM_331,NeuN_pl,1
CMC_MSSM_331,muscle,1
CMC_PITT_091,NeuN_mn,1
CMC_PITT_091,NeuN_pl,1


In [19]:
%connect_info

{
  "shell_port": 52163,
  "iopub_port": 33735,
  "stdin_port": 33889,
  "control_port": 52717,
  "hb_port": 36263,
  "ip": "127.0.0.1",
  "key": "38a7fd56-bbafb792ab1ef25184f82a10",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-10c2cc45-dd08-4173-a0a4-62c9838e9caf.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
