Several CommonMind samples were submitted to GENEWIZ for sequencing.  Which samples are still missing from s3://chesslab-bsmn?  From the list of fastq files in the s3 bucket we also create a sample_list.tsv for the bsmn-pipeline.

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import subprocess
import re
import io
import numpy as np

## Samples submitted to GENEWIZ
Import CSV representing the first sheet of `Pooling strategy summary for Attila 07Apr2020.xls` attached to Chaggai's email of 4/7/2020

In [2]:
csvpath = '/home/attila/projects/bsm/tables/pooling-strategy-summary-from-chaggai-1.csv'
to_gw = pd.read_csv(csvpath, sep='\t', header='infer', index_col=0)
to_gw = to_gw.iloc[:-1, :].copy() # the last row is a summary: total number of lanes
to_gw

Unnamed: 0_level_0,Conc (ng/µL)*,Pool,uL,# Reads,Yield (Mb),Mean Quality Score,% Bases >= 30,Total Reads/Pool,% of Pool,Coverage,...,uL/150uL,uL final,U,Number lanes,nM in pool,Vol/Pool,Actual/Proposed,Predicted lanes,Total lanes,Total Coverage
Sample Name*,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.5,A,11.6,9.753870e+08,292616.0,35.94,94.08,5.132388e+09,0.190045,68.3,...,44.87,18.0,8.94,4.0,1.15,72.5,0.92,0.31,0.50,100.0
65,0.5,A,11.5,4.814455e+08,144434.0,36.06,94.65,,0.093805,33.7,...,90.12,18.0,9.03,,DDW,47.5,0.46,0.15,0.25,50.0
B MSSM 1172,7.0,A,0.8,1.198945e+09,359684.0,35.96,94.09,,0.233604,83.9,...,2.52,2.9,20.30,,,,2.65,0.89,1.12,224.0
C MSSM 1199,2.1,A,2.8,5.148946e+08,154468.0,36.01,94.36,,0.100323,36.0,...,20.52,23.4,48.64,,,,2.63,0.88,0.98,196.0
D MSSM 1238,5.3,A,1.1,9.924667e+08,297740.0,36.01,94.31,,0.193373,69.5,...,4.18,4.8,25.39,,,,2.64,0.88,1.08,215.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13,1.0,L,5.8,4.647687e+08,139431.0,35.82,93.32,,0.168924,32.5,...,24.17,23.0,22.80,,DDW,-1.9,2.29,0.83,1.00,200.0
26,2.3,L,2.5,6.725015e+08,201750.0,35.86,93.42,,0.244427,47.1,...,7.20,7.4,16.79,,,,2.47,0.90,1.14,228.0
36,2.9,L,2.0,5.307355e+08,159221.0,35.87,93.54,,0.192901,37.2,...,7.30,7.5,21.46,,,,2.47,0.90,1.09,218.0
59,0.7,L,8.3,3.175936e+08,95278.0,35.93,93.87,,0.115432,22.2,...,50.62,20.5,14.22,,,,0.97,0.35,0.47,94.0


## Samples received from GENEWIZ
List contents of the `GENEWIZ/30-317737003/` folder in `s3://chesslab-bsmn`

In [3]:
p1 = subprocess.run(['aws', 's3', 'ls', 's3://chesslab-bsmn/GENEWIZ/30-317737003/'], capture_output=True)
p2 = subprocess.run(['tr', '--squeeze', '" "', ','], input=p1.stdout, capture_output=True)
from_gw = pd.read_csv(io.StringIO(p2.stdout.decode('utf-8')), names=['date', 'time', 'size', 'filename'])
from_gw.tail()

Unnamed: 0,date,time,size,filename
44,2020-04-03,11:40:05,28313528644,32_R1_001.fastq.gz
45,2020-04-03,11:50:30,29166821498,32_R2_001.fastq.gz
46,2020-04-03,13:46:03,42141068816,GMSSM1357_R1_001.fastq.gz
47,2020-04-03,14:11:18,36041341017,K_R1_001.fastq.gz
48,2020-04-03,14:27:21,7680,md5sum_list.txt


Remove non-fastq files from list

In [4]:
fqpattern = '[A-Z0-9]+_R[12].*fastq\.gz'
keeprow = [re.search(fqpattern, y) is not None for y in from_gw['filename']]
from_gw = from_gw.iloc[keeprow, :].copy()

Add a column called *Uploaded to S3* to Chaggai's sheet and save the sheet as a CSV

In [5]:
to_gw_sm = set([re.sub(' ', '', y) for y in to_gw.index])
from_gw_sm = set([re.sub('_R[12]_001\.fastq\.gz', '', y) for y in from_gw['filename']])
to_gw['Uploaded to S3'] = [y in from_gw_sm for y in to_gw.index]
csvpath = '/home/attila/projects/bsm/results/2020-04-07-GENEWIZ-samples/pooling-strategy-summary-from-chaggai-1-uploaded.csv'
to_gw.to_csv(csvpath)

Create symlink to CSV with a name similar to Chaggai's original name

In [6]:
%%bash
newcsv='/home/attila/projects/bsm/results/2020-04-07-GENEWIZ-samples/pooling-strategy-summary-from-chaggai-1-uploaded.csv'
symlink='/home/attila/projects/bsm/tables/Pooling strategy summary for Attila 07Apr2020-uploaded.csv'
ln -fs $newcsv "$symlink"

Missing samples (defined as those not yet uploaded to `s3://chesslab-bsmn`)

In [7]:
missing_sm = list(to_gw_sm - from_gw_sm)
missing_sm.sort()
print(missing_sm)

['3', '33', '34', '36', '38', '39', '43', '44', '46', '48', '49', '5', '50', '54', '57', '58', '59', '6', '60', '61', '63', '65', '67', '7', '72', '73', '74', '75', '76', '77', '8', '80', '84', '86', '9', 'AMSSM1160', 'BMSSM1172', 'CMSSM1199', 'DMSSM1238', 'EMSSM1247', 'FMSSM1346', 'HPITT1454', 'I', 'J']


## Creating sample_list for bsmn-pipeline

We use the `from_gw` data frame's `filename` column as a basis for the sample list data frame `slist`.  We add a `#sample_id` and a `location` column using `samples-from-Chaggai.csv` alias `genewiz_serialn.csv` ([syn21982509](https://www.synapse.org/#!Synapse:syn21982509)).

Note that `samples-from-Chaggai.csv` was manually edited to account for the file name `GMSSM1357_R1_001.fastq.gz` listed in the bucket, which should have been named as `G_R1_001.fastq.gz`.

In [8]:
sn = pd.read_csv('/home/attila/projects/bsm/tables/samples-from-Chaggai.csv', index_col='GENEWIZ_serialn')
s3prefix = 's3://chesslab-bsmn/GENEWIZ/30-317737003/'
slistarray = np.array([[sn.loc[re.sub('(^[0-9A-Za-z]+)_.*', '\\1', filename), 'CMC_simple_id'] + '_NeuN_pl', filename, s3prefix + filename] for filename in from_gw['filename']])
slist = pd.DataFrame(slistarray, columns=['#sample_id', 'file_name', 'location'])
slistpath = '/home/attila/projects/bsm/results/2020-04-07-GENEWIZ-samples/sample_list.tsv'
slist.to_csv(slistpath, sep='\t', header=True, index=False)

In [9]:
%connect_info

{
  "shell_port": 56259,
  "iopub_port": 45899,
  "stdin_port": 38951,
  "control_port": 42865,
  "hb_port": 55459,
  "ip": "127.0.0.1",
  "key": "010056a5-eaddbc0a04760d0677b05776",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-d10fd373-efb3-43f4-b011-1d3d5363639b.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
