The coverage is estimated for each sample in the new GENEWIZ samples (project 30-317737003) and compared to that for previous samples.

In [1]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
import pandas as pd
import re
from sklearn import linear_model, neighbors

## Preparation

Running `notebook/2020-04-29-sample-fastq-sizes/runme.sh` on Ada produced the following two tables:

In [2]:
ls ~/projects/bsm/results/2020-04-29-sample-fastq-sizes/*.tsv

[0m[00;38;5;244m/home/attila/projects/bsm/results/2020-04-29-sample-fastq-sizes/fastq-nblocks.tsv[0m[K
[00;38;5;244m/home/attila/projects/bsm/results/2020-04-29-sample-fastq-sizes/nreads.tsv[0m


In [3]:
fai = pd.read_csv('/big/data/refgenome/GRCh37/dna/hs37d5.fa.fai', header=None, sep='\t')
hg_len_bp = fai.sum()[1]
print(hg_len_bp / 10**9, 'Gb')

3.137454505 Gb


The number of reads for the whole genome was obtained by summing over the number of mapped and unmapped reads for each contig, which in turn was obtained with `samtools idxstats` (see `get_nreads.py`).  The number of reads gave an estimate of coverage using the length of the human genome and the lenght of each read (151).

In [4]:
nreads = pd.read_csv('/home/attila/projects/bsm/results/2020-04-29-sample-fastq-sizes/nreads.tsv', sep='\t', index_col='sample')
nreads['coverage'] = nreads['nreads'] * 151 / hg_len_bp
nreads.tail()

Unnamed: 0_level_0,nreads,path,coverage
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PITT_091_NeuN_mn,1040763710,/projects/bsm/alignments/PITT_091/PITT_091_Neu...,50.090071
PITT_091_NeuN_pl,3091314178,/projects/bsm/alignments/PITT_091/PITT_091_Neu...,148.77935
PITT_101_NeuN_pl,3503572864,/projects/bsm/alignments/PITT_101/PITT_101_Neu...,168.620613
PITT_118_NeuN_mn,929306289,/projects/bsm/alignments/PITT_118/PITT_118_Neu...,44.725828
PITT_118_NeuN_pl,4221676035,/projects/bsm/alignments/PITT_118/PITT_118_Neu...,203.181618


In [5]:
nblocks = pd.read_csv('/home/attila/projects/bsm/results/2020-04-29-sample-fastq-sizes/fastq-nblocks.tsv', sep='\t')
nblocks['fq.gz size, GiB'] = nblocks['nblocks, kB'] * 2 ** -20
def fix_samplen(sname):
    newname = re.sub('(MSSM|PITT)', '\\1_', sname)
    newname = re.sub('(pl|mn)1', '\\1_', newname)
    return(newname)
nblocks['sample'] = [fix_samplen(s) for s in nblocks['sample']]
#nblocks['sample'] = [re.sub('(MSSM|PITT)', '\\1_', s) for s in nblocks['sample']]
nblocks

Unnamed: 0,sample,"nblocks, kB",file path,"fq.gz size, GiB"
0,PITT_091_NeuN_pl,9021632,/projects/bsm/reads/PITT091_NeuN_pl/PITT091_Ne...,8.603699
1,PITT_091_NeuN_pl,8035760,/projects/bsm/reads/PITT091_NeuN_pl/PITT091_Ne...,7.663498
2,PITT_091_NeuN_pl,7769508,/projects/bsm/reads/PITT091_NeuN_pl/PITT091_Ne...,7.409580
3,PITT_091_NeuN_pl,8768292,/projects/bsm/reads/PITT091_NeuN_pl/PITT091_Ne...,8.362095
4,PITT_091_NeuN_pl,7779932,/projects/bsm/reads/PITT091_NeuN_pl/PITT091_Ne...,7.419521
...,...,...,...,...
2159,MSSM_109_NeuN_pl_,4391232,/projects/bsm/reads/MSSM109_NeuN_pl1/MSSM109_N...,4.187805
2160,MSSM_109_NeuN_pl_,4934676,/projects/bsm/reads/MSSM109_NeuN_pl1/MSSM109_N...,4.706074
2161,MSSM_109_NeuN_pl_,4306656,/projects/bsm/reads/MSSM109_NeuN_pl1/MSSM109_N...,4.107147
2162,MSSM_109_NeuN_pl_,4960984,/projects/bsm/reads/MSSM109_NeuN_pl1/MSSM109_N...,4.731163


Now get samples present in both data frame and filter data frames for those samples.  Then sort according to `nreads`.

In [6]:
fsize = nblocks.groupby('sample').sum()
samples = set(fsize.index).intersection(nreads.index)
nreads = nreads.loc[samples, :].sort_values('nreads', ascending=False)
fsize = fsize.loc[nreads.index, :]
df = pd.concat([nreads, fsize], axis=1)[['nreads', 'coverage', 'fq.gz size, GiB']]
print(df.head(), '\n' * 2, df.tail())

                      nreads    coverage  fq.gz size, GiB
sample                                                   
MSSM_373_NeuN_pl  5126657796  246.736750       418.182316
PITT_118_NeuN_pl  4221676035  203.181618       348.390968
MSSM_215_NeuN_pl  3992458107  192.149774       323.420944
MSSM_183_NeuN_pl  3733596080  179.691214       307.038284
MSSM_369_NeuN_pl  3348853213  161.174237       272.427040 

                      nreads   coverage  fq.gz size, GiB
sample                                                 
MSSM_118_muscle   802078851  38.602602        67.163513
MSSM_183_muscle   794585616  38.241966        66.080299
MSSM_179_NeuN_mn  767418276  36.934451        63.949398
PITT_010_NeuN_mn  735747926  35.410214        61.114750
MSSM_179_muscle   656607857  31.601346        66.167599


## Analysis

In [7]:
%connect_info

{
  "shell_port": 45079,
  "iopub_port": 60053,
  "stdin_port": 33433,
  "control_port": 48349,
  "hb_port": 53931,
  "ip": "127.0.0.1",
  "key": "dc34b9e4-9782ac4a7587f86d1516d7f5",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-069c7fbd-ed03-480e-a214-fd692f843ce9.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
