<b>Author:</b> ...

<b>Contributors:</b> ...


<div class="alert alert-block alert-info">
Before you start running this notebook, make sure you are using the Hail Genomics Analysis Environment. To do so,
<br/>
    
<ul>
    <li>Click on the <b>cloud analysis environment</b> icon on the righthand side of the screen.</li>
    <li>Inside <b>Recommended environments</b>, select <b>Hail Genomics Analysis</b> which creates a cloud environment for your analyses.</li>
    <li>This analysis can be run with <b>low compute</b> (e.g. 2 workers with 4 CPUs, 15 GB of RAM).</li>
    <li>Click on <b>Next</b>.</li>
</ul>
    
</div>

<h1>Notebook Objectives</h1>

This notebook shows contig length information and assembly quality.

<b>How to Use this Notebook...</b>

<b>As a tutorial:</b>

...

<b>As a resource:</b>

...

<h2>Relevant Information:</h2>

...

In [6]:
import numpy as np
import pandas as pd

In [2]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/hifiasm/1000151/report_map.txt | column -t

Assembly                    1000151.p_ctg  1000151.bp.hap1.p_ctg  1000151.bp.hap2.p_ctg
#_contigs_(gt_0_bp)         27151          32969                  28514
#_contigs_(gt_1000_bp)      27147          32965                  28513
#_contigs_(gt_5000_bp)      27141          32959                  28511
#_contigs_(gt_10000_bp)     27141          32959                  28511
#_contigs_(gt_25000_bp)     26992          32452                  28380
#_contigs_(gt_50000_bp)     21608          22856                  22436
Total_length_(gt_0_bp)      3742602783     3129680717             2909720507
Total_length_(gt_1000_bp)   3742599984     3129677918             2909719789
Total_length_(gt_5000_bp)   3742591626     3129669560             2909716419
Total_length_(gt_10000_bp)  3742591626     3129669560             2909716419
Total_length_(gt_25000_bp)  3739140140     3118623741             2906717820
Total_length_(gt_50000_bp)  3526738227     2745322344             2668983429
#_con

In [3]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/hifiasm/*/report_map.txt > full_report_map.txt

In [4]:
hap1_num = !cat full_report_map.txt | grep '#_contigs' | grep -v 'gt_' | awk '{ print $3 }'
hap2_num = !cat full_report_map.txt | grep '#_contigs' | grep -v 'gt_' | awk '{ print $4 }'

In [7]:
hap1_num_mean = f'{np.mean([int(a) for a in hap1_num]):,.0f} +/- {np.std([int(a) for a in hap1_num]):,.0f}'
hap1_num_mean

'28,174 +/- 4,680'

In [8]:
hap2_num_mean = f'{np.mean([int(a) for a in hap2_num]):,.0f} +/- {np.std([int(a) for a in hap2_num]):,.0f}'
hap2_num_mean

'24,926 +/- 3,727'

In [9]:
hap1_n50 = !cat full_report_map.txt | grep '^N50' | awk '{ print $3 }'
hap2_n50 = !cat full_report_map.txt | grep '^N50' | awk '{ print $4 }'

In [10]:
hap1_n50_median = f'{np.median([int(a) for a in hap1_n50]):,.0f}'
hap1_n50_median

'124,757'

In [11]:
hap2_n50_median = f'{np.median([int(a) for a in hap2_n50]):,.0f}'
hap2_n50_median

'128,424'

In [12]:
hap1_n50

['121705',
 '96345',
 '128468',
 '102209',
 '140572',
 '154206',
 '132194',
 '115991',
 '111710',
 '128370',
 '139928',
 '123424',
 '108181',
 '161210',
 '94935',
 '120513',
 '128794',
 '138522',
 '106282',
 '123816',
 '208254',
 '101454',
 '97040',
 '135043',
 '102430',
 '273123',
 '107153',
 '231932',
 '257746',
 '127552',
 '118965',
 '149646',
 '128975',
 '130412',
 '149081',
 '99255',
 '249584',
 '155875',
 '89957',
 '159792',
 '206929',
 '132914',
 '183496',
 '95336',
 '106871',
 '111956',
 '136340',
 '205220',
 '126278',
 '125961',
 '138080',
 '156384',
 '110823',
 '109437',
 '132820',
 '131101',
 '85932',
 '107058',
 '102428',
 '181991',
 '118561',
 '100915',
 '130935',
 '127450',
 '104709',
 '123602',
 '104122',
 '134990',
 '141722',
 '127728',
 '155992',
 '114546',
 '114961',
 '128575',
 '122633',
 '228006',
 '115109',
 '164459',
 '203981',
 '114693',
 '155542',
 '228503',
 '139728',
 '127481',
 '115244',
 '130268',
 '144142',
 '97878',
 '97690',
 '183480',
 '335800',
 '112856

In [13]:
!minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for 

In [18]:
!gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/hifiasm/1000151/1000151.haploTigs/* .

Copying gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/hifiasm/1000151/1000151.haploTigs/1000151.bp.hap1.p_ctg.fa.gz...
Copying gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/hifiasm/1000151/1000151.haploTigs/1000151.bp.hap2.p_ctg.fa.gz...
| [2 files][  1.5 GiB/  1.5 GiB]   25.6 MiB/s                                   
Operation completed over 2 objects/1.5 GiB.                                      


In [23]:
!zcat 1000151.bp.hap*.p_ctg.fa.gz > 1000151.bp.p_ctg.fa

In [25]:
!minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for 

In [32]:
!gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/bam/1000151/GRCh38/1000151.bam .

Copying gs://fc-aou-datasets-controlled/pooled/longreads/v7_base/bam/1000151/GRCh38/1000151.bam...
- [1 files][ 18.3 GiB/ 18.3 GiB]   26.9 MiB/s                                   
Operation completed over 1 objects/18.3 GiB.                                     


In [34]:
!samtools fastq 1000151.bam | head

@m64160e_220919_144140/64292609/ccs
CTAACCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

In [35]:
!samtools faidx 1000151.bp.p_ctg.fa

In [59]:
!samtools fasta 1000151.bam > 1000151.fa

[bam2fq_mainloop] Error writing to FASTx files.: Broken pipe
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 7 reads
samtools bam2fq: Error closing STDOUT: Broken pipe


In [60]:
!cat 1000151.fa

>m64160e_220919_144140/64292609/ccs
CTAACCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

In [62]:
!minimap2 -ayYL --MD --eqx -x map-pb 1000151.bp.p_ctg.fa 1000151.fa

[ERROR] missing input: please specify a query file to map or option -d to keep the index


In [51]:
!samtools index 1000151.aligned.bam

samtools index: "1000151.aligned.bam" is in a format that cannot be usefully indexed


In [54]:
!minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for 

In [63]:
!ls -lh

total 29G
-rw-rw-r-- 1 jupyter     users    0 Mar  4 05:39 1000151.aligned.bam
-rw-rw-r-- 1 jupyter     users  19G Mar  4 05:07 1000151.bam
-rw-rw-r-- 1 jupyter     users 810M Mar  4 04:49 1000151.bp.hap1.p_ctg.fa.gz
-rw-rw-r-- 1 jupyter     users 753M Mar  4 04:49 1000151.bp.hap2.p_ctg.fa.gz
-rw-rw-r-- 1 jupyter     users 5.7G Mar  4 04:55 1000151.bp.p_ctg.fa
-rw-rw-r-- 1 jupyter     users 2.5M Mar  4 05:22 1000151.bp.p_ctg.fa.fai
-rw-rw-r-- 1 jupyter     users  87K Mar  4 05:46 1000151.fa
-rw-rw-r-- 1 jupyter     users  13K Mar  4 05:33 1000151.fa.gz
-rw-rw-r-- 1 jupyter     users  36K Mar  4 05:29 1000151.fq.gz
-rw-rw-r-- 1 jupyter     users 109K Mar  4 04:15 auxiliary_metrics.GRCh38.tsv
-rw-rw-r-- 1 jupyter     users 960K Mar  4 04:37 full_report_map.txt
-rw-rw-r-- 1 jupyter     users 3.0G Feb 27 08:42 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
-rw-rw-r-- 1 jupyter     users  21M Mar  4 04:15 genomic_metrics.tsv
-rw-rw-r-- 1 jupyter     users  12M Feb 27 05:51 hail-20240227-02