<b>Author:</b> ...

<b>Contributors:</b> ...


<div class="alert alert-block alert-info">
Before you start running this notebook, make sure you are using the Hail Genomics Analysis Environment. To do so,
<br/>
    
<ul>
    <li>Click on the <b>cloud analysis environment</b> icon on the righthand side of the screen.</li>
    <li>Inside <b>Recommended environments</b>, select <b>Hail Genomics Analysis</b> which creates a cloud environment for your analyses.</li>
    <li>This analysis can be run with <b>low compute</b> (e.g. 2 workers with 4 CPUs, 15 GB of RAM).</li>
    <li>Click on <b>Next</b>.</li>
</ul>
    
</div>

<h1>Notebook Objectives</h1>

This notebook shows where the BAM files and manifest are, how to localize the manifest or a known BAM to your workspace bucket and active environment, and how to use the manifest to localize BAMs by the included paths.

<b>How to Use this Notebook...</b>

<b>As a tutorial:</b>

...

<b>As a resource:</b>

...

<h2>Relevant Information:</h2>

...

In [2]:
import pandas
import os

# This query represents dataset "Long reads with short read SVs, basic info" for domain "person" and was generated for All of Us Controlled Tier Dataset v7
dataset_51778023_person_sql = """
    SELECT
        person.person_id,
        person.gender_concept_id,
        p_gender_concept.concept_name as gender,
        person.birth_datetime as date_of_birth,
        person.race_concept_id,
        p_race_concept.concept_name as race,
        person.ethnicity_concept_id,
        p_ethnicity_concept.concept_name as ethnicity,
        person.sex_at_birth_concept_id,
        p_sex_at_birth_concept.concept_name as sex_at_birth 
    FROM
        `""" + os.environ["WORKSPACE_CDR"] + """.person` person 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_gender_concept 
            ON person.gender_concept_id = p_gender_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_race_concept 
            ON person.race_concept_id = p_race_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_ethnicity_concept 
            ON person.ethnicity_concept_id = p_ethnicity_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_sex_at_birth_concept 
            ON person.sex_at_birth_concept_id = p_sex_at_birth_concept.concept_id  
    WHERE
        person.PERSON_ID IN (SELECT
            distinct person_id  
        FROM
            `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` cb_search_person  
        WHERE
            cb_search_person.person_id IN (SELECT
                person_id 
            FROM
                `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
            WHERE
                has_whole_genome_variant = 1 ) 
            AND cb_search_person.person_id IN (SELECT
                person_id 
            FROM
                `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
            WHERE
                has_lr_whole_genome_variant = 1 ) 
            AND cb_search_person.person_id IN (SELECT
                person_id 
            FROM
                `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
            WHERE
                has_structural_variant_data = 1 ) )"""

dataset_51778023_person_df = pandas.read_gbq(
    dataset_51778023_person_sql,
    dialect="standard",
    use_bqstorage_api=("BIGQUERY_STORAGE_API_ENABLED" in os.environ),
    progress_bar_type="tqdm_notebook")

dataset_51778023_person_df.head(5)

Downloading:   0%|          | 0/989 [00:00<?, ?rows/s]

Unnamed: 0,person_id,gender_concept_id,gender,date_of_birth,race_concept_id,race,ethnicity_concept_id,ethnicity,sex_at_birth_concept_id,sex_at_birth
0,1904084,45878463,Female,1969-06-15 00:00:00+00:00,8516,Black or African American,38003564,Not Hispanic or Latino,46273637,Intersex
1,2100229,903096,PMI: Skip,1962-06-15 00:00:00+00:00,903096,PMI: Skip,903096,PMI: Skip,0,No matching concept
2,1835685,903096,PMI: Skip,1954-06-15 00:00:00+00:00,903096,PMI: Skip,903096,PMI: Skip,0,No matching concept
3,1938812,903096,PMI: Skip,2000-06-15 00:00:00+00:00,903096,PMI: Skip,903096,PMI: Skip,0,No matching concept
4,1203311,903096,PMI: Skip,1966-06-15 00:00:00+00:00,903096,PMI: Skip,903096,PMI: Skip,0,No matching concept


In [3]:
with open('sample_names.txt', 'w') as wf:
    for sample_name in list(dataset_51778023_person_df.person_id):
        wf.write(f'{sample_name}\n')

In [4]:
if not os.path.exists("AoU_srWGS_SV.vcf.gz"):
    !gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/vcf/AoU_srWGS_SV.vcf.gz* .

Copying gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/vcf/AoU_srWGS_SV.vcf.gz...
Copying gs://fc-aou-datasets-controlled/v7/wgs/short_read/structural_variants/vcf/AoU_srWGS_SV.vcf.gz.tbi...
/ [2 files][  3.1 GiB/  3.1 GiB]   25.7 MiB/s                                   
Operation completed over 2 objects/3.1 GiB.                                      


In [5]:
!bcftools view -S sample_names.txt AoU_srWGS_SV.vcf.gz | grep -m1 '^#CHROM' | sed 's/\t/\n/g' | tail -n +10 | wc -l

989
[main_vcfview] Error: cannot write to (null)


In [6]:
!bcftools view -S sample_names.txt -f "PASS" -G -O z -o AoU_srWGS_SV.subset.vcf.gz AoU_srWGS_SV.vcf.gz

In [7]:
!tabix -p vcf AoU_srWGS_SV.subset.vcf.gz

In [8]:
!bcftools view AoU_srWGS_SV.subset.vcf.gz | grep -v '^#' | wc -l

415509


In [9]:
!bcftools view AoU_srWGS_SV.subset.vcf.gz | grep -v '^#' | cut -f5 | sort | uniq -c

   5260 <CPX>
     21 <CTX>
 208527 <DEL>
     57 <DEL:ME:HERVK>
    751 <DEL:ME:LINE1>
  73822 <DUP>
  33847 <INS>
  73087 <INS:ME:ALU>
  11458 <INS:ME:LINE1>
   8064 <INS:ME:SVA>
    615 <INV>


In [10]:
!bcftools view AoU_srWGS_SV.subset.vcf.gz | grep -v '^#' | cut -f5 | sed 's/:.*//g' | sed 's/[<>]//g' | sort | uniq -c

   5260 CPX
     21 CTX
 209335 DEL
  73822 DUP
 126456 INS
    615 INV


<div class="alert alert-block alert-danger">
The remainder of this notebook is copied from our internal app.terra.bio production workspace. That data is queued for migration from internal storage to this public Researcher Workbench workspace. Please note that code below may not function properly until the migration is complete.
<br/>