# Code to get tracks set up for CoVid-relevant regulatory architecture
# Lung (epithelial-like) regulatory architecture

# Minimal tracks to prepare 
Track : format - source

    * CTCF : bigwig - Imputed CD8 T-cell ENCODE 
    * H3K27Ac : bigwig - ENCODE NHLF
    * H3k4Me3 : bigwig - ENCODE SAEC
    * H3k9Me3 : bigwig - ENCODE NHLF
    * DNAse-seq : bigwig - ENCODE SAEC
    * Methylation : bigwig - None
    * Loops : links - hESC
    * Hi-C : cool - IMR90 GItar
    * Genes : genes_bed -  Gencode
    * Repeats : bed - L1Base2
    * Chromatin state : bed - ENCODE/Segway
    * eQTL list : arcs - none
    * GWAS : bed - pvals and bigwig Ellinghaus and Covid19Hg
    * RNA-Seq : bigwig and txt - HBEC, BALF, BALF-Covid
    

In [None]:
import pyensembl, os, sys, re, numpy as np
from helper_funcs import *


### Transcription factor and histones

In [None]:
%%bash
#Get Histone marks:

# Get bigwig H3k4me3 fold change over control 
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF782BOR_H3k4Me3_SAEC.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/08/23/89ae3a85-af02-49bd-b378-e565595106b4/ENCFF782BOR.bigWig"
# Narrowpeaks
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF916HZY_H3k4Me3_SAEC_hg38.bed.gz --quiet \
    "https://encode-public.s3.amazonaws.com/2020/08/14/45ae9a4a-78ce-4407-a9a2-b566955f17cf/ENCFF916HZY.bed.gz"
gunzip /input_dir/corona_analysis/tracks/ENCFF916HZY_H3k4Me3_SAEC_hg38.bed.gz

# Get bigwig H3K27Ac fold change over control
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF637KNN_H3k27Ac_NHLF.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2017/02/07/b8c9384f-af28-48bd-9474-2293237bba68/ENCFF637KNN.bigWig"
# Narrowpeaks
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF731CHG_H3k27Ac_NHLF_hg38.bed.gz --quiet \
    "https://encode-public.s3.amazonaws.com/2017/02/07/95c81178-c7fc-4129-a0c8-05ecf4957ad7/ENCFF731CHG.bed.gz"    
gunzip /input_dir/corona_analysis/tracks/ENCFF731CHG_H3k27Ac_NHLF_hg38.bed.gz

# Get bigwig H3K9Me3 fold change over control
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF143LCY_H3k9Me3_NHLF.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/11/28/77fa3756-baeb-4a1b-967b-d2d02836db92/ENCFF143LCY.bigWig"
# Narrowpeaks
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF959LWN_H3k9Me3_NHLF_hg38.bed.gz --quiet \
    "https://encode-public.s3.amazonaws.com/2016/11/28/432ec1c3-87ff-403c-9fa8-07fe166cb5c3/ENCFF959LWN.bed.gz"
gunzip /input_dir/corona_analysis/tracks/ENCFF959LWN_H3k9Me3_NHLF_hg38.bed.gz


In [None]:
%%bash
#Get Bigwigs:

# Get CTCF chip-seq fold change over control
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF117BRM_CTCF_NHLF.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/12/16/6f8d8132-8da6-467e-a4e3-5f720ae2370c/ENCFF117BRM.bigWig"

# Get called CTCF peaks
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF777ODE_CTCF_NHLF.bed.gz --quiet \
    "https://encode-public.s3.amazonaws.com/2016/12/16/b9ad65ca-eb0d-4c6e-8d38-64346569f026/ENCFF777ODE.bed.gz"
gunzip /input_dir/corona_analysis/tracks/ENCFF777ODE_CTCF_NHLF.bed.gz
    

### ATRX

Get ATRX binding from [ATRX binds to atypical chromatin domains at the 3′ exons of zinc finger genes to preserve H3K9me3 enrichment](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4939920/)

Got chip-seq against ATRX from GSE70920


In [None]:
%%bash
cd /input_dir/corona_analysis/tracks
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE70nnn/GSE70920/suppl/GSE70920_RAW.tar
tar -xvf  GSE70920_RAW.tar 

#Visualize relevant bigwigs
#GSM1821904_hg19-K562-ATRX_SC-enrichment_over_input.bigWig
#GSM1821905_hg19-K562-ATRX_Abcam-enrichment_over_input.bigWig


In [None]:
%%bash

#Flip bigwigs to hg38
CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/GSM1821904_hg19-K562-ATRX_SC-enrichment_over_input.bigWig \
    /input_dir/corona_analysis/tracks/GSM1821904_hg38-K562-ATRX_SC-enrichment_over_input

CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/GSM1821905_hg19-K562-ATRX_Abcam-enrichment_over_input.bigWig \
    /input_dir/corona_analysis/tracks/GSM1821905_hg38-K562-ATRX_Abcam-enrichment_over_input



### G-Quadruplex


Got from still pre pub paper [Promoter G-quadruplexes and transcription factors cooperate to shape the cell type-specific transcriptome](https://www.biorxiv.org/content/10.1101/2020.08.27.236778v1.full.pdf)

Currently manually downloaded by using reviewer key from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145535

Also [G-quadruplex structures mark human regulatory chromatin](https://www.nature.com/articles/ng.3662?proof=t)

Got called bed peaks from ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99205/suppl/GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed.gz


In [None]:
%%bash

#Get called peaks from HaCAT lines

cd /input_dir/corona_analysis/tracks
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99205/suppl/GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed.gz
gunzip GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed.gz
sed 's/\s/\t/g' GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed | sort-bed - > t.bed
mv t.bed GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated_hg19.bed

CrossMap.py bed /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated_hg19.bed \
    /input_dir/corona_analysis/tracks/GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated_hg38.bed

sort-bed GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated_hg38.bed > t.bed
mv t.bed GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated_hg38.bed


In [None]:
%%bash

cd /input_dir/corona_analysis/tracks
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99205/suppl/GSE70920_RAW.tar
tar -xvf  GSE70920_RAW.tar 


In [None]:
#Flip to hg19
CrossMap.py bigwig /input_dir/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/GSE145535_RAW/GSM4320548_ChIPseq_IP_BG4_rep2_93T449_m1.ucsc.bigWig \
    /input_dir/corona_analysis/tracks/GSE145535_RAW/GSM4320548_ChIPseq_IP_BG4_rep2_93T449_m1


### Chromatin accessibility

In [None]:
%%bash

# Get bigwig DNAse normalized read count 
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF827VFY_DNAse_SAEC.bw --quiet \
"https://encode-public.s3.amazonaws.com/2017/08/24/65b36dd7-fa4e-47ba-91a8-536ebcff4d93/ENCFF827VFY.bigWig"


In [None]:
%%bash

#Get bigwig scATAC-seq collapsed by tissue from "A human cell atlas of fetal chromatin accessibility" all in hg19
wget -nc -O /input_dir/corona_analysis/tracks/lung_bronchiolar_and_alveolar_epithelial_cells_hg19.bw --quiet \
"https://atlas.fredhutch.org/data/bbi/descartes/human_atac/bigwig/lung_bronchiolar_and_alveolar_epithelial_cells.bw"
wget -nc -O /input_dir/corona_analysis/tracks/lung_vascular_endothelial_cells_hg19.bw --quiet \
"https://atlas.fredhutch.org/data/bbi/descartes/human_atac/bigwig/lung_vascular_endothelial_cells.bw"


In [None]:
%%bash

#Flip scATAC-seq bw to hg38
#Lung / alveolar epithelial
CrossMap.py bigwig /input_dir/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/lung_bronchiolar_and_alveolar_epithelial_cells_hg19.bw \
    /input_dir/corona_analysis/tracks/lung_bronchiolar_and_alveolar_epithelial_cells_hg38
#Lung vascular endothelial
CrossMap.py bigwig /input_dir/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/lung_vascular_endothelial_cells_hg19.bw \
    /input_dir/corona_analysis/tracks/lung_vascular_endothelial_cells_hg38


In [None]:
%%bash

# Get bed narrowpeak DNAse 
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF827VFY_DNAse_SAEC.bed.gz --quiet \
"https://encode-public.s3.amazonaws.com/2018/05/14/0b8ae8d1-bac1-41b2-b8fd-7795544fcc19/ENCFF546QUZ.bed.gz"
gunzip /input_dir/corona_analysis/tracks/ENCFF827VFY_DNAse_SAEC.bed.gz


### Called TADs

Get TAD calls from IMR90 in ENCODE ENCSR852KQC


In [None]:
%%bash

# Get bed narrowpeak DNAse 
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD.bedpe.gz --quiet \
"https://encode-public.s3.amazonaws.com/2019/02/15/61c15e87-5f91-4eef-adeb-0fc6cc1303e1/ENCFF307RGV.bedpe.gz"
gunzip /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD.bedpe.gz
#cut -f 1-3 

In [None]:
%%bash

cat /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD.bedpe | \
    tail -n +2 | cut -f 1-3 | sort-bed - > /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD_hg19.bed

#Flip bed to hg38
CrossMap.py bed /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD_hg19.bed \
    /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD_hg38.bed

sort-bed /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD_hg38.bed > t.bed; mv t.bed /input_dir/corona_analysis/tracks/ENCFF307RGV_IMR90_TAD_hg38.bed


### Rough XYZ Hi-C recovered

Using big database of 3D simulations: 
[GSDB](http://sysbio.rnet.missouri.edu/3dgenome/GSDB/browse.php)

Paper: [here](https://bmcmolcellbiol.biomedcentral.com/articles/10.1186/s12860-020-00304-y)


In [None]:
%%bash

#Get IMR90 and use MiniMDS run
wget -nc -O /input_dir/corona_analysis/tracks/IMR90_GDSB_all.tar.gz --quiet \
    "http://calla.rnet.missouri.edu/genome3d/GSDB/Database/BB8015WF/hIMR90_3DStructures.tar.gz"
cd /input_dir/corona_analysis/tracks/
tar -xzf /input_dir/corona_analysis/tracks/IMR90_GDSB_all.tar.gz


In [None]:
import os
out_xyz = "/input_dir/corona_analysis/tracks/hIMR90_miniMDS_hg18.bed"

with open(out_xyz, "w") as out_mini:
    for i in range(1,24):
        with open("/input_dir/corona_analysis/tracks/Yaffe_Tanay/miniMDS/chr"+str(i)+"_miniMDS.tsv","r") as miniMDS_in:
            cur_chr = miniMDS_in.readline().strip()
            if (cur_chr == "chr23"):
                cur_chr = "chrX"
            cur_res = int(miniMDS_in.readline().strip())
            cur_start = int(miniMDS_in.readline().strip())
            for line in miniMDS_in.readlines():
                #Toss lines that didn't work
                if ("nan" not in line):
                    cur_arr = line.strip().split("\t")
                    cur_bin = int(cur_arr[0])
                    cur_line = (cur_chr + "\t" + 
                        str(1+cur_start+(cur_bin*cur_res)) + "\t" +
                        str(cur_start+((cur_bin+1)*cur_res)) + "\t" +
                        cur_arr[1] + "," +
                        cur_arr[2] + "," +
                        cur_arr[3] + "\n")
                    out_mini.write(cur_line)

os.system("sort-bed " + out_xyz + " > t.bed; mv t.bed " + out_xyz)


In [None]:
%%bash

#Flip xyz to hg18

CrossMap.py bed /input_dir/corona_analysis/annotations/hg18ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/hIMR90_miniMDS_hg18.bed \
    /input_dir/corona_analysis/tracks/hIMR90_miniMDS_hg38.bed
sort-bed /input_dir/corona_analysis/tracks/hIMR90_miniMDS_hg38.bed > t.bed
mv t.bed /input_dir/corona_analysis/tracks/hIMR90_miniMDS_hg38.bed


### Chromatin looping

In [None]:
#Got loops from previous hESC paper in:
# /input_dir/corona_analysis/tracks/primed_.7_origami_intra.arcs

In [None]:
%%bash

#Get Hi-C contact matrix for IMR90 from GenomeGitar
wget -O /input_dir/corona_analysis/tracks/GSM1551599_contact_matrices.zip --quiet -nc \
    https://data.genomegitar.org/GSM1551599/GSM1551599_contact_matrices.zip
unzip /input_dir/corona_analysis/tracks/GSM1551599_contact_matrices.zip

#Get Hi-C data in h5 for IMR90 from GenomeGitar
wget -O /input_dir/corona_analysis/tracks/GSM1551599_hdf5.zip --quiet -nc \
    https://data.genomegitar.org/GSM1551599/GSM1551599_hdf5.zip
unzip /input_dir/corona_analysis/tracks/GSM1551599_hdf5.zip


In [None]:
%%bash

#Get Hi-C data from 4DN -- IMR90 In-situ Hi-C same as above
https://data.4dnucleome.org/experiment-set-replicates/4DNES1ZEJNRU/
    https://data.4dnucleome.org/files-processed/4DNFIH7TH4MF/@@download/4DNFIH7TH4MF.hic
    

### Chromatin state

In [None]:
%%bash

#Get fetal kidney chromatin state as predicted by Segway HMM model
wget --quiet -nc https://noble.gs.washington.edu/proj/encyclopedia/interpreted/BC_LUNG_01-11002.bed.gz 
gunzip BC_LUNG_01-11002.bed.gz 
sort-bed BC_LUNG_01-11002.bed | bgzip > /input_dir/corona_analysis/tracks/Segway_bc_lung.bed.gz
rm BC_LUNG_01-11002.bed
tabix -p bed /input_dir/corona_analysis/tracks/Segway_bc_lung.bed.gz


In [None]:
%%bash
#Get NHLF ChromHmm model
wget --quiet -nc -O E128_15_coreMarks_dense.bed.gz \
    https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E128_15_coreMarks_dense.bed.gz
gunzip E128_15_coreMarks_dense.bed.gz
sort-bed E128_15_coreMarks_dense.bed | bgzip > /input_dir/corona_analysis/tracks/E128_15_coreMarks_dense.bed.gz
rm E128_15_coreMarks_dense.bed
tabix -p bed /input_dir/corona_analysis/tracks/E128_15_coreMarks_dense.bed.gz


### RNA-seq

In [None]:
%%bash

#Get Human Bronchial Epithelial Cells (HBEC)

#Get bigwig of alignments
# Minus strand
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF730XSB_HBEC_rnaseq_minus.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/24/e2d0908f-7b88-40d5-ae12-526837091e9d/ENCFF730XSB.bigWig"

# Plus strand
wget -nc -O /input_dir/corona_analysis/tracks/NCFF322WTN_HBEC_rnaseq_plus.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/24/a7520a56-fc0a-450c-80c1-12d201fbd405/ENCFF322WTN.bigWig"

#Transcript quantifications
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF679UBR_HBEC_rnaseq_transcript.tsv --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/24/d293a404-1fb5-4aee-a300-b95ba3ca682c/ENCFF679UBR.tsv"
    
#Gene quantifications
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF207VFL_HBEC_rnaseq_gene.tsv --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/24/720e0bc2-c342-4542-946d-6c5728049c32/ENCFF207VFL.tsv"
    

In [None]:
%%bash

#Get NHLF Human lung fibroblast RNA-seq

#Get bigwig of alignments
# Minus strand
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF121EOD_NHLF_rnaseq_minus_hg38.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/29/378b154b-29e3-45d7-8eb5-74f2a335da9a/ENCFF121EOD.bigWig"

# Plus strand
wget -nc -O /input_dir/corona_analysis/tracks/ENCFF996XPQ_NHLF_rnaseq_plus_hg38.bw --quiet \
    "https://encode-public.s3.amazonaws.com/2016/02/29/f65439f6-4b3c-495b-91a5-a571b5937d33/ENCFF996XPQ.bigWig"


# Prepare wigs from BALF of control and COVID19 patients

In [None]:
%%bash

#Get chromosomes sizes and filter to canonical ones
wget -nc --quiet http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
grep -v "_alt" hg38.chrom.sizes | grep -v "_hap" | grep -v "Un_" | grep -v "random" | sed 's/^chr//g' | sed 's/^M/MT/g' > hg38.chrom.filtered.sizes 
rm hg38.chrom.sizes


### BALF RNA-seq control

In [None]:
%%bash

#BALF Control wigs to bigwig 
#s1
#Plus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s1_Signal.UniqueMultiple.str1.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_plus_hg38.bw
#Minus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s1_Signal.UniqueMultiple.str2.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_minus_hg38.bw

#s2
#Plus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s2_Signal.UniqueMultiple.str1.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s2_rnaseq_plus_hg38.bw
#Minus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s2_Signal.UniqueMultiple.str2.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s2_rnaseq_minus_hg38.bw

#s3
#Plus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s3_Signal.UniqueMultiple.str1.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s3_rnaseq_plus_hg38.bw
#Minus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_control_quant/BALF_control_s3_Signal.UniqueMultiple.str2.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_control_s3_rnaseq_minus_hg38.bw


In [None]:
%%bash

#BALF Infected wigs to bigwig 
#s1
#Plus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_infected_quant/BALF_corona_s1_Signal.UniqueMultiple.str1.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_plus_hg38.bw
#Minus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_infected_quant/BALF_corona_s1_Signal.UniqueMultiple.str2.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_minus_hg38.bw
   
#s2
#Plus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_infected_quant/BALF_corona_s2_Signal.UniqueMultiple.str1.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_corona_s2_rnaseq_plus_hg38.bw
#Minus strand
wigToBigWig /input_dir/corona_analysis/alignment_out/BALF_infected_quant/BALF_corona_s2_Signal.UniqueMultiple.str2.out.wig \
    hg38.chrom.filtered.sizes \
    /input_dir/corona_analysis/tracks/BALF_corona_s2_rnaseq_minus_hg38.bw


### Flip BALF bw to hg19

In [None]:
%%bash

#Control BALF S1
CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_plus_hg38.bw \
    /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_plus_hg19
CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_minus_hg38.bw \
    /input_dir/corona_analysis/tracks/BALF_control_s1_rnaseq_minus_hg19

#Infected BALF S1
CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_plus_hg38.bw \
    /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_plus_hg19
CrossMap.py bigwig /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_minus_hg38.bw \
    /input_dir/corona_analysis/tracks/BALF_corona_s1_rnaseq_minus_hg19


### eQTLs

### Mendelian diseases

In [None]:
%%bash

#Get mendelian variation tied to disease using clinvar hg38
wget --quiet -nc -O disease_names_clinvar.txt ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/disease_names


In [None]:
%%bash

wget --quiet -nc -O clinvar_curr_hg38.vcf.gz ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200615.vcf.gz
gunzip clinvar_curr_hg38.vcf.gz


In [None]:
%%bash

wget --quiet -nc -O clinvar_curr_hg19.vcf.gz ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20200615.vcf.gz
gunzip clinvar_curr_hg19.vcf.gz


In [None]:
%%bash

#get only lung related disease ConceptIDs
grep -i "lung\|COPD" disease_names_clinvar.txt | cut -f 3 | sort -u | sed '/^$/d' > lung_names.txt


In [None]:
import vcf

disease_ids = set()

with open('lung_names.txt', 'r') as diseases:
    for line in diseases.readlines():
        disease_ids.add(line.strip())

vcf_reader = vcf.Reader(open('clinvar_curr_hg19.vcf', 'r',encoding='utf-8'))

bed_lung_var_out = open("/input_dir/corona_analysis/annotations/clinvar_kidney_variants_hg19.bed","w")

for record in vcf_reader:
    record_keys = record.INFO.keys()
    if ("CLNDISDB" in record_keys and record.INFO["CLNDISDB"][0] is not None and "MedGen" in record.INFO["CLNDISDB"][0]):
        cur_id = (record.INFO["CLNDISDB"][0]).split(":")[1]
        cur_disease = (record.INFO["CLNDN"][0])
        cur_rs = "NA"
        if "RS" in record.INFO.keys():
            cur_rs = "rs"+record.INFO["RS"][0]
        
        if cur_id in disease_ids:
            out_record = ("chr" + str(record.CHROM) +
                "\t" + str(record.start) + 
                "\t" + str(record.end) + 
                "\t" + cur_disease + 
                ":" + cur_rs + 
                "\t" + "0" + 
                "\t" + "." + "\n")
            bed_lung_var_out.write(out_record)
            
bed_lung_var_out.close()

### Methylation

## GWAS CoVid

In [None]:
%%bash

#Get GWAS from https://www.medrxiv.org/content/10.1101/2020.05.31.20114991v1
#Got regions from supplementary table 2
# saved in /input_dir/corona_analysis/tracks/Ellinghaus_covid_gwas_hg38.bed


In [None]:
%%bash

CrossMap.py bed /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/Ellinghaus_covid_gwas_hg38.bed \
    Ellinghaus_covid_gwas_hg19.bed


In [None]:
#Flip output to hg19
liftover_bed(from_genome="hg38",
            to_genome="hg19",
            in_bed="/input_dir/corona_analysis/tracks/Ellinghaus_covid_gwas_hg38.bed",
            out_bed="/input_dir/corona_analysis/tracks/Ellinghaus_covid_gwas_hg19.bed")


In [None]:
%%bash

#Get GWAS from https://www.covid19hg.org ANA5 susceptibility 
wget -nc --quiet -O /input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA_C2_V2_20200701.txt.gz --no-check-certificate \
    https://storage.googleapis.com/covid19-hg-public/20200619/results/COVID19_HGI_ANA_C2_V2_20200701.txt.gz
    
gunzip -f /input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA_C2_V2_20200701.txt.gz


In [None]:

#Generate filtered, sorted bed file of variants above threshold
thresh = 1e-6
input_bed = "/input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA_C2_V2_20200701.txt"
out_bed = "/input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility.bed"
filter_gwas_2_bed(input_bed=input_bed, 
                  out_bed=out_bed, 
                  col_ix=[0,1,4,62], 
                  header=True, 
                  thresh=thresh, 
                  delim='\t')


In [None]:
#Flip output to hg19
liftover_bed(from_genome="hg38",
            to_genome="hg19",
            in_bed="/input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility.bed",
            out_bed="/input_dir/corona_analysis/tracks/covid_hg19_1e6_susceptibility.bed")


In [None]:
%%bash

#Generate target regions by merging thresholded p-values - 10/19
#hg19
bedops --range 50000 -m "/input_dir/corona_analysis/tracks/covid_hg19_1e6_susceptibility.bed" \
    | sort-bed - > "/input_dir/corona_analysis/tracks/covid_hg19_1e6_susceptibility_50kbMerged.bed"


In [None]:
%%bash

#Generate target regions by merging thresholded p-values - 10/19
#hg38
bedops --range 50000 -m "/input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility.bed" \
    | sort-bed - > "/input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility_50kbMerged.bed"


In [None]:
%%bash

#Pad wide for wide look on loci

#hg38
bedops --range 500000 --everything /input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility_50kbMerged.bed \
    | sort-bed - > /input_dir/corona_analysis/tracks/covid_hg38_1e6_susceptibility_500kbMerged.bed



In [None]:
%%bash

#Get GWAS from https://www.covid19hg.org ANA2 hospitalization 
wget -nc --quiet -O /input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA2_20200513.txt.gz --no-check-certificate \
    https://storage.googleapis.com/covid19-hg-public/20200508/results/COVID19_HGI_ANA2_20200513.txt.gz
    
gunzip -f /input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA2_20200513.txt.gz


In [None]:

#Generate filtered, sorted bed file of variants above threshold
thresh = 1e-6
input_bed = "/input_dir/corona_analysis/gwas/covid19hg/COVID19_HGI_ANA2_20200513.txt"
out_bed = "/input_dir/corona_analysis/tracks/covid_hg19_1e6_hospitalization.bed"
filter_gwas_2_bed(input_bed=input_bed, 
                  out_bed=out_bed, 
                  col_ix=[0,1,4,13], 
                  header=True, 
                  thresh=thresh, 
                  delim='\t')


In [None]:
#Flip output to hg19
liftover_bed(from_genome="hg38",
            to_genome="hg19",
            in_bed="/input_dir/corona_analysis/tracks/covid_hg19_1e6_hospitalization.bed",
            out_bed="/input_dir/corona_analysis/tracks/covid_hg38_1e6_hospitalization.bed")


In [None]:
%%bash

#Generate target regions by merging thresholded p-values - 10/19
#hg19
bedops --range 50000 -m "/input_dir/corona_analysis/tracks/covid_hg19_1e6_hospitalization.bed" \
    | sort-bed - > "/input_dir/corona_analysis/tracks/covid_hg19_1e6_hospitalization_50kbMerged.bed"


In [None]:
%%bash

#Generate target regions by merging thresholded p-values - 10/19
#hg38
bedops --range 50000 -m "/input_dir/corona_analysis/tracks/covid_hg38_1e6_hospitalization.bed" \
    | sort-bed - > "/input_dir/corona_analysis/tracks/covid_hg38_1e6_hospitalization_50kbMerged.bed"


# GWAS SCLC

Pulled McKay GWAS from [Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes](https://www.nature.com/articles/ng.3892)

Pulled GWAS from [Genome-Wide Interrogation Identifies YAP1 Variants Associated with Survival of Small-Cell Lung Cancer Patients](https://cancerres.aacrjournals.org/content/70/23/9721)


In [None]:
%%bash

#Flip to hg38 from hg19
CrossMap.py bed /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/mckay_2017_sclc_overall_hg19.bed \
    /input_dir/corona_analysis/tracks/mckay_2017_sclc_overall_hg38.bed

CrossMap.py bed /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/gwas_catalog_sclc_hg19.bed \
    /input_dir/corona_analysis/tracks/gwas_catalog_sclc_hg38.bed


In [None]:
%%bash

#Make merged overlapped regions as targets
bedops --range 50000 -m /input_dir/corona_analysis/tracks/mckay_2017_sclc_overall_hg38.bed \
    | sort-bed - | bedops --range 20000 --everything - > mckay_2017_sclc_overall_hg38_merge.bed

bedops --range 50000 -m /input_dir/corona_analysis/tracks/gwas_catalog_sclc_hg38.bed \
    | sort-bed - | bedops --range 20000 --everything - > gwas_catalog_sclc_hg38_merge.bed

bedops -e 1 mckay_2017_sclc_overall_hg38_merge.bed gwas_catalog_sclc_hg38_merge.bed \
    | sort-bed - > /input_dir/corona_analysis/tracks/mckay_merge_hg38_sclc.bed

bedops -e 1 mckay_2017_sclc_overall_hg38_merge.bed gwas_catalog_sclc_hg38_merge.bed \
    | sort-bed - > /input_dir/corona_analysis/tracks/mckay_merge_hg38_sclc.bed

bedops --everything --range 700000 mckay_merge_hg38_sclc.bed \
    | sort-bed - > /input_dir/corona_analysis/tracks/mckay_merge_wide_hg38_sclc.bed

rm mckay_2017_sclc_overall_hg38_merge.bed gwas_catalog_sclc_hg38_merge.bed


## GWAS IPF


Emailed author of [Genome-Wide Association Study of Susceptibility to Idiopathic Pulmonary Fibrosis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7047454/)

Got permission to download and use summary statistics in figs (no publish of data) (yay) (Louise Wain)


In [None]:
## downloaded sum stats and stored in: 
from helper_funcs import *

In [None]:


#Generate filtered, sorted bed file of variants above threshold
thresh = 5e-8
input_bed = "/input_dir/corona_analysis/gwas/ipf/allen_et_al_2019_ipf_meta_gwas_summary_statistics_public_download.txt"
out_bed = "/input_dir/corona_analysis/tracks/ipf_gwas_hg19_allen_5e-8.bed"
filter_gwas_2_bed(input_bed=input_bed, 
                  out_bed=out_bed, 
                  col_ix=[1,2,0,11], 
                  header=True, 
                  thresh=thresh, 
                  delim=' ')


In [None]:
#Generate filtered, sorted bed file of variants above threshold
thresh = 5e-6
input_bed = "/input_dir/corona_analysis/gwas/ipf/allen_et_al_2019_ipf_meta_gwas_summary_statistics_public_download.txt"
out_bed = "/input_dir/corona_analysis/tracks/ipf_gwas_hg19_allen_5e-6.bed"
filter_gwas_2_bed(input_bed=input_bed, 
                  out_bed=out_bed, 
                  col_ix=[1,2,0,11], 
                  header=True, 
                  thresh=thresh, 
                  delim=' ')


In [None]:
%%bash

CrossMap.py bed /input_dir/corona_analysis/annotations/hg19ToHg38.over.chain.gz /input_dir/corona_analysis/tracks/ipf_gwas_hg19_allen_5e-8.bed \
    /input_dir/corona_analysis/tracks/ipf_gwas_hg38_allen_5e-8.bed


# DTI plasmid catalog


In [2]:
%%bash

#Took all plasmid sgRNA locations from excel spreadsheet and PAINFULLY converted to bed files
cd /input_dir/corona_analysis/tracks/DTI_plasmids
cat * | sort -u | sort-bed - > DTI_all_plasmids_hg38.bed
CrossMap.py bed /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz /input_dir/corona_analysis/tracks/DTI_plasmids/DTI_all_plasmids_hg38.bed \
    /input_dir/corona_analysis/tracks/DTI_plasmids/DTI_all_plasmids_hg19.bed



@ 2020-12-09 17:51:33: Read the chain file:  /input_dir/corona_analysis/annotations/hg38ToHg19.over.chain.gz
