***Total: 42 points***

Complete this homework by writing R code to complete the following tasks. Keep in mind:

i. Empty chunks have been included where code is required
ii. This homework requires use of data files:

  - `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg` (Problems 1, 2)
  - `GIAB_highconf_v.3.3.2.vcf.gz` (Problem 3)
  
iv. You will be graded on your code and output results. The assignment is worth 42 points total; partial credit can be awarded.

For additional resources, please refer to these links:  
Problems 1 & 2:  
  - https://www.bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html
  - https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html  
Problem 3:  
  - https://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf  
Problem 4: 
  - https://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf  

# Problem 1: Overlaps between genomic regions and copy number alterations. (14 points total)

### Preparation
Load copy number segment results as shown in *2.1 BED format* of *Lecture16_GenomicData.Rmd*. You will use the same file as in the lecture notes, `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg`. Here is code to get you started.

In [116]:
#load packages
suppressPackageStartupMessages({
    library(tidyverse)
    library(GenomicRanges)
    library(plyranges)
    library(VariantAnnotation)
})

In [4]:
getwd()

In [7]:
segs <- read.delim("/workspaces/tfcb_2024/lectures/lecture15/TFCB_data/BRCA.genome_wide_snp_6_broad_Level_3_scna.seg", as.is = TRUE)
mode(segs$Chromosome) <- "character" 
segs[segs$Chromosome == 23, "Chromosome"] <- "X"
segs.gr <- as(segs, "GRanges")

### a. Find the segments in `segs.gr` that have *any* overlap with the region `chr8:128,746,347-128,755,810` (4 points)
Print out the first five unique TCGA IDs.

In [34]:
mygr8 <- GRanges(seqnames = "8",
                    ranges = IRanges(start = 128746347, end = 128755810))

overlap8 <- find_overlaps(segs.gr, mygr8) %>%
  as.tibble() %>%
  dplyr::select(Sample) %>%
  head(n=5) %>%
  print()

[90m# A tibble: 5 × 1[39m
  Sample                      
  [3m[90m<chr>[39m[23m                       
[90m1[39m TCGA-3C-AAAU-10A-01D-A41E-01
[90m2[39m TCGA-3C-AAAU-01A-11D-A41E-01
[90m3[39m TCGA-3C-AALI-10A-01D-A41E-01
[90m4[39m TCGA-3C-AALI-01A-11D-A41E-01
[90m5[39m TCGA-3C-AALJ-10A-01D-A41E-01


### b. Find the mean of the `Segment_Mean` values for copy number segments that have *any* overlap with the region chr17:37,842,337-37,886,915. (4 points)

In [52]:
mygr17 <- GRanges(seqnames = "17",
                    ranges = IRanges(start = 37842337, end = 37886915))

overlap17 <- find_overlaps(segs.gr, mygr17)

mean(overlap17$Segment_Mean, na.rm = TRUE)

### c. Find the patient sample distribution of copy number for `PIK3CA` (hg19). (6 points)
Find the counts of samples with deletion (D; `Segment_Mean < -0.3`), neutral (N; `Segment_Mean >= -0.3 & Segment_Mean <= 0.3`), gain (G; `Segment_Mean > 0.3`) segments that have `any` overlap with `PIK3CA` gene coordinates.  


In [81]:
seqinfo <- Seqinfo(genome = "hg19")
seqinfo <- keepStandardChromosomes(seqinfo) 
seqlevelsStyle(seqinfo) <- "NCBI"

PIK3CA_range <- GRanges(seqinfo = seqinfo,
                         seqnames = "3", ranges = IRanges(start = 179148114, end = 179240093))

overlap_pik3ca <- find_overlaps(segs.gr, PIK3CA_range)

overlap_D <- overlap_pik3ca %>%
  filter(Segment_Mean < -0.3) %>%
  length()

overlap_N <- overlap_pik3ca %>%
  filter(Segment_Mean >= -0.3 & Segment_Mean <= 0.3) %>%
  length()

overlap_G <- overlap_pik3ca %>%
  filter(Segment_Mean > 0.3) %>%
  length()

sample_dist <- tibble(c("Deletion", overlap_D), c("Neutral", overlap_N), c("Gain", overlap_G)) %>%
  print()

“cannot switch some of hg19's seqlevels from UCSC to NCBI style”


[90m# A tibble: 2 × 3[39m
  `c("Deletion", overlap_D)` `c("Neutral", overlap_N)` `c("Gain", overlap_G)`
  [3m[90m<chr>[39m[23m                      [3m[90m<chr>[39m[23m                     [3m[90m<chr>[39m[23m                 
[90m1[39m Deletion                   Neutral                   Gain                  
[90m2[39m 14                         2024                      165                   


# Problem 2: Frequency of copy number alteration events within genomic regions. (12 points total) 

This problem will continue to use the copy number data stored in `segs.gr`.

### a. Create a genome-wide tile of 1Mb windows for the human genome (`hg19`). (6 points)
See *3.1 Tiling the genome* of *Lecture16_GenomicData.Rmd* for hints.


In [83]:
seqinfo <- Seqinfo(genome = "hg19")
seqinfo <- keepStandardChromosomes(seqinfo) 
seqlevelsStyle(seqinfo) <- "NCBI"

slen <- seqlengths(seqinfo)
tileWidth <- 1000000
hg19_tile <- tileGenome(seqlengths = slen, tilewidth = tileWidth,
                    cut.last.tile.in.chrom = TRUE)

hg19_tile

“cannot switch some of hg19's seqlevels from UCSC to NCBI style”


GRanges object with 3114 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]        1         1-1000000      *
     [2]        1   1000001-2000000      *
     [3]        1   2000001-3000000      *
     [4]        1   3000001-4000000      *
     [5]        1   4000001-5000000      *
     ...      ...               ...    ...
  [3110]        Y 56000001-57000000      *
  [3111]        Y 57000001-58000000      *
  [3112]        Y 58000001-59000000      *
  [3113]        Y 59000001-59373566      *
  [3114]     chrM           1-16571      *
  -------
  seqinfo: 25 sequences from an unspecified genome

### b. Find the 1Mb window with the most frequent overlapping deletions. (6 points)
Find the 1Mb windows with `any` overlap with deletion copy number segments. Assume a deletion segment is defined as a segment in `segs.gr` having `Segment_Mean < -0.3`. 

Return one of the 1Mb window `Granges` entry with the highest frequency (count) of deletion segments.

Hint: Subset the `segs.gr` to only rows with `Segment_Mean < -0.3`. 

In [117]:
max_window <- segs.gr %>%
  subset(Segment_Mean < -0.3) %>%
  find_overlaps(hg19_tile) %>%
  arrange(-Num_Probes) %>%
  dplyr::slice(1)
max_window

GRanges object with 1 range and 3 metadata columns:
      seqnames           ranges strand |                 Sample Num_Probes
         <Rle>        <IRanges>  <Rle> |            <character>  <integer>
  [1]        2 484222-242476062      * | TCGA-AN-A0FK-01A-11D..     131076
      Segment_Mean
         <numeric>
  [1]      -0.3123
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

# Problem 3: Reading and annotating genomic variants (16 points total)

### Preparation

In [127]:
vcfFile <- "/workspaces/tfcb_2024/lectures/lecture15/TFCB_data/GIAB_highconf_v.3.3.2.vcf.gz"

In [124]:
getwd()

### a. Load variant data from VCF file `GIAB_highconf_v.3.3.2.vcf.gz` for `chr8:128,700,000-129,000,000`. (4 points)
Note: use genome build `hg19`.

In [129]:
myGRange8 <- GRanges(seqnames = "8", ranges = IRanges(start = 128700000, end = 129000000))
vcf.param <- ScanVcfParam(which = myGRange8)
vcf <- readVcf(vcfFile, genome = "hg19", param = vcf.param)

vcf

class: CollapsedVCF 
dim: 308 1 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 16 columns: DPSum, platforms, platformnames, platformbias, ...
info(header(vcf)):
                                   Number Type    Description                  
   DPSum                           1      Integer Total read depth summed ac...
   platforms                       1      Integer Number of different platfo...
   platformnames                   .      String  Names of platforms for whi...
   platformbias                    .      String  Names of platforms that ha...
   datasets                        1      Integer Number of different datase...
   datasetnames                    .      String  Names of datasets for whic...
   datasetsmissingcall             .      String  Names of datasets that are...
   callsets                        1      Integer Number of different callse...
   callsetnames                    .      String 

### b. Combine the fields of the VCF genotype information into a table. (4 points)
You may use your choice of data objects (e.g. `data.frame`).

In [137]:
info(vcf) %>%
  as.data.frame() %>%
  rownames_to_column("ID") %>%
  as_tibble()

ID,DPSum,platforms,platformnames,platformbias,datasets,datasetnames,datasetsmissingcall,callsets,callsetnames,varType,filt,callable,difficultregion,arbitrated,callsetwiththisuniqgenopassing,callsetwithotheruniqgenopassing
<chr>,<int>,<int>,<I<list>>,<I<list>>,<int>,<I<list>>,<I<list>>,<int>,<I<list>>,<chr>,<I<list>>,<I<list>>,<I<list>>,<chr>,<I<list>>,<I<list>>
rs6984323,,4,Illumina....,,4,HiSeqPE3....,IonExome....,5,HiSeqPE3....,,CS_CGnor....,CS_HiSeq....,,,,
rs4478537,,3,Illumina....,,3,HiSeqPE3....,IonExome....,4,HiSeqPE3....,,,CS_HiSeq....,,,,
rs34141920,,3,Illumina....,,3,HiSeqPE3....,IonExome....,4,HiSeqPE3....,,CS_CGnor....,CS_HiSeq....,AllRepea....,,,
rs17772814,,4,Illumina....,,5,HiSeqPE3....,IonExome,6,HiSeqPE3....,,CS_Solid....,CS_HiSeq....,,,,
rs77977256,,4,Illumina....,,4,HiSeqPE3....,IonExome....,5,HiSeqPE3....,,,CS_HiSeq....,,,,
8:128715845_AT/A,,1,Illumina,,1,HiSeqPE300x,CGnormal....,2,HiSeqPE3....,,CS_CGnor....,CS_HiSeq....,AllRepea....,,,
rs143209301,,3,Illumina....,,3,HiSeqPE3....,IonExome....,4,HiSeqPE3....,,,CS_HiSeq....,,,,
rs202231913,,1,Illumina,,1,HiSeqPE300x,CGnormal....,2,HiSeqPE3....,,,CS_HiSeq....,AllRepea....,,,
rs16902340,,4,Illumina....,,4,HiSeqPE3....,IonExome....,5,HiSeqPE3....,,,CS_HiSeq....,,,,
rs7841229,,4,Illumina....,,4,HiSeqPE3....,IonExome....,5,HiSeqPE3....,,,CS_HiSeq....,AllRepea....,,,


### c. Retrieve the following information at chr8:128747953. (8 points)
Print out the SNP ID (i.e. "rs ID"), reference base (`REF`), alterate base (`ALT`), genotype (`GT`), depth (`DP`), allele depth (`ADALL`), phase set (`PS`).

Hints: 

  i. `REF` and `ALT` are in the output of `rowRanges(vcf)`. See Section `3a` in `Lecture16_VariantCalls.ipynb` 
  ii. To get the sequence of `DNAString`, use `as.character(x)`.  
  ii. To get the sequence of `DNAStringSet`, use `as.character(unlist(x))`. 
  iii. To expand a list of information for `geno`, use `unlist(x)`.  

  

In [206]:
myGR8 = GRanges(seqnames = "8", ranges = IRanges(start = 128747953, end = 128747953))
vcf.param8 <- ScanVcfParam(which = myGR8)

vcf8 <- readVcf(vcfFile, genome = "hg19", param = vcf.param8)

info(vcf8) %>%
  as.data.frame() %>%
  rownames_to_column("rsID") %>%
  dplyr::select(rsID) %>%
  as.character()

as.character(rowRanges(vcf8)$REF)
as.character(unlist(rowRanges(vcf8)$ALT))
as.character(geno(vcf8)$GT)
as.character(geno(vcf8)$DP)
as.character(unlist(geno(vcf8)$ADALL))
as.character(geno(vcf8)$PS)