### Please take care to regularly create snapshots (File, Save and Checkpoint) and to download your notebook to your own computer. This way we can go back to this backup in case something goes bad.

version: may 2018

Since we are using pre-installed and ready-to-go (pre-compiled) software, there is no need to work in an script environment like Python. We will work in a linux basic command-line environment called 'bash'. The Jupyter notebook provides a shell around bash which allows a mix of instructions and commands and the ability to keep a clear log of your results.

# Contents
1. [NGS data processing: mapping and optimizing the WES data.](#NGS data processing: mapping and optimizing the WES data.)
    1. [Mapping with BWA](#Mapping with BWA)
    + [Marking duplicates](#Marking duplicates)
    + [Adding ReadGroups](#Adding ReadGroups)
+ [NGS data processing: mapping and optimizing the WGS data](#NGS data processing: mapping and optimizing the WGS data)
+ [Variant Calling & filtering](#Variant Calling & filtering)
    1. [Variant Calling](#Variant Calling)
    + [Variant Filtering](#Variant Filtering)
    + [Variant Annotation](#Variant Annotation)
+ [Navigate & explore sequencing data](#Navigate & explore sequencing data)

Exercise processing, visualization and interpretation of NGS data.
==================================================================

The raw output of a NGS sequencer is commonly a FASTQ file, which is a flat text file containing sequence reads and associated read names and quality scores on every line in the file.

![The Fastq Format](https://kscbioinformatics.files.wordpress.com/2017/02/rawilluminadatafastqfiles.png?w=840)

A first step in the analysis of next-generation sequencing reads concerns the mapping of the sequence reads to a reference genome. We have generated three types of sequencing data:
1.	A subset of paired-end whole exome sequencing (WES): datafiles/na12878_wes_brcagenes-1.fastq **and** datafiles/na12878_wes_brcagenes-2.fastq
2.	A subset of paired-end RNA-sequencing (RNA)
3.	A subset of paired-end Whole genome sequencing (WGS): datafiles/na12878_wgs_brcagenes-1.fastq **and** datafiles/na12878_wgs_brcagenes-2.fastq

Each of these datasets were derived from a public human sample and the sequence reads must be mapped to the human reference genome. The output of the mapping is a so-called Sequence Alignment Map (SAM) file most often converted into an index Binary Alignment File (BAM), which is not readable, but can be visualized using the Integrative Genomics Viewer. 

The human reference genome we use is: Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta

<span style="color:red">***IMPORTANT: $filename markings indicate that you should change this into YOUR filename of either an existing file or a new filename.***</span>



1 NGS data processing: mapping and optimizing the WES data.<a name="NGS data processing: mapping and optimizing the WES data."></a>
==============================
1A Mapping with BWA<a name="Mapping with BWA"></a>
------------------
We will use the tool 'bwa' for mapping our sequence data. see if it is installed and working, and check the version. Just type 'bwa' and see what happens.


<span style="color:purple">[Q]: What version of bwa is installed on your system?</span>

[A]:

<span style="color:purple">[Q]: What files are required to start the mapping on our data? [remember the lectures]</span>

[A]:

<span style="color:purple">[Q]: What is the format of the resulting alignmentfile? [remember the lectures]</span>

[A]:

<span style="color:green">[DO]: start an alignment of the WES (!) sequence reads in paired end mode, and store the output file. Remember to name it properly (including extention) so you can see from which experiment the alignment was.

We use the bwa subcommand and alignment method 'mem', which processes the file we provide and sends the output to the screen. This is not very useful so we need to 'capture' the screenoutput by saving it to a file. In linux this is done by the instruction '> filename' so the instruction is build up as follows:

bwa mem [the-human-reference-file] [fastq-file1] [fastq-file2] > my-filename.

Use <b>na12878_wes_brcagenes.sam</b> as output filename.

It will take ~5 min to complete.

</span>

<span style="color:green">[DO]: show the first 30 lines of the SAM file you’ve obtained by executing:
```bash 
head -n 30 $file
```
Where \$file is your samfile (should be na12878_wes_brcagenes.sam). You’ll clearly see a header section and the start of the reads section. </span>

<span style="color:purple">[Q]: What does this entry in the header mean?</span>

@SQ     SN:21   LN:48129895

[A]:

<span style="color:purple"> 
[Q]: Describe some read details from the sam file for the first read. If you need more explanation, see the format specifications [here.](https://samtools.github.io/hts-specs/SAMv1.pdf)
-	<span style="color:purple"> What is it’s orientation? Use the samfile FLAG (column 2) and [bitcode decoder](https://broadinstitute.github.io/picard/explain-flags.html).
-	<span style="color:purple"> Where does it map to in the genome?
-	<span style="color:purple"> Does it have insertions or deletions compared to the reference?
-	<span style="color:purple"> What is the length of the library molecule?
</span>

[A]:

The SAM files are not a very efficient way to handle these kind of data: information can be compressed and indexed by converting the SAM (Sequence Alignment Map) file into a BAM (Binary Alignment Map).

<span style="color:green">[DO]: Convert your SAM file into a BAM file with ‘sambamba view -S -f bam \$samfile > \$bamfile’, where \$samfile is your samfile and \$bamfile is the bamfile to be generated. Give it a proper name: <b>na12878_wes_brcagenes.bam</b>.<span>

<span style="color:purple"> [Q]: Compare the filesizes of the SAM and BAM file. What is the compression ratio? With the linux command 'll -h' you'll get a file listing in human readable format.</span>

[A]:

To be able to make use of the indexed searching options, we need to sort the BAM file by coordinates (default). 

<span style="color:green">[DO]: Sort your BAM file by 
```bash
sambamba sort $bamfile
```
<span style="color:green">This will result in a \$bamfile.sorted automatically.</span>

<span style="color:purple">[Q]: Review the headers of the sorted and unsorted $bamfile by executing 
```bash
sambamba view -H \$bamfile
``` 
<span style="color:purple">for both and compare. Notice the flag stating the bam file is sorted (and how).</span>

[A]:

### 2B Marking duplicates <a name="Marking duplicates"></a>

We need to mark (but not remove) the duplicate reads in our (sorted) bamfile so these won’t be used by analysis tools lateron.

<span style="color:green">[DO]: Mark the duplicates by executing 
```bash
sambamba markdup \$sorted_bamfile \$dedupped_sorted_bamfile
```
</span>

<span style="color:purple">[Q]: what is the percentage duplicate reads in this dataset?</span>

[A]:

### 2C Adding ReadGroups <a name="Adding ReadGroups"></a>
Previously you’ve examined the header of the SAM file and probably noticed that it contains very usefull information. Remember, it is very important to know from what sequence data and genome the alignment was generated. Additionally, you would want to know what tools and settings were used to be able to replicate results (See later in the course the F.A.I.R. principles). In the ReadGroups section of the header you can store information on the sample and experiment (see details [SAM specifications](http://samtools.github.io/hts-specs/SAMv1.pdf)). This is considered so important that GATK for instance requires additional information in these readgroups. In advanced analyses you can actively use this information from the readgroups.

In this part we are going to fill in the minimal required readgroups with data using PicardTools ($PICARD): AddOrReplaceReadGroups. In real life we advise to be much more elaborate and specific.
- LB (library): WES
- PL (platform): illumina
- PU (platform unit): slide_barcode
- SM (sample name): na12878


<span style="color:green">[DO]: Execute Picardtools AddOrReplaceReadGroups with the following parameters (adjust to your own BAM file: use the dedupped_sorted bam as input file which should be dedupped_na12878_wes_brcagenes.sorted.bam) and the data above. Name the outputfile RG_dedupped_sorted_na12878_wes_brcagenes.sorted.bam
```
java -jar $PICARD AddOrReplaceReadGroups \
I=$inputfile \
O=$outputfile \
CREATE_INDEX=true \
RGLB=$LB \
RGPL=$PL \
RGSM=$SM \
RGPU=$PU
```



<span style="color:green">[DO]: Check if the addition of the readgroup was succesful by viewing the header of your new file.</span>

### 3. NGS data processing: mapping and optimizing the WGS data <a name="NGS data processing: mapping and optimizing the WGS data"></a>
Now we successfully have performed the steps to align and optimize the results of the exome sequencing, we want to analyse the whole genome data(WGS) data in the same way.

<span style="color:green">[DO]: Repeat the steps to get to a sorted, deduplicated BAM file with readgroups for the WGS fastqs. Take care not to overwrite your WES results (so replace WES in the filenames by WGS), and clearly mark WGS in the appropriate readgroup.</span>


## 4. Variant Calling & filtering <a name="Variant Calling & filtering"></a>

### 4A. Variant Calling <a name="Variant Calling"></a>

We will use the [GenomeAnalysisToolkit](https://software.broadinstitute.org/gatk/) component HaploTypeCaller to call variants from our BAM files. Since we are using default settings, only three parameters are required: **R**: the Genome reference (the same file you used in the mapping procedure), **I**: the input BAM file (with the readgroups; should be RG_dedupped_sorted_na12878_wes_brcagenes.bam) **L**: the regions of the genome to look in, in this case we only look at the BRCA chromosomal regions and an **o**:output filename. Remember, the result from the variantcaller is a VCF file, so the output filename would be RG_dedupped_na12878_wes_brcagenes.sorted.vcf

<span style="color:green">[DO]: Complete the code below to start the variantcaller on your WES dataset. It will take ~2 min to complete, but will show detailed information.</span>

In [None]:
java -jar $GATK \
-T HaplotypeCaller \
-R  \
-L datafiles/brca_genes.bed \
-I  \
-o 

<span style="color:green">[DO]: Repeat this step to call variants on the WGS dataset; This should be RG_dedupped_na12878_wgs_brcagenes.sorted.bam. Runtime will be similar.</span>

If everything went well you should now have two vcf files with variants from the WES and WGS datasets. Very good.

<span style="color:purple">[Q]: Did you expect the runtime to be similar? Can you come up with a potential explanation? Remember from the lectures the characteristic differences between exome and genome sequencing. </span>

[A]:

Let's now find out how many variants were called. If you remember, the vcf file is build up with a header (starting with ## marks), a colum title row starting with # and variant lines. So, we can find out the number of variants by counting the lines NOT starting with #. Linux has a very convenient tool for that called **grep**: it can find text but also count lines containing (or in our case not containing) a string. The parameters are **-c** for counting and **-v** for inverse (so counting all lines that did not contain the string). After the options, provide the string to match (**'#'**) and of course you need to specify the vcf file you want to search through. You can also use **\*.vcf** to look through all vcf files.

<span style="color:green">[DO]: Count the variants in both WES and WGS vcf files.</span>

<span style="color:purple">[Q:]: Explain the difference in variantnumbers between the WES and WGS datasets.</span>

[A]:

### 4B. Variant Filtering <a name="Variant Filtering"></a>

Eventhough the GATK has a pretty sofisticated Variantcaller, you still need to determine if your variants are of acceptable quality and try to filter out noise from the sequencing. We can do this by applying filtering logic on characteristics of the variants, say coverage or quality for instance. Within the VCF format there is a special colomn reserved that shows the results of filtering rules: either PASS if everything is well, or if a variant fails a specific rule, it displays this filter (or multiple).

Depending on your application, the need and 'weight' of filters can differ: if you want to be supersensitive, don't filter too strictly. If you want to be very precise, filter rigourously. However, we do want to keep all the information in the vcf file, so we annotate the filterstatus in the appropriate way (filtersettings and names go into the header of the VCF, and the filter results go in the filter column). This way, results are very traceable since there is documentation on what analyses are beeing done.(We need to be FAIR remember :)).

You do need to take into account in subsequent analysis that the filter status is correctly used. Many tools allow for using 'PASS' variants only, or 'ALL' variants if you need to ignore the filterstate.

Moreover, SNPs and Indels require different filters, since they have different characteristics. To apply this correctly, we first need to split the vcf file into a vcf containing only SNPs and one containing only INDELs by using **SelectVariants (-T**). For reference see [here](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php). We can then define filter rules and apply them on the appropriate vcf file. When this is done, we simpely merge the filtered VCF files together.

```bash
java -jar $GATK \
-T SelectVariants \
-R $the_genome_reference_file \
-selectType [SNP|INDEL|MIXED] \
-V $input_file.vcf \
-o $outputfile.vcf
```


<span style="color:green">[DO]: Generate two seperate files with SNPs (SNPS_RG_dedupped_na12878_wes_brcagenes.sorted.vcf) and INDELs (INDELS_RG_dedupped_na12878_wes_brcagenes.sorted.vcf)  from the **WES** vcf files using the instructions above. </span>

<span style="color:green">[DO]: Generate two seperate files with SNPs (SNPS_RG_dedupped_na12878_wgs_brcagenes.sorted.vcf) and INDELs (SNPS_RG_dedupped_na12878_wgs_brcagenes.sorted.vcf) from the **WGS** vcf files using the instructions above. </span>

Now we have seperated the SNPs and the INDELs we can apply the filterrules. Since there are many possibilities it is not easy to formulate proper filter rules and it can take quite a bit of tuning to define filters that are neither too strict or too loose. We will take the advice from the [GATK's best practise guidelines](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) for this excersize.

The GATK filtering tools is called **VariantFiltration (-T )**, and next to the standard genome reference file (**-R**), vcf inputfile (**-V**) and an output filename (**-o**) it need paired **--filterExpression** and **--filterName** parameters. You can specify multiple if needed and each expression is referred to by it's name.

For SNPs the best practises advises: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -1.0 || ReadPosRankSum < -8.0"

For INDELs this is: "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"

<span style="color:green">[DO]: Apply the appropriate filters on the WES and WGS SNP and INDEL vcf files; you can define your own name for the filters. Make sure the output files clearly mention that variants are "filtered" ie filtered_SNPS_RG_dedupped_na12878_wes_brcagenes.sorted.vcf etc


In [None]:
WES:

In [None]:
WGS:

<span style="color:purple">[Q:]: How many SNPs and INDELs do we have for each dataset? Remember you can use 'grep' to count variant lines in a vcf file...</span>

[A]:

<span style="color:purple">[Q:]: How many **PASS** SNPs and INDELs do we have for each dataset? Remember you can use 'grep' to count variant lines in a vcf file...</span>

[A]:

In this stage we have the variants filtered, but to keep working with sepperate SNP and INDEL vcf files is not very convenient. We can simple merge the two vcf files by using GATK's tool **CombineVariants (-T)**. As usual you need to specify the reference (**-R**), and now multiple input vcf files (**-V**). Add the following parameter to 'blend' the genotypes together: **--genotypemergeoption UNSORTED**

Take care not to mix the WES and WGS files! Name the output file filtered_VARS_SNPS_RG_dedupped_na12878_wes_brcagenes.sorted.vcf (and wgs for the other set ofcourse)

<span style="color:green">[DO]: merge the SNP and INDEL vcf files together in a new vcf for both the WES and WGS datasets. Use the grep instruction to count the variants again in each resulting vcf file and check with your earlier count if you still have the same variantcounts.</span>

In [None]:
WES:

In [None]:
WGS:

### 4C. Variant annotation <a name="Variant Annotation"></a>

The VCF files we now have contain mostly technical information on the sequencing and variant quality. You could say the only biological information is the genotype. Run the code below, where you change the YOUR_IP_ADRES in the IP adres of your server (in the adressbar of the browser between https:// and /notebooks).  If you click on the link for a vcf file and open it to view it contents. If you need help 'decoding' the vcf, you can review the [VCF format specifications](https://samtools.github.io/hts-specs/VCFv4.2.pdf)

In [None]:
for i in *.vcf; 
    do echo https://YOUR_IP_ADRES:80/files/$i;
done

For most people, the information from the vcf file is not immediately useful and additional biological information needs to be added. We call this annotation: we retrieve information from other sources about a variant and transfer data into the vcf file. This annotation can be diverse: from occurence of a variant in public databases, predicted impact on transcription or translation to very specific details. The format of the vcf file allows storing of annotation in the INFO section and many tools have the capabilities to add this annotation from any other structured file (vcf, text table etc) to a vcf.

In the next excersizes we will annotate our vcf files with:
- The occurence of the variant in the most well known SNP database [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP)
- Annotation and prediction of the effects of variants on genes (such as amino acid changes etc)

We will use the [SNPeff/SNPsift package](http://snpeff.sourceforge.net/) (in this case SNPSIFT), and as annotation source v149 from [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?build_id=149) which can be found here: ~/external/dbSNP/common_all_20161121.vcf.gz. This is a large database with variants detected in other datasets and provides information on frequency and or population charactaristics. If a variant is present here, its reference ID will be stored in the VCF file, in the 'ID' column.

You can start the annotation as follows, where the standard output is the screen so it need directing to a file (give it a traceable name):

```bash
java -Xmx4g -jar $SNPSIFT annotate ~/external/dbSNP/common_all_20161121.vcf.gz infile.vcf > outfile.vcf
```

<span style="color:green">[DO]: Annotate both your WES and WGS vcf file (containing the filtered SNPs and INDELs together, so the filtered_VARS files) with dbSNP ID if this is a known variant. Name the outputfiles dbsnp_filtered_VARS...) </span>

In [None]:
WES:

In [None]:
WGS:

To add more biological annotation, we again use [SNPeff/SNPsift package](http://snpeff.sourceforge.net/), but now with SNPEFF and as annotation source v75 from [ensembl](http://feb2014.archive.ensembl.org/index.html): GRCh37.75. You can start the annotation as follows, where the standard output is the screen so it need directing to a file:

```bash
java -Xmx4g -jar $SNPEFF GRCh37.75 infile.vcf > outfile.vcf
```

This annotation step is more extensive than only an ID and will be stored in the INFO section of each variant.

<span style="color:green">[DO]: Annotate both your WES and WGS vcf file (containing the dbSNP annotated filtered_VARS) with ensembl information. This takes a couple of minutes. Name the output file ANN_dbsnp_filtered_VARS...</span>

In [None]:
WES:

In [None]:
WGS:

You can use the code from a few steps back to create links to vcf files and download/view them, but with the new annotation the files become more and more difficult to read. Let's try to focus on a single variant to illustrate this.

Use 'grep 41223094' on your WES vcf file before the annotation **and** on the annotated one.

In [None]:
grep 41223094 filtered_VARS_RG*wes*.vcf

In [None]:
grep 41223094 ANN_dbsnp_filtered_VARS_RG*wes*.vcf

You should notice two things:
- The third column now has the dbSNP ID filled in: apparently, this variant has been found before.
- In the INFO section, a very large block starting with ANN= has been added.This contains the information from ensembl on this variant in particular. 

Imagine how complex vcf files become if you have this information on >3 million variants. Moreover, this is considered 'basic' annotation; we can easily come up with >100 fields of additional information, like population frequencies, several predictions on pathogenicity, occurence in regulatory regions etc.

Since the information stored in the vcf format is structured, and the annotation tools can *add* information, there are also tools that can *extract* this information again. 

You can follow a number of approaches to do this:
- define a search question and 'annotate' variants that match this query, then in the next step simply match on this annotation flag. This is great if the query takes very long and it is likely to be used more.
- define a search question and create a new vcf with only variants matching this query. Pro: easily done, results stay fixed. Cons: how to name all these partial vcfs and keep overview? Also, information is still complex per variant.
- define a search question and extract a table with required information in text format. Pro: easy to read and use, Con: not standardized in format.

A simple example, following your 'grep 41223094' query above would be:
```bash
java -Xmx4g -jar $SNPSIFT filter "POS=41223094" $VCFFILE
```
If you run this you will see that the result is similar, you'll see the variant on this position. However, with grep you'll miss out on the complete header, while the SNPSIFT results yields a traceable, proper vcf (with even the search query added to the header).

Moreover, with grep it is not easy to combine statements and to use logic (AND, OR, etc) while the dedicated tools do support this. Additionally, you can often provide 'biological' terms, for instance HOM will be interpreted as homozygous so 0/0 and 1/1 genotypes, and HET as heterozygous (0/1).

ie: consider looking for heterozygous variants with a 'moderate' impact:
```bash
java -Xmx4g -jar $SNPSIFT filter "(ANN[*].IMPACT has 'MODERATE') & isHet(GEN[0])" $vcffile
```

<span style="color:green">[DO]: filter your annotated WES vcf file for variants with MODERATE impact and a homozygous genotype.</span>

Hopefully, you see the power of this concept and with help of the [documentation](http://snpeff.sourceforge.net/SnpSift.html#filter) on the syntax, you can execute powerful filterqueries.

However, this does not solve the legibility issue: the vcf format remains intact, so difficult to read for humans. Let's see if we can generate a nice table with filter results and only the information we need.

If we continue with the previous example, but want to retrieve a table with specific information, we can use SNPSIFTs **ExtractFields**

We can combine the filter step with the ExtractFields step as follows using linux pipe (|) operand:
```bash
java -Xmx4g -jar $SNPSIFT filter "(ANN[*].IMPACT has 'MODERATE') & isHom(GEN[0])" $vcffile | \
java -Xmx4g -jar $SNPSIFT ExtractFields - CHROM POS ID REF ALT ANN[0].GENE ANN[0].IMPACT ANN[0].HGVS_P GEN[0].GT
```

<span style="color:green">[DO]: Excute the command on your WES vcf file, and explain the results.</span>


<span style="color:green">[DO]: Repeat the command on your WGS vcf file, and compare to your WES results. What is the difference and can you explain this?</span>

## 5. Navigate & explore sequencing data<a name="Navigate & explore sequencing data"></a>

You've been working with NGS data on the commandline now and that works fine. However, at some stages you miss the visual aspect and that is what we want to show you next. We will use the [Integrative Genome browser - igv-](http://software.broadinstitute.org/software/igv/) to show your results from mapping and variant calling on this dataset. 

To do this, first we have to download our files from both the WES and WGS datasets. What we need is our (dedupped, sorted, readgroup) BAM files (and their BAI indixes) and our annotated VCF files: so 6 files in total. Store them in a folder on your computer.

## ADD the RNASeq file
You can create downloadlinks like this:
```bash
for i in *.vcf *.ba* datafiles/*rnaseq*.ba*;
    do echo https://YOUR_IP_ADRES:80/files/$i;
done
```

If you've downloaded the files, you can [**launch IGV**](http://data.broadinstitute.org/igv/projects/current/igv_mm.jnlp) (Or please go to the [IGV page](http://software.broadinstitute.org/software/igv/download) and pick the option to **launch IGV with 1.2 GB**.). 

Skip option to update java if prompted.


<span style="color:green">[DO]: In the IGV interface, type BRCA1 in the text field at the top and press enter. You have now navigated to the position in the human reference genome where the BRCA1 gene is located. You should see a track called ‘Refseq genes’. 

<span style="color:purple">[Q]: On which strand is the BRCA1 gene located? The gene is visualized with three different line thicknesses: what do they indicate?</span>

[A]

<span style="color:green">[DO]: Go to ‘File’ -> ‘Load from file’ and load the BAM file with your WGS results.
Navigate to the positions in the genome with coordinates: chr17:41,245,922-41,247,825.</span>

<span style="color:purple">[Q]:What do the purple and pink bars represent; and the color?</span>

[A]
Note: if you see alternative display, you can control the coloring in the right-click menu on the filename in the left panel.

<span style="color:green">[DO]: Put you mouse arrow on top of one of the purple or pink bars. A pop-up window will now appear. </span>

<span style="color:purple">[Q]: What is the mapping location of the read? Why do many reads have unique start and end positions? </span>

[A]

<span style="color:purple">[Q]: The grey track above the sequence reads represents the sequence coverage. What does sequence coverage mean? What is the approximate coverage in this region of the genome?</span>

[A]

<span style="color:green">[DO]: Go to the region with coordinates: chr17:41,245,041-41,260,275.</span>

<span style="color:purple">[Q]:Are exons and introns equally well covered with sequence reads? In the middle of this region you see a dip in the coverage. Zoom into this region and check the underlying genomic sequence. Could you come up with an explanation of the poor coverage in this region?</span>

<span style="color:green">[DO]:click the right mouse button on the wgs_brca1.bam name in the leftmost panel and choose the ‘collapsed’ display mode.
Now also load the BAM file for your WES data. </span>

<span style="color:purple">[Q]:What is the difference with the sequence data in wgs_brca1.bam?</span>

[A]

<span style="color:green">[DO]: Navigate to region chr17:41,245,184-41,245,302. In the WES Coverage track you see a bar with two colors (green and brown). </span>

<span style="color:purple">[Q]:What base is present in the reference genome at this position? What base do you find in your sequence reads at this position? Why do 50% of the reads have a reference base and 50% a non-reference base?</span>

[A]

<span style="color:green">[DO]: Now load the file RNAseq bamfile. </span>

<span style="color:purple">[Q]:What is the difference in the sequence read distribution when compared to the previous two datasets? This data set is derived from an RNA-sequencing experiment. Do you think that the BRCA1 gene is expressed in this sample?  </span>

[A]

<span style="color:green">[DO]: If you zoom out a bit (by pressing the minus box in the right corner), you see more variants: </span>

<span style="color:green">[DO]: Navigate to position chr17:41,195,955-41,198,056. </span>

<span style="color:purple">[Q]:Can you explain the clear differences in sequence coverage in this window?</span>

[A]

<span style="color:green">[DO]: Navigate to position chr17:41,243,013-41,243,488. You should now see several sequence reads which are connected with a thin line. In fact, these reads appear to be split and both ends map to different locations. </span>

<span style="color:purple">[Q]: Could you explain the different mapping locations of the two ends?</span>

[A]

### Variants
<span style="color:green">[DO]: Load from the folder your vcf file from variant calling on the exome and the WGS sequencing set.</span>

<span style="color:purple">[Q]: If you hover over the colored bars in the vcf track you get more information. What is the database id (RSxxxxx) of the T>C variant at pos chr17:41,244,000?</span>

[A]

<span style="color:green">[DO]: Look at position chr17:41,262,605.  </span>

<span style="color:purple">[Q]: What kind of variant is this? If you look at the WGS sequencing track you see more extreme variation. If you look at the sequence, can you find an explanation for this kind of variation? Is it likely that the protein is changed by this variation?</span>

[A]