# Identifying STR variants
In this notebook you will load the variant calling results into a DataFrame, do some data wrangling, and identify which samples have STR variants in them.

This notebook will not be graded separately, but may be considered when determining your participation grade.

Start by loading some common data science libraries that we'll use to work with the data we gererated:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context("poster")

We'll mostly be using the Pandas library today. For an overview of the library and some tutorials, see [here](https://pandas.pydata.org/docs/getting_started/overview.html). There is also a handy [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) containing most of the functionality you'll need.

Read the 'merged_summary_results.tsv' file into a pandas DataFrame. The most convenient way to do this is with the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function that pandas provides (make sure to set the column separator appropriately!).

In [2]:
# Your code here
df_tsv = pd.read_csv("C:/Users/bburr/FASTA/project-day-07/project-day-07/results/merged_results_summary.tsv", sep="\t")
df_tsv.head()





Unnamed: 0,#[1]CHROM,[2]POS,[3]REF,[4]ALT,[5]patient_1:GT,[6]patient_2:GT,[7]patient_3:GT
0,chr5,7241,a,.,0/0,0/0,0/0
1,chr5,9390,a,.,0/0,0/0,0/0
2,chr5,10062,t,.,0/0,0/0,0/0
3,chr5,10673,a,.,0/0,0/0,0/0
4,chr5,15411,t,.,0/0,0/0,0/0


The default column names that bcftools generated are not so nice. Rename them to something better.

In [3]:
# Your code here
df_tsv.columns = ['Chromosome', 'Position', 'Reference', 'Alternative', 'patient_1', 'patient_2', 'patient_3']
df_tsv.head()



Unnamed: 0,Chromosome,Position,Reference,Alternative,patient_1,patient_2,patient_3
0,chr5,7241,a,.,0/0,0/0,0/0
1,chr5,9390,a,.,0/0,0/0,0/0
2,chr5,10062,t,.,0/0,0/0,0/0
3,chr5,10673,a,.,0/0,0/0,0/0
4,chr5,15411,t,.,0/0,0/0,0/0


Now it's time to do some data wrangling. We want to transform the dataframe from it's current (wide) format to a long format<sup>*</sup>. After the wrangling, every row in the dataframe should contain an observation of one STR locus in one patient, like so:


|    | chromosome         | position     | reference |     alternative       |   patient  |       genotype     |
|--------------|--------------|-----------|------------|------------|------------|------------|
|  **0**  | chr5 | 298      | a        |       .     |      patient_1      |      0/0      |
|  **1**  | chr5      |  298 | a       |       .     |       patient_2     |       0/0     |
|  **2**  | chr5      |  298 | a       |       .     |       patient_3     |       0/0     |

Pandas dataframes offer a handy function to accomplish this: [pd.DataFrame.melt()](https://pandas.pydata.org/pandas-docs/version/1.0.0/reference/api/pandas.DataFrame.melt.html).

**Note: wide data formats are usually preferred when presenting data to humans, i.e., in a presentation. Long data formats tend to be more convenient for data analysis and plotting purposes.*

In [4]:
# Your code here
df_melt = pd.DataFrame.melt(df_tsv, id_vars = ["Chromosome", "Position", "Reference", "Alternative"], 
                            value_vars = ["patient_1", "patient_2", "patient_3"], 
                            var_name = "Patient", value_name = "Genotype")
df_melt = df_melt.sort_values(by=["Chromosome", "Position", "Patient"]).reset_index(drop=True)
df_melt.head()



Unnamed: 0,Chromosome,Position,Reference,Alternative,Patient,Genotype
0,chr5,7241,a,.,patient_1,0/0
1,chr5,7241,a,.,patient_2,0/0
2,chr5,7241,a,.,patient_3,0/0
3,chr5,9390,a,.,patient_1,0/0
4,chr5,9390,a,.,patient_2,0/0


Finally, we want to select only those rows from the dataframe that are of interest to us: those rows where there is an STR variant. Using what you know about the VCF format, you should be able to come up with a criterion to filter the dataframe on to select the variant rows.

*In case you're not sure how to filter pandas dataframes: here is a [tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)*.

In [5]:
# Your code here
df_filtered = df_melt[df_melt["Genotype"]!= "0/0"]
df_filtered.head()


Unnamed: 0,Chromosome,Position,Reference,Alternative,Patient,Genotype
183,chr5,106700,a,aa,patient_1,1/1
257,chr5,137481,a,agagagaga,patient_3,1/1


There should be two STR variants in total. If you managed to identify them, good job! You can now return to the main `README.md` file and continue with the rest of the analysis there.

## Bonus exercise: cyvcf2
In case you were already familiar with pandas and were able to complete the data wrangling easily, you can give this bonus exercise a try.

Instead of using bcftools to generate a summary file, it is much more common to work with the VCF file directly. Python libraries exists to facilitate this. See if you are able to identify the same STR variants as above by using the [cyvcf2](https://brentp.github.io/cyvcf2/) library to parse the `merged_results.vcf` file directly!

In [None]:
# Your code here
from cyvcf2 import VCF

df_vcf = "C:/Users/bburr/FASTA/project-day-07/project-day-07/results/merged_results.vcf"

for variant in VCF('df_vcf'):
    variant.REF, variant.ALT

    variant.CHROM, variant.start, variant.end, variant.ID, \
                variant.FILTER, variant.QUAL

    variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths
    variant.gt_phases, variant.gt_quals, variant.gt_bases

    variant.INFO.get('DP')
    variant.INFO.get('FS')
    variant.INFO.get('AC')

    str(variant)

    dp = variant.format('DP')
    sb = variant.format('SB')
    assert sb.shape == (n_samples, 4)

vcf = VCF('some.vcf.gz')
for v in vcf('11:435345-556565'):
    if v.INFO["AF"] > 0.1: continue
    print(str(v))

    print(v.genotypes)
    