<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Analysis" data-toc-modified-id="Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#vcf.Reader-[docs]" data-toc-modified-id="vcf.Reader-[docs]-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><code>vcf.Reader</code> <a href="https://pyvcf.readthedocs.io/en/latest/API.html#vcf-reader" target="_blank">[docs]</a></a></span></li></ul></li><li><span><a href="#vcf.model._Record-[docs]" data-toc-modified-id="vcf.model._Record-[docs]-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><code>vcf.model._Record</code> <a href="https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-record" target="_blank">[docs]</a></a></span></li><li><span><a href="#vcf.model_Call-[docs]" data-toc-modified-id="vcf.model_Call-[docs]-3"><span class="toc-item-num">3&nbsp;&nbsp;</span><code>vcf.model_Call</code> <a href="https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-call" target="_blank">[docs]</a></a></span><ul class="toc-item"><li><span><a href="#What-exactly-is-genotype?-And-why-is-there-a-heterozygous-property-in-this-class?" data-toc-modified-id="What-exactly-is-genotype?-And-why-is-there-a-heterozygous-property-in-this-class?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What exactly is genotype? And why is there a heterozygous property in this class?</a></span></li><li><span><a href="#Genotype-calling?" data-toc-modified-id="Genotype-calling?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Genotype calling?</a></span></li></ul></li><li><span><a href="#The-Script" data-toc-modified-id="The-Script-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Script</a></span></li></ul></div>

In [2]:
import vcf

import os
import gzip

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("ggplot")

In [7]:
# VCF data loading "hello world"
system_path = r"C:\Users\uniqu\Adaptation\github repos" \
              + "\Bioinformatics-Neural Networks for Genomic Risk"
system_path = system_path + r"\DawleyRats"
vcf_file_name_options = [r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.CharlesRiverOnly.vcf.gz",
                         r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.HarlanOnly.vcf.gz"]
vcf_file_name = vcf_file_name_options[0] # "...vcf.gz"

vcf_file_path = os.path.join(system_path, vcf_file_name)

vcf_reader = vcf.Reader(filename=vcf_file_path, compressed=True)

try:
    print("VCF data loaded successfully...\n")
    print(f"Metadata:\n{vcf_reader.metadata}")
except: 
    print("Failed to load VCF data.")

VCF data loaded successfully...

Metadata:
OrderedDict([('fileformat', 'VCFv4.2'), ('fileDate', '20180201'), ('source', ['PLINKv1.90'])])


--- 

## Analysis

Resources Used:
- [PyVCF Tutorial: Michal Linial (Jan, 2020). *Quantitative Biological Research with Python*.](https://youtu.be/jWu_nxlS5Vc) (ends @ 12 minutes)
- 

###  `vcf.Reader` [[docs]](https://pyvcf.readthedocs.io/en/latest/API.html#vcf-reader)

In [9]:
# Read the first n characters of the .gz vcf file.
with gzip.open(vcf_file_path, 'rt') as f:
    print(f.read(n := int(1e3)))

##fileformat=VCFv4.2
##fileDate=20180201
##source=PLINKv1.90
##contig=<ID=1,length=282745832>
##contig=<ID=2,length=266367381>
##contig=<ID=3,length=177678048>
##contig=<ID=4,length=184213463>
##contig=<ID=5,length=173704786>
##contig=<ID=6,length=147965078>
##contig=<ID=7,length=145599967>
##contig=<ID=8,length=133288266>
##contig=<ID=9,length=122022420>
##contig=<ID=10,length=112580048>
##contig=<ID=11,length=90453650>
##contig=<ID=12,length=52683120>
##contig=<ID=13,length=114010850>
##contig=<ID=14,length=115436306>
##contig=<ID=15,length=111173208>
##contig=<ID=16,length=90610649>
##contig=<ID=17,length=90840848>
##contig=<ID=18,length=87963297>
##contig=<ID=19,length=62264106>
##contig=<ID=20,length=56089150>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	1052	1053	1054	1055	1059	1060	1061	1062	1065	106

Q: What's the gzip module for? [gzip docs](https://docs.python.org/3/library/gzip.html)

- This module provides a simple interface to compress and decompress files like the GNU program gzip
- `gzip`: a module that provides the `GzipFile` class as well as the `open`, `compress`, and `decompress` convenience functions.

Q: The "*.gz" file extension?

- .gz files: compressed files created using the gzip compression utility, which was created to replace and improve on compress in UNIX. This utility is commonly used on UNIX-like systems.
- gzip file compression is often used to compress some elements of web pages to speed up page loading. 

Q: Why is the tool called `gzip`? 

- A .gz file is an archive file compressed by the standard GNU zip (gzip) compression algorithm.

Q: Why use the `GzipFile` class? [class docs](https://docs.python.org/3/library/gzip.html#gzip.GzipFile)

- It simulates most of the methods of a "file object"

Q: "file object"? [Python docs. file object.](https://docs.python.org/3/glossary.html#term-file-object)

file object: 
- An object exposing a file-oriented API (with methods such as `read()` or `write()`) to an underlying resource
- Tutorial [file objects](https://youtu.be/Uh2ebFW8OYM)
- Tutorial [OS Module](https://www.youtube.com/watch?v=tJxcKyFMTGo)

Q: `gzip.open` method?



In [10]:
vcf_reader.metadata

OrderedDict([('fileformat', 'VCFv4.2'),
             ('fileDate', '20180201'),
             ('source', ['PLINKv1.90'])])

Q: How does the `OrderDict` type differ from the regular dictionary? [OrderedDict docs](https://docs.python.org/3.4/library/collections.html?highlight=ordereddict)
- It retains the order in which the entries were added.

[Python Dictionary Methods](https://www.w3schools.com/python/python_ref_dictionary.asp)

In [11]:
vcf_reader.metadata.items()

odict_items([('fileformat', 'VCFv4.2'), ('fileDate', '20180201'), ('source', ['PLINKv1.90'])])

In [12]:
for pair in vcf_reader.metadata.items():
    print(pair)

('fileformat', 'VCFv4.2')
('fileDate', '20180201')
('source', ['PLINKv1.90'])


In [13]:
vcf_reader.infos

OrderedDict([('PR',
              Info(id='PR', num=0, type='Flag', desc='Provisional reference allele, may not be based on real reference genome', source=None, version=None))])

In [14]:
vcf_reader.infos.items()

odict_items([('PR', Info(id='PR', num=0, type='Flag', desc='Provisional reference allele, may not be based on real reference genome', source=None, version=None))])

In [15]:
# name: name of an info object
# info: a vcf.Reader info object
for name, info in vcf_reader.infos.items():
    print(f"{name} ({info.type}): {info.desc}")

PR (Flag): Provisional reference allele, may not be based on real reference genome


`vcf_reader.infos` | info object:
- `info.type`: type of the info object
- `info.desc`: desription of the info object

## `vcf.model._Record` [[docs]][record docs]

**class `vcf.model._Record`**: A set of calls at a site. The standard VCF fields CHROM, POS, IS, REF, ALT, INFO, QUAL, FILTER, and FORMAT are available as properties (details on [Wikipedia][vcf_wikipedia]). 

The list of calls is in the `samples` attribute. 

[record docs]: https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-record
[vcf_wikipedia]: https://en.wikipedia.org/wiki/Variant_Call_Format

`vcf_reader` is an iterable object.

This means `it = iter(vcf_reader)` would be redundant and we can already use `next()`. 

In [29]:
record = next(vcf_reader)

print(f"Chromsome: {record.CHROM}")
print(f"position: {record.POS}")
print(f"alternative alleles: {record.ALT}")
print(f"reference base: {record.REF}")
print(f"Variation info: {record.INFO}")
print(f"Identifier of variation: {record.ID}")

Chromsome: 1
position: 669588
alternative alleles: [T]
reference base: A
Variation info: {'PR': True}
Identifier of variation: chr1.669588


## `vcf.model_Call` [[docs]][call docs]

**class `vcf.model._Call`**: A genotype call, a cell entry in a VCF file.

[call docs]: https://pyvcf.readthedocs.io/en/latest/API.html#vcf-model-call

### What exactly is genotype? And why is there a heterozygous property in this class?

A **gene** is a section of DNA that encodes a trait. The precise arrangement of **nucleotides** in a gene can differ between copies of the same gene. Therefore, a gene can exist in different forms across organisms. These different forms are known as **alleles**. The exact fixed position on the chromosome that contains a particular gene is known as a **locus**.

A **diploid** organism either inherits two copies of the same allele or one copy of two different alleles from their parents. If an individual inherits two identical alleles, their **genotype** is said to be **homozygous** at that locus. However, if they possess two different alleles, their genotype is classed as **heterozygous** for that locus.

Alleles of the same gene are either autosomal dominant or recessive. An **autosomal dominant allele** will always be preferentially expressed over a recessive allele.

The subsequent combination of alleles that an individual possesses for a specific gene is their **genotype**.  

Nucleotides are each composed of a phosphate group, sugar and a base.

[source: technologynetworks.com/genomics](https://www.technologynetworks.com/genomics/articles/genotype-vs-phenotype-examples-and-definitions-318446)

### Genotype calling?

Genotype calling is the process of determining the genotype for each individual and is typically only done for positions in which a SNP or a 'variant' has already been called. We use the word 'calling' here to signify the estimation of one unique SNP or genotype.

**Source**: Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. *Nature reviews. Genetics, 12(6)*, 443–451. https://doi.org/10.1038/nrg2986 

In [43]:
for idx, sample in enumerate(record.samples):
    print(type(sample))
    break

<class 'vcf.model._Call'>


In [48]:
sample.gt_bases

sample.

False

In [20]:
# vcf_reader.samples (list): the genotype calls

type(vcf_reader.samples), len(vcf_reader.samples), vcf_reader.samples[:3]

(list, 1780, ['1052', '1053', '1054'])

In [35]:
np.array(record.samples)

array([Call(sample=1052, CallData(GT=0/0)),
       Call(sample=1053, CallData(GT=0/0)),
       Call(sample=1054, CallData(GT=0/0)), ...,
       Call(sample=4182, CallData(GT=0/0)),
       Call(sample=4659, CallData(GT=0/1)),
       Call(sample=920, CallData(GT=0/0))], dtype=object)

In [82]:
print(f"Variant type: {record.var_type}\n\
      \t{record.is_snp}, {record.is_indel}\n\
      \t{record.alleles}")

Variant type: snp
      	True, False
      	['C', T]


In [None]:
record_batches = []
batch_size = 1000 # number of records per batch

In [18]:
record_count = 0 

for batch

for idx, record in enumerate(vcf_reader):
    record_count += 1
    if idx % 500 == 0:
        print(f"{idx} records looped")

1780
0 records looped
1780
500 records looped
1780
1000 records looped
1780
1500 records looped
1780
2000 records looped
1780
2500 records looped
1780
3000 records looped
1780
3500 records looped
1780
4000 records looped
1780
4500 records looped
1780
5000 records looped
1780
5500 records looped
1780
6000 records looped
1780
6500 records looped
1780
7000 records looped
1780
7500 records looped
1780
8000 records looped
1780
8500 records looped
1780
9000 records looped
1780
9500 records looped
1780
10000 records looped
1780
10500 records looped
1780
11000 records looped
1780
11500 records looped
1780
12000 records looped
1780
12500 records looped
1780
13000 records looped
1780
13500 records looped
1780
14000 records looped
1780
14500 records looped
1780
15000 records looped
1780
15500 records looped
1780
16000 records looped
1780
16500 records looped
1780
17000 records looped
1780
17500 records looped
1780
18000 records looped


KeyboardInterrupt: 

---

## The Script 

In [None]:
import vcf

import os
import gzip

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("ggplot")

def read_chars_gz(n):
    """ Read the first n characters of the .gz vcf file.
    Args:
        n (int)
    """
    with gzip.open(vcf_file_path, 'rt') as f:
        print(f.read(n))


In [None]:
def main():
    system_path = r"C:\Users\uniqu\Adaptation\github repos" \
              + "\Bioinformatics-Neural Networks for Genomic Risk"
    system_path = system_path + r"\DawleyRats"
    vcf_file_name_options = [r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.CharlesRiverOnly.vcf.gz",
                             r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.HarlanOnly.vcf.gz"]
    vcf_file_name = vcf_file_name_options[0] # "...vcf.gz"

    vcf_file_path = os.path.join(system_path, vcf_file_name)

    vcf_reader = vcf.Reader(filename=vcf_file_path, compressed=True)

    try:
        print("VCF data loaded successfully...\n")
        print(f"Metadata:\n{vcf_reader.metadata}")
    except: 
        print("Failed to load VCF data.")
    
    n = int(1e4)
    read_chars_gz(n)
        
    
if __name__ == '__main__':
    main()