<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Analysis" data-toc-modified-id="Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Analysis</a></span></li><li><span><a href="#The-Script" data-toc-modified-id="The-Script-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Script</a></span></li></ul></div>

In [1]:
import vcf

import os
import gzip

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("ggplot")

In [2]:
# VCF data loading "hello world"
system_path = r"C:\Users\uniqu\Adaptation\github repos" \
              + "\Bioinformatics-Neural Networks for Genomic Risk"
system_path = system_path + r"\DawleyRats"
vcf_file_name_options = [r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.CharlesRiverOnly.vcf.gz",
                         r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.HarlanOnly.vcf.gz"]
vcf_file_name = vcf_file_name_options[0] # "...vcf.gz"

vcf_file_path = os.path.join(system_path, vcf_file_name)

vcf_reader = vcf.Reader(filename=vcf_file_path, compressed=True)

try:
    print("VCF data loaded successfully...\n")
    print(f"Metadata:\n{vcf_reader.metadata}")
except: 
    print("Failed to load VCF data.")

VCF data loaded successfully...

Metadata:
OrderedDict([('fileformat', 'VCFv4.2'), ('fileDate', '20180201'), ('source', ['PLINKv1.90'])])


--- 

## Analysis

Resources Used:
- [PyVCF Tutorial: Michal Linial (Jan, 2020). *Quantitative Biological Research with Python*.](https://youtu.be/jWu_nxlS5Vc) (ends @ 12 minutes)
- 

In [None]:
# Read the first n characters of the .gz vcf file.
with gzip.open(vcf_file_path, 'rt') as f:
    print(f.read(n := int(1e5)))

Q: What's the gzip module for? [gzip docs](https://docs.python.org/3/library/gzip.html)

- This module provides a simple interface to compress and decompress files like the GNU program gzip
- `gzip`: a module that provides the `GzipFile` class as well as the `open`, `compress`, and `decompress` convenience functions.

Q: The "*.gz" file extension?

- .gz files: compressed files created using the gzip compression utility, which was created to replace and improve on compress in UNIX. This utility is commonly used on UNIX-like systems.
- gzip file compression is often used to compress some elements of web pages to speed up page loading. 

Q: Why is the tool called `gzip`? 

- A .gz file is an archive file compressed by the standard GNU zip (gzip) compression algorithm.

Q: Why use the `GzipFile` class? [class docs](https://docs.python.org/3/library/gzip.html#gzip.GzipFile)

- It simulates most of the methods of a "file object"

Q: "file object"? [Python docs. file object.](https://docs.python.org/3/glossary.html#term-file-object)

file object: 
- An object exposing a file-oriented API (with methods such as `read()` or `write()`) to an underlying resource
- Tutorial [file objects](https://youtu.be/Uh2ebFW8OYM)
- Tutorial [OS Module](https://www.youtube.com/watch?v=tJxcKyFMTGo)

Q: `gzip.open` method?



In [4]:
vcf_reader.metadata

OrderedDict([('fileformat', 'VCFv4.2'),
             ('fileDate', '20180201'),
             ('source', ['PLINKv1.90'])])

Q: How does the `OrderDict` type differ from the regular dictionary? [OrderedDict docs](https://docs.python.org/3.4/library/collections.html?highlight=ordereddict)
- It retains the order in which the entries were added.

[Python Dictionary Methods](https://www.w3schools.com/python/python_ref_dictionary.asp)

In [5]:
vcf_reader.metadata.items()

odict_items([('fileformat', 'VCFv4.2'), ('fileDate', '20180201'), ('source', ['PLINKv1.90'])])

In [6]:
for pair in vcf_reader.metadata.items():
    print(pair)

('fileformat', 'VCFv4.2')
('fileDate', '20180201')
('source', ['PLINKv1.90'])


In [7]:
vcf_reader.infos

OrderedDict([('PR',
              Info(id='PR', num=0, type='Flag', desc='Provisional reference allele, may not be based on real reference genome', source=None, version=None))])

In [8]:
vcf_reader.infos.items()

odict_items([('PR', Info(id='PR', num=0, type='Flag', desc='Provisional reference allele, may not be based on real reference genome', source=None, version=None))])

In [9]:
# name: name of an info object
# info: a vcf.Reader info object
for name, info in vcf_reader.infos.items():
    print(f"{name} ({info.type}): {info.desc}")

PR (Flag): Provisional reference allele, may not be based on real reference genome


`vcf_reader.infos` | info object:
- `info.type`: type of the info object
- `info.desc`: desription of the info object

In [13]:
# vcf_reader.samples (list): sample names

type(vcf_reader.samples)
len(vcf_reader.samples)

len(vcf_reader.samples)

1780

In [19]:
vcf_reader

<vcf.parser.Reader at 0x1159fa65a60>

In [18]:
record_count = 0 
for idx, record in enumerate(vcf_reader):
    record_count += 1
    if idx % 500 == 0:
        print(len(record.samples))
        print(f"{idx} records looped")
len(record_count)

1780
0 records looped
1780
500 records looped
1780
1000 records looped
1780
1500 records looped
1780
2000 records looped
1780
2500 records looped
1780
3000 records looped
1780
3500 records looped
1780
4000 records looped
1780
4500 records looped
1780
5000 records looped
1780
5500 records looped
1780
6000 records looped
1780
6500 records looped
1780
7000 records looped
1780
7500 records looped
1780
8000 records looped
1780
8500 records looped
1780
9000 records looped
1780
9500 records looped
1780
10000 records looped
1780
10500 records looped
1780
11000 records looped
1780
11500 records looped
1780
12000 records looped
1780
12500 records looped
1780
13000 records looped
1780
13500 records looped
1780
14000 records looped
1780
14500 records looped
1780
15000 records looped
1780
15500 records looped
1780
16000 records looped
1780
16500 records looped
1780
17000 records looped
1780
17500 records looped
1780
18000 records looped


KeyboardInterrupt: 

`vcf_reader` is an iterable object.

This means `it = iter(vcf_reader)` would be redundant and we can already use `next()`. 

In [110]:
record = next(vcf_reader)

print(f"Chromsome: {record.CHROM} \
    {record.POS}\
    {record.ALT}\
    {record.REF}\
    {record.INFO}\
    {record.ID}")

Chromsome: 1     1482686    [C]    T    {'PR': True}    chr1.1482686


In [112]:
np.array(record.samples)

array([Call(sample=1052, CallData(GT=0/1)),
       Call(sample=1053, CallData(GT=0/0)),
       Call(sample=1054, CallData(GT=0/0)), ...,
       Call(sample=4182, CallData(GT=0/0)),
       Call(sample=4659, CallData(GT=0/1)),
       Call(sample=920, CallData(GT=0/0))], dtype=object)

In [82]:
print(f"Variant type: {record.var_type}\n\
      \t{record.is_snp}, {record.is_indel}\n\
      \t{record.alleles}")

Variant type: snp
      	True, False
      	['C', T]


---

## The Script 

In [None]:
import vcf

import os
import gzip

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("ggplot")

def read_chars_gz(n):
    """ Read the first n characters of the .gz vcf file.
    Args:
        n (int)
    """
    with gzip.open(vcf_file_path, 'rt') as f:
        print(f.read(n))


In [None]:
def main():
    system_path = r"C:\Users\uniqu\Adaptation\github repos" \
              + "\Bioinformatics-Neural Networks for Genomic Risk"
    system_path = system_path + r"\DawleyRats"
    vcf_file_name_options = [r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.CharlesRiverOnly.vcf.gz",
                             r"allChr.allSamps.90DR2.maf01.hweE7.noIBD.HarlanOnly.vcf.gz"]
    vcf_file_name = vcf_file_name_options[0] # "...vcf.gz"

    vcf_file_path = os.path.join(system_path, vcf_file_name)

    vcf_reader = vcf.Reader(filename=vcf_file_path, compressed=True)

    try:
        print("VCF data loaded successfully...\n")
        print(f"Metadata:\n{vcf_reader.metadata}")
    except: 
        print("Failed to load VCF data.")
    
    n = int(1e4)
    read_chars_gz(n)
        
    
if __name__ == '__main__':
    main()