# Gen559 NumPy practice notebook
### 2020.11.25

### Practice problem 1a 

In the cell below, write out in words the major operational steps required to solve **Practice problem 1b**.
>The file *gm12878.hg38.vcf* contains information about genetic variation in a cell line that is part of the [HapMap Project](https://www.coriell.org/1/NIGMS/Collections/HapMap-project?gclid=CjwKCAiA2O39BRBjEiwApB2IklG_xOJwN12uISiqhihv2xu2U8GJc1dPvd1pvLLcsZurn2iksDxkUxoCieAQAvD_BwE). 

>The format of the file is: CHROM \t POS \t ID \t REF \t ALT \t QUAL \t FILTER \t INFO \t FORMAT \t NA12878

>*CHROM = chromosome; POS = position; REF = allele in reference genome; ALT = alternate allele*

>*On mac, you can unzip by running the command 'gunzip gm12878.hg38.vcf.gz'. On Windows, you can use [7-Zip](https://www.7-zip.org/).*

>* In the cell below, write a function that calculates the mean, median, and standard deviation for distances between single nucleotide variants (SNVs) on a specified chromosome.
>* Use NumPy to calculate the requested summary statistics.
>* Be sure to avoid header lines beginging with '#'
>* The file also contains infomration about indels, advoid these lines too. Only consider SNVs in your calculations.
>* Run your function on chrX and print the result.

* Create function to open and parse .vcf file
* Extract file contents, add coordinates to list
* Build in logic to exclude header lines and non-SNV lines
* Calculate distances between SNVs
* Return summary stat info about distances

### Practice problem 1b

The file *gm12878.hg38.vcf* contains information about genetic variation in a cell line that is part of the [HapMap Project](https://www.coriell.org/1/NIGMS/Collections/HapMap-project?gclid=CjwKCAiA2O39BRBjEiwApB2IklG_xOJwN12uISiqhihv2xu2U8GJc1dPvd1pvLLcsZurn2iksDxkUxoCieAQAvD_BwE). 

The format of the file is: CHROM \t POS \t ID \t REF \t ALT \t QUAL \t FILTER \t INFO \t FORMAT \t NA12878

*CHROM = chromosome; POS = position; REF = allele in reference genome; ALT = alternate allele*

*On mac, you can unzip by running the command 'gunzip gm12878.hg38.vcf.gz'. On Windows, you can use [7-Zip](https://www.7-zip.org/).*

* In the cell below, write a function that calculates the mean, median, and standard deviation for distances between single nucleotide variants (SNVs) on a specified chromosome.
* Use NumPy to calculate the requested summary statistics.
* Be sure to avoid header lines beginging with '#'
* The file also contains infomration about indels, advoid these lines too. Only consider SNVs in your calculations.
* Run your function on chrX and print the result.

In [2]:
# Import NumPy.
import numpy as np


def get_distances(file, chrom):
    '''Takes in name of vcf file, returns a list of distances between SNVs
    from a specified chromosome'''
    
    ## Open input file for reading, parse out SNP coords as ints.
    
    # Open specified file.
    with open(file, 'r') as f:
        
        # Create and populate a list of variant coordinates as integers. Skip header lines.
        # Only consider line if length of 'REF' and 'ALT' both = 1, i.e. the
        # line describes a SNV.
        coords = [int(line.strip().split('\t')[1]) for line in f \
        if line.strip().split("\t")[0][0]!='#' \
        and line.strip().split("\t")[0] == chrom \
        and len(line.strip().split("\t")[3]) == 1 \
        and len(line.strip().split("\t")[4]) == 1]

    # Make list to hold output.
    out_distances = []

    # Iterate through file data, add distances to list where appropriate.
    for i in range(0, len(coords)-1, 1):
        out_distances.append((coords[i+1])-coords[i])

    # Return list of distances.
    return out_distances


# Call function using user input.
dist_list = get_distances('gm12878.hg38.vcf', 'chrX')

# Create numpy array.
dist_array = np.asarray(dist_list)

# Calculate required metrics.
print('Mean distance between SNVs is %0.2f bp' % (np.average(dist_array)))
print('Median distance between SNVs is %0.2f bp' % (np.median(dist_array)))
print('STD of distance between SNVs is %0.2f bp' % (np.std(dist_array)))

Mean distance between SNVs is 1397.79 bp
Median distance between SNVs is 503.00 bp
STD of distance between SNVs is 5729.88 bp


### Practice problem 2

* Use NumPy to write and print a 3 x 4 matrix of zeros.
* Roll and round a random number generator 12 times to make a list of 0/1 values. Print your list.
* Update your original matrix using the following coordiante transformations: list[0] = matrix[0][0], list[4] = matrix[1][0], etc.
* Print the updated matrix.

In [4]:
a = np.zeros((3,4))
b = [round(np.random.random()) for x in range (12)]

print (a)
print (b)

for i in range(0, 3, 1):
    for j in range(0,4,1):
        ind = 4*i + j
        a[i][j] = b[ind]

print (a)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
[[1. 0. 0. 1.]
 [0. 0. 1. 1.]
 [0. 1. 0. 1.]]
