# Gen559 argparse practice notebook
### 2020.12.02

### Practice problem 1 

Create a command line script, *proccess-vcf.py* that contains an argument parser. Add the following features to your argument parser:

1. Specify the .vcf file to operate on. **This argument should be required.**      

   
   
2. Specifiy the chromosome to extract information about. **This should be required but have a default value of 'chr1'.**  


3. Specify which variant type of the possiblities defined below should be extracted. **The user should only be allowed to specify one type, i.e. these should be mutually exclusive options:**
   * Single nucleotide change from Reference
   * Insertion relative to Reference
   * Deletion relative to Reference     
   
   
4. Specify the stem of the name of the output file that will be produced. E.g "my_variants" for the output file *my_variants.vcf*

Run your script using the Terminal (OS X, Linux, UNIX) or Command Prompt (Windows) on *gm12878.hg38.vcf* and demonstrate its functionality.

You may use the cell below to write out your code if you do not feel comfortable using an IDE, but be advised notebooks do not have the ability to properly execute argparse. *Hint: if needed, you can write all of the necessary logic to build your parser as functions in the notebook to test their function.*


### Solution

* *process-vcf.py* code (shown below) and outputs from running on *gm12878.hg38.vcf* with -c = 'chr17' in '-s', '-i' and '-d' mode are available on [GitHub](https://github.com/beliveau-lab/gen559/tree/main/notebooks) and [Canvas](https://canvas.uw.edu/courses/1430304/files/folder/jupyter_notebooks).


**process-vcf.py:**  

```python
import argparse


def extract_variation(file, chrom):
    '''Takes in name of vcf file and returns a list of variants
    from a specified chromosome'''

    # Open specified file.
    with open(file, 'r') as f:

        # Extract SNVs if args.snv is True.
        if args.snv is True:
            # Create and populate of extracted variants. Skip header lines.
            # Only consider line if length of 'REF' and 'ALT' both = 1, i.e.
            # the line describes a SNV.
            variants = [line.strip() for line in f \
            if line.strip().split("\t")[0][0]!='#' \
            and line.strip().split("\t")[0] == chrom \
            and len(line.strip().split("\t")[3]) == 1 \
            and len(line.strip().split("\t")[4]) == 1]

        # Extract insertions if args.ins is True.
        if args.insertion is True:

            # Create and populate of extracted variants. Skip header lines.
            # Only consider line if length of 'REF' = 1 and length of 'ALT'
            # > 1, i.e. the line describes an insertion.
            variants = [line.strip() for line in f \
            if line.strip().split("\t")[0][0]!='#' \
            and line.strip().split("\t")[0] == chrom \
            and len(line.strip().split("\t")[3]) == 1 \
            and len(line.strip().split("\t")[4]) > 1]

        # Extract deletions if args.del is True.
        if args.deletion is True:

            # Create and populate of extracted variants. Skip header lines.
            # Only consider line if length of 'REF' > 1 and length of 'ALT'
            # = 1, i.e. the line describes a deletion.
            variants = [line.strip() for line in f \
            if line.strip().split("\t")[0][0]!='#' \
            and line.strip().split("\t")[0] == chrom \
            and len(line.strip().split("\t")[3]) > 1 \
            and len(line.strip().split("\t")[4]) == 1]

    # Return list of extracted variants.
    return variants


# Create argument parser.
userInput = argparse.ArgumentParser(description= \
                                    'Takes an input vcf file and extracts ' \
                                    'extracts chromosome-specific variant ' \
                                    'information to return in a new .vcf file')

## Add arguments to parser.

# Add argument to import file.
userInput.add_argument('-f', '--file', action='store', required=True, \
                        help='The .vcf file to process (required)')

# Add argument to specify chromosome.
userInput.add_argument('-c', '--chrom', action='store', required=True, \
                        default='chr1', help='The chromosome from which ' \
                        'to extract variant information (required), ' \
                        'default="chr1"')

# Create mutually exclusive argument group.
mutEx = userInput.add_mutually_exclusive_group()

# Add SNV argument to mutEx group.
mutEx.add_argument('-s', '--snv', action='store_true', default=False,
                    help='Return SNV variants, default=False')

# Add insertion argument to mutEx group.
mutEx.add_argument('-i', '--insertion', action='store_true', default=False, \
                    help='Return insertion variants, default=False')

# Add deletion argument to mutEx group.
mutEx.add_argument('-d', '--deletion', action='store_true', default=False, \
                    help='Return deletion variants, default=False')

# Add argument to specify output filename stem.
userInput.add_argument('-o', '--output', action='store', required=True, \
                        default='my_variants', help='The stem of the ' \
                        'output file name (required), ' \
                        'default="my_variants"')

# Import user-specified command line values.
args = userInput.parse_args()

# Call extract_variation function with user input.
outputList = extract_variation(args.file, args.chrom)

# Create and write output file.
with open(args.output + '.vcf', 'w') as f:
    f.write('\n'.join(outputList))
```

### Practice problem 2

Use *process-vcf.py* on *gm12878.hg38.vcf* to extract both insertion and deletion variants from chr17. In the cell below, write code to parse the output files and determine the number and average length of the both the insertion and deletion variants. Print your results. 


In [1]:
# Open insertions file and extact insertion lengths.
with open("chr17_ins.vcf", "r") as f:
    insertion_lengths = [len(line.strip().split("\t")[4]) for line in f]

# Open deletions file and extact deletion lengths.
with open("chr17_del.vcf", "r") as f:
    deletion_lengths = [len(line.strip().split("\t")[3]) for line in f]


# Print request info about variants.
print('There are %d insertion variants with an average length of %0.2f bp' \
      % (len(insertion_lengths), sum(insertion_lengths)/len(insertion_lengths)))

print('There are %d deletion variants with an average length of %0.2f bp' \
      % (len(deletion_lengths), sum(deletion_lengths)/len(deletion_lengths)))
    

There are 8830 insertion variants with an average length of 4.21 bp
There are 8531 deletion variants with an average length of 4.06 bp
