# Reference

Reference guide for tasks that I am going to forget.

In [23]:
%%html
<style>
   table {float: left}
</style>

## Read file line by line

Use the [open](https://docs.python.org/3/library/functions.html#open) function with the file and the mode as arguments, where the mode is as follows:

| Mode |    Description   |
| ---- | ---------------- |
|  r   |   reading        |
|  w   |   writing        |
|  a   |   appending      |
|  rb  |   reading binary |
|  wb  |   writing binary |

In [28]:
num_line = 5
line_count = 1

with open('../data/iris.csv') as f:
    for index, line in enumerate(f):
        if line_count == num_line:
            break
        print("Line {}: {}".format(index, line.strip()))
        line_count += 1

Line 0: Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Line 1: 5.1,3.5,1.4,0.2,setosa
Line 2: 4.9,3,1.4,0.2,setosa
Line 3: 4.7,3.2,1.3,0.2,setosa


## Read gz file

[Read lines from compressed text files](https://stackoverflow.com/questions/10566558/python-read-lines-from-compressed-text-files) using the `gzip` package.

In [10]:
import gzip

num_line = 5
line_count = 1
with gzip.open('../data/Pfeiffer.vcf.gz', 'rt') as f:
    for line in f:
        if line_count == num_line:
            break
        to_print = 'Reading line number {}: {}'
        print(to_print.format(line_count, line))
        line_count += 1

Reading line number 1: ##fileformat=VCFv4.1

Reading line number 2: ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

Reading line number 3: ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

Reading line number 4: ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">



## Regular expressions

Use the `re` package for regexes.

In [13]:
import gzip
import re

num_line = 5
line_count = 1
with gzip.open('../data/Pfeiffer.vcf.gz', 'rt') as f:
    for line in f:
        if re.search('^##', line):
            continue
        if line_count == num_line:
            break
        print(line)
        line_count += 1

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	manuel

1	866511	rs60722469	C	CCCCT	258.62	PASS	AC=2;AF=1.00;AN=2;DB;DP=11;FS=0.000;HRun=0;HaplotypeScore=41.3338;MQ0=0;MQ=61.94;QD=23.51;set=variant	GT:AD:DP:GQ:PL	1/1:6,5:11:14.79:300,15,0

1	879317	rs7523549	C	T	150.77	PASS	AC=1;AF=0.50;AN=2;BaseQRankSum=1.455;DB;DP=21;Dels=0.00;FS=1.984;HRun=0;HaplotypeScore=0.0000;MQ0=0;MQ=60.00;MQRankSum=-0.037;QD=7.18;ReadPosRankSum=0.112;set=variant2	GT:AD:DP:GQ:PL	0/1:14,7:21:99:181,0,367

1	879482	.	G	C	484.52	PASS	AC=1;AF=0.50;AN=2;BaseQRankSum=1.934;DP=48;Dels=0.00;FS=4.452;HRun=0;HaplotypeScore=0.5784;MQ0=0;MQ=59.13;MQRankSum=-0.240;QD=10.09;ReadPosRankSum=1.537;set=variant2	GT:AD:DP:GQ:PL	0/1:28,20:48:99:515,0,794



## String splitting

Use `string.split(separator, maxsplit)` where `maxsplit` specifies how many splits to perform (default is all occurrences).

In [21]:
import gzip
import re

num_line = 5
line_count = 1
with gzip.open('../data/Pfeiffer.vcf.gz', 'rt') as f:
    for line in f:
        if re.search('^#', line):
            continue
        if line_count == num_line:
            break
        x = line.split("\t")
        print(x[:5])
        line_count += 1

['1', '866511', 'rs60722469', 'C', 'CCCCT']
['1', '879317', 'rs7523549', 'C', 'T']
['1', '879482', '.', 'G', 'C']
['1', '880390', 'rs3748593', 'C', 'A']
