# Read

A simple example on how **Variant** can read and how can be treated.

In [1]:
from os import getcwd
from os.path import dirname
from openvariant import Annotation, Variant

dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'
annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'

`Annotation` object generated from _annotation_ file. Parameters:

- `annotation_path` - Path of _annotation_ file.

`Variant` object to iterate through the parsed file. Parameters:

- `path` - Path of _input_ file.
- `annotation` - Annotation object which _input_ will be parsed.

One of the main functions of _Variant_ is `read`.It will generate an iterator to scan the parsed file.

`read` function parameters:

- `where` - Filter expression.
- `group_key` - Key to group rows.


In this example, it will get the 10 first lines of parsed files through an _annotation_ file.

In [2]:
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)

for n_line, line in enumerate(result.read()):
    print(f'Line {n_line}: {line}')
    if n_line == 9:
        break

Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 3: {'POSITION': '139058', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 4: {'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 5: {'POSITION': '187146', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 6: {'POSITION': '187153', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 7: {'POSITION': '187264', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 8: {'POSITION': '187323', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'PO

As we can see in the output each line is a `dict` where the `key` is the field of the parsed result and the `value` is the value in that cell.

**Variant** has different attributes than we can explore:

In [3]:
print('Headers: ', result.header)
print('Input file: ', result.path)

Headers:  ['POSITION', 'DATASET', 'SAMPLE', 'STRAND_REF', 'PLATFORM']
Input file:  /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz


Also, we can check the _Annotation_ which input file was parsed.

+ _Annotation_ file path - `path`
+ Format - `format`
+ Annotations - `annotations`
+ Columns - `columns`
+ Delimiter - `delimiter`
+ Excludes - `excludes`
+ Patterns - `patterns`
+ Structure - `structure`

In [4]:
print(result.annotation.annotations)

{'PLATFORM': ('STATIC', 'WGS'), 'POSITION': ('INTERNAL', ['Position', 'Start', 'Start_Position', 'Pos', 'Chromosome_Start', 'POS'], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3b20>, nan), 'DATASET': ('FILENAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3940>, re.compile('(.*)')), 'SAMPLE': ('DIRNAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3520>, re.compile('(.*)')), 'STRAND': ('INTERNAL', ['Strand', 'Chromosome_Strand', ''], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3310>, nan), 'STRAND_REF': ('MAPPING', ['STRAND'], {'+': 'POS', '-': 'NEG'})}


One of the parameter to `read` function is `where`. You will be able to apply a conditional filter. The possible operations can be:

+ `==` - Equal.
+ `!=` - Not equal.
+ `<=` - Less or equal than.
+ `<` - Less than.
+ `>=` - More or equal than.
+ `>` - More than.

One example of this parameter is the following one:

In [5]:
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)

for n_line, line in enumerate(result.read(where="POSITION == 186112")):
    print(f'{line}')

{'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


Also, `read` allows `group_key` as a parameter which it will group rows depending on its value.

**Variant** can be combined with `findfiles` as it shows the following example. It will print the 3 first lines of each input file.

In [6]:
from os.path import basename
from openvariant import findfiles

dataset_folder = f'{dirname(getcwd())}/datasets/sample1'

for file_path, annotation in findfiles(base_path=dataset_folder):
    result = Variant(path=file_path, annotation=annotation)

    n_line = 1
    print('File: ', basename(file_path), '\n')
    for n_line, line in enumerate(result.read()):
        print(f'Line {n_line}: {line}')
        if n_line == 2:
            print("\n")
            break

File:  5a3a743.wxs.maf.gz 

Line 0: {'POSITION': '65872', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '131628', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '183697', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  22f5b2f.wxs.maf.gz 

Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  345c90e.raw_somatic_mutation.vcf.gz 

Line 0: {'POSITION': '10267', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 1: {'POSITION': '10273', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10321', 'DATA