# Read

Simple example on how **Variant** can read and how can be treated.

In [1]:
from os import getcwd
from os.path import dirname
from openvariant import Annotation, Variant

dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'
annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'

We can see the 10 first line of a file parsed through an _annotation_ file.

In [2]:
annotation = Annotation(annotation_file)
result = Variant(dataset_file, annotation)

n_line = 10
for n_line, line in enumerate(result.read()):
    print(f'Line {n_line}: {line}')
    if n_line == 9:
        break

Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 3: {'POSITION': '139058', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 4: {'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 5: {'POSITION': '187146', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 6: {'POSITION': '187153', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 7: {'POSITION': '187264', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 8: {'POSITION': '187323', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'PO

As we can see in the output each line is a `dict` where the `key` is the field of the parsed result and the `value` is the value inside that cell.

**Variant** has different attributes than we can explore:

In [3]:
print('Headers: ', result.header)
print('Input file: ', result.path)

Headers:  ['POSITION', 'DATASET', 'SAMPLE', 'STRAND_REF', 'PLATFORM']
Input file:  /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz


Also, we can check the _Annotation_ which input file was parsed.

+ _Annotation_ file path - `path`
+ Format - `format`
+ Annotations - `annotations`
+ Columns - `columns`
+ Delimiter - `delimiter`
+ Excludes - `excludes`
+ Patterns - `patterns`
+ Structure - `structure`

In [4]:
print(result.annotation.annotations)

{'PLATFORM': ('STATIC', 'WGS'), 'POSITION': ('INTERNAL', ['Position', 'Start', 'Start_Position', 'Pos', 'Chromosome_Start', 'POS'], <openvariant.annotation.builder.Builder object at 0x7f7eac15cca0>, nan), 'DATASET': ('FILENAME', <openvariant.annotation.builder.Builder object at 0x7f7eac15cdc0>, re.compile('(.*)')), 'SAMPLE': ('DIRNAME', <openvariant.annotation.builder.Builder object at 0x7f7eac15cc10>, re.compile('(.*)')), 'STRAND': ('INTERNAL', ['Strand', 'Chromosome_Strand', ''], <openvariant.annotation.builder.Builder object at 0x7f7eac15c280>, nan), 'STRAND_REF': ('MAPPING', ['STRAND'], {'+': 'POS', '-': 'NEG'})}


**Variant** can be combined with `find_files` as it show the following example. It will print the 3 first lines of each input file.

In [5]:
from os.path import basename
from openvariant import find_files

dataset_folder = f'{dirname(getcwd())}/datasets/sample1'

for file_path, annotation in find_files(dataset_folder):
    result = Variant(file_path, annotation)

    n_line = 1
    print('File: ', basename(file_path), '\n')
    for n_line, line in enumerate(result.read()):
        print(f'Line {n_line}: {line}')
        if n_line == 2:
            print("\n")
            break

File:  5a3a743.wxs.maf.gz 

Line 0: {'POSITION': '65872', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '131628', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '183697', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  22f5b2f.wxs.maf.gz 

Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  345c90e.raw_somatic_mutation.vcf.gz 

Line 0: {'POSITION': '10267', 'DATASET': '.raw_somatic_mutation.vcf.gz', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 1: {'POSITION': '10273', 'DATASET': '.raw_somatic_mutation.vcf.gz', 'PLATFORM': 'WGS', 'INFO': 'WGS