# _Group by_

A simple example where we can find how **group by** task works. This task is able with command-line.


In [1]:
from os.path import dirname
from os import getcwd
from openvariant import group_by

dataset_folder = f'{dirname(getcwd())}/datasets/sample2'
annotation_path = f'{dirname(getcwd())}/datasets/sample2/annotation.yaml'

`group_by` task allows us to group the rows depending on the value of an output field.

- `base_path` - Input path to explore and parse.
- `annotation_path` - Path of the annotation path.
- `script` - Command-line to execute with the result of the parsing.
- `key_by` - Key to group rows.
- `where` - Filter expression.
- `cores` - Maximum processes to run in parallel.
- `quite` - Do not show the progress meanwhile the parsing is running.
- `header` - Show header on the result.
- `skip_files` - Skip unreadable files and directories.

On the following example we can see a general case for `group by` task:

In [2]:
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None, key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")

Group: MESO
ACAP3	1p36.33	MESO
ACTRT2	1p36.32	MESO
AGRN	1p36.33	MESO
ANKRD65	1p36.33	MESO
ATAD3A	1p36.33	MESO
ATAD3B	1p36.33	MESO
ATAD3C	1p36.33	MESO
AURKAIP1	1p36.33	MESO
B3GALT6	1p36.33	MESO


Group: ACC
ACAP3	1p36.33	ACC
ACTRT2	1p36.32	ACC
AGRN	1p36.33	ACC
ANKRD65	1p36.33	ACC
ATAD3A	1p36.33	ACC
ATAD3B	1p36.33	ACC
ATAD3C	1p36.33	ACC
AURKAIP1	1p36.33	ACC
B3GALT6	1p36.33	ACC




One of the parameters on `count` task is `where`. You will be able to apply a conditional filter. The possible operations can be:

+ `==` - Equal.
+ `!=` - Not equal.
+ `<=` - Less or equal than.
+ `<` - Less than.
+ `>=` - More or equal than.
+ `>` - More than.

One example of this parameter is the following one:

In [3]:
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None,where="SYMBOL == 'ATAD3C'", key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")

Group: MESO
ATAD3C	1p36.33	MESO


Group: ACC
ATAD3C	1p36.33	ACC




Also, on `group by` task, there is `script` parameter which will allow to the user to execute a command shell on the parsed result. In the following example we can see how many characters there are in each group of the parsed output:

In [4]:
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script="wc -m", key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")

Group: MESO
181


Group: ACC
172


