# Getting percentile groups from a project

In this notebook, we are going to show the process we follow when generating the different subgroups of cells of a project in order to later create the percentiles. These subgroups will be obtained following an algorithm, according to the different characteristics and conditions of a project.

Another objetive of this notebook is to generate a table with a row per project and a column for each characteristic combination, with the goal of get the number of subgroups we can get of a project.

An important fact in the subgroup generation is that, even all the combinations of characterictics are posible, we will are using a subset of these combinations. Since we want the subgroups to be as specific as posible, we will go straight forward instead of get all the possible combinations.



Here we can see an example of the table structure:

| Project_ID | X0 | X1 | X2 | X0-X1 | X0-X2 | X1-X2 | X0-X1-X2 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 000 | 1 | 0 | 0 | 2 | 0 | 0 | 4 |
| 001 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| 002 | 1 | 0 | 0 | 0 | 10 | 0 | 0 |

In this example, we can see that the project with the id *000* has all the characterictics and the number of subgroups are calculated for the combinations *X0*, *X0-X1* and *X0-X1-X2*; the second project just has X1 and the last one has *X0* and *X2*, so number of subgroups are calculated for the combinations *X0* and *X0-X2*.

## Algorithm explanation

First, we are explaining the main algorithm, explaining some keypoints. We will call this main function `get_groups_from_project`, and it get as a param the ID of a project. To understand what this algorithm does, we will first look at the pseudocode of the function, and then explain the variables used and the auxiliary functions.

### Main function

Here is the pseudocode of the function:

```
function get_groups_from_project(project_ID):
    characteristics <- list of characteristics we will use to divide in groups
    metadata <- read_metadata(project_ID)
    
    subgroups <- init_subgroups(metadata, 'specie')
    used_characteristics <- ['specie']
    
    row <- init_row(project_ID, used_characteristics + characterictics)
    update_row(row, used_characteristics, subgroups)
    
    project_characteristics <- metadatos.columns
    
    for characteristic in characteristics:
        if c not in project_characteristics
            skip this characteristic
        
        subgroups_aux <- []
        for subgroup in subgroups:
            subgroup_aux <- get_subgroups(subgroup, characteristic)
            
            subgroups_aux <- subgroups_aux + subgroup_aux
        
        used_characteristics <- used_characteristics + [c]
        subgroups <- subgroups_aux
        update_row(row, used_characteristics, subgroups)

    return row, subgroups
```

The variables we use are use for:

- **characteristics**: A list of the characteristics we will use for divide the cells in groups.
- **metadata**: A dataframe with the project information. Here we can find a row per cell and a column of each characteristic.
- **subgroups**: A list of the groups we have at the moment, obtained using *used_characterictics* for divide the dataframe. Each subgroup is a python dictionary with the dataframe of the group and the characterictics used in the division. An example of subgroup can be:

```
    {
        'dataframe': <pandas dataframe object>,
        'specie': 'homo sapiens',
        'cell_type': 'neuron',
        'organ': 'brain'
    }
```
- **used_characterictics**: A list of the characterictics from *characterictics* that have been used to divide the cells in subgroups.
- **row**: The row of tha table with the number of subgroups for each combination. In our case it is a dictionary.
- **project_characterisic**: The list of characterstics that the project uses.

As we can see, the function just seen is making use of some other functions, which we are going to explain in the next sections. These functions are:

- read_metadata
- init_subgroups
- get_subgroups
- init_row
- update_row

### Reading metadata

### Getting subgroups 

### Managing rows

## Algorithm implementation

In [6]:
def get_groups_from_project(project_ID, characteristics):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """
    
    # Read the metadata file using the API
    metadata = read_metadata(project_ID)
    
    # Initialitation of parameters
    subgroups = init_subgroups(metadata, 'specie')
    used_characteristics = ['specie']
    
    row = init_row(project_ID, used_characteristics + characterictics)
    update_row(row, used_characteristics, subgroups)
    
    project_characteristics = metadatos.columns
    
    # Start the subgroup generation using the characteristics
    for characteristic in characteristics:
        # If the characteristic is not in the project, we skip it
        if c not in project_characteristics:
            continue
        
        # For each subgroup created, divide it using the current characteristic
        subgroups_aux = []
        for subgroup in subgroups:
            subgroup_aux = get_subgroups(subgroup, characteristic)
            
            subgroups_aux = subgroups_aux + subgroup_aux
        
        # Update parameters
        used_characteristics = used_characteristics + [c]
        subgroups = subgroups_aux
        update_row(row, used_characteristics, subgroups)

    return row, subgroups