# Getting percentile groups from a project

In this notebook, we are going to show the process we follow when generating the different subgroups of cells of a project in order to later create the percentiles. These subgroups will be obtained following an algorithm, according to the different characteristics and conditions of a project.

Another objetive of this notebook is to generate a table with a row per project and a column for each characteristic combination, with the goal of get the number of subgroups we can get of a project.

An important fact in the subgroup generation is that, even all the combinations of characteristics are posible, we will are using a subset of these combinations. Since we want the subgroups to be as specific as posible, we will go straight forward instead of get all the possible combinations.



Here we can see an example of the table structure:

| Project_ID | X0 | X1 | X2 | X0-X1 | X0-X2 | X1-X2 | X0-X1-X2 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 000 | 1 | 0 | 0 | 2 | 0 | 0 | 4 |
| 001 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| 002 | 1 | 0 | 0 | 0 | 10 | 0 | 0 |

In this example, we can see that the project with the id *000* has all the characteristics and the number of subgroups are calculated for the combinations *X0*, *X0-X1* and *X0-X1-X2*; the second project just has X1 and the last one has *X0* and *X2*, so number of subgroups are calculated for the combinations *X0* and *X0-X2*.

## Algorithm explanation

First, we are explaining the main algorithm, explaining some keypoints. We will call this main function `get_groups_from_project`, and it get as a param the ID of a project. To understand what this algorithm does, we will first look at the pseudocode of the function, and then explain the variables used and the auxiliary functions.

### Main function

The frist part of the function consist in reading the metadata and initializing the variables we are going to use (these variables are explained below). Then, we try to divide the data with every characteristic if it is present in the data, building the row with the number of groups per combination, and, finally, we return the row and the subgroups created.

An important step is the stop condition: there are remaining characteristics to divide and we still have some subgroups to divide (a subgroup has to meet some conditions to be considered valid).

Here is the pseudocode of the function:

---
```
function get_groups_from_project(project_ID):
    characteristics <- list of characteristics we will use to divide in groups
    metadata <- read_metadata(project_ID)
    
    subgroups <- init_subgroups(metadata)
    used_characteristics <- []
    
    row <- init_row(project_ID, characteristics)
    
    project_characteristics <- metadatos.columns
    
    for characteristic in characteristics:
        if c not in project_characteristics
            skip this characteristic
        
        subgroups_aux <- []
        for subgroup in subgroups:
            subgroup_aux <- get_subgroups(subgroup, characteristic)
            
            subgroups_aux <- subgroups_aux + subgroup_aux
        
        used_characteristics <- used_characteristics + [c]
        subgroups <- subgroups_aux
        
        if subgroups is empty:
            break
        
        update_row(row, used_characteristics, subgroups)

    return row, subgroups
```
---

The variables we use are use for:

- **characteristics**: A list of the characteristics we will use for divide the cells in groups.
- **metadata**: A dataframe with the project information. Here we can find a row per cell and a column of each characteristic.
- **subgroups**: A list of the groups we have at the moment, obtained using *used_characteristics* for divide the dataframe. Each subgroup is a python dictionary with the dataframe of the group and the characteristics used in the division. An example of subgroup can be:

```python
    {
        'dataframe': <pandas dataframe object>,
        'specie': 'homo sapiens',
        'cell_type': 'neuron',
        'organ': 'brain'
    }
```
- **used_characteristics**: A list of the characteristics from *characteristics* that have been used to divide the cells in subgroups.
- **row**: The row of tha table with the number of subgroups for each combination. In our case it is a dictionary.
- **project_characterisic**: The list of characterstics that the project uses.

As we can see, the function just seen is making use of some other functions, which we are going to explain in the next sections. These functions are:

- read_metadata
- init_subgroups
- get_subgroups
- init_row
- update_row

### Reading metadata

The first thing to do is to read the metadata that will contain the information of the projects related to the characteristics for each cell studied with which we want to divide the project into groups. To archieve that, we have design a method that uses the API-REST to get the download link of a project and read the metadata.

We can see the pseudocode of the function here:

---
```
function read_metadata(project_ID):
    links <- obtain links of the projects from the API
    
    if metadata_link in links:
        return metadata
        
    return error - no metadata for this project
```
---

It is a very simple function that first query the API for the links and then return the dataframe if it exists.

### Getting subgroups 

In this part, we are going to explain how we divide a group in the corresponding subgroups using a characteristic. Given a group (a python dictionary object) and a characteristic, we can do a `groupby` by this characteristic and return a subgroup for each value. 

Also, not all subgroups will be considered valid, for a subgroup to be valid, it has to fulfill that this group consists of 25 cells or more. This is done so that the calculation of the percentiles is meaningful since a smaller number of cells would make the calculation too insignificant.

Now, we can take a look at the pseudocode of the function called `get_subgroups`:

---
```
function get_subgroups(group, characteristic):
    dataframe <- group['dataframe']
    groupby <- dataframe.groupby(characteristic)
    
    subgroups <- []
    for value, subgroup in groupby:
        if n_cells(subgroup) < 25:
            skip this subgroup
            
        new_subgroup <- group.copy()
        new_subgroup['dataframe'] <- subgroup
        new_subgroup[characteristic] <- value
        subgroups <- subgroups + new_subgroup
    
    return subgroups
```
---

To clarify the use of the function, let's exemplify the following example. The parameters given to the function are *disease* as characteristic and this group:

```python
{
    'dataframe': <pandas dataframe object>,
    'specie': 'homo sapiens',
    'organ': 'brain'
}
```

Now, we have three diseases in this group: *parkinson*, *alzheimer* and *brain cancer*. Since the subgroup with the disease *kdhds* has 10 cells, it wont be consider as a valid subgroup. Knowing that, the list of subgroups we get is:

```python
[
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'parkinson'
    },
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'brain cancer'
    }
]
```

However, in the main algorithm we dont have a 'subgroup', we have a dataframe with all the metadata. In order to initialize the subgroups, we can create the function called `init_subgroups`. This function creates a python dictionary only with the dataframe and returns it as a list. It is a pretty simple function as we can see in the pseudocode:

---
```
function init_subgroups(metadata):
    dictionary <- dictionary with metadata as 'dataframe'
    subgroups <- [dictionary]
    
    return subgroups
```
---

### Managing rows

Finally, we have to mess with the creating and updating of the row with the number of subgroups per each combination. As we said before, this row will be a python dictionary with each combination. For the initiation of the row, we have created the function `init_row`, which gets all the combinations of the characteristics and creates a dictionary with each combination initialized at 0.

Here we can see the pseudocode of this function:

---
```
function init_row(project_ID, characteristics):
    combination_names <- get all combinations of the characteristics
    row <- init dict with combination_names as keys, with a value of 0 for each key
    row['project_ID'] <- project_ID
    
    return row
```
---

An example of an empty row for the characteristics *X0*, *X1* and *X2* is:

```python
{
    'project_ID': '000',
    'X0': 0,
    'X1': 0,
    'X2': 0,
    'X0-X1': 0,
    'X0-X2': 0,
    'X1-X2': 0,
    'X0-X1-X2': 0
}
```

Once we have the row created, we can update the number of groups in a combination using the function `update_row`. This function recives as parameters the row, the characteristics used to create the subgroup and the subgroups created. With the characteristics used, it generates the name of the combination that corresponds to the name of the combinaition in the row. For example, if the subgroups have been created with the characterictictics *X1* and *X2*, the name generated is *X1-X2*. Finally, it updates the combination in the row with the number of subgroups.

The pseudocode of this function is:

---
```
function update_row(row, characteristics_used, subgroups):
    n_subgroups <- length(subgroups)
    combination_name <- generate name from characteristics_used
    row[combination_name] <- n_subgroups
```
---

## Algorithm implementation

Now we undestand the algorithm, we can continue with the implementation of the methods explained.

First, we will implement `read_metadata`, then we are going to implement the functions relationed with the subgroups: `init_subgroups` and `get_subgroups`, continuing with the functions for manage rows `init_row` and `update_row`, ending with the main function that uses all these functions `get_groups_from_project`.

As we implement a method we will be testing it with a sample project from the *Single Cell Expresion Atlas* (SCEA) repository. This project can be found in https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7678/experiment-design.

In [1]:
project_ID = 'E-MTAB-7678'

### Reading metadata implementation

In [2]:
import requests    
import pandas as pd

In [3]:
def read_metadata(project_ID):
    """Return the metadata of the project with the project_ID
    
    Parameters
    ----------
    project_ID : str
        The ID of a project

    Returns
    -------
    
    metadata: pandas dataframe
        A dataframe of the project with its metadata
    """
    
    # Define the link and the metadata key name
    API_downloads_link = 'http://localhost:5000/project/downloads/'
    metadata_key_name = 'experimentDesignLink'
    
    # Get the download links of the project
    links = requests.get(API_downloads_link + project_ID).json()
    if not links: # If project doesn't exists
        raise Exception(f'Project with ID {project_ID} not found')
    links = links[0]
    
    # Return the metadata if it exists
    if metadata_key_name in links:
        metadata_link = links[metadata_key_name]
        metadata = pd.read_csv(metadata_link, sep='\t')
        
        return metadata

    raise Exception('Metadata link not found')

In [42]:
metadata = read_metadata(project_ID)
metadata.head()

Unnamed: 0,Assay,Sample Characteristic[organism],Sample Characteristic Ontology Term[organism],Sample Characteristic[strain],Sample Characteristic Ontology Term[strain],Sample Characteristic[age],Sample Characteristic Ontology Term[age],Sample Characteristic[developmental stage],Sample Characteristic Ontology Term[developmental stage],Sample Characteristic[sex],...,Sample Characteristic[organism part],Sample Characteristic Ontology Term[organism part],Sample Characteristic[cell type],Sample Characteristic Ontology Term[cell type],Sample Characteristic[immunophenotype],Sample Characteristic Ontology Term[immunophenotype],Factor Value[cell type],Factor Value Ontology Term[cell type],Factor Value[immunophenotype],Factor Value Ontology Term[immunophenotype]
0,SAMEA5367303-AAACCTGAGGATGTAT,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
1,SAMEA5367303-AAACCTGCACCGTTGG,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
2,SAMEA5367303-AAACCTGGTCCAGTGC,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
3,SAMEA5367303-AAACCTGTCAGCTGGC,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
4,SAMEA5367303-AAACGGGAGACTGTAA,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,


As we can see, the column names are too long and there are also some columns we don't want (Ontology Term columns). SO we will apply a function to proccess the metadata before we use it.

In [41]:
import re

def process_metadata(metadata):
    """Return the processed metadata
    
    Parameters
    ----------
    metadata : pandas dataframe
        metadata of a project

    Returns
    -------
    
    metadata: pandas dataframe
        metadata processed
    """
    cols = [c for c in metadata.columns if 'ontology term' not in c.lower()]
    metadata = metadata[cols] # Drop columns with ontology terms

    metadata = metadata.rename(columns=lambda x: re.sub(r'.+\[(.+)\]',r'\1',x)) # Rename columns
    
    metadata = metadata.loc[:,~metadata.columns.duplicated()] # Remove duplicated columns
    
    return metadata

In [43]:
metadata = process_metadata(metadata)
metadata.head()

Unnamed: 0,Assay,organism,strain,age,developmental stage,sex,genotype,organism part,cell type,immunophenotype
0,SAMEA5367303-AAACCTGAGGATGTAT,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
1,SAMEA5367303-AAACCTGCACCGTTGG,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
2,SAMEA5367303-AAACCTGGTCCAGTGC,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
3,SAMEA5367303-AAACCTGTCAGCTGGC,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
4,SAMEA5367303-AAACGGGAGACTGTAA,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+


### Getting subgroups implementation

In [5]:
def init_subgroups(metadata):
    """Returns a list of one dictionary, containing the metadata
    
    Parameters
    ----------
    
    metadata : pandas dataframe
        The metadata of a project in a dataframe

    Returns
    -------
    
    subgroups: list
        List with the initial group of the metadata
    """
    dictionary = {'dataframe': metadata}
    subgroups = [dictionary]

    return subgroups

In [6]:
subgroups = init_subgroups(metadata)

In [7]:
def get_subgroups(group, characteristic):
    """Divide the group in subgroups using the characteristic
    
    Parameters
    ----------
    
    group : dict
        The group with the dataframe and the characteristics used

    characteristic: str
        The characteristic used for the division
    
    Returns
    -------
    
    subgroups: list
        List with the subgroups created
    """
    # Get the dataframe and group by the characteristic
    dataframe = group['dataframe']
    groupby = dataframe.groupby(by=characteristic)

    # Create the new subgroups
    subgroups = []
    for value, subgroup in groupby:
        # If the group does not have enough cells skip it
        if len(subgroup) < 25:
            continue

        # Creaete the subgroup from the group
        new_subgroup = group.copy()
        new_subgroup['dataframe'] = subgroup
        new_subgroup[characteristic] = value
        subgroups = subgroups + [new_subgroup]

    return subgroups

In [8]:
subgroups = get_subgroups(subgroups[0], 'Sample Characteristic[cell type]')

In [9]:
print(f'Dividing with cell type, we get {len(subgroups)} subgroups')
print()
print(f"The first subgroup has the celltype \"{subgroups[0]['Sample Characteristic[cell type]']}\" and has {len(subgroups[0]['dataframe'])} cells")
print(f"The second subgroup has the celltype \"{subgroups[1]['Sample Characteristic[cell type]']}\" and has {len(subgroups[1]['dataframe'])} cells")

Dividing with cell type, we get 2 subgroups

The first subgroup has the celltype "classical monocyte" and has 2718 cells
The second subgroup has the celltype "lung macrophage" and has 4334 cells


### Managing rows implementation

We will select *X0*, *X1* and *X2* as examples of charactericticts.

In [53]:
characteristics = ['X0','X1','X2']

Before implement `init_row`, we have to create a function to get all the names of a list of characteristics.

In this case, we have created three simple functions:
- `get_combinations`: Given a list get all the combinations.
- `combination_to_name`: Given a combination, returns its name.
- `get_combinations_names`: Given a list, get the names of all combinations

Here we can see the three functions with examples.

In [54]:
import itertools

def get_combinations(stuff):
    combinations = []
    for L in range(1, len(stuff)+1):
        for subset in itertools.combinations(stuff, L):
            combinations += [subset]

    return combinations

In [55]:
get_combinations(characteristics)

[('X0',),
 ('X1',),
 ('X2',),
 ('X0', 'X1'),
 ('X0', 'X2'),
 ('X1', 'X2'),
 ('X0', 'X1', 'X2')]

In [56]:
def combiation_to_name(combination):
    name = ''
    for item in combination:
        name += str(item) + '-'
    
    return name[:-1]

In [57]:
combiation_to_name(characteristics)

'X0-X1-X2'

In [58]:
def get_combinations_names(stuff):
    combinations = get_combinations(stuff)
    
    combinations_names = [combiation_to_name(combination) for combination in combinations]
    return combinations_names

In [59]:
combination_names = get_combinations_names(characteristics)
combination_names

['X0', 'X1', 'X2', 'X0-X1', 'X0-X2', 'X1-X2', 'X0-X1-X2']

Now, we can make use of `get_combinations_names` in the function `init_row` to build the row:

In [60]:
def init_row(project_ID, characteristics):
    """Creates a new row with the combinations of the characteristics.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        An empty row with the combinations
    """
    
    combinations_names = get_combinations_names(characteristics)
    row = dict.fromkeys(combinations_names, 0)
    row['project_ID'] = project_ID

    return row

In [61]:
row = init_row(project_ID, characteristics)
row

{'X0': 0,
 'X1': 0,
 'X2': 0,
 'X0-X1': 0,
 'X0-X2': 0,
 'X1-X2': 0,
 'X0-X1-X2': 0,
 'project_ID': 'E-MTAB-7678'}

The next step is to create the function `update_row` we explained before.

In [62]:
def update_row(row, characteristics_used, subgroups):
    """Update the row adding the number of subgroups to the combination.

    Parameters
    ----------
    row : dict
        The ID of a project
    characteristics_used : list
        List of str with the characteristics used to divide the groups
    subgroups: list
        List of subgroups
    """
    
    n_subgroups = len(subgroups)
    combination_name = combiation_to_name(characteristics_used)
    row[combination_name] = n_subgroups

In [63]:
update_row(row, ['X1', 'X2'], subgroups)
row

{'X0': 0,
 'X1': 0,
 'X2': 0,
 'X0-X1': 0,
 'X0-X2': 0,
 'X1-X2': 2,
 'X0-X1-X2': 0,
 'project_ID': 'E-MTAB-7678'}

### Main function implementation

Once we have define all the previous functions, we can declare the main method.

In [64]:
def get_groups_from_project(project_ID, characteristics):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """    
    # Read the metadata file using the API
    metadata = read_metadata(project_ID)
    metadata = process_metadata(metadata)
    
    # Initialitation of parameters
    subgroups = init_subgroups(metadata)
    row = init_row(project_ID, characteristics)
    project_characteristics = metadata.columns
    used_characteristics = []
        
    # Start the subgroup generation using the characteristics
    for characteristic in characteristics:
        # If the characteristic is not in the project, we skip it
        if characteristic not in project_characteristics:
            continue
        
        # For each subgroup created, divide it using the current characteristic
        subgroups_aux = []
        for subgroup in subgroups:
            subgroup_aux = get_subgroups(subgroup, characteristic)
            
            subgroups_aux = subgroups_aux + subgroup_aux
        
        # Update parameters
        used_characteristics = used_characteristics + [characteristic]
        subgroups = subgroups_aux
        
        # If there are no subgroups left, stop
        if not subgroups:
            break
        
        update_row(row, used_characteristics, subgroups)

    return row, subgroups

For this example, we are going to use three characteristics:

- Organism (specie)
- Cell type
- Organism part

In [65]:
characteristics = [
    'organism',
    'cell type',
    'organism part'
]

In [66]:
row, subgroups = get_groups_from_project(project_ID, characteristics)

In [67]:
row

{'organism': 1,
 'cell type': 0,
 'organism part': 0,
 'organism-cell type': 2,
 'organism-organism part': 0,
 'cell type-organism part': 0,
 'organism-cell type-organism part': 2,
 'project_ID': 'E-MTAB-7678'}