# Getting percentile groups from a project

In this notebook, we are going to show the process we follow when generating the different subgroups of cells of a project in order to later create the percentiles. These subgroups will be obtained following an algorithm, according to the different characteristics and conditions of a project. In addition, we are generating the percentiles for each subgroup.

Another objetive of this notebook is to generate a table with a row per project and four columns:
- project ID
- number of subgroups
- characterictics used
- number of percentiles

An important fact in the subgroup generation is that, even all the combinations of characteristics are posible, we will are using a subset of these combinations. Since we want the subgroups to be as specific as posible, we will go straight forward instead of get all the possible combinations. Also, there are miltiple possible combinations as we will see later on, we will have to select the best one.


Here we can see an example of the table structure:

| project ID | number of subgroups | characterictics used | number of percentiles
| --- | --- | --- | --- |
| 000 | 4 | 'X1-X2-X3' | 567
| 001 | 5 | 'X1' | 99563
| 002 | 10 | 'X0-X2' | 5851

In this example, we can see that the project with the id *000* has all the characteristics and the number of subgroups are calculated for the combinations *X0*, *X0-X1* and *X0-X1-X2*; the second project just has X1 and the last one has *X0* and *X2*, so number of subgroups are calculated for the combinations *X0* and *X0-X2*.

## Algorithm explanation

First, we are explaining the main algorithm, explaining some keypoints. We will call this main function `get_groups_from_project`, and it get as a param the ID of a project. To understand what this algorithm does, we will first look at the pseudocode of the function, and then explain the variables used and the auxiliary functions.

### Main function

The frist part of the function consist in reading the metadata and initializing the variables we are going to use (these variables are explained below). Then, we try to divide the data with every characteristic if it is present in the data, building the row with the number of groups per combination, and, finally, we return the row and the subgroups created.

An important step is the stop condition: there are remaining characteristics to divide and we still have some subgroups to divide (a subgroup has to meet some conditions to be considered valid).

Here is the pseudocode of the function:

---
```python
function get_groups_from_project(project_ID):
    characteristics <- list of characteristics we will use to divide in groups
    metadata <- read_metadata(project_ID)
    
    subgroups <- init_subgroups(metadata)
    used_characteristics <- []
        
    project_characteristics <- metadatos.columns
    
    for characteristic in characteristics:
        if c not in project_characteristics
            skip this characteristic
        
        subgroups_aux <- []
        for subgroup in subgroups:
            subgroup_aux <- get_subgroups(subgroup, characteristic)
            
            subgroups_aux <- subgroups_aux + subgroup_aux
        
        used_characteristics <- used_characteristics + [c]
        subgroups <- subgroups_aux
        
        if subgroups is empty:
            break
    
    num_percentiles <- create_percentiles(subgroups)
    row <- create_row(project_ID, subgroups, used_characteristics)

    return row, subgroups
```
---

The variables we use are use for:

- **characteristics**: A list of the characteristics we will use for divide the cells in groups.
- **metadata**: A dataframe with the project information. Here we can find a row per cell and a column of each characteristic.
- **subgroups**: A list of the groups we have at the moment, obtained using *used_characteristics* for divide the dataframe. Each subgroup is a python dictionary with the dataframe of the group and the characteristics used in the division. An example of subgroup can be:

```python
    {
        'dataframe': <pandas dataframe object>,
        'specie': 'homo sapiens',
        'cell_type': 'neuron',
        'organ': 'brain'
    }
```
- **used_characteristics**: A list of the characteristics from *characteristics* that have been used to divide the cells in subgroups.
- **row**: The row of tha table with the number of subgroups and the characterictics used. In our case it is a dictionary.
- **project_characterisic**: The list of characterstics that the project uses.

As we can see, the function just seen is making use of some other functions, which we are going to explain in the next sections. These functions are:

- read_metadata
- init_subgroups
- get_subgroups
- create_row

### Reading metadata

The first thing to do is to read the metadata that will contain the information of the projects related to the characteristics for each cell studied with which we want to divide the project into groups. To archieve that, we have design a method that uses the API-REST to get the download link of a project and read the metadata.

We can see the pseudocode of the function here:

---
```python
function read_metadata(project_ID):
    links <- obtain links of the projects from the API
    
    if metadata_link in links:
        return metadata
        
    return null # no metadata for this project
```
---

It is a very simple function that first query the API for the links and then return the dataframe if it exists.

### Getting subgroups 

In this part, we are going to explain how we divide a group in the corresponding subgroups using a characteristic. Given a group (a python dictionary object) and a characteristic, we can do a `groupby` by this characteristic and return a subgroup for each value. 

Also, not all subgroups will be considered valid, for a subgroup to be valid, it has to fulfill that this group consists of 25 cells or more. This is done so that the calculation of the percentiles is meaningful since a smaller number of cells would make the calculation too insignificant.

Now, we can take a look at the pseudocode of the function called `get_subgroups`:

---
```python
function get_subgroups(group, characteristic):
    dataframe <- group['dataframe']
    groupby <- dataframe.groupby(characteristic)
    
    subgroups <- []
    for value, subgroup in groupby:
        if n_cells(subgroup) < 25:
            skip this subgroup
            
        new_subgroup <- group.copy()
        new_subgroup['dataframe'] <- subgroup
        new_subgroup[characteristic] <- value
        subgroups <- subgroups + new_subgroup
    
    return subgroups
```
---

To clarify the use of the function, let's exemplify the following example. The parameters given to the function are *disease* as characteristic and this group:

```python
{
    'dataframe': <pandas dataframe object>,
    'specie': 'homo sapiens',
    'organ': 'brain'
}
```

Now, we have three diseases in this group: *parkinson*, *alzheimer* and *brain cancer*. Since the subgroup with the disease *kdhds* has 10 cells, it wont be consider as a valid subgroup. Knowing that, the list of subgroups we get is:

```python
[
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'parkinson'
    },
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'brain cancer'
    }
]
```

However, in the main algorithm we dont have a 'subgroup', we have a dataframe with all the metadata. In order to initialize the subgroups, we can create the function called `init_subgroups`. This function creates a python dictionary only with the dataframe and returns it as a list. It is a pretty simple function as we can see in the pseudocode:

---
```python
function init_subgroups(metadata):
    dictionary <- dictionary with metadata as 'dataframe'
    subgroups <- [dictionary]
    
    return subgroups
```
---

### Creating row

Finally, we have to mess with the creating  of the row with the number of subgroups and the combination. As we said before, this row will be a python dictionary.

Here we can see the pseudocode of this function:

---
```python
function create_row(project_ID, subgroups, used_characteristics):
    n_subgroups <- length(subgroups)
    combination_name <- generate name from characteristics_used
    row[combination_name] <- create_dict(project_ID, n_subgroups, combination_name)
    
    return row
```
---

## Algorithm implementation

Now we undestand the algorithm, we can continue with the implementation of the methods explained.

First, we will implement `read_metadata`, then we are going to implement the functions relationed with the subgroups: `init_subgroups` and `get_subgroups`, continuing with the functions for manage rows `init_row` and `update_row`, ending with the main function that uses all these functions `get_groups_from_project`.

As we implement a method we will be testing it with a sample project from the *Single Cell Expresion Atlas* (SCEA) repository. This project can be found in https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7678/experiment-design.

In [1]:
project_ID = 'E-MTAB-7678'

### Reading metadata implementation

Instead of reading just the metadata, we have designed a function that also read the expression matrix, since we will need this information later on.

In [2]:
import requests    
import pandas as pd

In [16]:
def read_files(project_ID):
    """Return the metadata of the project with the project_ID
    
    Parameters
    ----------
    project_ID : str
        The ID of a project

    Returns
    -------
    
    metadata: pandas dataframe
        A dataframe of the project with its metadata
    """
    
    # Define the link and the metadata key name
    API_downloads_link = 'http://194.4.103.57:5000/project/downloads/'
    metadata_key_name = 'experimentDesignLink'
    filtered_key_name = 'filteredTPMLink'
    normalised_key_name = 'normalisedCountsLink'
    
    # Define variables
    metadata = None
    matrix = None
    gene_names = None
    cell_names = None
    
    # Get the download links of the project
    links = requests.get(API_downloads_link + project_ID).json()
    if not links: # If project doesn't exists
        raise Exception(f'Project with ID {project_ID} not found')
    links = links[0]
    
    # Return the metadata if it exists
    if metadata_key_name in links:
        metadata_link = links[metadata_key_name]
        metadata = pd.read_csv(metadata_link, sep='\t', low_memory=False)
    
    if filtered_key_name in links:
        matrix_link = links[filtered_key_name]
        matrix, cell_names, gene_names = download_matrix(matrix_link, matrix_type='filtered')
    elif normalised_key_name in links:
        matrix_link = links[normalised_key_name]
        matrix, cell_names, gene_names = download_matrix(matrix_link, matrix_type='normalised')
    
    # If project does not have metadata link, return none
    return metadata, matrix, gene_names, cell_names

In [3]:
from scipy.io import mmread
import zipfile
import os
import re

def download_matrix(matrix_link, matrix_type='normalised'): 
    # download the file contents in binary format
    response = requests.get(matrix_link)
    
    project_ID = re.sub(r'.*/experiment/(.+)/download/.*', r'\1', matrix_link)
    
    zip_name = project_ID + ".zip"
    if matrix_type == 'normalised':
        matrix_path = project_ID + '.aggregated_filtered_normalised_counts.mtx'
        gene_path = project_ID + '.aggregated_filtered_normalised_counts.mtx_rows'
        cell_path = project_ID + '.aggregated_filtered_normalised_counts.mtx_cols'
    elif matrix_type == 'filtered':
        matrix_path = project_ID + '.expression_tpm.mtx'
        gene_path = project_ID + '.expression_tpm.mtx_rows'
        cell_path = project_ID + '.expression_tpm.mtx_cols'
        
    # open method to open a file on your system and write the contents
    with open(zip_name, "wb") as code:
        code.write(response.content)
        
    with zipfile.ZipFile(zip_name, 'r') as zip_ref:
        zip_ref.extract(matrix_path)
        zip_ref.extract(gene_path)
        zip_ref.extract(cell_path)
    
    matrix = mmread(matrix_path).transpose()
    cell_names = pd.read_csv(cell_path, header=None, names=['Assay'])
    gen_names = pd.read_csv(gene_path, header=None, names=['Gen_Name'])
 
    os.remove(zip_name)
    os.remove(matrix_path)
    os.remove(cell_path)
    os.remove(gene_path)
    
    return matrix, cell_names, gen_names

In [34]:
metadata, _, gene_names, cell_names = read_files(project_ID)
metadata

Unnamed: 0,Assay,Sample Characteristic[organism],Sample Characteristic Ontology Term[organism],Sample Characteristic[strain],Sample Characteristic Ontology Term[strain],Sample Characteristic[age],Sample Characteristic Ontology Term[age],Sample Characteristic[developmental stage],Sample Characteristic Ontology Term[developmental stage],Sample Characteristic[sex],...,Sample Characteristic[organism part],Sample Characteristic Ontology Term[organism part],Sample Characteristic[cell type],Sample Characteristic Ontology Term[cell type],Sample Characteristic[immunophenotype],Sample Characteristic Ontology Term[immunophenotype],Factor Value[cell type],Factor Value Ontology Term[cell type],Factor Value[immunophenotype],Factor Value Ontology Term[immunophenotype]
0,SAMEA5367303-AAACCTGAGGATGTAT,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
1,SAMEA5367303-AAACCTGCACCGTTGG,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
2,SAMEA5367303-AAACCTGGTCCAGTGC,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
3,SAMEA5367303-AAACCTGTCAGCTGGC,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
4,SAMEA5367303-AAACGGGAGACTGTAA,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,,lung macrophage,http://purl.obolibrary.org/obo/CL_1001603,CD45+ F4/80+ CD11c- Ly6C lo CD64+,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7047,SAMEA5367306-TTTGCGCTCCTAGGGC,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,
7048,SAMEA5367306-TTTGCGCTCCTCAATT,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,
7049,SAMEA5367306-TTTGGTTTCGGATGGA,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,
7050,SAMEA5367306-TTTGTCACAATGAATG,Mus musculus,http://purl.obolibrary.org/obo/NCBITaxon_10090,C57BL/6J,http://www.ebi.ac.uk/efo/EFO_0000606,10 week,,adult,http://www.ebi.ac.uk/efo/EFO_0001272,female,...,lung,http://purl.obolibrary.org/obo/UBERON_0002048,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,,classical monocyte,http://purl.obolibrary.org/obo/CL_0000860,CD45+ F4/80+ CD11c- Ly6C lo CD64-,


As we can see, the column names are too long and there are also some columns we don't want (Ontology Term columns). So we will apply a function to proccess the metadata before we use it. We need the cell names used in the matrix so we can filter it.

In [4]:
def process_metadata(metadata, cell_names):
    """Return the processed metadata
    
    Parameters
    ----------
    metadata : pandas dataframe
        metadata of a project

    Returns
    -------
    
    metadata: pandas dataframe
        metadata processed
    """
    cols = [c for c in metadata.columns if 'ontology term' not in c.lower()]
    metadata = metadata[cols] # Drop columns with ontology terms
    
    metadata = metadata.rename(columns=lambda x: re.sub(r'.+\[(.+)\]',r'\1',x)) # Rename columns
        
    metadata = metadata.loc[:,~metadata.columns.duplicated()] # Drop duplicated columns

    # Delete cells that are not in the matrix
    metadata = pd.merge(
        cell_names,
        metadata,
        how="inner",
        on='Assay'
    )
    
    return metadata

In [35]:
metadata = process_metadata(metadata, cell_names)
metadata

Unnamed: 0,Assay,organism,strain,age,developmental stage,sex,genotype,organism part,cell type,immunophenotype
0,SAMEA5367303-AAACCTGAGGATGTAT,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
1,SAMEA5367303-AAACCTGCACCGTTGG,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
2,SAMEA5367303-AAACCTGGTCCAGTGC,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
3,SAMEA5367303-AAACCTGTCAGCTGGC,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
4,SAMEA5367303-AAACGGGAGACTGTAA,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,lung macrophage,CD45+ F4/80+ CD11c- Ly6C lo CD64+
...,...,...,...,...,...,...,...,...,...,...
7047,SAMEA5367306-TTTGCGCTCCTAGGGC,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,classical monocyte,CD45+ F4/80+ CD11c- Ly6C lo CD64-
7048,SAMEA5367306-TTTGCGCTCCTCAATT,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,classical monocyte,CD45+ F4/80+ CD11c- Ly6C lo CD64-
7049,SAMEA5367306-TTTGGTTTCGGATGGA,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,classical monocyte,CD45+ F4/80+ CD11c- Ly6C lo CD64-
7050,SAMEA5367306-TTTGTCACAATGAATG,Mus musculus,C57BL/6J,10 week,adult,female,wild type genotype,lung,classical monocyte,CD45+ F4/80+ CD11c- Ly6C lo CD64-


### Getting subgroups implementation

In [5]:
def init_subgroups(metadata):
    """Returns a list of one dictionary, containing the metadata
    
    Parameters
    ----------
    
    metadata : pandas dataframe
        The metadata of a project in a dataframe

    Returns
    -------
    
    subgroups: list
        List with the initial group of the metadata
    """
    dictionary = {'dataframe': metadata}
    subgroups = [dictionary]

    return subgroups

In [36]:
subgroups = init_subgroups(metadata)

In [2]:
def parse_concrete(word):
        aux = list(word.title())

        for i in range(len(word)):
            if word[i].isupper():
                aux[i] = word[i]

        aux = ''.join(aux).replace(' ', '')

        return aux

In [1]:
from OntologyConversorHCA import OntologyConversorHCA
from OntologyConversorSCAE import OntologyConversorSCAE

In [3]:
def get_subgroups(group, characteristic):
    """Divide the group in subgroups using the characteristic
    
    Parameters
    ----------
    
    group : dict
        The group with the dataframe and the characteristics used

    characteristic: str
        The characteristic used for the division
    
    Returns
    -------
    
    subgroups: list
        List with the subgroups created
    """
    # Get the dataframe and group by the characteristic
    dataframe = group['dataframe']
    groupby = dataframe.groupby(by=characteristic)
    ''' TODO
    pass_values = [
        'not applicable',
        'not available',
        ''
    ]
    '''
    # Create the new subgroups
    subgroups = []
    hca = OntologyConversorHCA()
    scea = OntologyConversorSCAE()
    for value, subgroup in groupby:
        # If the group does not have enough cells skip it
        if len(subgroup) < 25:
            continue
        value_0 = parse_concrete(value)
        value_hca = hca.parse_word(value)
        value_scea = scea.parse_word(value)
        
        if value_hca == value_scea:
            value = value_hca
        elif value_hca != value_0:
            value = value_hca
        else:
            value = value_scea
        
        # Creaete the subgroup from the group
        new_subgroup = group.copy()
        new_subgroup['dataframe'] = subgroup
        new_subgroup[characteristic] = value
        subgroups = subgroups + [new_subgroup]

    return subgroups

In [37]:
subgroups = get_subgroups(subgroups[0], 'cell type')

In [13]:
print(f'Dividing with cell type, we get {len(subgroups)} subgroups')
print()
print(f"The first subgroup has the celltype \"{subgroups[0]['cell type']}\" and has {len(subgroups[0]['dataframe'])} cells")
print(f"The second subgroup has the celltype \"{subgroups[1]['cell type']}\" and has {len(subgroups[1]['dataframe'])} cells")

Dividing with cell type, we get 2 subgroups

The first subgroup has the celltype "classical monocyte" and has 2718 cells
The second subgroup has the celltype "lung macrophage" and has 4334 cells


### Managing rows implementation

We will select *X0*, *X1* and *X2* as examples of charactericticts.

In [38]:
characteristics = ['X0','X1','X2']

Before implement `create_row`, we have to create a function to get the name of a list of characteristics called `combination_to_name`.

In [7]:
def combiation_to_name(combination):
    name = ''
    for item in combination:
        name += str(item) + '/'
    
    return name[:-1]

In [40]:
combiation_to_name(characteristics)

'X0/X1/X2'

Now, we can make use of `combiation_to_name` in the function `create_row` to build the row:

In [8]:
def create_row(project_ID, subgroups, characteristics_used, metadata_cells, number_genes):
    """Creates a new row with the combinations of the characteristics.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    subgroups: list
        List of subgroups created
    characteristics_used : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        An empty row with the combinations
    """
    
    cells = 0
    for subgroup in subgroups:
        cells += len(subgroup['dataframe'])
    
    n_subgroups = len(subgroups)
    combination_name = combiation_to_name(characteristics_used)
    row = {
        'project_ID': project_ID,
        'num_subgroups': n_subgroups,
        'num_cells': cells,
        'cells_used': (cells / metadata_cells) * 100,
        'characteristics_used': combination_name,
        'number_genes': number_genes
    }

    return row

In [43]:
row = create_row(project_ID, subgroups, characteristics, len(metadata), len(gene_names))
row

{'project_ID': 'E-MTAB-7678',
 'num_subgroups': 2,
 'num_cells': 7052,
 'cells_used': 100.0,
 'characteristics_used': 'X0/X1/X2',
 'number_genes': 19975}

### Printing subgroups

In [9]:
def print_subgroups(subgroups):
    for n, subgroup in enumerate(subgroups):
        print(f'Subgroup {n}:')
        for key, value in subgroup.items():
            print('\t', end='')
            if key == 'dataframe':
                print(f'Number of cells: {len(value)}')
            else:
                print(f'{key}: {value}')

In [45]:
print_subgroups(subgroups)

Subgroup 0:
	Number of cells: 2718
	cell type: classical monocyte
Subgroup 1:
	Number of cells: 4334
	cell type: lung macrophage


### Main function implementation

Once we have define all the previous functions, we can declare the main method.

In [10]:
def get_groups_from_project(project_ID, characteristics):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """ 
    needed_characteristic = [
                                'cell type',
                                'developmental stage',
                                'inferred cell type - ontology labels'
                            ]
    
    # Read the metadata file using the API
    metadata, _, gene_names, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    metadata_cells = len(metadata)
    number_genes = len(gene_names)

    # Initialitation of parameters
    subgroups = init_subgroups(metadata)
    project_characteristics = metadata.columns
    used_characteristics = []
        
    # Start the subgroup generation using the characteristics
    for characteristic in characteristics:
        # If the characteristic is not in the project, we skip it
        if characteristic not in project_characteristics:
            continue
        
        # For each subgroup created, divide it using the current characteristic
        subgroups_aux = []
        for subgroup in subgroups:
            subgroup_aux = get_subgroups(subgroup, characteristic)
            
            subgroups_aux = subgroups_aux + subgroup_aux
        
        # Check if we have lost cells
        cells_aux = sum([len(x['dataframe']) for x in subgroups_aux])
        if cells_aux < metadata_cells and characteristic not in needed_characteristic:
            continue
        
        # Update parameters
        used_characteristics = used_characteristics + [characteristic]
        subgroups = subgroups_aux
        
        # If there are no subgroups left, stop
        if not subgroups:
            break
        
    row = create_row(project_ID, subgroups, used_characteristics, metadata_cells, number_genes)
        
    return row, subgroups

For this example, we are going to use three characteristics:

- Organism (specie)
- Cell type
- Organism part

In [47]:
characteristics = [
    'organism',
    'cell type',
    'organism part'
]

project_ID = 'E-MTAB-7678'

In [48]:
row, subgroups = get_groups_from_project(project_ID, characteristics)

In [49]:
pd.DataFrame([row])

Unnamed: 0,project_ID,num_subgroups,num_cells,cells_used,characteristics_used,number_genes
0,E-MTAB-7678,2,7052,100.0,organism/cell type/organism part,19975


In [50]:
print_subgroups(subgroups)

Subgroup 0:
	Number of cells: 2718
	organism: Mus musculus
	cell type: classical monocyte
	organism part: lung
Subgroup 1:
	Number of cells: 4334
	organism: Mus musculus
	cell type: lung macrophage
	organism part: lung


## Comparing characterictics groups

As more than a combination is possible, for example, we can have a project with cell type and inferred cell types, we will give the algorithm multiples characteristics so we can comare them.

In [11]:
def get_groups_from_project_multiple_test(project_ID, characteristics_groups):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics_groups : list
        Lists of lists of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """
    needed_characteristic = [
                                'cell type',
                                'developmental stage',
                                'inferred cell type - ontology labels'
                            ]
    
    # Read the metadata file using the API
    metadata, _, gene_names, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    metadata_cells = len(metadata)
    number_genes = len(gene_names)

    rows = []
    combinations_subgroups = []
    
    for characteristics in characteristics_groups:
        # Initialitation of parameters
        subgroups = init_subgroups(metadata)
        project_characteristics = metadata.columns
        used_characteristics = []

        # Start the subgroup generation using the characteristics
        for characteristic in characteristics:
            # If the characteristic is not in the project, we skip it
            if characteristic not in project_characteristics:
                continue

            # For each subgroup created, divide it using the current characteristic
            subgroups_aux = []
            for subgroup in subgroups:
                subgroup_aux = get_subgroups(subgroup, characteristic)

                subgroups_aux = subgroups_aux + subgroup_aux

            # Check if we have lost cells
            cells_aux = sum([len(x['dataframe']) for x in subgroups_aux])
            if cells_aux < metadata_cells and characteristic not in needed_characteristic:
                continue  
            
            # Update parameters
            used_characteristics = used_characteristics + [characteristic]
            subgroups = subgroups_aux
            
            # If there are no subgroups left, stop
            if not subgroups:
                break

        row = create_row(project_ID, subgroups, used_characteristics, metadata_cells, number_genes)
        
        if row not in rows:
            rows.append(row)
            combinations_subgroups.append(subgroups)
    
    return rows, combinations_subgroups

In [52]:
project_ID = 'E-CURD-46'

characterictics_groups = [
    [
        'organism',
        'cell type',
        'organism part',
        'disease'
    ],
    [
        'organism',
        'inferred cell type - ontology labels',
        'organism part',
        'disease'
    ],
    [
        'organism',
        'inferred cell type - authors labels',
        'organism part',
        'disease'
    ]
]

In [53]:
rows, combinations_subgroups = get_groups_from_project_multiple_test(project_ID, characterictics_groups)

pd.DataFrame(rows)

Unnamed: 0,project_ID,num_subgroups,num_cells,cells_used,characteristics_used,number_genes
0,E-CURD-46,1,101844,100.0,organism/organism part/disease,25194
1,E-CURD-46,33,62521,61.388987,organism/inferred cell type - ontology labels,25194


As we can see, all the combinations all valid, but we have to select the best one.

### Selecting the best subgroups combination

Given a list of subgroup combinations, we have to select the best one.

In [12]:
def best_subgroup_combination(subgroups_combinations):
    best_combination = None
    best_combination_index = 0
    
    for n, combination in enumerate(subgroups_combinations):
        if compare_combination(best_combination, combination) == 1:
            best_combination = combination
            best_combination_index = n
    
    return best_combination, best_combination_index

With the function `compare_combination` we compare two combinations and get the best one. In this case, the best combination will be the more specific (the one that has used more characteristics for the division).

In [13]:
def compare_combination(combination0, combination1):
        
    if combination0 is None:
        return 1
    if combination1 is None:
        return -1
    
    combination0_characteristics = len(combination0['characteristics_used'].split('/'))
    combination1_characteristics = len(combination1['characteristics_used'].split('/'))
    # Compare characteristics
    if combination0_characteristics > combination1_characteristics:
        return -1
    if combination0_characteristics < combination1_characteristics:
        return 1
    
    return 0

In [56]:
best_subgroup_combination(rows)

({'project_ID': 'E-CURD-46',
  'num_subgroups': 1,
  'num_cells': 101844,
  'cells_used': 100.0,
  'characteristics_used': 'organism/organism part/disease',
  'number_genes': 25194},
 0)

In [57]:
rows

[{'project_ID': 'E-CURD-46',
  'num_subgroups': 1,
  'num_cells': 101844,
  'cells_used': 100.0,
  'characteristics_used': 'organism/organism part/disease',
  'number_genes': 25194},
 {'project_ID': 'E-CURD-46',
  'num_subgroups': 33,
  'num_cells': 62521,
  'cells_used': 61.38898707827658,
  'characteristics_used': 'organism/inferred cell type - ontology labels',
  'number_genes': 25194}]

We are adding this functionality to the main function, so we can compare and return the best combination of characteristics.

In [14]:
def get_groups_from_project_multiple(project_ID, characteristics_groups):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics_groups : list
        Lists of lists of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """    
    needed_characteristic = [
        'cell type',
        'developmental stage',
        'inferred cell type - ontology labels'
    ]
    
    # Read the metadata file using the API
    metadata, _, gene_names, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    subgroups_cells = len(metadata)
    number_genes = len(gene_names)

    rows = []
    subgroups_list = []
    for characteristics in characteristics_groups:
        # Initialitation of parameters
        subgroups = init_subgroups(metadata)
        project_characteristics = metadata.columns
        used_characteristics = []

        # Start the subgroup generation using the characteristics
        for characteristic in characteristics:
            # If the characteristic is not in the project, we skip it
            if characteristic not in project_characteristics:
                continue

            # For each subgroup created, divide it using the current characteristic
            subgroups_aux = []
            for subgroup in subgroups:
                subgroup_aux = get_subgroups(subgroup, characteristic)

                subgroups_aux = subgroups_aux + subgroup_aux

            # Check if we have lost cells
            cells_aux = sum([len(x['dataframe']) for x in subgroups_aux])
            if cells_aux < subgroups_cells and characteristic not in needed_characteristic:
                continue
                
            # Update parameters
            used_characteristics = used_characteristics + [characteristic]
            subgroups = subgroups_aux
            subgroups_cells = cells_aux
            
            # If there are no subgroups left, stop
            if not subgroups:
                break

        row = create_row(project_ID, subgroups, used_characteristics, subgroups_cells, number_genes)
        
        rows.append(row)
        subgroups_list.append(subgroups)
        
    row, index = best_subgroup_combination(rows)
    subgroups = subgroups_list[index]
    
    return row, subgroups

In [59]:
row, subgroups = get_groups_from_project_multiple(project_ID, characterictics_groups)

pd.DataFrame([row])

KeyboardInterrupt: 

In [71]:
print_subgroups(subgroups)

Subgroup 0:
	Number of cells: 101844
	organism: Homo sapiens
	organism part: ileal mucosa
	disease: Crohn's disease


## Get groups from all projects

As we said, the main objetive is to get a table with the groups we would get grouping the data by some charactericticts. In this part, we are going to create this table.

### Get all the project IDs from the API

First of all, we have to get the IDs from all the projects. We can archieve that using the API as follows:

In [17]:
project_IDs = requests.get('http://194.4.103.57:5000/project/metadata/project_ID').json()
project_IDs[:6]

['005d611a-14d5-4fbf-846e-571a1f874f70',
 '027c51c6-0719-469f-a7f5-640fe57cbece',
 '091cf39b-01bc-42e5-9437-f419a66c8a45',
 '116965f3-f094-4769-9d28-ae675c1b569c',
 '1defdada-a365-44ad-9b29-443b06bd11d6',
 '2043c65a-1cf8-4828-a656-9e247d4e64f1']

### Defining the characterictics we want to use

In [18]:
characterictics_groups = [
    [
        'organism',
        'cell type',
        'developmental stage',
        'disease',
        'organism part',
        'sampling site',
        'biopsy site',
        'metastatic site'
    ],
    [
        'organism',
        'developmental stage',
        'inferred cell type - ontology labels',
        'disease',
        'organism part',
        'sampling site',
        'biopsy site',
        'metastatic site'
    ],
    [
        'organism',
        'developmental stage',
        'inferred cell type - authors labels',
        'disease',
        'organism part',
        'sampling site',
        'biopsy site',
        'metastatic site'
    ]
]

In [19]:
from IPython.display import clear_output

rows = []
n_projects = len(project_IDs)

In [22]:
for n, project_ID in enumerate(project_IDs[185:]):
    clear_output(wait=True)
    print(f"{n+1+185}/{n_projects}")
    
    row, subgroups = get_groups_from_project_multiple_test(project_ID, characterictics_groups)
    rows = rows + row

187/187


In [23]:
df = pd.DataFrame(rows)
df

Unnamed: 0,project_ID,num_subgroups,num_cells,cells_used,characteristics_used,number_genes
0,091cf39b-01bc-42e5-9437-f419a66c8a45,1,32445,100.000000,organism/cell type/developmental stage/disease...,23656
1,091cf39b-01bc-42e5-9437-f419a66c8a45,8,24319,74.954538,organism/developmental stage/inferred cell typ...,23656
2,091cf39b-01bc-42e5-9437-f419a66c8a45,1,32445,100.000000,organism/developmental stage/disease/organism ...,23656
3,116965f3-f094-4769-9d28-ae675c1b569c,2,6209,100.000000,organism/cell type/developmental stage/disease...,18715
4,116965f3-f094-4769-9d28-ae675c1b569c,5,2099,33.805766,organism/developmental stage/inferred cell typ...,18715
...,...,...,...,...,...,...
307,cc95ff89-2e68-4a08-a234-480eca21ce79,32,545346,89.901188,organism/developmental stage/inferred cell typ...,25052
308,cc95ff89-2e68-4a08-a234-480eca21ce79,2,606606,100.000000,organism/developmental stage/organism part,25052
309,f8aa201c-4ff1-45a4-890e-840d63459ca2,2,23517,100.000000,organism/cell type/developmental stage/disease...,20638
310,f8aa201c-4ff1-45a4-890e-840d63459ca2,7,9258,39.367266,organism/developmental stage/inferred cell typ...,20638


In [25]:
genes_per_combination = df['num_subgroups'] * df['number_genes']
genes_per_combination

0       23656
1      189248
2       23656
3       37430
4       93575
        ...  
307    801664
308     50104
309     41276
310    144466
311     41276
Length: 312, dtype: int64

In [29]:
total_percentiles = genes_per_combination.sum()

In [30]:
total_subgroups = df['num_subgroups'].sum()

In [32]:
print(f"With the subgroups we have made, we have:\n\t- {total_percentiles} percentiles\n\t- {total_subgroups} samplings")

With the subgroups we have made, we have:
	- 19806755 percentiles
	- 898 samplings


In [33]:
df.to_csv("percentile_groups.csv")