# Getting percentile groups from a project

In this notebook, we are going to show the process we follow when generating the different subgroups of cells of a project in order to later create the percentiles. These subgroups will be obtained following an algorithm, according to the different characteristics and conditions of a project.

Another objetive of this notebook is to generate a table with a row per project and three columns:
- project ID
- number of subgroups
- characterictics used

An important fact in the subgroup generation is that, even all the combinations of characteristics are posible, we will are using a subset of these combinations. Since we want the subgroups to be as specific as posible, we will go straight forward instead of get all the possible combinations.



Here we can see an example of the table structure:

| project ID | number of subgroups | characterictics used |
| --- | --- | --- |
| 000 | 4 | 'X1-X2-X3' |
| 001 | 5 | 'X1' |
| 002 | 10 | 'X0-X2' |

In this example, we can see that the project with the id *000* has all the characteristics and the number of subgroups are calculated for the combinations *X0*, *X0-X1* and *X0-X1-X2*; the second project just has X1 and the last one has *X0* and *X2*, so number of subgroups are calculated for the combinations *X0* and *X0-X2*.

## Algorithm explanation

First, we are explaining the main algorithm, explaining some keypoints. We will call this main function `get_groups_from_project`, and it get as a param the ID of a project. To understand what this algorithm does, we will first look at the pseudocode of the function, and then explain the variables used and the auxiliary functions.

### Main function

The frist part of the function consist in reading the metadata and initializing the variables we are going to use (these variables are explained below). Then, we try to divide the data with every characteristic if it is present in the data, building the row with the number of groups per combination, and, finally, we return the row and the subgroups created.

An important step is the stop condition: there are remaining characteristics to divide and we still have some subgroups to divide (a subgroup has to meet some conditions to be considered valid).

Here is the pseudocode of the function:

---
```
function get_groups_from_project(project_ID):
    characteristics <- list of characteristics we will use to divide in groups
    metadata <- read_metadata(project_ID)
    
    subgroups <- init_subgroups(metadata)
    used_characteristics <- []
    
    row <- init_row(project_ID, characteristics)
    
    project_characteristics <- metadatos.columns
    
    for characteristic in characteristics:
        if c not in project_characteristics
            skip this characteristic
        
        subgroups_aux <- []
        for subgroup in subgroups:
            subgroup_aux <- get_subgroups(subgroup, characteristic)
            
            subgroups_aux <- subgroups_aux + subgroup_aux
        
        used_characteristics <- used_characteristics + [c]
        subgroups <- subgroups_aux
        
        if subgroups is empty:
            break
        
        update_row(row, used_characteristics, subgroups)

    row <- create_row(project_ID, subgroups, used_characteristics)

    return row, subgroups
```
---

The variables we use are use for:

- **characteristics**: A list of the characteristics we will use for divide the cells in groups.
- **metadata**: A dataframe with the project information. Here we can find a row per cell and a column of each characteristic.
- **subgroups**: A list of the groups we have at the moment, obtained using *used_characteristics* for divide the dataframe. Each subgroup is a python dictionary with the dataframe of the group and the characteristics used in the division. An example of subgroup can be:

```python
    {
        'dataframe': <pandas dataframe object>,
        'specie': 'homo sapiens',
        'cell_type': 'neuron',
        'organ': 'brain'
    }
```
- **used_characteristics**: A list of the characteristics from *characteristics* that have been used to divide the cells in subgroups.
- **row**: The row of tha table with the number of subgroups and the characterictics used. In our case it is a dictionary.
- **project_characterisic**: The list of characterstics that the project uses.

As we can see, the function just seen is making use of some other functions, which we are going to explain in the next sections. These functions are:

- read_metadata
- init_subgroups
- get_subgroups
- create_row

### Reading metadata

The first thing to do is to read the metadata that will contain the information of the projects related to the characteristics for each cell studied with which we want to divide the project into groups. To archieve that, we have design a method that uses the API-REST to get the download link of a project and read the metadata.

We can see the pseudocode of the function here:

---
```
function read_metadata(project_ID):
    links <- obtain links of the projects from the API
    
    if metadata_link in links:
        return metadata
        
    return null # no metadata for this project
```
---

It is a very simple function that first query the API for the links and then return the dataframe if it exists.

### Getting subgroups 

In this part, we are going to explain how we divide a group in the corresponding subgroups using a characteristic. Given a group (a python dictionary object) and a characteristic, we can do a `groupby` by this characteristic and return a subgroup for each value. 

Also, not all subgroups will be considered valid, for a subgroup to be valid, it has to fulfill that this group consists of 25 cells or more. This is done so that the calculation of the percentiles is meaningful since a smaller number of cells would make the calculation too insignificant.

Now, we can take a look at the pseudocode of the function called `get_subgroups`:

---
```
function get_subgroups(group, characteristic):
    dataframe <- group['dataframe']
    groupby <- dataframe.groupby(characteristic)
    
    subgroups <- []
    for value, subgroup in groupby:
        if n_cells(subgroup) < 25:
            skip this subgroup
            
        new_subgroup <- group.copy()
        new_subgroup['dataframe'] <- subgroup
        new_subgroup[characteristic] <- value
        subgroups <- subgroups + new_subgroup
    
    return subgroups
```
---

To clarify the use of the function, let's exemplify the following example. The parameters given to the function are *disease* as characteristic and this group:

```python
{
    'dataframe': <pandas dataframe object>,
    'specie': 'homo sapiens',
    'organ': 'brain'
}
```

Now, we have three diseases in this group: *parkinson*, *alzheimer* and *brain cancer*. Since the subgroup with the disease *kdhds* has 10 cells, it wont be consider as a valid subgroup. Knowing that, the list of subgroups we get is:

```python
[
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'parkinson'
    },
    {
        'dataframe': <pandas dataframe object>, # df of the subgroup
        'specie': 'homo sapiens',
        'organ': 'brain',
        'disease': 'brain cancer'
    }
]
```

However, in the main algorithm we dont have a 'subgroup', we have a dataframe with all the metadata. In order to initialize the subgroups, we can create the function called `init_subgroups`. This function creates a python dictionary only with the dataframe and returns it as a list. It is a pretty simple function as we can see in the pseudocode:

---
```
function init_subgroups(metadata):
    dictionary <- dictionary with metadata as 'dataframe'
    subgroups <- [dictionary]
    
    return subgroups
```
---

### Creating row

Finally, we have to mess with the creating  of the row with the number of subgroups and the combination. As we said before, this row will be a python dictionary.

Here we can see the pseudocode of this function:

---
```
function create_row(project_ID, subgroups, used_characteristics):
    n_subgroups <- length(subgroups)
    combination_name <- generate name from characteristics_used
    row[combination_name] <- create_dict(project_ID, n_subgroups, combination_name)
    
    return row
```
---

## Algorithm implementation

Now we undestand the algorithm, we can continue with the implementation of the methods explained.

First, we will implement `read_metadata`, then we are going to implement the functions relationed with the subgroups: `init_subgroups` and `get_subgroups`, continuing with the functions for manage rows `init_row` and `update_row`, ending with the main function that uses all these functions `get_groups_from_project`.

As we implement a method we will be testing it with a sample project from the *Single Cell Expresion Atlas* (SCEA) repository. This project can be found in https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-7678/experiment-design.

In [1]:
project_ID = 'E-MTAB-7678'

### Reading metadata implementation

Instead of reading just the metadata, we have designed a function that also read the expression matrix, since we will need this information later on.

In [2]:
import requests    
import pandas as pd

In [3]:
def read_files(project_ID):
    """Return the metadata of the project with the project_ID
    
    Parameters
    ----------
    project_ID : str
        The ID of a project

    Returns
    -------
    
    metadata: pandas dataframe
        A dataframe of the project with its metadata
    """
    
    # Define the link and the metadata key name
    API_downloads_link = 'http://localhost:5000/project/downloads/'
    metadata_key_name = 'experimentDesignLink'
    filtered_key_name = 'filteredTPMLink'
    normalised_key_name = 'normalisedCountsLink'
    
    # Define variables
    metadata = None
    matrix = None
    gene_names = None
    cell_names = None
    
    # Get the download links of the project
    links = requests.get(API_downloads_link + project_ID).json()
    if not links: # If project doesn't exists
        raise Exception(f'Project with ID {project_ID} not found')
    links = links[0]
    
    # Return the metadata if it exists
    if metadata_key_name in links:
        metadata_link = links[metadata_key_name]
        metadata = pd.read_csv(metadata_link, sep='\t', low_memory=False)
    
    if filtered_key_name in links:
        matrix_link = links[filtered_key_name]
        matrix, cell_names, gen_names = download_matrix(matrix_link, matrix_type='filtered')
    elif normalised_key_name in links:
        matrix_link = links[normalised_key_name]
        matrix, cell_names, gen_names = download_matrix(matrix_link, matrix_type='normalised')
    
    # If project does not have metadata link, return none
    return metadata, matrix, gene_names, cell_names

In [4]:
from scipy.io import mmread
import zipfile
import os
import re

def download_matrix(matrix_link, matrix_type='normalised'): 
    # download the file contents in binary format
    response = requests.get(matrix_link)
    
    project_ID = re.sub(r'.*/experiment/(.+)/download/.*', r'\1', matrix_link)
    
    zip_name = project_ID + ".zip"
    if matrix_type == 'normalised':
        matrix_path = project_ID + '.aggregated_filtered_normalised_counts.mtx'
        gene_path = project_ID + '.aggregated_filtered_normalised_counts.mtx_rows'
        cell_path = project_ID + '.aggregated_filtered_normalised_counts.mtx_cols'
    elif matrix_type == 'filtered':
        matrix_path = project_ID + '.expression_tpm.mtx'
        gene_path = project_ID + '.expression_tpm.mtx_rows'
        cell_path = project_ID + '.expression_tpm.mtx_cols'
        
    # open method to open a file on your system and write the contents
    with open(zip_name, "wb") as code:
        code.write(response.content)
        
    with zipfile.ZipFile(zip_name, 'r') as zip_ref:
        zip_ref.extract(matrix_path)
        zip_ref.extract(gene_path)
        zip_ref.extract(cell_path)
    
    matrix = mmread(matrix_path).transpose()
    cell_names = pd.read_csv(cell_path, header=None, names=['Assay'])
    gen_names = pd.read_csv(gene_path, header=None, names=['Gen_Name'])
 
    os.remove(zip_name)
    os.remove(matrix_path)
    os.remove(cell_path)
    os.remove(gene_path)
    
    return matrix, cell_names, gen_names

In [5]:
metadata, _, _, cell_names = read_files("091cf39b-01bc-42e5-9437-f419a66c8a45")
metadata

Unnamed: 0,Assay,Sample Characteristic[organism],Sample Characteristic Ontology Term[organism],Sample Characteristic[individual],Sample Characteristic Ontology Term[individual],Sample Characteristic[ethnic group],Sample Characteristic Ontology Term[ethnic group],Sample Characteristic[age],Sample Characteristic Ontology Term[age],Sample Characteristic[developmental stage],...,Sample Characteristic[disease],Sample Characteristic Ontology Term[disease],Sample Characteristic[organism status],Sample Characteristic Ontology Term[organism status],Factor Value[ethnic group],Factor Value Ontology Term[ethnic group],Factor Value[inferred cell type - ontology labels],Factor Value Ontology Term[inferred cell type - ontology labels],Factor Value[inferred cell type - authors labels],Factor Value Ontology Term[inferred cell type - authors labels]
0,group1-AAACCTGAGACCACGA,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_2,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,28 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,hematopoietic stem cell,http://purl.obolibrary.org/obo/CL_0000037,hematopoietic stem cell,http://purl.obolibrary.org/obo/CL_0000037
1,group1-AAACCTGAGTTGAGTA,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_2,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,28 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,common myeloid progenitor,http://purl.obolibrary.org/obo/CL_0000049,myeloid progenitor,http://purl.obolibrary.org/obo/CL_0000049
2,group1-AAACCTGCAAATCCGT,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_2,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,28 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,erythroid progenitor cell,http://purl.obolibrary.org/obo/CL_0000038,erythroid progenitor,http://purl.obolibrary.org/obo/CL_0002361
3,group1-AAACCTGGTGACTACT,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_2,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,28 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,erythroid progenitor cell,http://purl.obolibrary.org/obo/CL_0000038,erythroid progenitor,http://purl.obolibrary.org/obo/CL_0002361
4,group1-AAACCTGGTTCAGTAC,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_2,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,28 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,hematopoietic multipotent progenitor cell,http://purl.obolibrary.org/obo/CL_0000837,hematopoietic multipotent progenitor,http://purl.obolibrary.org/obo/CL_0000837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32440,group9-TTTGTCAGTGTAAGTA,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_3,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,19 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,granulocyte monocyte progenitor cell,http://purl.obolibrary.org/obo/CL_0000557,monocyte progenitor,
32441,group9-TTTGTCAGTTCTCATT,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_3,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,19 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,,,,
32442,group9-TTTGTCATCCTCAACC,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_3,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,19 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,erythroid progenitor cell,http://purl.obolibrary.org/obo/CL_0000038,erythroid progenitor,http://purl.obolibrary.org/obo/CL_0002361
32443,group9-TTTGTCATCGAATGCT,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,HS_BM_3,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,19 year,,adult,...,normal,http://purl.obolibrary.org/obo/PATO_0000461,alive,http://purl.obolibrary.org/obo/PATO_0001421,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,,,,


As we can see, the column names are too long and there are also some columns we don't want (Ontology Term columns). SO we will apply a function to proccess the metadata before we use it. We need the cell names used in the matrix so we can filter it.

In [5]:
def process_metadata(metadata, cell_names):
    """Return the processed metadata
    
    Parameters
    ----------
    metadata : pandas dataframe
        metadata of a project

    Returns
    -------
    
    metadata: pandas dataframe
        metadata processed
    """
    cols = [c for c in metadata.columns if 'ontology term' not in c.lower()]
    metadata = metadata[cols] # Drop columns with ontology terms
    
    metadata = metadata.rename(columns=lambda x: re.sub(r'.+\[(.+)\]',r'\1',x)) # Rename columns
        
    metadata = metadata.T.drop_duplicates().T # Drop duplicated columns

    # Delete cells that are not in the matrix
    metadata = pd.merge(
        cell_names,
        metadata,
        how="inner",
        on='Assay'
    )
    
    return metadata

In [7]:
metadata = process_metadata(metadata, cell_names)
metadata

Unnamed: 0,Assay,organism,individual,ethnic group,age,developmental stage,sex,organism part,cell type,immunophenotype,clinical information,disease,organism status,inferred cell type - ontology labels,inferred cell type - authors labels
0,group1-AAACCTGAGACCACGA,Homo sapiens,HS_BM_2,European,28 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,hematopoietic stem cell,hematopoietic stem cell
1,group1-AAACCTGAGTTGAGTA,Homo sapiens,HS_BM_2,European,28 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,common myeloid progenitor,myeloid progenitor
2,group1-AAACCTGCAAATCCGT,Homo sapiens,HS_BM_2,European,28 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,erythroid progenitor cell,erythroid progenitor
3,group1-AAACCTGGTGACTACT,Homo sapiens,HS_BM_2,European,28 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,erythroid progenitor cell,erythroid progenitor
4,group1-AAACCTGGTTCAGTAC,Homo sapiens,HS_BM_2,European,28 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,hematopoietic multipotent progenitor cell,hematopoietic multipotent progenitor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32440,group9-TTTGTCAGTGTAAGTA,Homo sapiens,HS_BM_3,European,19 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,granulocyte monocyte progenitor cell,monocyte progenitor
32441,group9-TTTGTCAGTTCTCATT,Homo sapiens,HS_BM_3,European,19 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,,
32442,group9-TTTGTCATCCTCAACC,Homo sapiens,HS_BM_3,European,19 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,erythroid progenitor cell,erythroid progenitor
32443,group9-TTTGTCATCGAATGCT,Homo sapiens,HS_BM_3,European,19 year,adult,female,bone marrow,hematopoietic stem cell,"CD34-positive, CD38-negative","HIV, HBV and HCV negative",normal,alive,,


### Getting subgroups implementation

In [6]:
def init_subgroups(metadata):
    """Returns a list of one dictionary, containing the metadata
    
    Parameters
    ----------
    
    metadata : pandas dataframe
        The metadata of a project in a dataframe

    Returns
    -------
    
    subgroups: list
        List with the initial group of the metadata
    """
    dictionary = {'dataframe': metadata}
    subgroups = [dictionary]

    return subgroups

In [9]:
subgroups = init_subgroups(metadata)

In [7]:
def get_subgroups(group, characteristic):
    """Divide the group in subgroups using the characteristic
    
    Parameters
    ----------
    
    group : dict
        The group with the dataframe and the characteristics used

    characteristic: str
        The characteristic used for the division
    
    Returns
    -------
    
    subgroups: list
        List with the subgroups created
    """
    # Get the dataframe and group by the characteristic
    dataframe = group['dataframe']
    groupby = dataframe.groupby(by=characteristic)

    # Create the new subgroups
    subgroups = []
    for value, subgroup in groupby:
        # If the group does not have enough cells skip it
        if len(subgroup) < 25:
            continue

        # Creaete the subgroup from the group
        new_subgroup = group.copy()
        new_subgroup['dataframe'] = subgroup
        new_subgroup[characteristic] = value
        subgroups = subgroups + [new_subgroup]

    return subgroups

In [11]:
subgroups = get_subgroups(subgroups[0], 'cell type')

In [13]:
print(f'Dividing with cell type, we get {len(subgroups)} subgroups')
print()
print(f"The first subgroup has the celltype \"{subgroups[0]['cell type']}\" and has {len(subgroups[0]['dataframe'])} cells")
#print(f"The second subgroup has the celltype \"{subgroups[1]['cell type']}\" and has {len(subgroups[1]['dataframe'])} cells")

Dividing with cell type, we get 1 subgroups

The first subgroup has the celltype "hematopoietic stem cell" and has 32445 cells


### Managing rows implementation

We will select *X0*, *X1* and *X2* as examples of charactericticts.

In [14]:
characteristics = ['X0','X1','X2']

Before implement `create_row`, we have to create a function to get the name of a list of characteristics called `combination_to_name`.

In [8]:
def combiation_to_name(combination):
    name = ''
    for item in combination:
        name += str(item) + '/'
    
    return name[:-1]

In [16]:
combiation_to_name(characteristics)

'X0-X1-X2'

Now, we can make use of `combiation_to_name` in the function `create_row` to build the row:

In [9]:
def create_row(project_ID, subgroups, characteristics_used):
    """Creates a new row with the combinations of the characteristics.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    subgroups: list
        List of subgroups created
    characteristics_used : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        An empty row with the combinations
    """
    
    cells = 0
    for subgroup in subgroups:
        cells += len(subgroup['dataframe'])
    
    n_subgroups = len(subgroups)
    combination_name = combiation_to_name(characteristics_used)
    row = {
        'project_ID': project_ID,
        'num_subgroups': n_subgroups,
        'num_cells': cells,
        'characteristics_used': combination_name
    }

    return row

In [19]:
row = create_row(project_ID, subgroups, characteristics)
row

NameError: name 'characteristics' is not defined

### Printing subgroups

In [10]:
def print_subgroups(subgroups):
    for n, subgroup in enumerate(subgroups):
        print(f'Subgroup {n}:')
        for key, value in subgroup.items():
            print('\t', end='')
            if key == 'dataframe':
                print(f'Number of cells: {len(value)}')
            else:
                print(f'{key}: {value}')

In [20]:
print_subgroups(subgroups)

Subgroup 0:
	Number of cells: 32445
	cell type: hematopoietic stem cell


### Main function implementation

Once we have define all the previous functions, we can declare the main method.

In [11]:
def get_groups_from_project(project_ID, characteristics):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics : list
        List of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """    
    # Read the metadata file using the API
    metadata, _, _, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    
    # Initialitation of parameters
    subgroups = init_subgroups(metadata)
    project_characteristics = metadata.columns
    used_characteristics = []
        
    # Start the subgroup generation using the characteristics
    for characteristic in characteristics:
        # If the characteristic is not in the project, we skip it
        if characteristic not in project_characteristics:
            continue
        
        # For each subgroup created, divide it using the current characteristic
        subgroups_aux = []
        for subgroup in subgroups:
            subgroup_aux = get_subgroups(subgroup, characteristic)
            
            subgroups_aux = subgroups_aux + subgroup_aux
        
        # Update parameters
        used_characteristics = used_characteristics + [characteristic]
        subgroups = subgroups_aux
        
        # If there are no subgroups left, stop
        if not subgroups:
            break
        
    row = create_row(project_ID, subgroups, used_characteristics)
        
    return row, subgroups

For this example, we are going to use three characteristics:

- Organism (specie)
- Cell type
- Organism part

In [81]:
characteristics = [
    'organism',
    'cell type',
    'organism part'
]

project_ID = 'E-MTAB-7678'

In [82]:
row, subgroups = get_groups_from_project(project_ID, characteristics)

In [83]:
row

{'project_ID': 'E-MTAB-7678',
 'num_subgroups': 2,
 'characteristics_used': 'organism-cell type-organism part'}

In [84]:
pd.DataFrame([row])

Unnamed: 0,project_ID,num_subgroups,characteristics_used
0,E-MTAB-7678,2,organism-cell type-organism part


## Comparing characterictics groups

As more than a combination is possible, for example, we can have a project with cell type and inferred cell types, we will give the algorithm multiples characteristics so we can comare them.

In [12]:
def get_groups_from_project_multiple_test(project_ID, characteristics_groups):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics_groups : list
        Lists of lists of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """    
    # Read the metadata file using the API
    metadata, _, _, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    
    rows = []
    
    for characteristics in characteristics_groups:
        # Initialitation of parameters
        subgroups = init_subgroups(metadata)
        project_characteristics = metadata.columns
        used_characteristics = []

        # Start the subgroup generation using the characteristics
        for characteristic in characteristics:
            # If the characteristic is not in the project, we skip it
            if characteristic not in project_characteristics:
                continue

            # For each subgroup created, divide it using the current characteristic
            subgroups_aux = []
            for subgroup in subgroups:
                subgroup_aux = get_subgroups(subgroup, characteristic)

                subgroups_aux = subgroups_aux + subgroup_aux

            # Update parameters
            used_characteristics = used_characteristics + [characteristic]
            subgroups = subgroups_aux

            # If there are no subgroups left, stop
            if not subgroups:
                break

        row = create_row(project_ID, subgroups, used_characteristics)
        rows.append(row)
        
    return rows

In [17]:
project_ID = 'E-CURD-46'

characterictics_groups = [
    [
        'organism',
        'cell type',
        'organism part',
        'disease'
    ],
    [
        'organism',
        'inferred cell type - ontology labels',
        'organism part',
        'disease'
    ],
    [
        'organism',
        'inferred cell type - authors labels',
        'organism part',
        'disease'
    ]
]

In [41]:
rows = get_groups_from_project_multiple_test(project_ID, characterictics_groups)

pd.DataFrame(rows)

Unnamed: 0,project_ID,num_subgroups,num_cells,characteristics_used
0,E-CURD-46,1,101844,organism-disease
1,E-CURD-46,33,62521,organism-inferred cell type - ontology labels-...
2,E-CURD-46,35,62521,organism-inferred cell type - authors labels-d...


### Selecting the best subgroups combination

Given a list of subgroup combination, we have to select the best one.

In [13]:
def best_subgroup_combination(subgroups_combinations):
    best_combination = None
    best_combination_index = 0
    
    for n, combination in enumerate(subgroups_combinations):
        if compare_combination(best_combination, combination) == 1:
            best_combination = combination
            best_combination_index = n
    
    return best_combination, best_combination_index

With the function `compare_combination` we compare two combinations and get the best one.

In [20]:
def compare_combination(combination0, combination1):
        
    if combination0 is None:
        return 1
    if combination1 is None:
        return -1
    
    combination0_characteristics = len(combination0['characteristics_used'].split('/'))
    combination1_characteristics = len(combination1['characteristics_used'].split('/'))
    # Compare characteristics
    if combination0_characteristics > combination1_characteristics:
        return -1
    if combination0_characteristics < combination1_characteristics:
        return 1
    
    return 0

In [85]:
best_subgroup_combination(rows)

({'project_ID': 'E-CURD-46',
  'num_subgroups': 1,
  'num_cells': 101844,
  'characteristics_used': 'organism-disease'},
 0)

In [74]:
rows

[{'project_ID': 'E-CURD-46',
  'num_subgroups': 1,
  'num_cells': 101844,
  'characteristics_used': 'organism-disease'},
 {'project_ID': 'E-CURD-46',
  'num_subgroups': 33,
  'num_cells': 62521,
  'characteristics_used': 'organism-inferred cell type - ontology labels-disease'},
 {'project_ID': 'E-CURD-46',
  'num_subgroups': 35,
  'num_cells': 62521,
  'characteristics_used': 'organism-inferred cell type - authors labels-disease'}]

We are adding this functionality to the main function, so we can compare and return the best combination of characteristics.

In [15]:
def get_groups_from_project_multiple(project_ID, characteristics_groups):
    """Generate the groups for percentile creation using characteristics to divide.

    Parameters
    ----------
    project_ID : str
        The ID of a project
    characteristics_groups : list
        Lists of lists of str with the characteristics used to divide the project

    Returns
    -------
    
    row: dict
        The row (dictionary) with the number of subgroups created with each combination
    subgroups: list
        A list with dictionaries containing the groups, the characteristics and the values used for the division.
    """    
    # Read the metadata file using the API
    metadata, _, _, cell_names = read_files(project_ID)
    
    # If there is not metadata for this project, return empty lists
    if metadata is None:
        return [], []
    
    metadata = process_metadata(metadata, cell_names)
    
    rows = []
    subgroups_list = []
    for characteristics in characteristics_groups:
        # Initialitation of parameters
        subgroups = init_subgroups(metadata)
        project_characteristics = metadata.columns
        used_characteristics = []

        # Start the subgroup generation using the characteristics
        for characteristic in characteristics:
            # If the characteristic is not in the project, we skip it
            if characteristic not in project_characteristics:
                continue

            # For each subgroup created, divide it using the current characteristic
            subgroups_aux = []
            for subgroup in subgroups:
                subgroup_aux = get_subgroups(subgroup, characteristic)

                subgroups_aux = subgroups_aux + subgroup_aux

            # Update parameters
            used_characteristics = used_characteristics + [characteristic]
            subgroups = subgroups_aux

            # If there are no subgroups left, stop
            if not subgroups:
                break

        row = create_row(project_ID, subgroups, used_characteristics)
        
        rows.append(row)
        subgroups_list.append(subgroups)
        
    row, index = best_subgroup_combination(rows)
    subgroups = subgroups_list[index]
    
    return row, subgroups

In [21]:
row, subgroups = get_groups_from_project_multiple(project_ID, characterictics_groups)

pd.DataFrame([row])

Unnamed: 0,project_ID,num_subgroups,num_cells,characteristics_used
0,E-CURD-46,33,62521,organism/inferred cell type - ontology labels/...


## Get groups from all projects

As we said, the main objetive is to get a table with the groups we would get grouping the data by some charactericticts. In this part, we are going to create this table.

### Get all the project IDs from the API

First of all, we have to get the IDs from all the projects. We can archieve that using the API as follows:

In [22]:
project_IDs = requests.get('http://localhost:5000/project/metadata/project_ID').json()
project_IDs[:5]

['005d611a-14d5-4fbf-846e-571a1f874f70',
 '027c51c6-0719-469f-a7f5-640fe57cbece',
 '091cf39b-01bc-42e5-9437-f419a66c8a45',
 '116965f3-f094-4769-9d28-ae675c1b569c',
 '1defdada-a365-44ad-9b29-443b06bd11d6']

### Defining the characterictics we want to use

In [23]:
common_characteristics = [
    'organism',
    'organism part',
    'sampling site',
    'biopsy site',
    'metastatic site',
    'developmental stage',
    'cell line',
    'disease'
]

characterictics_groups = [
    common_characteristics + ['cell type'],
    common_characteristics + ['inferred cell type - ontology labels'],
    common_characteristics + ['inferred cell type - authors labels'],
]

In [24]:
from IPython.display import clear_output

rows = []
n_projects = len(project_IDs)

for n, project_ID in enumerate(project_IDs):
    clear_output(wait=True)
    print(f"{n+1}/{n_projects}")
    
    row, subgroups = get_groups_from_project_multiple(project_ID, characterictics_groups)
    rows = rows + [row]

3/187


KeyboardInterrupt: 

In [None]:
pd.DataFrame([rows])