# Evaluation of Patterns Obtained with BiGGEsTS

### 1. Introduction

This section covers the following steps:
- Importing the necessary libraries;
- Defining the required functions for this code file;
- Upload necessary files and datasets.

In [4]:
import pandas as pd
from DI2 import *
from auxiliary import *

In [5]:
def print_scores_for_each_property_classic_approach(results_biclustering_paths, csv_dataset, outcome, property, techniques,
                                                    filter_biclusters_bool, support_value):
    
    """
    This function evaluates the bicluster groups obtained for a certain bio-variable (cell density, aggregate size, for example), using 3/4 discriminative scores from the DISA tool.

    Inputs:
        - results_biclustering_paths: a list with the file paths of the txt files for a certain bio-variable;
        - csv_dataset: the file path that contains the location of the bio-variable dataset (the raw dataset);
        - outcome: the outcome column/ list/ dataframe?;
        - techniques: a list os strings: each string contains a description of the discretization, filter and sort techniques used in the BiGGEsTS application, for the user to understand 
        the differences between each group of biclusters, from the same bio-variable;
        - filter_biclusters_bool: a Boolean, if True, this function will only evaluate the patterns that have a support equal or higher than 6%.

    For the three outcome values (0, 1 and 2), this function will print the following discriminative scores:
        - Lift
        - Standardized lift
        - Confidence
    """

    data = pd.read_csv(csv_dataset, sep='\t')
    data = data.drop('Experiment', axis=1)
    data.columns = range(len(data.columns))
    data['values'] = outcome

    output_configurations = {
            "print_numeric_intervals": False
            }
            
    class_cat = discritizer_CM_content(data, 'values')
    class_information = {
        "values": class_cat["values"],
        "outcome_value": 2,
        "type": "Categorical"
        }
    
    support = support_value

    print("####################################### " + property + " ###############################################")
    print()
    
    for i in range(len(results_biclustering_paths)):
        print("#################################### Group " + str(i+1) + " ############################################")
        print('Discretization, sorting and filtering techniques:')
        print(techniques[i])
        print()
    
        biclusters = retrive_patterns(results_biclustering_paths[i], True)
        if filter_biclusters_bool:
            biclusters, biclusters_idxs = filter_patterns_support(biclusters, len(outcome), support)

        class_information["outcome_value"] = 2
        print("####################################### outcome_value = 2 ###############################################")
        print()

        if len(biclusters) != 0:
            stats(data, biclusters, class_information, output_configurations, biclusters_idxs)
        
        class_information["outcome_value"] = 1
        print("####################################### outcome_value = 1 ###############################################")
        print()
        if len(biclusters) != 0:
            stats(data, biclusters, class_information, output_configurations, biclusters_idxs)

        class_information["outcome_value"] = 0
        print("####################################### outcome_value = 0 ###############################################")
        print()
        if len(biclusters) != 0:
            stats(data, biclusters, class_information, output_configurations, biclusters_idxs)

        print("####################################################################################################")
        print()

In [6]:
def get_y_from_a_bicluster(biclusters_results_file_path, bicluster_number, 
                           dataset_csv_file_path, outcome, discritize_CM_content):
    """
    This function ruturns the outcome (y) of the rows of a bicluster given by the user.
    Inputs:
        - biclusters_results_file_path: a string with the file path for the txt file (that contains the biclusters information);
        - bicluster_number: bicluster ID of that bicluster group;
        - dataset_csv_file_path: a string with the file path for the txt file, that contains the bio-variable dataset;
        - outcome: the outcome column/ dataframe;
        - discritize_CM_content: a Boolean, if True, it returns the outcome categorized (0 if y<=50, 1 if 50<y<80, 2 if y>=80), if False, it returns the outcome percentage (the
        number between 0 and 100%).
    """
    
    bicluster_dict = retrive_patterns(biclusters_results_file_path, False)
    data = pd.read_csv(dataset_csv_file_path, sep='\t')
    data = data.drop('Experiment', axis=1)
    data.columns = range(len(data.columns))
    data['values'] = outcome

    if discritize_CM_content:
        class_cat = discritizer_CM_content(data, 'values')
    else:
        class_cat = data

    y_values_from_a_bicluster = class_cat.loc[bicluster_dict[bicluster_number-1]['lines'], 'values']

    return y_values_from_a_bicluster

In [7]:
# File paths of the bio-variables datasets 
csv_dataset_cell_density = 'Datasets csvs/Cell_density.txt'
csv_dataset_aggregate_size = 'Datasets csvs/Aggregate_size.txt'
csv_dataset_average_DO = 'Datasets csvs/Average_DO.txt'
csv_dataset_glucose_concentration = 'Datasets csvs/Glucose_concentration.txt'
csv_dataset_lactate_concentration = 'Datasets csvs/Lactate_concentration.txt'
csv_dataset_DO_cell_count = 'Datasets csvs/DO_concentration_cell_count.txt'
csv_dataset_DO_gradient_cell_count = 'Datasets csvs/DO_gradient_cell_count.txt'
csv_dataset_DO_2nd_derivative = 'Datasets csvs/DO_concentration_2ndDerivative.txt'
csv_dataset_average_pH = 'Datasets csvs/Average_pH.txt'
csv_dataset_average_pH_gradient = 'Datasets csvs/pH_Gradient.txt'


# File paths for txts that contain bicluster informations
biclusters_cell_density_file_paths = ['BiGGEstTS results/Cell Density/Group 1/Biclusters Information.txt', 
                                     'BiGGEstTS results/Cell Density/Group 2/Biclusters Information.txt',
                                     'BiGGEstTS results/Cell Density/Group 3/Biclusters Information.txt']

biclusters_agg_size_file_paths = ['BiGGEstTS results/Aggregate Size/Group 1/Biclusters information.txt',
                                  'BiGGEstTS results/Aggregate Size/Group 2/Biclusters information.txt']

biclusters_average_DO_file_paths = ['BiGGEstTS results/Average DO/Group 1/Biclusters information.txt',
                                    'BiGGEstTS results/Average DO/Group 2/Biclusters information.txt',
                                    'BiGGEstTS results/Average DO/Group 3/Biclusters information.txt',
                                    'BiGGEstTS results/Average DO/Group 4/Biclusters information.txt']

biclusters_glucose_concentration_file_paths = ['BiGGEstTS results/Glucose Concentration/Group 1/Biclusters information.txt']

biclusters_lactate_concentration_file_paths = ['BiGGEstTS results/Lactate Concentration/Group 1/Biclusters information.txt',
                                               'BiGGEstTS results/Lactate Concentration/Group 2/Biclusters information.txt']

biclusters_DO_cell_count_file_paths = ['BiGGEstTS results/DO Concentration: Cell Count/Group 1/Biclusters information.txt',
                                       'BiGGEstTS results/DO Concentration: Cell Count/Group 2/Biclusters information.txt']

biclusters_DO_gradient_cell_count_paths = ['BiGGEstTS results/DO Concentration Gradient: Cell Count/Group 1/Biclusters information.txt',
                                           'BiGGEstTS results/DO Concentration Gradient: Cell Count/Group 2/Biclusters information.txt']

biclusters_DO_2nd_derivative_paths = ['BiGGEstTS results/DO Concentration 2nd Derivative/Group 1/Biclusters information.txt']

biclusters_average_pH_paths = ['BiGGEstTS results/Average pH/Group 1/Biclusters information.txt']

biclusters_average_pH_gradient_paths = ['BiGGEstTS results/Average pH Gradient/Group 1/Biclusters information.txt']


# Description of the discretization, filtering and sorting techniques used in biclustering                 
biclustering_cell_density_techniques = ['Discretization- equal frequency 3 symbols; Filter- biclusters with row and column similarity > 25%; Sort- p-value (ascending).',
                                        'Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                        'Discretization- variation between time points (3 symbols); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_agg_size_techniques = ['Discretization- equal frequency 3 symbols; sorted by p-value (ascending); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                  'Discretization- variation between time points (3 symbols); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_average_DO_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                    'Discretization- equal frequency 5 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                    'Discretization- equal frequency 7 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                    'Discretization- variation between time points (3 symbols); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_glucose_concentration_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_lactate_concentration_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                               'Discretization- equal frequency 5 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_DO_cell_count_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                       'Discretization- equal frequency 5 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_DO_gradient_cell_count_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).',
                                                'Discretization- equal frequency 5 symbols; Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_DO_2nd_derivative_techniques = ['Discretization- equal frequency 3 symbols; Filter- statistical significance, p-value 0.05; Sort- p-value (ascending).']

biclusters_average_pH_techniques = ['Discretization- equal frequency (3 symbols); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']

biclusters_average_pH_gradient_techniques = ['Discretization- equal frequency (3 symbols); Filter- statistical significance, p-value 0.01; Sort- p-value (ascending).']


# Outcome
y_file_path = 'Datasets csvs/CM_content_dd10.txt'
y = pd.read_csv(y_file_path, sep='\t')
y = y.drop('Experiment', axis=1)
y.reset_index(drop=True, inplace=True)
y.columns = ['values']

### 2. Apply DISA Tool Using the Classic Distribution (3 bins)

In this section, the Classic approach from DISA is applied for all bio-variables: cell density, aggregate size, and so on. 
In this approach, the biclusters found before are evaluated using the categorized outcome y (percentage of CM content). The outcome is categorized in 3 bins:

- if y<=50, it is considered a bad observation, class 0;
- if 50<y<80, it is considered a good observation, class 1;
- if y>=80, it is considered a very good observation, class 2.

Then the results are discussed, and the groups of most important patterns are chosen after. 

In [8]:
results_file_paths = [biclusters_cell_density_file_paths, biclusters_agg_size_file_paths, biclusters_average_DO_file_paths, biclusters_glucose_concentration_file_paths,
                      biclusters_lactate_concentration_file_paths, biclusters_DO_cell_count_file_paths, biclusters_DO_gradient_cell_count_paths, biclusters_DO_2nd_derivative_paths,
                      biclusters_average_pH_paths, biclusters_average_pH_gradient_paths]
csv_dataset_paths = [csv_dataset_cell_density, csv_dataset_aggregate_size, csv_dataset_average_DO, csv_dataset_glucose_concentration, csv_dataset_lactate_concentration,
                     csv_dataset_DO_cell_count, csv_dataset_DO_gradient_cell_count, csv_dataset_DO_2nd_derivative, csv_dataset_average_pH, csv_dataset_average_pH_gradient]
properties = ['CELL DENSITY', 'AGGREGATE SIZE', 'AVERAGE DO', 'GLUCOSE CONCENTRATION', 'LACTATE CONCENTRATION', 'DO CONCENTRATION/CELL COUNT',
            'DO CONCENTRATION GRADIENT/ CELL COUNT', 'DO CONCENTRATION 2ND DERIVATIVE', 'AVERAGE PH', 'AVERAGE PH GRADIENT']
techniques_lists = [biclustering_cell_density_techniques, biclusters_agg_size_techniques, biclusters_average_DO_techniques, biclusters_glucose_concentration_techniques,
                    biclusters_lactate_concentration_techniques, biclusters_DO_cell_count_techniques, biclusters_DO_gradient_cell_count_techniques,
                    biclusters_DO_2nd_derivative_techniques, biclusters_average_pH_techniques, biclusters_average_pH_gradient_techniques]

# Run the function for each bio-variable in a for loop
for results_file_path, csv_dataset_path, property, techniques_list in zip(results_file_paths, csv_dataset_paths, properties, techniques_lists):
    print_scores_for_each_property_classic_approach(results_file_path, csv_dataset_path, y['values'], property, techniques_list, filter_biclusters_bool= True, support_value=6)

####################################### CELL DENSITY ###############################################

#################################### Group 1 ############################################
Discretization, sorting and filtering techniques:
Discretization- equal frequency 3 symbols; Filter- biclusters with row and column similarity > 25%; Sort- p-value (ascending).

Total number of bics
31
Average number of columns
3.064516129032258
Standard deviation of columns
1.4576933584122587
Average number of rows
9.419354838709678
Standard deviation of rows
9.472453078569584

After filtering the biclusters with support >=  6 %:

Total number of bics
21
Average number of columns
2.3333333333333335
Standard deviation of columns
0.7126966450997985
Average number of rows
12.761904761904763
Standard deviation of rows
9.884594853584092
####################################### outcome_value = 2 ###############################################

+-------------------+-------+------+------+------+------+---

### 3. Patterns that Discretize a Single Outcome

Only one pattern could discretize the outcome value 0 only: 
- Cell Density Group 1: Pattern 1

No patterns discretize only the outcome value 1. 

Three patterns could discretize only the outcome value 2:

- Cell Density Group 1: Pattern 22
- DO concentration gradient/ cell count Group 2: Pattern 3
- Average pH gradient Group 1: Pattern 5

### 4. Results 1 and Results 1.1

The first results consider the patterns with the following characteristics: 
- A support equal or higher than 6% (at least 4 rows);
- Discretize either only outcome value 2, or both outcome values 2 and 1;
- A lift (for outcome value 2) equal or higher than 1.2.

Results 1 does not contain the patterns that correspond to pH variables. Results 1.1 contain the patterns from results 1 and the patterns that correspond to pH variables.

The results are shown in this format: 
***Pattern number (outcome(s) discriminated) (number of rows)***

**Cell Density**
- Group 1:
    - P22 (2) (4 rows)
    - P30 (1 & 2) (4 rows)

**Average DO**
- Group 2:
    - P5 (1 & 2) (7 rows)

- Group 3:
    - P5 (1 & 2) (7 rows)

- Group 4:
    - P1 (1 & 2) (5 rows)

**DO concentration gradient/ cell count**
- Group 2:
    - P3 (2) (4 rows)


Results 1.1 contain the patterns above, and also the pH patterns below:

**Average pH**
- Group 1:
    - P1 (1 & 2) (5 rows)

**Average pH gradient**
- Group 1:
    - P1 (1 & 2) (4 rows)
    - P4 (1 & 2) (5 rows)
    - P5 (2) (4 rows)
    - P8 (1 & 2) (5 rows)


### 5. Results 2 and Results 2.1

The second results/ set of patterns have the following characteristics: 
- A support equal or higher than 6% (at least 4 rows);
- Discretize either only outcome value 2, both outcome values 2 and 1, or outcome values 2, 1 and 0;
- A lift (for outcome value 2) equal or higher than 1.2;
- If it discretizes outcome value 0, the confidence for outcome value 0 is equal or lower than 10%.

Results 2 does not contain the patterns that correspond to pH variables. Results 2.1 contain the patterns from results 2 and the patterns that correspond to pH variables.

The results are shown in this format: ***Pattern number (outcome discriminated) (number of rows)***

**Cell Density**
- Group 1:
    - P22 (2) (4 rows)
    - P30 (1 & 2) (4 rows)

- Group 2: 
    - P3 (2 & 1 & 0) (12 rows)
    - P9 (2 & 1 & 0) (17 rows)

**Average DO**
- Group 2: 
    - P4 (2 & 1 & 0) (11 rows)
    - P5 (1 & 2) (7 rows)
    - P8 (2 & 1 & 0) (19 rows)
    - P10 (2 & 1 & 0) (16 rows)

- Group 3: 
    - P4 (2 & 1 & 0) (14 rows)
    - P5 (1 & 2) (7 rows)

- Group 4:
    - P1 (1 & 2) (5 rows)

**DO concentration gradient/ cell count**
- Group 2:
    - P3 (2) (4 rows)

**Glucose concentration**
- Group 1:
    - P9 (2 & 1 & 0) (18 rows)


Results 2.1 contain the patterns above, and also the pH patterns below:

**Average pH**
- Group 1: 
    - P1 (1 & 2) (5 rows)

**Average pH gradient**
- Group 1:
    - P1 (1 & 2) (4 rows)
    - P4 (1 & 2) (5 rows)
    - P5 (2) (4 rows)
    - P7 (1 & 2 & 0) (15 rows)
    - P8 (1 & 2) (5 rows)
    - P13 (1 & 2 & 0) (13 rows)
    - P15 (1 & 2 & 0) (14 rows)