# Generating bulk signatures

Gregory Way, 2020

We perform an analysis of variance (ANOVA) model to analyze features.

We use the ANOVA results to determine if metadata information or real biology treatments contribute to observed differences in cell painting features.

## Datasets

We generate a bulk signature for three datasets: CloneAE (bortezomib), ixazomib, and CB-5083.
We apply the following procedure for all datasets.

## Data processing

Before building signatures, we perform the following operations to prepare profiles for analysis.

1. Load profiles
2. Filter profile data
    * Remove features as determined by our feature selection procedure in `0.compile-bulk-datasets`
    * Remove wildtype parental lines (to avoid isolating a clonal selection signature)
      * Note, the cloneAE dataset only had WT_parental lines (so don't remove these!)
    * Only select samples that were treated with 0.1% DMSO
3. Perform signature building operation

## Signature building

We use the following approach to identify features that are most different between wildtype and resistant clones.
We apply this approach for each CellProfiler feature independently.

1. Perform an ANOVA for the following factors: Plate, batch, clone ID, and resistance status
2. Apply a Tukey's HSD posthoc test to determine which pairwise comparisons in the ANOVA were driving the signifance.
3. Remove features:
    * If the feature was significantly contributing to plate effects
    * If the feature was significantly contributing to batch effects
    * If the feature was significantly different within clones of the same resistance type (e.g. if a feature was significant for Resistant Clone A vs. Resistant Clone B). These are non-specific features.
4. Form the signature
    * The signature is all features that are significantly different for the resistance status factor _and not removed via the procedure above_

In [1]:
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(ggrepel))

source(file.path("scripts", "signature_utils.R"))

In [2]:
set.seed(123)

datasets <- c(
    "cloneAE",
    "ixazomib",
    "cb5083"
)

input_data_dir <- "data"
data_file <- file.path(input_data_dir, "bulk_profiles_analytical_set.csv.gz")

output_fig_dir = file.path("figures", "anova")
output_results_dir = file.path("results", "signatures")

In [3]:
# Load feature selected features
feat_file <- file.path(input_data_dir, "dataset_features_selected.tsv")
all_selected_features_df <- readr::read_tsv(feat_file, col_types = readr::cols())
head(all_selected_features_df, 3)

features,dataset
<chr>,<chr>
Cells_AreaShape_Compactness,cloneAE
Cells_AreaShape_Extent,cloneAE
Cells_AreaShape_Orientation,cloneAE


In [4]:
# Load profiles
bulk_col_types <- readr::cols(
    .default = readr::col_double(),
    Metadata_Plate = readr::col_character(),
    Metadata_Well = readr::col_character(),
    Metadata_batch = readr::col_character(),
    Metadata_clone_number = readr::col_character(),
    Metadata_plate_map_name = readr::col_character(),
    Metadata_treatment = readr::col_character(),
    Metadata_dataset = readr::col_character(),
    Metadata_clone_type = readr::col_character(),
    Metadata_clone_type_indicator = readr::col_character(),
    Metadata_model_split = readr::col_character(),
    Metadata_cell_density = readr::col_character(),
    Metadata_plate_filename = readr::col_character(),
    Metadata_treatment_time = readr::col_character(),
    Metadata_unique_sample_name = readr::col_character(),
    Metadata_time_to_adhere = readr::col_character()
)

bulk_df <- readr::read_csv(data_file, col_types = bulk_col_types)

print(dim(bulk_df))
head(bulk_df, 4)

[1]  612 3547


Metadata_Plate,Metadata_Well,Metadata_batch,Metadata_clone_number,Metadata_plate_map_name,Metadata_treatment,Metadata_dataset,Metadata_clone_type,Metadata_clone_type_indicator,Metadata_model_split,⋯,Nuclei_Texture_Variance_RNA_10_02,Nuclei_Texture_Variance_RNA_10_03,Nuclei_Texture_Variance_RNA_20_00,Nuclei_Texture_Variance_RNA_20_01,Nuclei_Texture_Variance_RNA_20_02,Nuclei_Texture_Variance_RNA_20_03,Nuclei_Texture_Variance_RNA_5_00,Nuclei_Texture_Variance_RNA_5_01,Nuclei_Texture_Variance_RNA_5_02,Nuclei_Texture_Variance_RNA_5_03
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
HCT116bortezomib,B03,2019_02_15_Batch1_20X,WT_parental,PlateMap_HCT116bortezomib,0.1% DMSO,cloneAE,sensitive,0,training,⋯,-0.793648,-0.7971384,-0.8497648,-0.8422521,-0.8371976,-0.8360676,-0.7911125,-0.7921908,-0.7929302,-0.7948077
HCT116bortezomib,B04,2019_02_15_Batch1_20X,WT_parental,PlateMap_HCT116bortezomib,0.1% DMSO,cloneAE,sensitive,0,training,⋯,-0.6146661,-0.6211401,-0.6681232,-0.6482634,-0.6645153,-0.6630872,-0.6084365,-0.6124887,-0.6099403,-0.6126431
HCT116bortezomib,B05,2019_02_15_Batch1_20X,WT_parental,PlateMap_HCT116bortezomib,0.1% DMSO,cloneAE,sensitive,0,training,⋯,-0.7211521,-0.726177,-0.7334183,-0.7047708,-0.7273677,-0.7210842,-0.7165588,-0.7199086,-0.7220983,-0.721685
HCT116bortezomib,B06,2019_02_15_Batch1_20X,CloneA,PlateMap_HCT116bortezomib,0.1% DMSO,cloneAE,resistant,1,training,⋯,-0.9428608,-0.9424811,-0.9696387,-0.9835661,-0.9495513,-0.9579973,-0.9454605,-0.9447918,-0.9505914,-0.9474764


## Subset data for signature building effort

We are building our signature using only a subset of the data in `bulk_df`.
We isolate profiles based on the following criteria:

* Profiles in the training set (as defined in `0.compile-bulk-datasets.ipynb`)
* Profiles treated with 0.1% DMSO (we are only interested in baseline differences between resistant/sensitive clones)
* Profiles that are clones (not wildtype parental lines)
  * We do not want to isolate a signature of clonal selection

In [5]:
training_data <- list()
for (dataset in datasets) {
    # Subset dataset
    bulk_subset_df <- bulk_df %>%
        dplyr::filter(
            Metadata_dataset == !!dataset,
            Metadata_model_split == "training",
        )
    
    # Apply feature selection performed in 0.compile-bulk-datasets
    selected_features <- all_selected_features_df %>%
        dplyr::filter(dataset == !!dataset) %>%
        dplyr::pull(features)
    
    bulk_subset_df <- bulk_subset_df %>%
        dplyr::select(starts_with("Metadata"), all_of(selected_features))
    
    # Populate the list for signature building
    bulk_subset_df$Metadata_clone_type_indicator <- factor(
        bulk_subset_df$Metadata_clone_type_indicator, levels = c("0", "1")
    )
    training_data[[dataset]] <- bulk_subset_df
    
    # Print dataset description
    print(paste("Training dataset:", dataset))
    print(table(
        bulk_subset_df$Metadata_clone_number,
        bulk_subset_df$Metadata_batch
    ))
}

[1] "Training dataset: cloneAE"
             
              2019_02_15_Batch1_20X 2019_03_20_Batch2 2020_07_02_Batch8
  CloneA                          3                 2                 3
  CloneE                          2                 2                 4
  WT_parental                     3                 3                 2
[1] "Training dataset: ixazomib"
                   
                    2020_08_24_Batch9 2020_09_08_Batch10
  Ixazomib clone 01                 5                  4
  Ixazomib clone 02                 4                  5
  Ixazomib clone 03                 6                  3
  Ixazomib clone 04                 4                  5
  Ixazomib clone 05                 5                  4
  WT clone 04                       5                  4
  WT clone 05                       4                  5
  WT clone 06                       5                  4
  WT clone 07                       4                  5
[1] "Training dataset: cb5083"
            

## Perform ANOVA and TukeyHSD

### Ignore features

Now, we build the signatures.
Our aim is to determine the CellProfiler features that most differentiate senstive vs. resistant clones.
However, we do not want to include features that only appear different because of technical artifacts.
Therefore, we include the following terms in the ANOVA model:

* Metadata_batch
* Metadata_Plate

We want to ignore features that are different when comparing profiles across batch and plate.
We also want to remove _some_ of the features that are different for:

* Metadata_clone_number

Because we are interested in building a generic signature of resistance, we do not want to include features that are different _within clones of the same resistance status_.
In other words, some features will be different between two clones that are both resistant (or two clones that are both wildtype).
These features are not specific to the core resistance signature, but instead belong to some other factor of population difference.
Ignore these features.

### Isolate features

We also include the biological factor we care about and want to isolate:

* Metadata_clone_type_indicator

This variable is a factor that indicates if the clone is sensitive (0) or resistant (1).
After removing the features that contribute to plate, batch, and within clone type clones, we select the features that demonstrate a significant difference between resistance status.

In [6]:
formula_terms <- paste(
    "~",
    "Metadata_clone_type_indicator", "+",
    "Metadata_batch", "+",
    "Metadata_Plate", "+",
    "Metadata_clone_number"
)

In [7]:
# Fit the ANOVA model and perform Tukey HSD
lm_results <- list()
tukey_results <- list()
for (dataset in datasets) {
    print(paste("Now processing...", dataset))
    
    # Extract the dataset used to train
    analytical_df <- training_data[[dataset]]
    
    # Fit linear model to determine sources of variation and process results
    lm_results[[dataset]] <- perform_anova(analytical_df, formula_terms)
    
    # Order the full results data frame by significance and extract feature names
    full_results_df <- lm_results[[dataset]][["full_results_df"]] %>%
        dplyr::arrange(desc(neg_log_p))
    
    features <- unique(full_results_df$feature)
    
    # Perform TukeyHSD posthoc test
    tukey_results[[dataset]] <- process_tukey(
        aov_list = lm_results[[dataset]][["aovs"]],
        features = features
    )
}

[1] "Now processing... cloneAE"
[1] "Now processing... ixazomib"
[1] "Now processing... cb5083"


## Build signatures and save intermediate and signature files

We perform the signature building activity here and save three files to disk:

* ANOVA results for all three datasets
* Tukey HSD results for all three datasets
* Final bulk signatures for all three datasets

In [8]:
all_anova_results <- list()
all_tukey_results <- list()
all_signature_results <- list()
for (dataset in datasets) {
    # Process ANOVA results
    anova_results_df <- lm_results[[dataset]][["full_results_df"]] %>%
        dplyr::mutate(dataset = dataset)

    all_anova_results[[dataset]] <- anova_results_df
    
    # Process tukey results
    tukey_results_df <- tukey_results[[dataset]] %>%
        dplyr::mutate(dataset = dataset)
    
    all_tukey_results[[dataset]] <- tukey_results_df
    
    # Build signature
    features <- unique(anova_results_df$feature)

    # Note that TukeyHSD() p value is already adjusted for multiple within comparisons,
    # but not across multiple features
    num_cp_features <- length(features)
    signif_line <- -log10(0.05 / num_cp_features)

    # Derive signature by systematically removing features influenced by technical artifacts
    signature_features <- tukey_results_df %>%
        dplyr::filter(term == "Metadata_clone_type_indicator", neg_log_adj_p > !!signif_line) %>%
        dplyr::pull(feature)

    feature_exclude_plate <- tukey_results_df %>%
        dplyr::filter(term == "Metadata_Plate", neg_log_adj_p > !!signif_line) %>%
        dplyr::pull(feature)

    feature_exclude_batch <- tukey_results_df %>%
        dplyr::filter(term == "Metadata_batch", neg_log_adj_p > !!signif_line) %>%
        dplyr::pull(feature)

    # Determine if the clone number comparison is between like-clones
    wt_clone_count <- stringr::str_count(
        tukey_results_df %>%
        dplyr::filter(term == "Metadata_clone_number") %>%
        dplyr::pull("comparison"), "WT"
    )

    # Exclude features with very high within sensitivity-type clones
    feature_exclude_nonspecific_variation <- tukey_results_df %>%
        dplyr::filter(term == "Metadata_clone_number") %>%
        dplyr::mutate(wt_clone_count = wt_clone_count) %>%
        dplyr::filter(neg_log_adj_p > !!signif_line * 15, wt_clone_count != 1) %>%
        dplyr::pull(feature)

    final_signature_features <- setdiff(
        signature_features, unique(feature_exclude_plate)
    )
    final_signature_features <- setdiff(
        final_signature_features, unique(feature_exclude_batch)
    )
    final_signature_features <- setdiff(
        final_signature_features, unique(feature_exclude_nonspecific_variation)
    )
    
    # Create a summary of the signatures
    signature_summary_df <- tibble(signature_features)

    signature_summary_df <- signature_summary_df %>%
        dplyr::mutate(
            plate_exclude = signature_summary_df$signature_features %in% feature_exclude_plate,
            batch_exclude = signature_summary_df$signature_features %in% feature_exclude_batch,
            non_specific_exclude = signature_summary_df$signature_features %in% feature_exclude_nonspecific_variation,
            final_signature = signature_summary_df$signature_features %in% final_signature_features,
            dataset = dataset
        )
    
    all_signature_results[[dataset]] <- signature_summary_df
}

In [9]:
# Output files
anova_output_file <- file.path(output_results_dir, "anova_results_full_bulk_signature.tsv.gz")
tukey_output_file <- file.path(output_results_dir, "tukey_results_full_bulk_signature.tsv.gz")
signature_output_file <- file.path(output_results_dir, "signature_summary_full_bulk_signature.tsv")

dplyr::bind_rows(all_anova_results) %>% readr::write_tsv(anova_output_file)
dplyr::bind_rows(all_tukey_results) %>% readr::write_tsv(tukey_output_file)
dplyr::bind_rows(all_signature_results) %>% readr::write_tsv(signature_output_file)