Skip to content

frisbyts/cure

Repository files navigation

CURE code

This code was written as part of the cure project to predict breast cancer survivorship in a subset of the TCGA BRCA data set. This document provides basic descriptions of the inputs and outputs of each script in this repository. Some of these assume access to the pghbio data hosted on BRIDGES.

########################################################################################

format_dataframes.py

This script properly formats clinical, imaging, and RNA expression data for later use. It identifies samples for which the endpoint labels are known.

Usage:

python format_dataframes.py clinical_spreadsheet image_spreadsheet expression_dataframe

Inputs:

  1. clinical_spreadsheet- The clinical data provided by the CURE project

  2. image_spreadsheet- The imaging data provided by the Murphy lab.

  3. expression_dataframe- The dataframe generated by get_salmon_data.py

Outputs:

Separate Pandas dataframes and accompanying labels will be saved to the working directory in pickle format for each data type.

########################################################################################

get_salmon_data.py

This script finds the expression data of genes related to breast cancer for each sample. Assumes access to pghbio on BRIDGES.

Usage:

python get_salmon_data.py

Inputs:

None :)

Outputs:

Writes Pandas dataframe to current working directory containing expression information.

########################################################################################

cost_sensitive_learning.py

This script performs cost sensitive learning with leave-one-out cross validation. We assume the data is formatted into an array of shape (n,m), where n is the number of samples, and m the number of features. The data must be numerical, we suggest using a one-hot encoding for categorical data. The labels should be binary (0,1). The top 20 features according to mutual information with the labels are selected and used for learning.

Usage:

import cost_sensitive_learning as csl

l = csl.OptimizedCost(classifier, primary_constraint, primary_constraint_threshold, secondary_constraint)

statistics, predictions, probabilities = l.find_best_weights(X,y)

Inputs:

  1. classifier- a scikit-learn classifier (eg. RandomForestClassifier)

  2. primary_constraint- The metric that you want your model to meet a speicific threshold. Must be one of 'sensitivity', 'specificity', 'auc', 'f1', or 'matthews'.

  3. primary_constraint_threshold- The numeric threshold you want your model to attain for the primary_constraint.

  4. secondary_constraint- Given the model meets the primary constraint threshold, a second metric you would like the model to maximize. Must be one of 'sensitivity', 'specificity', 'auc', 'f1', or 'matthews'

  5. X- A data matrix

  6. y- labels

Outputs:

  1. statistics- A 4-tuple containing the secondary constraint, sensitivity, specificity, and accuracy of the model

  2. predictions- The predicted classifications.

  3. probabilities- The predicted class probabilities.

####################################################################################

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages