CURE code

This code was written as part of the cure project to predict breast cancer survivorship in a subset of the TCGA BRCA data set. This document provides basic descriptions of the inputs and outputs of each script in this repository. Some of these assume access to the pghbio data hosted on BRIDGES.

########################################################################################

format_dataframes.py

This script properly formats clinical, imaging, and RNA expression data for later use. It identifies samples for which the endpoint labels are known.

Usage:

python format_dataframes.py clinical_spreadsheet image_spreadsheet expression_dataframe

Inputs:

clinical_spreadsheet- The clinical data provided by the CURE project
image_spreadsheet- The imaging data provided by the Murphy lab.
expression_dataframe- The dataframe generated by get_salmon_data.py

Outputs:

Separate Pandas dataframes and accompanying labels will be saved to the working directory in pickle format for each data type.

########################################################################################

get_salmon_data.py

This script finds the expression data of genes related to breast cancer for each sample. Assumes access to pghbio on BRIDGES.

Usage:

python get_salmon_data.py

Inputs:

None :)

Outputs:

Writes Pandas dataframe to current working directory containing expression information.

########################################################################################

cost_sensitive_learning.py

This script performs cost sensitive learning with leave-one-out cross validation. We assume the data is formatted into an array of shape (n,m), where n is the number of samples, and m the number of features. The data must be numerical, we suggest using a one-hot encoding for categorical data. The labels should be binary (0,1). The top 20 features according to mutual information with the labels are selected and used for learning.

Usage:

import cost_sensitive_learning as csl

l = csl.OptimizedCost(classifier, primary_constraint, primary_constraint_threshold, secondary_constraint)

statistics, predictions, probabilities = l.find_best_weights(X,y)

Inputs:

classifier- a scikit-learn classifier (eg. RandomForestClassifier)
primary_constraint- The metric that you want your model to meet a speicific threshold. Must be one of 'sensitivity', 'specificity', 'auc', 'f1', or 'matthews'.
primary_constraint_threshold- The numeric threshold you want your model to attain for the primary_constraint.
secondary_constraint- Given the model meets the primary constraint threshold, a second metric you would like the model to maximize. Must be one of 'sensitivity', 'specificity', 'auc', 'f1', or 'matthews'
X- A data matrix
y- labels

Outputs:

statistics- A 4-tuple containing the secondary constraint, sensitivity, specificity, and accuracy of the model
predictions- The predicted classifications.
probabilities- The predicted class probabilities.

####################################################################################

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
__init__.py		__init__.py
cost_sensitive_learning.py		cost_sensitive_learning.py
format_dataframes.py		format_dataframes.py
get_salmon_data.py		get_salmon_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

init.py

init.py

cost_sensitive_learning.py

cost_sensitive_learning.py

format_dataframes.py

format_dataframes.py

get_salmon_data.py

get_salmon_data.py

Repository files navigation

CURE code

About

Releases

Packages

Languages

frisbyts/cure

Folders and files

Latest commit

History

Repository files navigation

CURE code

About

Resources

Stars

Watchers

Forks

Languages