module1-biomarker

This is the template repository for the module 1 group assignment conducting a proteomic analysis of serum data to identify possible early detection biomarkers of autism spectrum disorder (ASD).

Data for this assignment come from Hewitson et al. (2021). Blood biomarker discovery for autism spectrum disorder: A proteomic analysis. PLoS One. 2021 Feb 24;16(2):e0246581. doi: 10.1371/journal.pone.0246581. Data are publicly available and were accessed on August 19, 2022.

Repository contents

data contains
- biomarker-raw.csv raw data as retrieved from PLoS One
- biomarker-clean.RData processed data
scripts contains
- preprocessing.R script used to generate the processed data
- preliminary-analysis.R concise version of in-class codes used to replicate published analysis
results contains a report template report.qmd

Assignment tasks

Most of the lifting for this analysis was already done in class. As noted, the published analysis excluded 192 unidentified variables which are included in our analysis by default. In addition, we made some simplifications for ease of presentation -- specifically, repeated RF fitting, forward selection in the logistic regression model, and repeated estimation of AUROC for the final classifier(s) were omitted. As such, our results will not match those in the paper exactly.

Your group's task is to explore the sensitivity of the results to certain design choices in the methodology. Use the provided scripts preprocessing.R and preliminary-analysis.R as starting points to address the following questions:

What do you imagine is the reason for log-transforming the protein levels in biomarker-raw.csv? (Hint: look at the distribution of raw values for a sample of proteins.)
Temporarily remove the outlier trimming from preprocessing and do some exploratory analysis of outlying values. Are there specific subjects (not values) that seem to be outliers? If so, are outliers more frequent in one group or the other? (Hint: consider tabluating the number of outlying values per subject.)
Experiment with the following modifications:
- repeat the analysis but carry out the entire selection procedure on a training partition -- in other words, set aside some testing data at the very beginning and don't use it until you are evaluating accuracy at the very end
- choose a larger number (more than ten) of top predictive proteins using each selection method
- use a fuzzy intersection instead of a hard intersection to combine the sets of top predictive proteins across selection methods
How are results affected by each modification?
Use any method to find either:
- a simpler panel that achieves comparable classification accuracy
- an alternative panel that achieves improved classification accuracy
Benchmark your results against the in-class analysis.

Assignment instructions

Coordinate with your assigned group, divide tasks, and make a plan for how to proceed.
Similar to last time, develop your scratch work via scripts in the repository. Store these in a subdirectory, e.g., scripts/drafts/FILENAME.R .
Pool your work and develop a primary script that contains the codes that produce your findings. Store this as scripts/analysis-main.R .
Prepare a write-up using the template results/report.qmd .
To submit, commit all changes to the main branch of your group repository by the assignment deadline.

Remark: ensure your contributions are recorded in the commit history.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
.github		.github
data		data
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
module1-biomarker.Rproj		module1-biomarker.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

data

data

results

results

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md