1. In this project, I try to use a machine learning package, MLSeq from Bioconductor to find out the best model that predicts breast cancer subtype.
2. For my project, the machine learning process will use 5-fold cross-validation, and 10 repeats. I use 28 data sets from TCGA to train and test this model using 12 datasets.
3. The main goal of this project is to have a taste of machine learning in R. So I may do many copy and paste, but I will give my understanding and opinions in the R notebook.
4. The original package link: https://www.bioconductor.org/packages/devel/bioc/vignettes/MLSeq/inst/doc/MLSeq.pdf
5. The original Vignettes/sample link: https://www.bioconductor.org/packages/devel/bioc/vignettes/MLInterfaces/inst/doc/MLprac2_2.pdf
All data are available from TCGA
1.The training data and testing data are from TCGA, 40 datasets in total. Training datasets : testing datasets = 7:3
2.When we download datasets from TCGA, we can add the htseq file into cart, and download them as a comprised file.
According to the Vignette, I've input the data and converted the data to be right data frames which are ready to do MLSeq. And the next step is to choose a model, do the Normalization and transformation, and use the normalized data to train model.
1.I use 28 datasets to training all the model offered by MLSeq, and use 12 dataset to test the model.
\ | Actual | Actual |
---|---|---|
Predicted | iia | i |
i | 3 | 5 |
iia | 2 | 2 |
Model | Accuracy |
---|---|
voomNSC | 75% |
plda | 50% |
plda2 | 46.43% |
svm | 53.57% |
NSC | 51.85 |
nblda | 46.43% |
ensembl_gene_id_version | entrezgene_id | description | hgnc_symbol |
---|---|---|---|
ENSG00000124107.5 | 6590 | secretory leukocyte peptidase inhibitor [Source:HGNC Symbol;Acc:HGNC:11092] | SLPI |
ENSG00000198888.2 | 4535 | mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol;Acc:HGNC:7455] | MT-ND1 |