To identify ECMGs, the Seurat package in R was utilized for object generation and cell filtering to ensure high-quality cells. The filtering criteria entailed removing genes detected in fewer than 3 cells, cells with less than 50 detected genes, or cells with more than 5% mitochondrial genes, and the data were then normalized. Principal component analysis (PCA) was performed on the first 1500 highly variable genes identified through JackStraw analysis. To cluster and visualize the resulting PCA data, we utilized the FindClusters function in R with a resolution parameter of 0.5. For visualization, the t-distributed stochastic neighbor embedding (t-SNE) algorithm was applied. Marker genes (adjusted P-value < 0.05 and |log fold change (FC)| > 1) for each cluster were identified using the FindAllMarkers function in conjunction with the Wilcoxon-Mann-Whitney test, which compared gene expression differences between a cluster and all other clusters. Additionally, the SingleR package was utilized to annotate and visualize the cell types.
An agglomerative pam clustering with a 1-pearson correlation distances and resampling 80% of the samples for 1000 repetitions was performed to divided the patients from the TCGA cohort into different clusters based on the ECMGs. The optimal number of clusters was determined based on the cumulative distribution function (CDF), the consistency matrix, and the relative change of the area under the CDF curve.
A comprehensive approach was employed by integrating 101 algorithm combinations with 10 machine learning algorithms to construct a prognostic signature with high accuracy and stability. The 10 machine learning algorithms utilized in this study were CoxBoost, elastic network (Enet), generalized boosted regression modeling (GBM), Lasso, partial least squares regression for Cox (plsRcox), Ridge, random survival forest (RSF), stepwise Cox, supervised principal components (SuperPC), and survival support vector machine (survival-SVM). Notably, some of these algorithms, including CoxBoost, Lasso, RSF, and stepwise Cox, possessed feature selection capabilities.
By conducting a comprehensive literature search on Pubmed (https://pubmed.ncbi.nlm.nih.gov/), we gathered published signatures for performance comparison with ECMGPS (excluding miRNA signatures due to limited miRNA information in the validation cohorts). These collected signatures were fitted using various algorithms, such as Lasso and RSF, and encompassed diverse biological significance. Subsequently, risk scores were calculated for the five cohorts using the genes or RNA and coefficients provided in the respective articles. The performance in predicting BCR of PCa was then compared using the C-index.