Quantitative Structure Activity Relationship (QSAR) biodegradation data set. The data set source is UCI machine learning repository.
The data set has 1,055 instances (molecules) and their 41 features (chemical and physical properties). Each molecule is either readily biodegradable (RB) or not readily biodegradable (NRB). Logistic regression, logistic with lasso, logistic with ridge, radial SVM, and random forests are used here to classify each molecule as RB or NRB based on its 41 features.
The following was repeated 100 times to capture the spread of train error, test error, and minimum CV error rates for all 5 classification methods.
Through random sampling, the data set was separated into 90% train set and the rest was used as test set data. The data was weighted through resampling since there was some imbalance of 66% NRB and 34% RB.
Using the train set, the hyper parameters of logistic lasso, logistic ridge, and radial svm were tuned with 10-fold CV. The minimum CV error rates were extracted. The fitted models were used to capture train error, test error, false positive (fp) train error, false negative (fn) train error, fp test error, and fn test error rates.
As shown in the figure below, of the 5 classification methods rf and svm models have the lowest train error rates with the train set data. However, these two models perform much worse with test set data, while the rest of the models, although have more spread out error rates, on average have error rate similar to those with their train set data. The box plot of test fn errors stands out, here, svm and rf have the worst performance and should not be used for identification of RB molecules, instead logistic ridge appears to have the best overall performance for purpose of identifying RB molecules.
The following figure shows the spread of minimum CV error rates for logistic lasso, logistic ridge, and svm.
From the figure below, the smallest cross-validation errors for logistic lasso and logistic ridge are similar to the cv error of unregularized logistic regression.
The heatmap is used to determine the SVM parameters (gamma and cost) that give the smallest cross-validation error rate. In the case of QSAR biodegradation data set, the best cost parameter appears to be equal to 100 and the best gamma parameter is 0.1.
V19 was too sparse and was not used with any of the classification methods. Feature definitions can be found here. The logistic lasso and logistic ridge have similar patterns for coefficient importance. However, logistic lasso tends to emphasize some coefficients and reduce others to zero, while logistic ridge tends to reduce all coefficients in proportion to their importance. Hence, lasso coefficients are either large or small, while those of ridge are somewhere inbetween. The variable importance pattern of random forests method is completely different from the patterns of logistic lasso and logistic ridge, because random forests method is non-linear method and logistic regression is a linear method, two different methods produce two different patterns for variable importance.