Skip to content

akula01/Supervised-Machine-Learning-Ensemble-model-for-Type-2-Diabetes-Prediction

Repository files navigation

Supervised-Machine-Learning-Ensemble-model-for-Type-2-Diabetes-Prediction

According to the American Diabetes Association(ADA), 30.3 million people in the United States have diabetes, but only 7.2 million may be undiagnosed and unaware of their condition. Type 2 diabetes is usually diagnosed for most patients later on in life whereas the less common Type 1 diabetes is diagnosed early on in life. People can live healthy and happy lives while living with diabetes, but early detection produces a better overall outcome on most patient's health. Thus, to test the accurate prediction of Type 2 diabetes, we use the patients' information from an electronic health records company called Practice Fusion, that has about 10,000 patient records from 2009 to 2012. This data contains individual key biometrics, including age, diastolic and systolic blood pressure, gender, height, and weight.

We use this data on popular machine learning algorithms: k-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, MLP Neural Network, and Naive Bayes. For each algorithm, we tune hyperparameters to produce the best accuracy, and evaluate the performance of every model based on their classification accuracy, precision, sensitivity, specificity/recall, negative predictive value, and F1 score. Overall, the highest classification accuracy achieved is 82.54% by the MLP Neural Network.

In our study, we find that all algorithms other than Naive Bayes suffered from very low precision. Hence, we take a step further and incorporate all the algorithms into a weighted average or soft voting ensemble model where each algorithm will count towards a majority vote towards the decision outcome of whether a patient has diabetes or not.

Unlike the previous works that focused either particular classifier-set or a Pima Indians dataset that is heavily biased towards limited female population, we use a new approach and dataset, yet use the Pima Indians dataset for the baseline comparison. While the accuracy of the previous works on Pima Indians dataset was less than 80%, the accuracy of our Ensemble model reached 89% for the same dataset. The accuracy of the Ensemble model on Practice Fusion is 85%, by far our ensemble approach is new in this space.

We firmly believe that the weighted average ensemble model not only performed well in overall metrics but also helped to recover wrong predictions and aid in accurate prediction of Type 2 diabetes. Our accurate model can be used as an alert for the patients to seek medical evaluation in time.

Releases

No releases published

Packages

No packages published