Skip to content

The optimal way to predict the medical cost of patients is based on their historical data such as age, gender children, smoking habits and the region they live in. of course, this is a regression problem so linear, non-linear and ensemble methods are used to choose the best model to make predictions with less variance and with higher accuracy po…

ashen007/MedicalCostPrediction-insurance_fraudDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

medical cost predictions & insurance fraud detection

The optimal way to predict the medical cost of patients is based on their historical data such as age, gender children, smoking habits and the region they live in. of course, this is a regression problem so linear, non-linear and ensemble methods are used to choose the best model to make predictions with less variance and with higher accuracy possible. fraud detection decided to parts based on either patient admit to hospital or OPD. fraud detection is a binary classification task.

Data quality

before applying any modelling algorithm or before do any data changes examine basic assumptions with 4-plots, which are: (almost every feature of fraud detection data are categorical 4-plots are using on numerical data)

  • Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-drifting.

  • Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot will be approximately the same over the entire horizontal axis.

  • Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.

  • Fixed Distribution: If the fixed distribution assumption holds, in particular, if the fixed normal distribution holds, then the histogram will be bell-shaped, and the normal probability plot will be linear.

4-plot example

central tendency

after proving the data fulfil the underlying assumption then exploratory data analysis can be begun. The first summary is about location statistics and variability of data then distribution and correlation statistics are calculated. four estimators were used to estimate mean(not robust), trimmed mean, winterized mean and median. winsorized mean has the lowest standard error among The three mean estimators. median is lower than mean in every feature that is a hint on right-skewed distribution.

central tendency estimators standard error of estimator

distribution analysis

chi-squared goodness of fit test done on every discrete and continuous feature to estimate the best distribution to match with data.

distribution (histogram) distribution (histogram)

theoretical distributions with underline distribution of data :

distribution (histogram) distribution (histogram) distribution (histogram)

estimated distribution based on chi-squared test data :

estimated distribution estimated distribution estimated distribution estimated distribution estimated distribution estimated distribution

normality test

three types of testing methods are used to test the normality of distribution. Q-Q-plots and Anderson darling, wilk Shapiro.

  • H0 = The null hypothesis assumes no difference between the observed and theoretical distribution
  • Ha = The alternative hypothesis assumes there is a difference between the observed and theoretical distribution

qq plot qq plot qq plot qq plot

anderson darling test results

feature

statics

significant level

critical value

age 18.7887 0.574 15
0.654 10
0.785 5
0.915 2.5
1.089 1
bmi 1.2355 0.574 15
0.654 10
0.785 5
0.915 2.5
1.089 1
children 87.6711 0.574 15
0.654 10
0.785 5
0.915 2.5
1.089 1
charges 85.1285 0.574 15
0.654 10
0.785 5
0.915 2.5
1.089 1

wilk Shapiro test results

feature

statics

p value

null hypothesis

age 0.9446 5.6874e-22 reject
bmi 0.9938 2.6098e-05 reject
children 0.8231 5.0663e-36 reject
charges 0.8146 1.1504e-36 reject

outliers

box plots used to identify outliers in the data. in fraud, detection outliers are used to identify unusual behaviours so that purpose outliers should be kept in the dataset.

outliers

feature standardization, transformation and selection

three methods used to standardized features which are -

  • quantile transformation
  • box-cox transformation
  • yoe-Johnson transformation

scaling is done by the robust scaling method because there are outliers in data.

scaling in fraud detection dataset:

diamentional reduction

fraud detection data has two separate data on patients which are medical information and beneficiary information about A certain insurance claims. these two combine using claim id. in that case it has 53 features and 40474 records. Random forest classifier used with Recursive feature elimination cross-validation to identify optimized feature count. k best method use to get the best k features for models.

models

  • medical cost prediction

coefficient of determination or R2, It measures the amount of variance of the prediction which is explained by the dataset. R2 values close to 1 mean an almost-perfect regression, while values close to 0 (or negative) imply a bad model.

  • linear models - ordinary least square, lasso (L1), ridge (L2), elastic net, RANSAC, Huber

model

score in train data

score in test data

ols 0.761332 0.791844
lasso 0.362672 0.362733
ridge 0.761330 0.791807
elasticnet 0.559900 0.569312
ransac 0.595668 0.623618
huber 0.754969 0.788107
  • non-linear models - polynomial ridge, k nearest neighbours, decision tree

model

score in train data

score in test data

ridge poly 0.829277 0.856282
knn 0.822168 0.840297
tree 0.842030 0.851783
  • ensemble model - bagging, random forest, adaptive boost, stacking

model

score in train data

score in test data

bagging 0.843557 0.852594
adaboost 0.759680 0.791620
random forest 0.838272 0.854912
stacking 0.831071 0.851779
  • insurance fraud detection

The first section of the classifier pipeline is feature selection using select best scoring by mutual_info_classif to choose 10 out of 48 features. The next section is the grid search for tuning hyper-parameters of the underlying model. it can be linear, tree, polynomial or ensemble. scoring methods used in grid search was precision, recall, f1 and auc. to re-fit the model f1 and auc used.

tested models

  • LogisticRegression

  • SGDClassifier

  • LogisticRegression with Polynomial function

  • SVC

  • KNeighborsClassifier

  • DecisionTreeClassifier

  • ExtraTreeClassifier

  • RandomForestClassifier

  • AdaBoostClassifier

  • BaggingClassifier

  • VotingClassifier

model

score in train data

score in test data

LogisticRegression 0.73 0.73
SGDClassifier 0.68 0.68
LogisticRegression with Polynomial function 0.73 0.73
SVC 0.73 0.73
KNeighborsClassifier 0.73 0.82
DecisionTreeClassifier 0.91 0.92
ExtraTreeClassifier 0.73 0.73
RandomForestClassifier 0.87 0.93
AdaBoostClassifier 0.97 1
BaggingClassifier 0.91 0.92
VotingClassifier 0.95 0.97

Dataset

dataset use in here is medical cost prediction dataset from kaggle. you can find it from here.

how to contribute

  • you can improve models
  • modify readme file to more clear look
  • feature engineer data

About

The optimal way to predict the medical cost of patients is based on their historical data such as age, gender children, smoking habits and the region they live in. of course, this is a regression problem so linear, non-linear and ensemble methods are used to choose the best model to make predictions with less variance and with higher accuracy po…

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published