medical cost predictions & insurance fraud detection

The optimal way to predict the medical cost of patients is based on their historical data such as age, gender children, smoking habits and the region they live in. of course, this is a regression problem so linear, non-linear and ensemble methods are used to choose the best model to make predictions with less variance and with higher accuracy possible. fraud detection decided to parts based on either patient admit to hospital or OPD. fraud detection is a binary classification task.

Data quality

before applying any modelling algorithm or before do any data changes examine basic assumptions with 4-plots, which are: (almost every feature of fraud detection data are categorical 4-plots are using on numerical data)

Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-drifting.
Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot will be approximately the same over the entire horizontal axis.
Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.
Fixed Distribution: If the fixed distribution assumption holds, in particular, if the fixed normal distribution holds, then the histogram will be bell-shaped, and the normal probability plot will be linear.

central tendency

after proving the data fulfil the underlying assumption then exploratory data analysis can be begun. The first summary is about location statistics and variability of data then distribution and correlation statistics are calculated. four estimators were used to estimate mean(not robust), trimmed mean, winterized mean and median. winsorized mean has the lowest standard error among The three mean estimators. median is lower than mean in every feature that is a hint on right-skewed distribution.

distribution analysis

chi-squared goodness of fit test done on every discrete and continuous feature to estimate the best distribution to match with data.

theoretical distributions with underline distribution of data :

estimated distribution based on chi-squared test data :

normality test

three types of testing methods are used to test the normality of distribution. Q-Q-plots and Anderson darling, wilk Shapiro.

H0 = The null hypothesis assumes no difference between the observed and theoretical distribution
Ha = The alternative hypothesis assumes there is a difference between the observed and theoretical distribution

anderson darling test results
feature	statics	significant level	critical value
age	18.7887	0.574	15
		0.654	10
		0.785	5
		0.915	2.5
		1.089	1
bmi	1.2355	0.574	15
		0.654	10
		0.785	5
		0.915	2.5
		1.089	1
children	87.6711	0.574	15
		0.654	10
		0.785	5
		0.915	2.5
		1.089	1
charges	85.1285	0.574	15
		0.654	10
		0.785	5
		0.915	2.5
		1.089	1

wilk Shapiro test results
feature	statics	p value	null hypothesis
age	0.9446	5.6874e-22	reject
bmi	0.9938	2.6098e-05	reject
children	0.8231	5.0663e-36	reject
charges	0.8146	1.1504e-36	reject

outliers

box plots used to identify outliers in the data. in fraud, detection outliers are used to identify unusual behaviours so that purpose outliers should be kept in the dataset.

feature standardization, transformation and selection

three methods used to standardized features which are -

quantile transformation
box-cox transformation
yoe-Johnson transformation

scaling is done by the robust scaling method because there are outliers in data.

scaling in fraud detection dataset:

diamentional reduction

fraud detection data has two separate data on patients which are medical information and beneficiary information about A certain insurance claims. these two combine using claim id. in that case it has 53 features and 40474 records. Random forest classifier used with Recursive feature elimination cross-validation to identify optimized feature count. k best method use to get the best k features for models.

models

medical cost prediction

coefficient of determination or R2, It measures the amount of variance of the prediction which is explained by the dataset. R2 values close to 1 mean an almost-perfect regression, while values close to 0 (or negative) imply a bad model.

linear models - ordinary least square, lasso (L1), ridge (L2), elastic net, RANSAC, Huber

model	score in train data	score in test data
ols	0.761332	0.791844
lasso	0.362672	0.362733
ridge	0.761330	0.791807
elasticnet	0.559900	0.569312
ransac	0.595668	0.623618
huber	0.754969	0.788107

non-linear models - polynomial ridge, k nearest neighbours, decision tree

model	score in train data	score in test data
ridge poly	0.829277	0.856282
knn	0.822168	0.840297
tree	0.842030	0.851783

ensemble model - bagging, random forest, adaptive boost, stacking

model	score in train data	score in test data
bagging	0.843557	0.852594
adaboost	0.759680	0.791620
random forest	0.838272	0.854912
stacking	0.831071	0.851779

insurance fraud detection

The first section of the classifier pipeline is feature selection using select best scoring by mutual_info_classif to choose 10 out of 48 features. The next section is the grid search for tuning hyper-parameters of the underlying model. it can be linear, tree, polynomial or ensemble. scoring methods used in grid search was precision, recall, f1 and auc. to re-fit the model f1 and auc used.

tested models

LogisticRegression

SGDClassifier

LogisticRegression with Polynomial function

SVC

KNeighborsClassifier

DecisionTreeClassifier

ExtraTreeClassifier

RandomForestClassifier

AdaBoostClassifier

BaggingClassifier

VotingClassifier

model	score in train data	score in test data
LogisticRegression	0.73	0.73
SGDClassifier	0.68	0.68
LogisticRegression with Polynomial function	0.73	0.73
SVC	0.73	0.73
KNeighborsClassifier	0.73	0.82
DecisionTreeClassifier	0.91	0.92
ExtraTreeClassifier	0.73	0.73
RandomForestClassifier	0.87	0.93
AdaBoostClassifier	0.97	1
BaggingClassifier	0.91	0.92
VotingClassifier	0.95	0.97

Dataset

dataset use in here is medical cost prediction dataset from kaggle. you can find it from here.

how to contribute

you can improve models
modify readme file to more clear look
feature engineer data

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
.idea		.idea
data		data
demo		demo
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirement		requirement

ashen007/MedicalCostPrediction-insurance_fraudDetection

Folders and files

Latest commit

History

Repository files navigation

medical cost predictions & insurance fraud detection

Data quality

anderson darling test results

feature

statics

significant level

critical value

wilk Shapiro test results

feature

statics

p value

null hypothesis

feature standardization, transformation and selection

diamentional reduction

models

model

score in train data

score in test data

model

score in train data

score in test data

model

score in train data

score in test data

tested models

model

score in train data

score in test data

Dataset

how to contribute

About

Topics

Resources

Stars

Watchers

Forks

Languages