The optimal way to predict the medical cost of patients is based on their historical data such as age, gender children, smoking habits and the region they live in. of course, this is a regression problem so linear, non-linear and ensemble methods are used to choose the best model to make predictions with less variance and with higher accuracy possible. fraud detection decided to parts based on either patient admit to hospital or OPD. fraud detection is a binary classification task.
before applying any modelling algorithm or before do any data changes examine basic assumptions with 4-plots, which are: (almost every feature of fraud detection data are categorical 4-plots are using on numerical data)
-
Fixed Location: If the fixed location assumption holds, then the run sequence plot will be flat and non-drifting.
-
Fixed Variation: If the fixed variation assumption holds, then the vertical spread in the run sequence plot will be approximately the same over the entire horizontal axis.
-
Randomness: If the randomness assumption holds, then the lag plot will be structureless and random.
-
Fixed Distribution: If the fixed distribution assumption holds, in particular, if the fixed normal distribution holds, then the histogram will be bell-shaped, and the normal probability plot will be linear.
central tendency
after proving the data fulfil the underlying assumption then exploratory data analysis can be begun. The first summary is about location statistics and variability of data then distribution and correlation statistics are calculated. four estimators were used to estimate mean(not robust), trimmed mean, winterized mean and median. winsorized mean has the lowest standard error among The three mean estimators. median is lower than mean in every feature that is a hint on right-skewed distribution.
distribution analysis
chi-squared goodness of fit test done on every discrete and continuous feature to estimate the best distribution to match with data.
theoretical distributions with underline distribution of data :
estimated distribution based on chi-squared test data :
normality test
three types of testing methods are used to test the normality of distribution. Q-Q-plots and Anderson darling, wilk Shapiro.
- H0 = The null hypothesis assumes no difference between the observed and theoretical distribution
- Ha = The alternative hypothesis assumes there is a difference between the observed and theoretical distribution
age | 18.7887 | 0.574 | 15 |
0.654 | 10 | ||
0.785 | 5 | ||
0.915 | 2.5 | ||
1.089 | 1 | ||
bmi | 1.2355 | 0.574 | 15 |
0.654 | 10 | ||
0.785 | 5 | ||
0.915 | 2.5 | ||
1.089 | 1 | ||
children | 87.6711 | 0.574 | 15 |
0.654 | 10 | ||
0.785 | 5 | ||
0.915 | 2.5 | ||
1.089 | 1 | ||
charges | 85.1285 | 0.574 | 15 |
0.654 | 10 | ||
0.785 | 5 | ||
0.915 | 2.5 | ||
1.089 | 1 |
age | 0.9446 | 5.6874e-22 | reject |
bmi | 0.9938 | 2.6098e-05 | reject |
children | 0.8231 | 5.0663e-36 | reject |
charges | 0.8146 | 1.1504e-36 | reject |
outliers
box plots used to identify outliers in the data. in fraud, detection outliers are used to identify unusual behaviours so that purpose outliers should be kept in the dataset.
three methods used to standardized features which are -
- quantile transformation
- box-cox transformation
- yoe-Johnson transformation
scaling is done by the robust scaling method because there are outliers in data.
scaling in fraud detection dataset:
fraud detection data has two separate data on patients which are medical information and beneficiary information about A certain insurance claims. these two combine using claim id. in that case it has 53 features and 40474 records. Random forest classifier used with Recursive feature elimination cross-validation to identify optimized feature count. k best method use to get the best k features for models.
- medical cost prediction
coefficient of determination or R2, It measures the amount of variance of the prediction which is explained by the dataset. R2 values close to 1 mean an almost-perfect regression, while values close to 0 (or negative) imply a bad model.
- linear models - ordinary least square, lasso (L1), ridge (L2), elastic net, RANSAC, Huber
ols | 0.761332 | 0.791844 |
lasso | 0.362672 | 0.362733 |
ridge | 0.761330 | 0.791807 |
elasticnet | 0.559900 | 0.569312 |
ransac | 0.595668 | 0.623618 |
huber | 0.754969 | 0.788107 |
- non-linear models - polynomial ridge, k nearest neighbours, decision tree
ridge poly | 0.829277 | 0.856282 |
knn | 0.822168 | 0.840297 |
tree | 0.842030 | 0.851783 |
- ensemble model - bagging, random forest, adaptive boost, stacking
bagging | 0.843557 | 0.852594 |
adaboost | 0.759680 | 0.791620 |
random forest | 0.838272 | 0.854912 |
stacking | 0.831071 | 0.851779 |
- insurance fraud detection
The first section of the classifier pipeline is feature selection using select best scoring by mutual_info_classif
to choose 10 out of 48 features. The next section is the grid search for tuning hyper-parameters of the underlying model.
it can be linear, tree, polynomial or ensemble. scoring methods used in grid search was precision
, recall
,
f1
and auc
. to re-fit the model f1
and auc
used.
LogisticRegression
SGDClassifier
LogisticRegression
withPolynomial
function
SVC
KNeighborsClassifier
DecisionTreeClassifier
ExtraTreeClassifier
RandomForestClassifier
AdaBoostClassifier
BaggingClassifier
VotingClassifier
LogisticRegression | 0.73 | 0.73 |
SGDClassifier | 0.68 | 0.68 |
LogisticRegression with Polynomial function | 0.73 | 0.73 |
SVC | 0.73 | 0.73 |
KNeighborsClassifier | 0.73 | 0.82 |
DecisionTreeClassifier | 0.91 | 0.92 |
ExtraTreeClassifier | 0.73 | 0.73 |
RandomForestClassifier | 0.87 | 0.93 |
AdaBoostClassifier | 0.97 | 1 |
BaggingClassifier | 0.91 | 0.92 |
VotingClassifier | 0.95 | 0.97 |
dataset use in here is medical cost prediction dataset from kaggle. you can find it from here.
- you can improve models
- modify readme file to more clear look
- feature engineer data