## Import Packages

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn import metrics 

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")

## Preprocess Dataset

In [3]:
mlcc = pd.read_csv("MLCC.csv")
mlcc.head()

Unnamed: 0,ClaimNumber,EvalResult,6054,6055,6056,6057,6058,6060,6061,6062,...,7928,7929,7930,7931,8445,8446,8447,8448,8449,8450
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
2,3,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,2,2,2,1,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,3,2,3,1,0
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,4,2,4,1,0


After I load the dataset and save it into **mlcc**, I re-index the dataframe using the values in **ClaimNumber**. Then, I remove spaces from column names of **mlcc**.

In [4]:
mlcc = mlcc.set_index('ClaimNumber')
mlcc.columns = mlcc.columns.str.strip()

In [5]:
mlcc.EvalResult.value_counts()

0    238322
1     44742
2         1
Name: EvalResult, dtype: int64

Since only 0 and 1 are valid values in **EvalResult** and there is only one row with the other value(2) in **EvalResult**, I keep the rows in which **EvalResult** values are either 0 or 1.

In [6]:
mlcc = mlcc.loc[(mlcc['EvalResult'] == 1) | (mlcc['EvalResult'] == 0)]

## Feature Selection/Dimensionality Reduction

Here, I find the number of unique values in each ID column and the frequency of value 0 in that specific ID column. Then, I save these results into a dictionary **unique_dict** with the name of that column as key.

In [7]:
unique_dict = {}
for col in mlcc.columns[1:]:
    unique_dict[col] = [len(mlcc[col].unique())] 
    val_counts = mlcc[col].value_counts()
    if 0 in val_counts.index:
        unique_dict[col].append(val_counts.loc[0])
    else:
        unique_dict[col].append(0)

If most values in a ID column are 0, then results classified differently would be very likely to have same answers for the corresponding question. Therefore, that ID column might not be helpful for classification. Following this logic, I only take the ID columns where the occurrences of 0 are less than the sample size of the dataset by 20000 or more. I choose this threshold because there are approximately 44000 results classified as 1. If at least half of results classified as 1 have different answers for that question from those classified as 0, then that ID column could be helpful for classification.

In [8]:
vals = pd.DataFrame(unique_dict, index = ['UniqueValNum', 'Counts0']).T
remain_vals = vals.loc[vals['Counts0'] <= mlcc.shape[0] - 20000]
remain_vals

Unnamed: 0,UniqueValNum,Counts0
6062,9,248148
6064,6,259433
6065,4,8378
6085,6,243844
7778,7,75600
7779,5,164544
7781,6,106861
7782,7,55131
7783,4,255274
7786,3,8715


Some ID columns have a large number of unique values, which indicates that those questions have many different answers. Those diverse answers might reflect unique aspects of answerers and distinguish and help classify them. Therefore, I take the median of the number of unique values in all ID columns I select and create two subsets of ID columns by comparing numbers of unique values with the median.

In [9]:
norm_cols = remain_vals.loc[remain_vals.UniqueValNum <= remain_vals.UniqueValNum.median()].index

In [10]:
cat_cols = remain_vals.loc[remain_vals.UniqueValNum > remain_vals.UniqueValNum.median()].index

## Classifer

Here, I create a pipeline that transforms ID columns that I select and fits a Logistic Regression model on these transformed columns.

For ID columns that have more unique values than the threshold, I one-hot encode their values because diverse answers might be helpful for classification. Since one-hot encoding would generate correlated features, I apply Principal Component Analysis(PCA) to drop those correlated features. For ID columns that have fewer or same number of unique values as the threshold, I keep them unchanged. Then, I drop all other columns in **mlcc**.

Since Logistic Regression model is designed for binary classification, I choose it as my model. I set parameters solver = "saga" algorithm because it is a large dataset of 283064 rows and "saga" is a good choice for optimizing model's performance on large dataset. Since there are a total of 96 unique values in two ID columns in which the number of unique values are larger than the threshold, I set maximum iterations to 100 for the solver "saga" to converge.

In [11]:
cats = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown = "ignore")), 
    ('pca', PCA(svd_solver='full'))
])

ct = ColumnTransformer([('Normal', FunctionTransformer(lambda x: x), norm_cols),
                        ('Categ', cats, cat_cols)], remainder = "drop")

pl = Pipeline([('feats', ct), ('lr', LogisticRegression(solver = 'saga', max_iter = 100))])

## Evaluation of Classifier's Performance

Classification results in **mlcc** are not balanced. There are 238322 samples classified as 0 and 44742 samples classified as 1. Therefore, the accuracy score is not able to indicate my model's performance of correctly classifying samples whose true label is 1. Hence, I use the recall score and the specificity score as evaluation metrics to check the model's performance on classifying each label.

Being fitted on 70% data and predicting 30% data, my model obtains a recall score of 0.84 and a specificity score of 0.98. In other words, 84% of samples labeled as 1 are correctly classified, and 98% of samples labeled as 0 are correctly classified. Overall, accuracy score of 0.959 indicates that my model is able to correctly predict labels of 95.9% samples. 

In [12]:
mlcc['EvalResult'].value_counts()

0    238322
1     44742
Name: EvalResult, dtype: int64

In [20]:
X = mlcc.loc[:, mlcc.columns != 'EvalResult']
Y = mlcc['EvalResult']

In [21]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)

In [22]:
pl.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('feats', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('Normal', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function <lambda> at 0x000001C681FB08C8>, inv_kw_args=None,
      ...penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False))])

In [23]:
preds = pl.predict(x_test)

In [24]:
# proportion of predictions that are right
metrics.accuracy_score(y_test, preds)

0.9592204427696656

In [25]:
# Recall: proportion of results labeled as 1 that are correctly classified
# TP/P
metrics.recall_score(y_test, preds)

0.8402860548271752

In [26]:
# Specificity: proportion of results labeled as 0 that are correctly classified
# TN/N
metrics.recall_score(y_test, preds, pos_label=0)

0.9815514154638022