In [141]:
import pandas as pd

# Classification

## Load the Dataset

Load the Dataset and Split the Dataset into Training and Test Sets

In [142]:
df = pd.read_csv('customer_credit_score.csv')

X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

## KNN Model


KNN with n_neighbors = 3

In [143]:
from sklearn.neighbors import KNeighborsClassifier

KNN_model = KNeighborsClassifier(n_neighbors=3)
KNN_model.fit(X_test, y_test)

## Decision Tree Model

Decision Tree Model

In [144]:
from sklearn.tree import DecisionTreeClassifier

DT_model = DecisionTreeClassifier(random_state=42)
DT_model.fit(X_train, y_train)

## Evaluation

Predictions with the First and Second Model using following metrics
- Accuracy
- Precision
- Recall

In [145]:
from sklearn.metrics import accuracy_score

KNN_y_pred = KNN_model.predict(X_test)
KNN_score = accuracy_score(y_test, KNN_y_pred)

print("KNN Model score: ", KNN_score)
print()

from sklearn.metrics import confusion_matrix

KNN_cf = confusion_matrix(y_test, KNN_y_pred)

TN, FP, FN, TP = KNN_cf.ravel()
print('TN:', TN)
print('FP: ', FP)
print('FN: ', FN)
print('TP: ', TP)
print()

accuracy = (TN + TP) / (TN + FP + FN + TP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print('Accuracy: ', accuracy)
print('Precision: ', precision)
print('Recall: ', recall)
# ROUND UP for reference in case recommendations

KNN Model score:  0.86975

TN: 13113
FP:  1002
FN:  1603
TP:  4282

Accuracy:  0.86975
Precision:  0.8103709311127933
Recall:  0.7276125743415464


In [146]:
DT_y_pred = DT_model.predict(X_test)
DT_score = accuracy_score(y_test, DT_y_pred)

print("DT Model score: ", DT_score)
print()

DT_cf = confusion_matrix(y_test, DT_y_pred)

TN, FP, FN, TP = DT_cf.ravel()

accuracy = (TN + TP) / (TN + FP + FN + TP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print('Accuracy: ', accuracy)
print('Precision: ', precision)
print('Recall: ', recall)
# ROUND UP for reference in case recommendations

DT Model score:  0.8252

Accuracy:  0.8252
Precision:  0.7056291960750559
Recall:  0.6965165675446049


 ## Use Case Recommendation

The KNN and Decision Tree models was both able to fit the requirement of being able to detect at least 70% of poor credit scores within the entire population since the recall of KNN and Decision Tree models are 73% and 70%, respectively. Both models also fit the requirement about the certainty that around 70% of poor credit score predictions are truly poor credit scores because the precision of KNN and Decision Tree models are 82% and 71%, respectively.

However, I would recommend giving the **KNN model** to the company for their Credit Scoring model because it has a higher recall, meaning that it can predict more percentage of poor credit scores within the dataset. The KNN model also has a higher precision so it has a higher certainty that its predictions of poor credit scores are accurate.

# Case Study

What are my observations when you changed the KNN n_neighbors to 9? What could be the possible cause for this observation?

In [147]:
# Change the KNN n_neighbors to 9
KNN_model = KNeighborsClassifier(n_neighbors=9)
KNN_model.fit(X_test, y_test)

KNN_y_pred = KNN_model.predict(X_test)
KNN_score = accuracy_score(y_test, KNN_y_pred)

print("KNN Model score: ", KNN_score)
print()

KNN_cf = confusion_matrix(y_test, KNN_y_pred)

TN, FP, FN, TP = KNN_cf.ravel()
print('TN:', TN)
print('FP: ', FP)
print('FN: ', FN)
print('TP: ', TP)
print()

accuracy = (TN + TP) / (TN + FP + FN + TP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print('Accuracy: ', accuracy)
print('Precision: ', precision)
print('Recall: ', recall)

KNN Model score:  0.8101

TN: 12817
FP:  1298
FN:  2500
TP:  3385

Accuracy:  0.8101
Precision:  0.7228272474909246
Recall:  0.5751911639762107


After changing the KNN n_neighbors to 9, the accuracy, precision, and recall decreased. The possible reason for this is using the 9th nearest neighbors increased the number and distance of neighbors needed to predict the label. This leads to an increase in bias(Zhu, n.d.) or systematic errors in the model, causing inaccurate predictions (Barcelos, 2022).

# Regression

## Load the Dataset

Load the Dataset and Split the Dataset into Training and Test Sets

In [148]:
df = pd.read_csv('insurance_standardized.csv')

X = df.drop(columns=['charges'])
y = df['charges']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,-1.438764,-1.010519,-0.45332,-0.908614,1.970587,-1.343905,16884.924
1,-1.509965,0.989591,0.509621,-0.078767,-0.507463,-0.438495,1725.5523
2,-0.797954,0.989591,0.383307,1.580926,-0.507463,-0.438495,4449.462
3,-0.441948,0.989591,-1.305531,-0.908614,-0.507463,0.466915,21984.47061
4,-0.513149,0.989591,-0.292556,-0.908614,-0.507463,0.466915,3866.8552


## Linear Regression Model

Linear Regression Model

In [149]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression()
LR_model.fit(X_train, y_train)

## Create the Second Model

KNN Regression Model with n_neighbors = 13

In [150]:
from sklearn.neighbors import KNeighborsRegressor

KNNR_model = KNeighborsRegressor(n_neighbors=13)
KNNR_model.fit(X_train, y_train)

## Evaluation

Predictions with the First and Second Model using the following metrics
- Mean Absolute Error
- R2

In [151]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

LR_y_pred = LR_model.predict(X_test)

LR_mae = mean_absolute_error(y_test, LR_y_pred)
LR_r2 = r2_score(y_test, LR_y_pred)

print('LR_MAE: ', LR_mae)
print('LR_R2: ', LR_r2)

LR_MAE:  3930.333273901141
LR_R2:  0.7998747145449959


In [152]:
KNNR_y_pred = KNNR_model.predict(X_test)

KNNR_mae = mean_absolute_error(y_test, KNNR_y_pred)
KNNR_r2 = r2_score(y_test, KNNR_y_pred)

print('KNNR_MAE: ', KNNR_mae)
print('KNNR_R2: ', KNNR_r2)

KNNR_MAE:  3054.9664831765217
KNNR_R2:  0.8624333834293745


## Use Case Recommendation

With these models, which models would I recommend giving to the company for their Medical Insurance Pricing Estimate Model?

Only the KNN Regression model fit the requirement of the client that an average error is not greater than 3100 USD because its Mean Absolute Error(MAE) is approximately 3054.97 while the Linear Regression's model is 3930.33, which is greater than 3100 USD.

Therefore, I would recomment giving the KNN Regression model to the company for their Medical Insurance Pricing Estimate Model because it has a lower MAE of 3054.97 so its prodicted value can only differ by 3054.97 USD. Moreover, the R2 of the KNN Regression Model is approximately 87%, meaning that it is able to capture around 87% of the true charges in the dataset.

# Case Study

How does removing 1 random column affect the performance of my chosen model?

In [153]:
# Remove 'sex' column in KNN Regression model
X = df.drop(columns=['charges','sex'])
y = df['charges']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

KNNR_model = KNeighborsRegressor(n_neighbors=13)
KNNR_model.fit(X_train, y_train)

KNNR_y_pred = KNNR_model.predict(X_test)

KNNR_mae = mean_absolute_error(y_test, KNNR_y_pred)
KNNR_r2 = r2_score(y_test, KNNR_y_pred)

print('KNNR_MAE: ', KNNR_mae)
print('KNNR_R2: ', KNNR_r2)

KNNR_MAE:  2925.4000158458666
KNNR_R2:  0.8740983466329277


When I removed the sex feature, the performance of the KNN Regression model improved. This is because from MAE of 3054.97, the MAE became to 2925.40 which is lower meaning that the average error is lower and increasing the accuracy of the predictions. Also, the R2 of the new KNN Regression model increased from approximately 87% to approximately 88%, indicating that it can now capture an additional of around 1% of the true charges in the dataset.