## Exercise 9: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all classification models

Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e10/overview



In [32]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from imblearn.under_sampling import RandomUnderSampler

## Dataset File

In [33]:
dataset_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/train.csv?raw=true'
df = pd.read_csv(dataset_url)

## Test File

In [34]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/test.csv?raw=true'
dt=pd.read_csv(test_url)

## Sample Submission File

In [35]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [36]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39098 entries, 0 to 39097
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           39098 non-null  int64  
 1   loan_status  39098 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 611.0 KB


In [37]:
print(df.info())
print(dt.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object

In [38]:
label_encoder = LabelEncoder()
df['id'] = label_encoder.fit_transform(df['id'])

X = df.drop(columns=['loan_status', 'id'])
y = df['loan_status']
X = pd.get_dummies(X, drop_first=True)

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Under-sampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_resampled)
X_val_scaled = scaler.transform(X_val)

# Prepare the test data
if 'id' in dt.columns:
    test_ids = dt['id'].copy()
    dt = dt.drop(columns=['id'])

dt = pd.get_dummies(dt, drop_first=True)
dt = dt.reindex(columns=X.columns, fill_value=0)

# Feature scaling for test data
X_test_scaled = scaler.transform(dt)

## 1. Train a KNN Classifier

In [39]:
# put your answer here
knn_classifier = KNeighborsClassifier(n_neighbors=22)
knn_classifier.fit(X_train_scaled, y_resampled)

y_pred_knn = knn_classifier.predict(X_val_scaled)

accuracy_knn = accuracy_score(y_val, y_pred_knn)
print(f"Accuracy (KNN): {accuracy_knn}")

Accuracy (KNN): 0.8754369511467304


- Perform cross validation

In [40]:
# put your answer here
cv_scores_knn = cross_val_score(knn_classifier, X, y, cv=5)
print("Cross-validation scores (KNN):", cv_scores_knn)
print("Mean cross-validation score (KNN):", cv_scores_knn.mean())

Cross-validation scores (KNN): [0.89342655 0.8924887  0.89359707 0.89214767 0.89206241]
Mean cross-validation score (KNN): 0.8927444794952681


## 2. Train a Logistic Regression Classifier

In [41]:
# put your answer here
logreg = LogisticRegression(max_iter=500, solver='liblinear')
logreg.fit(X_train_scaled, y_resampled)
y_pred_logreg = logreg.predict(X_val_scaled)

accuracy_logreg = accuracy_score(y_val, y_pred_logreg)
print(f"Accuracy (Logistic Regression): {accuracy_logreg}")

Accuracy (Logistic Regression): 0.8404808594083042


- Perform cross validation

In [42]:
# put your answer here
cv_scores_logreg = cross_val_score(logreg, X, y, cv=5)
print("Cross-validation scores (Logistic Regression):", cv_scores_logreg)
print("Mean cross-validation score (Logistic Regression):", cv_scores_logreg.mean())

Cross-validation scores (Logistic Regression): [0.87066246 0.87296445 0.8705772  0.87211186 0.87032143]
Mean cross-validation score (Logistic Regression): 0.8713274788984566


## 3. Train a Naive Bayes Classifier

In [43]:
# put your answer here
nb_classifier = GaussianNB()
nb_classifier.fit(X_train_scaled, y_resampled)
y_pred_nb = nb_classifier.predict(X_val_scaled)

accuracy_nb = accuracy_score(y_val, y_pred_nb)
print(f"Accuracy (Naive Bayes): {accuracy_nb}")

Accuracy (Naive Bayes): 0.8860942961889334


- Perform cross validation

In [44]:
# put your answer here
cv_scores_nb = cross_val_score(nb_classifier, X, y, cv=5)
print("Cross-validation scores (Naive Bayes):", cv_scores_nb)
print("Mean cross-validation score (Naive Bayes):", cv_scores_nb.mean())

Cross-validation scores (Naive Bayes): [0.87705687 0.88208713 0.87901782 0.88515645 0.88191662]
Mean cross-validation score (Naive Bayes): 0.881046977576946


## 4. Train a SVM Classifier

In [45]:
# put your answer here
svm_classifier = SVC()
svm_classifier.fit(X_train_scaled, y_resampled)
y_pred_svm = svm_classifier.predict(X_val_scaled)

accuracy_svm = accuracy_score(y_val, y_pred_svm)
print(f"Accuracy (SVM): {accuracy_svm}")

Accuracy (SVM): 0.890271975445477


- Perform cross validation

In [46]:
# put your answer here
cv_scores_svm = cross_val_score(svm_classifier, X, y, cv=5)
print("Cross-validation scores (SVM):", cv_scores_svm)
print("Mean cross-validation score (SVM):", cv_scores_svm.mean())

Cross-validation scores (SVM): [0.85761787 0.85761787 0.85761787 0.85753261 0.85761787]
Mean cross-validation score (SVM): 0.8576008184840992


## 5. Train a Decision Tree Classifier

In [47]:
# put your answer here
df_classifier = DecisionTreeClassifier(max_depth=5, random_state=42)
df_classifier.fit(X_train_scaled, y_resampled)
y_pred_df = df_classifier.predict(X_val_scaled)

accuracy_df = accuracy_score(y_val, y_pred_df)
print(f"Accuracy (Decision Tree): {accuracy_df}")

Accuracy (Decision Tree): 0.8958137948674226


- Perform cross validation

In [48]:
# put your answer here
cv_scores_df = cross_val_score(df_classifier, X, y, cv=5)
print("Cross-validation scores (Decision Tree):", cv_scores_df)
print("Mean cross-validation score (Decision Tree):", cv_scores_df.mean())

Cross-validation scores (Decision Tree): [0.93435075 0.9381874  0.93460653 0.94100094 0.93827266]
Mean cross-validation score (Decision Tree): 0.9372836558956432


## 6. Train a Random Forest Classifier

In [49]:
# put your answer here
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_scaled, y_resampled)
y_pred_rf = rf_classifier.predict(X_val_scaled)

accuracy_rf = accuracy_score(y_val, y_pred_rf)
print(f"Accuracy (Random Forest): {accuracy_rf}")

Accuracy (Random Forest): 0.904595447182198


In [50]:
# Cross-validation
cv_scores_rf = cross_val_score(rf_classifier, X, y, cv=5)
print("Cross-validation scores (Random Forest):", cv_scores_rf)
print("Mean cross-validation score (Random Forest):", cv_scores_rf.mean())

Cross-validation scores (Random Forest): [0.9479069  0.94986785 0.94918578 0.95344872 0.95097621]
Mean cross-validation score (Random Forest): 0.9502770909710974


## 7. Compare all the performance of all classification models

In [52]:
# put your answer here
results = {
    'KNN': {'Accuracy': accuracy_knn, 'CV Score': cv_scores_knn.mean()},
    'Logistic Regression': {'Accuracy': accuracy_logreg, 'CV Score': cv_scores_logreg.mean()},
    'Naive Bayes': {'Accuracy': accuracy_nb, 'CV Score': cv_scores_nb.mean()},
    'SVM': {'Accuracy': accuracy_svm, 'CV Score': cv_scores_svm.mean()},
    'Decision Tree': {'Accuracy': accuracy_df, 'CV Score': cv_scores_df.mean()},
    'Random Forest': {'Accuracy': accuracy_rf, 'CV Score': cv_scores_rf.mean()}
}

# Print the results
for model_name, metrics in results.items():
    print(f"{model_name}: Accuracy = {metrics['Accuracy']:.4f}, CV Score = {metrics['CV Score']:.4f}")

# Choose the best performing model based on accuracy
best_model = max(results.items(), key=lambda x: x[1]['Accuracy'])
print(f"Best Model: {best_model[0]} with Accuracy: {best_model[1]['Accuracy']:.4f}")

KNN: Accuracy = 0.8754, CV Score = 0.8927
Logistic Regression: Accuracy = 0.8405, CV Score = 0.8713
Naive Bayes: Accuracy = 0.8861, CV Score = 0.8810
SVM: Accuracy = 0.8903, CV Score = 0.8576
Decision Tree: Accuracy = 0.8958, CV Score = 0.9373
Random Forest: Accuracy = 0.9046, CV Score = 0.9503
Best Model: Random Forest with Accuracy: 0.9046


## 9. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [58]:
submission_df = pd.DataFrame({
    'id': test_ids,
    'loan_status': rf_classifier.predict(X_test_scaled)
})

submission_df = submission_df.sort_values(by='id').reset_index(drop=True)
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
