### Objective  
- To understand and practice building effective end-to-end machine learning models using this notebook as a companion to the lecture.
- Some pointers have been provided after various code snippets. These are not specific to this dataset.

### Instructions
- Clearly explain how each method or function used in the notebook works.  
- For every code snippet, document the key insights and takeaways.  

# Imports

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

import xgboost as xgb

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the Dataset

In [None]:
sample_submission = pd.read_csv("/kaggle/input/playground-series-s4e10/sample_submission.csv")
train = pd.read_csv("/kaggle/input/playground-series-s4e10/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s4e10/test.csv")

### Pointers
- We are using `read_csv` from **pandas** to read the file into a DataFrame.  
- Some formats in which data can be stored include formats such as `.tsv`, `.xlsx`, or in SQL Databases. Look up how to read these file types and try loading files as an exercise.  
- Sometimes data might have to be converted to a tabular format. As an exercise, practice converting and handling data in **JSON** format as well.  


In [None]:
print(f'The shape of the training data is {train.shape}')
print(f'The shape of the test data is {test.shape}')

In [None]:
# View the first few rows of the dataset
train.head()

## Description of Features
1. person_age: The age of the loan applicant in years
2. person_income: Income of the applicant 
3. person_home_ownership: Status of home ownership among Rent, Own, Mortagage and others   
4. person_emp_length: Length of employment in years
5. loan_intent: Purpose of the loan
6. loan_grade: Some metric assigning a quality score to the loan 
7. loan_amnt: Loan amount requested by the candidate
8. loan_int_rate: Interest rate associated with the loan
9. loan_percent_income: Percentage of income to be used for loan payments? 
10. cb_person_default_on_file: Indication of whether the applicant has defaulted earlier
11. cb_person_cred_hist_length: Length of applicant's credit history in years
12. loan_status: Approval / Rejection of the loan (Target Variable)

### Pointers
- Use domain knowledge and prior work for guidance, but avoid letting assumptions overly influence decisions.  
- Allow the data to reveal its own story instead of creating a story first and forcing the data to fit it.  


In [None]:
# Information about datatypes and null values in columns
train.info()

### Initial Observations:
- There are no Null values present in any of the columns
- The categorical columns are **person_home_ownership**, **loan_intent**, **loan_grade**, **cb_person_default_on_file**
- The numerical columns are **person_age**, **person_income**, **person_emp_length**, **loan_amnt**, **loan_int_rate**, **loan_percent_income**, **cb_person_cred_hist_length**

### Pointers
- A function may report 0 NULL values, but it typically only checks for `NaN`. Missing values could still be represented in other ways (e.g., blanks, special characters, or placeholders).  
- Columns may sometimes contain textual representations of numbers (e.g., *one, two, three*) instead of numeric values.  


In [None]:
# Some statistics about different numerical columns
train.describe()

### Initial Observations:
- The columns **person_age** and **person_emp_length** have 123 as the maximum value. These data points are erroneous.
- Majority of the values for the column **loan_status** appears to be 0. This can indicate imbalance.
- The columns are in different scales. Note **person_age**, **person_income**, **loan_amnt**, **loan_percent_income**.

### Pointers
- Can be used to spot the presence of potential outliers
- Provides an understanding of the **scale** and **distribution** of the data.  
- Highlights features that may be taking a single constant value. We can remove these

In [None]:
#To check for NULL values
train.isna().sum()

In [None]:
#To check for duplicates
train.duplicated().sum()

In [None]:
#Number of unique values in each column
train.nunique()

### Pointers
- Columns with a number of unique values close to the total number of datapoints (high cardinality) may contribute little value to the model.  
- The **cardinality** of categorical features helps in deciding the appropriate **encoding technique** to apply.  

In [None]:
#Count of occurence of each value of the feature
train['person_home_ownership'].value_counts()

### Pointers
- Can highlight the **dominant categories** within categorical features.  
- Can reveal **inconsistencies or typos** in category values (*rent* vs *Rent* vs *RENT*, or *MORTGAGE* vs *MORTGAUGE*).  


# Exploratory Data Analysis

EDA is primarily about exploring the dataset. By examining  individual features or groups of features, we can uncover patterns and insights that guide the decisions made during modeling.

## The target variable

In [None]:
train['loan_status'].value_counts()

In [None]:
sns.countplot(data=train, x='loan_status')
plt.title('Distribution of the target variable')
plt.show()

### Initial Observations
- The dataset appears to be **imbalanced**
- Roughly 14% of the values belong to `class 1`

### Pointers
- Class imbalance in datasets is often domain-specific and should be carefully evaluated.  
- May require the use of **imbalance handling techniques** (e.g., resampling, synthetic data generation, class weights).  
- **Accuracy alone is not sufficient** for imbalanced datasets. As an exercise, consider which alternative metrics  would be more appropriate, and why.  


## Univariate Feature Analysis

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.countplot(data=train, x="person_home_ownership", ax=axes[0])
axes[0].set_title("Distribution of Home Ownership")

sns.countplot(data=train, x="loan_grade", ax=axes[1])
axes[1].set_title("Distribution of Loan Grade")

sns.countplot(data=train, x="loan_intent", ax=axes[2])
axes[2].set_title("Distribution of Loan Intent")
plt.xticks(rotation=45)
plt.show()

plt.tight_layout()
plt.show()

### Initial Observations
- Some of the categories are more dominant than the others. Can check the connection between categories and the target
- If there are unimportant categories, those can be replaced with a new category "Other"
- Does the loan_grade column have a natural ordering for the alphabets?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(data=train['person_age'], ax=axes[0])
axes[0].set_title("Boxplot of Age")

sns.boxplot(data=train['person_income'], ax=axes[1])
axes[1].set_title("Boxplot of Income")

plt.tight_layout()
plt.show()

### Initial Observations
- The value > 120 in the column `person_age` could be an erroneous entry. Removal would be a good option
- In the case of `person_income`, there are some large values but these could be naturally occuring in the dataset. What are some possible approaches to handle this?

### Pointers
- While claiming points to be outliers (for ex, using a box plot), be clear as to what method is used to label the point as an outlier
- Different methods may select different points as outliers
- It is important to distinguish between outlier values that are naturally occuring & those that are erroneous entries
- In some cases points that are outliers might be the valuable points. For eg: In a *"Money Transaction"* dataset, transactions with a very large amount of money could be indicative of *Fraud*

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.boxplot(data=train['loan_amnt'], ax=axes[0])
axes[0].set_title("Boxplot of Loan Amount(Train)")

sns.boxplot(data=test['loan_amnt'], ax=axes[1])
axes[1].set_title("Boxplot of Loan Amount(Test)")

plt.tight_layout()
plt.show()

### Initial Observations:
- Even though there appear to be outliers in the boxplot of `loan_amnt` for training data, we can retain them as a similar distribution is observed in the case of test data 

## Bivariate Feature Analysis

In [None]:
plt.figure(figsize=(15, 6))
sns.heatmap(train.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Initial Observations:
- The features `cb_person_cred_hist_length` and `person_age` are highly correlated. This is expected as older people would have a longer credit history
- The features `loan_int_rate` and `loan_percent_income` have a higher correlation with the target. These have to be looked at more closely

### Pointers
- Be clear about what correlation is. What is the range of values? What do those values mean?
- What happens if two features are highly correlated? How does it impact various models?
- Are features with strong correlation with the target variable also identified as important features after training a model?

In [None]:
sns.countplot(data=train, x='loan_grade', hue='loan_status')
plt.title('Loan Default Rate by Loan Grade')
plt.show()

Initial Observations:
- The categories D, E, F, G are less frequent than A, B, C. However, in those categories, there are more loans approved than rejected

In [None]:
sns.boxplot(x=train['loan_status'],y=train['person_income'])

### Initial Observations
- The income of individuals where the loans were approved appear to be under 500,000

In [None]:
# subset_features = ['loan_amnt', 'loan_int_rate', 'person_income', 'person_age', 'loan_status']
# sns.pairplot(train[subset_features], hue='loan_status')
# plt.title('Pair Plot of Selected Features')
# plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(train[train['loan_status'] == 1]['loan_int_rate'], label='Approved', fill=True)
sns.kdeplot(train[train['loan_status'] == 0]['loan_int_rate'], label='Non-Approved', fill=True)
plt.title('CDF of Loan Int Rate by Loan Status')
plt.xlabel('Loan Int Rate')
plt.ylabel('Density')
plt.legend()
plt.show()

### Initial Observations
- Loans are more likely to be approved if the interest rates are higher

# Creating a Validation Set

## Why do we need a Validation dataset?
- Train our models to do well on unseen data.
- Identify good values for the HyperParameters through Hyperparameter Tuning

In [None]:
train = train.drop(columns=["id"])
test = test.drop(columns=["id"])

X = train.iloc[:,:-1]
y = train.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Pointers
- The use of stratify can help when there are rare / imbalanced classes
- Repeatedly using the same validation dataset can indirectly lead to overfitting

# Feature Engineering

# Preprocessing

In [None]:
ordinal_col = ["loan_grade"]
onehot_cols = ["person_home_ownership", "loan_intent", "cb_person_default_on_file"]

ordinal_transformer = OrdinalEncoder(categories=[['G', 'F', 'E', 'D', 'C', 'B', 'A']])  
onehot_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ordinal_col),
        ("ohe", onehot_transformer, onehot_cols)
    ],
    remainder="passthrough"  
)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("scaler", MinMaxScaler())
])

### Pointers
- Encoding Categorical variables using One Hot Encoding vs Ordinal vs Target vs Frequency
- Scaling methods. Is it mandatory to use scaling? Which scaler does well even if there are outliers?
- How many features are present after preprocessing?
- Should feature selection methods be used? 

In [None]:
X_train_processed = pipeline.fit_transform(X_train)
X_test_processed = pipeline.transform(X_test)

test_processed = pipeline.transform(test)

In [None]:
print(f'The shape of the processed training data is {X_train_processed.shape}')
print(f'The shape of the processed validation data is {X_test_processed.shape}')
print(f'The shape of the processed test data is {test_processed.shape}')

# Baseline Model

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_processed, y_train)

y_pred_proba = clf.predict_proba(X_test_processed)[:, 1]

roc_auc = roc_auc_score(y_test, y_pred_proba)
print("ROC AUC Score:", roc_auc)

# Exploring some basic Models

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "KNN": KNeighborsClassifier(),
    "SVM": SVC(probability=True, random_state=42),
    "Bagging": BaggingClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(random_state=42)
}

def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    
    results = {}
    for split, (X, y) in {"Train": (X_train, y_train), "Test": (X_test, y_test)}.items():
        y_pred = model.predict(X)
        y_proba = model.predict_proba(X)[:, 1] if hasattr(model, "predict_proba") else model.decision_function(X)

        results[split] = {
            "Accuracy": accuracy_score(y, y_pred),
            "Precision": precision_score(y, y_pred, zero_division=0),
            "Recall": recall_score(y, y_pred, zero_division=0),
            "F1-Score": f1_score(y, y_pred, zero_division=0),
            "ROC-AUC": roc_auc_score(y, y_proba)
        }
    return results

final_results = {}
for name, model in models.items():
    final_results[name] = evaluate_model(model, X_train_processed, y_train, X_test_processed, y_test)

results_df = pd.concat({outer: pd.DataFrame(inner).T for outer, inner in final_results.items()})

print(results_df)

In [None]:
metrics = ["Accuracy", "Precision", "Recall", "F1-Score", "ROC-AUC"]

for metric in metrics:
    plt.figure(figsize=(10,6))
    train_scores = [final_results[m]["Train"][metric] for m in models.keys()]
    test_scores = [final_results[m]["Test"][metric] for m in models.keys()]
    
    x = np.arange(len(models))
    width = 0.35
    
    plt.bar(x - width/2, train_scores, width, label='Train')
    plt.bar(x + width/2, test_scores, width, label='Test')
    
    plt.xticks(x, models.keys(), rotation=45, ha='right')
    plt.ylabel(metric)
    plt.title(f"Model Comparison - {metric}")
    plt.legend()
    plt.tight_layout()
    plt.show()

# Hyperparameter Tuning

In [None]:
rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5, scoring='f1', n_jobs=-1)
rf_grid.fit(X_train_processed, y_train)

best_rf = rf_grid.best_estimator_

In [None]:
ada_params = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0]
}

ada_grid = GridSearchCV(AdaBoostClassifier(random_state=42), ada_params, cv=5, scoring='f1', n_jobs=-1)
ada_grid.fit(X_train_processed, y_train)

best_ada = ada_grid.best_estimator_

In [None]:

def evaluate_best_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else model.decision_function(X_test)

    return {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, zero_division=0),
        "Recall": recall_score(y_test, y_pred, zero_division=0),
        "F1-Score": f1_score(y_test, y_pred, zero_division=0),
        "ROC-AUC": roc_auc_score(y_test, y_proba)
    }

print("Best Random Forest Params:", rf_grid.best_params_)
print("Random Forest Test Performance:", evaluate_best_model(best_rf, X_test_processed, y_test))

print("\nBest AdaBoost Params:", ada_grid.best_params_)
print("AdaBoost Test Performance:", evaluate_best_model(best_ada, X_test_processed, y_test))

In [None]:
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='auc')

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

grid_search = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,
    verbose=1,
    n_jobs=-1
)

grid_search.fit(X_train_processed, y_train)

print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test_processed)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

# Model Evaluation

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
xgb.plot_importance(best_model, importance_type="gain", height=0.5, max_num_features=15)
plt.title("Top 15 Feature Importances (Gain)")
plt.show()

In [None]:
y_proba = best_model.predict_proba(X_test_processed)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0,1],[0,1],'--',color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

# Submission