This repository contains a Jupyter Notebook (svc1.ipynb
) that demonstrates a complete machine learning workflow for loan approval classification using a Support Vector Machine (SVM) model. The notebook covers data loading, cleaning, exploratory data analysis (EDA), feature engineering, model training with different hyperparameters, and evaluation with visualizations.
The dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle. It has been enriched with additional variables related to Financial Risk for Loan Approval. SMOTENC was used to simulate new data points and enlarge the instance size.
The dataset contains 45,000 records and 14 variables. Below is the description of each column:
Column | Description | Type |
---|---|---|
person_age | Age of the person | Float |
person_gender | Gender of the person | Categorical |
person_education | Highest education level | Categorical |
person_income | Annual income | Float |
person_emp_exp | Years of employment experience | Integer |
person_home_ownership | Home ownership status (e.g., rent, own, mortgage) | Categorical |
loan_amnt | Loan amount requested | Float |
loan_intent | Purpose of the loan | Categorical |
loan_int_rate | Loan interest rate | Float |
loan_percent_income | Loan amount as a percentage of annual income | Float |
cb_person_cred_hist_length | Length of credit history in years | Float |
credit_score | Credit score of the person | Integer |
previous_loan_defaults_on_file | Indicator of previous loan defaults | Categorical |
loan_status (target variable) | Loan approval status: 1 = approved; 0 = rejected | Integer |
The notebook is organized into several key sections:
- Loading Data: Reads the dataset from a CSV file.
- Initial Exploration: Displays the data, checks for null values, and prints data information and summary statistics.
- Handling Missing Values: Missing data in critical columns (e.g.,
loan_percent_income
,cb_person_cred_hist_length
,credit_score
,previous_loan_defaults_on_file
,loan_status
) are filled using forward fill. - Categorical Data Encoding:
- Conversion of
person_gender
values (e.g., 'male' to 1, other values to 0). - Replacement of categorical education levels with numeric codes.
- Mapping values for
person_home_ownership
andprevious_loan_defaults_on_file
into numeric formats. - Uniform replacement for the
loan_intent
category.
- Conversion of
- Visualizations:
- Uses Plotly for interactive pie charts (e.g., distribution of
person_gender
,loan_status
,person_education
, andperson_home_ownership
). - Seaborn and Matplotlib are used for histograms and a heatmap of the correlation matrix.
- Uses Plotly for interactive pie charts (e.g., distribution of
- Correlation Analysis:
- A correlation heatmap visualizes relationships among variables.
- Features highly correlated with the target variable (
loan_status
) are selected for model training.
- Feature Selection:
- Based on correlation analysis, a subset of features is selected (excluding the target variable).
- Data Scaling:
- StandardScaler is used to normalize feature values.
- Train-Test Split:
- The dataset is split into training and testing sets (80% train, 20% test).
- Model Training:
- Multiple SVM models are trained:
- Default SVC
- SVC with a different penalty parameter (e.g.,
C=10
) - SVC with a specified gamma value (e.g.,
gamma=0.0122
)
- Multiple SVM models are trained:
- Prediction and Evaluation:
- The accuracy score of each model is computed.
- A confusion matrix is generated and visualized with a heatmap to assess model performance.
To run the notebook, ensure you have the following Python packages installed:
pip install matplotlib plotly seaborn pandas numpy scikit-learn