<a href="https://colab.research.google.com/github/alongiladi/Machine_Learning_With_Python/blob/main/group_7_excercise_4_Final_Exc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment: Approve or Decline Loans

In this assignment, we will use a dataset of loan applications to predict whether a loan should be approved or declined.
Your task is to build a model that can predict the outcome of a loan application based on the given features.

### Dataset
- **Source:** https://raw.githubusercontent.com/schauppi/Intro_ML/main/datasets/loan_data.csv

### Steps:

1. **Explore the Data with Pandas**
    - Load the data
    - Display the first few rows of the data
    - Check for missing values
    - Get basic information about the data
    - Convert categorical variables to numerical (Because we want to use a machine learning algorithm and they require numerical input)
    - Get basic descriptive statistics
2. **Data Visualization**
    - Use Matplotlib to create visualizations to understand the data and spot potential patterns
    - Plot the following:
        - Distribution of person age
        - Gender distribution
        - Loan status proportion
        - Loan amount vs. income
3. **Prepare the Data for Modeling**
    - Get the target and features
    - Scale the features using `MinMaxScaler`
    - Split the data into training and testing sets
4. **Modeling**
    - Train and evaluate an `SVM` with a `linear` kernel
    - Train and evaluate an `SVM` with a `RBF` kernel
    - Train and evaluate a `SVM` with a `polynomial` kernel
5. **Grid Search**
    - Use `GridSearchCV` to find the best hyperparameters for the `SVM` with a `RBF` kernel
    - Set the parameters to search over:
        - `C`: 0.1, 0.5, 1, 5, 10
        - `kernel`: 'rbf'
    - Set the number of cross-validation folds to 2
    - Set the scoring metric to `f1`
6. **Train and Evaluate the Model with the Best Hyperparameters**
    - Train the `SVM` with the best hyperparameters found by `GridSearchCV`
    - Evaluate the model on the testing set
    - Print the accuracy, precision, recall, and F1 score of the model

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

### 1. Explore the Data with Pandas

In [None]:
# Load the data
data = pd.read_csv('https://raw.githubusercontent.com/schauppi/Intro_ML/main/datasets/loan_data.csv')

In [None]:
# Display the first few rows of the data
data.head()

In [None]:
# Check for missing values
data.isnull().sum()

In [None]:
# Get basic information about the data
data.info()

#### Data Conversion

We are converting **categorical** (text-based) variables into **numerical values** using **label encoding**. This is necessary because **machine learning algorithms** can only process numerical data.

**Conversions**:
 - **Gender**: female (0), male (1)
 - **Education**: High School (0), Bachelor (1), Master (2), Associate (3), Doctorate (4)
 - **Home Ownership**: MORTGAGE (0), RENT (1), OWN (2), OTHER (3)
 - **Loan Intent**: PERSONAL (0), EDUCATION (1), MEDICAL (2), VENTURE (3), HOMEIMPROVEMENT (4), DEBTCONSOLIDATION (5)
 - **Previous Defaults**: No (0), Yes (1)

Using Pandas' `map()` function, we assign numerical values to each category. This allows our data to be processed by **machine learning algorithms** while maintaining the distinct categories in our dataset.

In [None]:
# Convert categorical variables to numerical
data['person_gender']=data['person_gender'].map({"female":0 ,"male":1})
data['person_education']=data['person_education'].map({"High School":0 ,"Bachelor":1 , 'Master':2 , 'Associate':3 , 'Doctorate':4})
data['person_home_ownership']=data['person_home_ownership'].map({"MORTGAGE":0 ,"RENT":1 , 'OWN':2 , 'OTHER':3})
data['loan_intent']=data['loan_intent'].map({"PERSONAL":0 ,"EDUCATION":1 , 'MEDICAL':2 , 'VENTURE':3 , 'HOMEIMPROVEMENT':4 , 'DEBTCONSOLIDATION':5})
data['previous_loan_defaults_on_file']=data['previous_loan_defaults_on_file'].map({"No":0 ,"Yes":1})

In [None]:
# Display the first few rows of the data and check if the conversion was successful
data.head()

In [None]:
# Get basic descriptive statistics
data.describe()

### 2. Data Visualization

You will visualize the distribution of **person age**, **gender distribution**, **loan status proportion**, and the relationship between **loan amount and income** to better understand our dataset.


In [None]:
# Plot distribution for 'person_age'
data['person_age'].plot(kind='hist')
plt.title("Distribution of Person Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Create a bar plot showing gender distribution in the dataset using `value_counts()` with blue for female (0) and pink for male (1) to visualize gender balance in loan applications.


In [None]:
# Plot gender distribution
data['person_gender'].value_counts().plot(kind='bar', color=['blue', 'pink'])
plt.title("Gender Distribution")
plt.xlabel("Gender (0=Female, 1=Male)")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.show()

Create a pie chart visualizing loan status proportion using `value_counts()` with autopct to display percentages, helping you understand the balance between approved and declined loans.


In [None]:
# Plot loan status proportion
data['loan_status'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title("Loan Status Proportion")
plt.ylabel("")
plt.show()

Create a scatter plot visualizing the relationship between loan amount and person income, which helps identify patterns and potential correlations in loan applications.


In [None]:
# Plot loan amount vs. income
plt.scatter(data['person_income'], data['loan_amnt'], alpha=0.5)
plt.title("Loan Amount vs. Income")
plt.xlabel("Income")
plt.ylabel("Loan Amount")
plt.show()

### 3. Prepare the Data for Modeling

Split the data into features (X) and target (y), scale the features using MinMaxScaler, then divide into training and testing sets with an 80-20 split and random_state=42 for reproducibility.


In [None]:
# Get the target and features
X = data.drop('loan_status', axis=1)
y = data['loan_status']

In [None]:
# Scale the features using MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

### 4. Modeling

Train and evaluate three SVM models with different kernels (linear, RBF, polynomial), then assess their performance using accuracy, precision, recall, and F1 score metrics.


In [None]:
# Train and evaluate an SVM with a linear kernel
svm_linear = SVC(kernel='linear', C=1)
svm_linear.fit(X_train, y_train)

In [None]:
# get predictions
y_pred = svm_linear.predict(X_test)

# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))

In [None]:
# Train and evaluate an SVM with a RBF kernel
svm_rbf = SVC(kernel='rbf', C=1)
svm_rbf.fit(X_train, y_train)

In [None]:
# get predictions
y_pred = svm_rbf.predict(X_test)

# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))

In [None]:
# Train and evaluate an SVM with a polynomial kernel
svm_poly = SVC(kernel='poly', C=1)
svm_poly.fit(X_train, y_train)

In [None]:
# get predictions
y_pred = svm_poly.predict(X_test)

# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))

### 5. Grid Search

You will use `GridSearchCV` to find the best hyperparameters for the `SVM` with a `RBF` kernel.


In [None]:
# Define the parameter grid
param_grid = {
    'C': [0.1, 0.5, 1, 5, 10],
    'kernel': ['rbf']
}

In [None]:
# Train and evaluate an SVM with a RBF kernel using GridSearchCV
svm_rbf_gs = SVC()
grid_search = GridSearchCV(svm_rbf_gs, param_grid, cv=2, scoring='f1')
grid_search.fit(X_train, y_train)

In [None]:
# Print the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

### 6. Train and Evaluate the Model with the Best Hyperparameters

Here you will train the `SVM` with the best hyperparameters found by `GridSearchCV` and evaluate it on the testing set.

In [None]:
# Train the SVM with the best hyperparameters found by GridSearchCV
svm_rbf_best = SVC(kernel='rbf', C=grid_search.best_params_['C'])
svm_rbf_best.fit(X_train, y_train)

In [None]:
# get predictions
y_pred = svm_rbf_best.predict(X_test)

# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))