Credit Scoring Model
Develop a credit scoring model to predict the
creditworthiness of individuals based on historical
financial data. Utilize classification algorithms and
assess the model's accuracy.

Data Preprocessing: Prepare the data by handling missing values, encoding categorical variables, and splitting the dataset into training and testing sets.

1. Data Loading and Initial Exploration

In [None]:
import pandas as pd
data = pd.read_csv('credit_data.csv')
print(data.head())

2. Data Preprocessing

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Separate features and target variable
X = data.drop('creditworthiness', axis=1)
y = data['creditworthiness']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps for numerical and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


3. Model Building

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Define the model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))])

# Fit the model
model.fit(X_train, y_train)


4. Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
import numpy as np

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Classification report
print(classification_report(y_test, y_pred))

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation Scores: {cv_scores}")
print(f"Average Cross-validation Score: {np.mean(cv_scores):.2f}")


Explanation: <br/>
Data Loading and Initial Exploration: <br/>
Loaded the dataset.
Displayed the first few rows to understand its structure. <br/>
Data Preprocessing: <br/>
Split the dataset into training and testing sets.
Defined preprocessing steps for numerical and categorical features using pipelines. <br/>
Model Building: <br/>
Created a pipeline that includes preprocessing and a Random Forest Classifier.
Fitted the model on the training data. <br/>
Model Evaluation: <br/>
Predicted on the testing set.
Calculated accuracy and displayed a classification report.
Used cross-validation to assess the model's performance