
# German credit score


## Dataset Overview

The original dataset contains 1000 entries with 20 categorical/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit from a bank. Each person is classified as either a good or bad credit risk according to the set of attributes.

The link to the original dataset can be found below.

## Content

It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Therefore, I wrote a small Python script to convert it into a readable CSV file. Several columns were ignored because, in my opinion, either they are not important or their descriptions are unclear.

The selected attributes are:

- **Age** (numeric)
- **Sex** (text: male, female)
- **Job** (numeric:
  - 0 - unskilled and non-resident
  - 1 - unskilled and resident
  - 2 - skilled
  - 3 - highly skilled)
- **Housing** (text: own, rent, or free)
- **Saving accounts** (text: little, moderate, quite rich, rich)
- **Checking account** (numeric, in DM - Deutsch Mark)
- **Credit amount** (numeric, in DM)
- **Duration** (numeric, in months)
- **Purpose** (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
- **Risk** (target value: Good or Bad Risk)


In [24]:
! rm -rf credit_scoring_7904_Q4_2024
!git clone https://github.com/cesarlegendre/credit_scoring_7904_Q4_2024


Cloning into 'credit_scoring_7904_Q4_2024'...
remote: Enumerating objects: 202, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 202 (delta 2), reused 1 (delta 1), pack-reused 199 (from 1)[K
Receiving objects: 100% (202/202), 2.43 MiB | 6.80 MiB/s, done.
Resolving deltas: 100% (82/82), done.


In [25]:
#Import libraries
#loading dataset
import pandas as pd
import numpy as np

#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# data splitting
from sklearn.model_selection import train_test_split

# data modeling
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Mondel performance
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report
from sklearn import metrics


#warnings
import warnings
warnings.simplefilter(action='ignore')


file = 'credit_scoring_7904_Q4_2024/data_sets/german_credit_score/data.csv'



### Strategy for Data Analysis and Model Development

#### 1. Introduction
   - **Objective**: Provide a clear overview of the dataset and its context.
   - **Action**:
     - Describe the source of the dataset.
     - Provide key details about the dataset (size, attributes, and target variable).

#### 2. Library Setup
   - **Objective**: Ensure that the necessary tools are available for data processing and analysis.
   - **Action**:
     - Import essential libraries (e.g., pandas, numpy, matplotlib, seaborn, scikit-learn).
     - Load the dataset into a usable format (e.g., CSV, database, etc.).

#### 3. Data Familiarization
   - **Objective**: Understand the structure and nature of the dataset.
   - **Action**:
     - **3.1 Data Types**: Examine the types of each attribute.
     - **3.2 Data Shape**: Understand the dimensions (rows, columns) of the dataset.
     - **3.3 Missing Data**: Identify any null values and their distribution across attributes.
     - **3.4 Unique Values**: Check for unique or categorical values in each column.
     - **3.5 Initial Inspection**: Display the first few rows to get a sense of the data.

#### 4. Variable Exploration
   - **Objective**: Dive deeper into key variables to gain insights.
   - **Action**:
     - **4.1 Descriptive Statistics and Visualizations**:
       - Use histograms, bar charts, and boxplots to explore distributions.
       - Summarize key descriptive statistics (mean, median, mode, variance, etc.).

#### 5. Data Correlation
   - **Objective**: Identify relationships between variables.
   - **Action**:
     - **5.1 Correlation Analysis**:
       - Compute correlation matrices.
       - Visualize correlations with heatmaps to detect highly correlated variables.

#### 6. Data Preprocessing
   - **Objective**: Prepare data for model training.
   - **Action**:
     - **6.1 Preprocessing Libraries**: Import necessary libraries for preprocessing (e.g., StandardScaler, LabelEncoder).
     - **6.2 Defining Features and Target**: Separate the dataset into features (X) and target (Y) variables.
     - **6.3 Train-Test Split**: Split the data into training and testing sets for model validation.

#### 7. Model Development and Evaluation

   **7.1 Model 1: Random Forest**
   - **Objective**: Build and evaluate the first model using Random Forest.
   - **Action**:
     - **7.1.1 Random Forest Implementation**: Train a Random Forest model on the training set.
     - **7.1.2 Model Scoring**: Evaluate the model's accuracy, precision, recall, and other metrics.
     - **7.1.3 Cross Validation**: Apply cross-validation to ensure the model's generalizability.

   **7.2 Model 2: Logistic Regression**
   - **Objective**: Build and evaluate the second model using Logistic Regression.
   - **Action**:
     - **7.2.1 Logistic Regression Implementation**: Train a Logistic Regression model on the training set.
     - **7.2.2 Model Scoring**: Evaluate the model's performance (accuracy, precision, recall, F1-score).
     - **7.2.3 Cross Validation**: Validate the model using cross-validation techniques.
     - **7.2.4 ROC Curve**: Plot the ROC curve to assess the model's ability to distinguish between classes.




# Prompting step by step, DIY


1. **Instruction**: Load required libraries and import dataset.
   **Prompt**: "Write Python code to load libraries like pandas, numpy, seaborn, matplotlib, and load a CSV dataset into a DataFrame using pandas."

2. **Instruction**: Check for missing values, data types, and the shape of the dataset.
   **Prompt**: "Write Python code to display the data types, check for missing values, and show the shape of a dataset in a pandas DataFrame."

3. **Instruction**: Get unique values of each column and show the first few rows of the dataset.
   **Prompt**: "Write Python code to print the unique values of each column and display the first 5 rows of a pandas DataFrame."

4. **Instruction**: Create and plot a bar chart showing the distribution of the target variable "Risk".
   **Prompt**: "Write Python code using Plotly to create a bar chart showing the distribution of the 'Risk' column in a dataset."

5. **Instruction**: Create and plot a histogram showing the distribution of the 'Age' column.
   **Prompt**: "Write Python code using Plotly to plot histograms for the 'Age' column grouped by 'good' and 'bad' risk categories."

6. **Instruction**: Use Seaborn to create distribution and count plots of the 'Age' column, split by 'Risk'.
   **Prompt**: "Write Python code using Seaborn to create distribution and count plots of the 'Age' column, colored by 'Risk'."

7. **Instruction**: Categorize the 'Age' column into groups and create box plots to compare 'Credit Amount' by age group and risk.
   **Prompt**: "Write Python code to categorize 'Age' into bins and create box plots comparing 'Credit Amount' across age groups and risk categories using Plotly."

8. **Instruction**: Create a bar plot showing the distribution of the 'Housing' column, grouped by 'Risk'.
   **Prompt**: "Write Python code using Plotly to create bar plots showing the distribution of the 'Housing' column, grouped by 'Risk'."

9. **Instruction**: Create a violin plot showing the distribution of 'Credit Amount' by 'Housing' and risk.
   **Prompt**: "Write Python code using Plotly to create a violin plot for 'Credit Amount' by 'Housing' and risk categories."

10. **Instruction**: Create a bar plot and box plots comparing the distribution of 'Sex' and 'Credit Amount' across risk categories.
    **Prompt**: "Write Python code using Plotly to create bar and box plots comparing the distribution of 'Sex' and 'Credit Amount' across risk categories."

11. **Instruction**: Create plots comparing the distribution of 'Job' and 'Credit Amount' across risk categories.
    **Prompt**: "Write Python code using Plotly to create bar and box plots comparing 'Job' distribution and 'Credit Amount' across risk categories."

12. **Instruction**: Visualize the correlation between all the features in the dataset.
    **Prompt**: "Write Python code to plot a heatmap of the correlation matrix of a dataset using Seaborn."

13. **Instruction**: Perform feature engineering by converting categorical columns into dummy variables.
    **Prompt**: "Write Python code to convert categorical columns into dummy variables using pandas' get_dummies() function."

14. **Instruction**: Split the data into training and testing sets.
    **Prompt**: "Write Python code to split the data into training and testing sets using scikit-learn's train_test_split."

15. **Instruction**: Implement several machine learning models (Logistic Regression, Random Forest, etc.) and evaluate their performance using cross-validation.
    **Prompt**: "Write Python code to implement multiple machine learning models like Logistic Regression, Random Forest, and evaluate them using cross-validation."

16. **Instruction**: Fine-tune the hyperparameters of a Random Forest model using GridSearchCV.
    **Prompt**: "Write Python code to perform hyperparameter tuning for a Random Forest model using scikit-learn's GridSearchCV."

17. **Instruction**: Plot the ROC curve for the model's predictions.
    **Prompt**: "Write Python code to plot the ROC curve for the predictions of a model using Matplotlib."


