# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description

The notebooks focus on a borrower's **credit modeling problem**. The database was downloaded through a dataquest project and is available at link below. The data is from **Lending Club** and contains data from loans made in the period **2007 to 2011**. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. The **target variable**, or what we are wanting to predict, is whether or not, given a person's history, they will repay the loan. 

You can download the data from the [Kaggle](https://www.kaggle.com/datasets/samaxtech/lending-club-20072011-data).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1fKGuR5U5ECf7On6Zo1UWzAIWZrMmZnGc"></center>


## 1.2 Install and load libraries

In [3]:
!pip install wandb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [1]:
import wandb
import pandas as pd

## 1.3 Fetch Data

### 1.3.1 Create the raw_data artifact

In [2]:
# importing the dataset
loans = pd.read_csv("../loans_2007.csv");
loans.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [3]:
loans.to_csv("raw_data.csv",index=False)

In [12]:
import os
from dotenv import load_dotenv
load_dotenv()

WANDB_API_KEY = os.environ.get('WANDB_API_KEY')

In [13]:
# Login to Weights & Biases
!wandb login --relogin $WANDB_API_KEY

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/phamdinhkhanh/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


Wandb or Weights & Biases (W&B) is a powerful MLOps platform designed to help machine learning practitioners track experiments, visualize metrics, and collaborate efficiently. It provides tools for carrying out MLOps. In specific, here are some key features and benefits of using W&B:

- **Experiment Tracking**: Log and visualize training metrics like loss, accuracy, and hyperparameters.
- **Model Management**: Compare different runs, save model checkpoints, and reproduce results.
- **Dataset Versioning**: Track and manage datasets used in experiments.
- **Hyperparameter Tuning**: Optimize model performance using W&B’s Sweeps.
- **Collaboration & Reproducibility**: Share experiment results easily with your team.

You can have a cold start [wandb](https://docs.wandb.ai/tutorials/)?

In [None]:
# Send the raw_data.csv to the Wandb storing it as an artifact by wandb cli (https://docs.wandb.ai/ref/cli/)
!wandb artifact put \
      --name risk_credit/raw_data.csv \
      --type raw_data \
      --description "The raw credit risk data from 2007 Lending Club" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "aikhanhblog-datascienceworld-kan/risk_credit/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33maikhanhblog[0m ([33maikhanhblog-datascienceworld-kan[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.19.8
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/phamdinhkhanh/Documents/Courses/DataScienceWorld.Kan/MLOps/credit_risk_mlops/notebooks/wandb/run-20250321_224430-lhsrmz9z[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mdifferent-grass-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/aikhanhblog-datascienceworld-kan/risk_credit[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.a