## Generating the Dataset

Generates loan data over a three-year period (from three years ago up to April 22, 2024). It randomly creates loan records with details such as loan ID, borrower ID, loan amount, start date, maturity date, loan type, status (active or completed), investor count, profit percentage, repayment amount with interest, and risk rating. 

The script iterates through each month within the date range, randomly determines the number of loans for each month, and populates loan details accordingly. Finally, it saves the generated loan data into a JSON file named "loans_data.json" with a structured format using the `json.dump()` function.

In [1]:
import json
import random
import pandas as pd
from datetime import datetime, timedelta

end_date = datetime(2024, 4, 22)
start_date = end_date - timedelta(days=3*365)  # Three years ago

loans = []
loan_id_counter = 1  # Starting counter for unique loan IDs

current_date = start_date
while current_date <= end_date:
    num_loans = random.randint(1000, 2000)

    for _ in range(num_loans):
        loan_id = f"L{loan_id_counter}"
        borrower_id = random.randint(1, 100)
        loan_amount = random.randint(5000, 50000)
        start_date = current_date.replace(day=random.randint(1, 22))
        maturity_date = start_date + timedelta(days=random.randint(180, 1095))  # Loan duration between 6 months to 3 years
        loan_type = random.choice(["Personal", "Business", "Mortgage"])
        status = random.choice(["Active", "Completed"])
        investors_count = random.randint(1, 20)
        profit_percentage = random.randint(5, 15)
        repayment_amount_with_interest = int(loan_amount * (1 + profit_percentage / 100))
        risk_rating = random.choice(["Low", "Medium", "High"])

        loan_data = {
            "borrower_id": borrower_id,
            "loan_id": loan_id,
            "loan_amount": loan_amount,
            "start_date": start_date.strftime("%Y-%m-%d"),
            "maturity_date": maturity_date.strftime("%Y-%m-%d"),
            "loan_type": loan_type,
            "status": status,
            "investors_count": investors_count,
            "profit_percentage": profit_percentage,
            "repayment_amount_with_interest": repayment_amount_with_interest,
            "risk_rating": risk_rating
        }
        loans.append(loan_data)

        loan_id_counter += 1

    current_date = current_date + timedelta(days=30) 

data = {"loans": loans}

with open("loans_data.json", "w") as f:
    json.dump(data, f, indent=4)

print("Generated loan data saved to 'loan_data.json'")

Generated loan data saved to 'loan_data.json'


*********************

## Informations and statistics about the dataset

In [2]:
with open('loans_data.json', 'r') as json_file:
    loan_data = json.load(json_file)

loans_df = pd.DataFrame(loan_data['loans'])

loan_stats = loans_df.describe(include='all')

total_loans = len(loans_df)
total_loan_amount = loans_df['loan_amount'].sum()
average_loan_amount = loans_df['loan_amount'].mean()
max_loan_amount = loans_df['loan_amount'].max()
min_loan_amount = loans_df['loan_amount'].min()
active_loans = (loans_df['status'] == 'Active').sum()
completed_loans = (loans_df['status'] == 'Completed').sum()

risk_rating_counts = loans_df['risk_rating'].value_counts()

loan_type_counts = loans_df['loan_type'].value_counts()

print(f"Total number of loans: {total_loans}")
print(f"Total loan amount across all loans: ${total_loan_amount}")
print(f"Average loan amount: ${average_loan_amount}")
print(f"Largest loan amount: ${max_loan_amount}")
print(f"Smallest loan amount: ${min_loan_amount}")
print(f"Number of active loans: {active_loans}")
print(f"Number of completed loans: {completed_loans}")


print("\nRisk Rating Distribution:")
print(risk_rating_counts)

print("\nLoan Type Distribution:")
print(loan_type_counts)

Total number of loans: 54077
Total loan amount across all loans: $1487621688
Average loan amount: $27509.32352016569
Largest loan amount: $49999
Smallest loan amount: $5000
Number of active loans: 27040
Number of completed loans: 27037

Risk Rating Distribution:
High      18114
Medium    18104
Low       17859
Name: risk_rating, dtype: int64

Loan Type Distribution:
Mortgage    18139
Personal    17970
Business    17968
Name: loan_type, dtype: int64


In [3]:
loans_df['start_date'] = pd.to_datetime(loans_df['start_date'])

loans_df['loan_year'] = loans_df['start_date'].dt.year

loan_counts_by_year = loans_df['loan_year'].value_counts().sort_index()

print("Number of Loans for Each Year:")
print(loan_counts_by_year)

Number of Loans for Each Year:
2021    12235
2022    18330
2023    17288
2024     6224
Name: loan_year, dtype: int64


In [4]:
print(loans_df['start_date'].min())
print(loans_df['start_date'].max())
print(loans_df['maturity_date'].min())
print(loans_df['maturity_date'].max())

2021-04-01 00:00:00
2024-04-22 00:00:00
2021-10-01
2027-04-21


In [5]:
print("\nDetailed Loan Statistics:")
loan_stats


Detailed Loan Statistics:


Unnamed: 0,borrower_id,loan_id,loan_amount,start_date,maturity_date,loan_type,status,investors_count,profit_percentage,repayment_amount_with_interest,risk_rating
count,54077.0,54077,54077.0,54077,54077,54077,54077,54077.0,54077.0,54077.0,54077
unique,,54077,,814,2011,3,2,,,,3
top,,L1,,2023-02-20,2024-12-25,Mortgage,Active,,,,High
freq,,1,,116,72,18139,27040,,,,18114
mean,50.573571,,27509.32352,,,,,10.546591,9.984799,30256.619894,
std,28.854541,,13006.518267,,,,,5.760193,3.16227,14339.105593,
min,1.0,,5000.0,,,,,1.0,5.0,5254.0,
25%,26.0,,16271.0,,,,,6.0,7.0,17885.0,
50%,51.0,,27475.0,,,,,11.0,10.0,30200.0,
75%,76.0,,38812.0,,,,,16.0,13.0,42677.0,
