### Create a dummy dataset for gender bias audit assessment:

To better understand how an auditor can use gender bias techniques in practice, in our package, we have used a dummy dataset.  This notebook shows the source code that we developed to build this dummy dataset.

- Q. Do I need the dummy dataset to run the package? 
- A. No! The dummy dataset just helps you to better understand the techniques we have used to audit bias.


- Q. Do I need to run this code to use the dummy dataset? 
- A. No. The final output of this code (the dummy data) is stored on the project GitHub page and you can download it there.


- Q. What is the benefit of reading this code? How / where in the project I may need it? 
- A. You do not need this Jupyter notebook code to run a bias audit. However, we do recommend that you read this notebook. It helps you to better understand how we have created the dataset, the structure of the dataset, and how each variable within it was constructed.  


- Q. What was the logic behind creating these variables and constructing the data in this way?
- A. We tried to make this dataset look similar to the datasets that we received from our FSP partners. This dummy data consists of three main components, which are:

> Input data: A dataset that an FSP uses as the main input for its credit decisions. This dataset usually contains basic demographic and socio-economic information, credit bureau data, and customer alternative data (if available).

> Output data: This dataset shows whether a credit application has been approved.
If the application is approved, the output dataset shares information on the credit terms of each approved application.
If the application is rejected, the output dataset usually contains information on why the application was rejected.

> Repayment performance data: This dataset shows the loan performance of those customers who have received loans.

### <span style = 'color:purple'> Constructing the input data   </span>

###### Importing package and setting the directory

In [1]:
import pandas as pd
import random 
import numpy as np
import uuid
import os
from random import randrange
from datetime import timedelta, datetime

os.chdir('C:\\Users\mm\Downloads')

Create an empty DataFrame object


In [2]:
df = pd.DataFrame()
n = 400                                      # number of records/applications in the dataset

Create unique applicant and application IDs variables

In [3]:
# UUid is a module that provides immutable UUID objects (the UUID class) and the functions uuid1(), uuid3(), uuid4(), uuid5()
# for generating version 1, 3, 4, and 5 UUIDs as specified in RFC 4122. If all you want is a unique ID (our goal here!),
# you should probably call uuid1() or uuid4(). 
# Note that uuid1() may compromise privacy since it creates a UUID containing the computerâ€™s network address.
# uuid4() creates a random UUID. We have used UUID4 in this project. 

# create unique applicant ID
Applicant_IDs = []
for i in range(n):
    Applicant_IDs.append(str(uuid.uuid4()))  # create a random UUID and then convert it to a string of hex digits in standard 
                                             # form using str function

df['Applicant_ID'] = Applicant_IDs           # add the Applicant_ID variable to our dataset

# create unique application ID
Application_IDs = []
for i in range(n):
    Application_IDs.append(uuid.uuid4().hex) # create a random UUID and then convert it to a 32-character hexadecimal string

df['Application_ID'] = Application_IDs       # add the Application_IDs variable to our dataset

Create the credit application date variable 

In [4]:
def random_date(start, end):
    """
    This function will return a random datetime between two datetime objects.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

Application_Dates = []

for i in range(n):
    d1 = datetime.strptime('1/1/2019 8:00 PM', '%m/%d/%Y %I:%M %p')  
    d2 = datetime.strptime('1/1/2020 5:00 PM', '%m/%d/%Y %I:%M %p')
    Application_Dates.append(random_date(d1, d2))

df['Application_Dates'] = Application_Dates    # add the Application_Dates variable to our dataset

Create gender variable

In [5]:
Gender = list(np.random.choice(a = ['Female', 'Male'], size = n, p = [0.37, 0.63]))        
df['Gender'] = Gender                                         # add the Gender variable to our dataset

Create the age variable

In [6]:
# First, we create a list for the applicant's age using a normal distribution
# We assume that our credit applicants have to be at least 18 years old and not to be older than 65.
# Therefore, if the random generator generates an age younger than 18 or older than 65, we replace it with another
# randomly generated age, which can be between 18 and 65.  

Age = map(int, np.random.normal(40, 30, n))  # create a random normal variable with mean 40 and standard deviation of 30  
Age = list(map(lambda x: random.randint(18, 65) if x < 18 or x > 65 else x, Age))
df['Age'] = Age  # add the age variable to our dataset

Create education level variable

In [7]:
Education_Level = list(np.random.choice(a = ['Elementary school', 'High school', 'College', 'Others'],
size = n, p = [0.3, 0.4, 0.25, 0.05]))
df['Education_Level'] = Education_Level                         # add the Education_Level variable to our dataset

Create marital status variable

In [8]:
Marital_Status = list(np.random.choice(a = ['Married_Partnered', 'Single', 'Others'], size = n, p = [0.6, 0.26, 0.14])) 
df['Marital_Status'] = Marital_Status                           # add the Marital_Status variable to our dataset

Create employment status variable

In [9]:
Employment_Status = list(np.random.choice(a = ['Employed', 'Unemployed', 'Others'], size = n, p = [0.71, 0.12, 0.17]))
df['Employment_Status'] = Employment_Status                     # add the Employment_Status variable to our dataset

Create credit purpose variable

In [10]:
Credit_Purpose = list(np.random.choice(a = ['Business', 'Family', 'Personal', 'Others'], size = n,p = [0.42, 0.31, 0.10, 0.17]))
df['Credit_Purpose'] = Credit_Purpose                           # add the Credit_Purpose variable to our dataset

Create new or renewal variable

In [11]:
New_Renewal = list(np.random.choice(a = ['Yes', 'No'], size = n, p = [0.77, 0.23]))
df['New_Renewal'] = New_Renewal                                  # add the New_Renewal variable to our dataset

Create application channel variable

In [12]:
Application_Channel = list(np.random.choice(a = ['Digital', 'Physical_Branch'], size = n, p = [0.55, 0.45]))
df['Application_Channel'] = Application_Channel                  # add the Application_Channel variable to our dataset

Create credit bureau variable

In [13]:
Has_Any_Credit_Bureau_Data = list(np.random.choice(a = ['Yes', 'No'], size = n, p = [0.67, 0.33]))
df['Has_Any_Credit_Bureau_Data'] = Has_Any_Credit_Bureau_Data    # add the Application_Channel variable to our dataset

Create verification variable

In [14]:
Verification_Any = list(np.random.choice(a = ['Yes', 'No'], size = n, p = [0.13, 0.87]))
df['Verification_Any'] = Verification_Any             # add the Verification_Any variable to our dataset.
                                                      # This variable shows whether an applicant had to pass extra
                                                      # verification checks during his/her/they credit assessment.       

Create total income variable

In [15]:
Total_Income = map(int, np.random.normal(6000, 3500, n))  
Total_Income = list(map(lambda x: 0 if x < 0 else x, Total_Income))
df['Total_Income'] = Total_Income                                # add the Total_Income variable to our dataset

Create credit score variable

In [16]:
Credit_Score = map(int, np.random.normal(400, 200, n))  
Credit_Score = list(map(lambda x: random.randint(250, 700) if x < 250 or x > 700 else x, Credit_Score)) 
# we assume credit score can take a value between 250 to 700.

df['Credit_Score'] = Credit_Score                                # add the Credit_Score variable to our dataset

### <span style = 'color:purple'> Constructing the output data   </span>

Create the credit decision variable

In [17]:
percentile_40 = np.percentile(df.Credit_Score, 40)  
percentile_70 = np.percentile(df.Credit_Score, 70)

def my_func(dt):
    
    # the following percentage and 40th and 70th percentile are just based on some made up assumptions with this goal
    # in mind that we want our dataset to look similar to real world datasets.
    
    if (dt['Credit_Score'] <= percentile_40):
        val = list(np.random.choice(a = ('Yes', 'No'), size = 1, p = [0.09, 0.91]))   
    
    elif ((dt['Credit_Score'] > percentile_40) and (dt['Credit_Score'] <= percentile_70)):
        val = list(np.random.choice(a = ('Yes', 'No'), size = 1, p = [0.55, 0.45]))

    else:
        val = list(np.random.choice(a = ('Yes', 'No'), size = 1, p = [0.75, 0.25]))       
    return val[0]

df['Credit_Decision'] = df.apply(my_func, axis = 1)              # add the Credit_Decision variable to our dataset

Create the loan tenure variable

In [18]:
def my_func(dt):
    
    if ((dt['Credit_Score'] <= percentile_40) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(30, 5, 1)))  
        val = list(map(lambda x: random.randint(6, 36) if x < 6 else x, val)) # we assume loan tenure cannot be less than 6 months
    
    
    elif ((dt['Credit_Score'] > percentile_40) & (dt['Credit_Score'] <= percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(22, 4, 1)))  
        val = list(map(lambda x: random.randint(5, 30) if x < 6 else x, val)) 
    
    elif ((dt['Credit_Score'] > percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(15, 3, 1)))  
        val = list(map(lambda x: random.randint(6, 20) if x < 6 else x, val)) 
              
    else:
        val = [np.nan]                                   
    
    return val[0]

df['Loan_Tenure'] = df.apply(my_func, axis = 1)          # add the Loan_Tenure variable to our dataset

Create the interest rate variable

In [19]:
def my_func(dt):
    
    if ((dt['Credit_Score'] <= percentile_40) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(27, 2, 1)))  
        val = list(map(lambda x: random.randint(22, 33) if x < 0 else x, val)) # interest rate cannot be negative. 
    
    
    elif ((dt['Credit_Score'] > percentile_40) & (dt['Credit_Score'] <= percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(18, 3, 1)))  
        val = list(map(lambda x: random.randint(15, 25) if x < 0 else x, val)) 
    
    elif ((dt['Credit_Score'] > percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(12, 2.5, 1)))  
        val = list(map(lambda x: random.randint(8, 12) if x < 0 else x, val)) 
              
    else:
        val = [np.nan]     
    
    return val[0]

df['Interest_Rate'] = df.apply(my_func, axis = 1)   # add the Interest_Rate variable to our dataset

Create the disbursed credit amount variable

In [20]:
def my_func(dt):
    
    if ((dt['Credit_Score'] <= percentile_40) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(1000, 100, 1)))  
        val = list(map(lambda x: random.randint(800, 1200) if x < 0 else x, val)) # disbursed loan amount cannot be negative    
    
    elif ((dt['Credit_Score'] > percentile_40) & (dt['Credit_Score'] <= percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(3000, 300, 1)))  
        val = list(map(lambda x: random.randint(2000, 3500) if x < 0 else x, val)) 
    
    elif ((dt['Credit_Score'] > percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(7000, 100, 1)))  
        val = list(map(lambda x: random.randint(5000, 9000) if x < 0 else x, val)) 
              
    else:
        val = [np.nan]       
    
    return val[0]

df['Disbursed_Credit_Amount'] = df.apply(my_func, axis = 1)     # add the Disbursed_Credit_Amount variable to our dataset

### <span style = 'color:purple'> Constructing the repayment data   </span>

Create the number of days delayed variable

In [21]:
def my_func(dt):
    
    if ((dt['Credit_Score'] <= percentile_40) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(12, 27, 1)))  
        val = list(map(lambda x: 0 if x < 0 else x, val)) # days delay cannot be negative. 
       
    elif ((dt['Credit_Score'] > percentile_40) & (dt['Credit_Score'] <= percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(8, 19, 1)))  
        val = list(map(lambda x: 0 if x < 0 else x, val)) 
    
    elif ((dt['Credit_Score'] > percentile_70) & (dt['Credit_Decision'] == 'Yes')):
        val = list(map(int, np.random.normal(4, 10, 1)))  
        val = list(map(lambda x: 0 if x < 0 else x, val)) 
              
    else:
        val = [np.nan]                 
    
    return val[0]

df['Number_of_days_delayed'] = df.apply(my_func, axis = 1)    # add the Number_of_days_delayed variable to our dataset

***

End.