This notebook will be used with the aim of creating an usable data frame in order to train and show how different classification algorithms work. This data will be stored in the data/ folder. Thus, it can be used for different algorithms' training.

# 1. Set up

# 2. Import necessary libraries

In [1]:
import pandas as pd

In [2]:
import warnings

warnings.simplefilter("ignore")

# 3. Define global variables

In [3]:
INPUT_DATA = "../../data/credit_card_data/Credit_card.csv"
INPUT_TARGET = "../../data/credit_card_data/Credit_card_label.csv"

OUTPUT_PATH = "../../data/credit_card_data/data_modified_binary_classification.csv"

# 4. Functions

# 5. Code

We are going to make use of some credit card details data. The data is taken from https://www.kaggle.com/datasets/rohitudageri/credit-card-details.

Variables:

- Ind_ID: Client ID
- Gender: Gender information
- Car_owner: Having car or not
- Propert_owner: Having property or not
- Children: Count of children
- Annual_income: Annual income
- Type_Income: Income type
- Education: Education level
- Marital_status: Marital_status
- Housing_type: Living style
- Birthday_count: Use backward count from current day (0), -1 means yesterday.
- Employed_days: Start date of employment. Use backward count from current day (0). Positive value means, individual is currently unemployed.
- Mobile_phone: Any mobile phone
- Work_phone: Any work phone
- Phone: Any phone number
- EMAIL_ID: Any email ID
- Type_Occupation: Occupation
- Family_Members: Family size
- Label: 0 is application approved and 1 is application rejected.

## 5.1. Load and transform data

First of all we are going to load the both the data and the target variables making use of pandas library

In [4]:
data = pd.read_csv(INPUT_DATA)
data.head()

Unnamed: 0,Ind_ID,GENDER,Car_Owner,Propert_Owner,CHILDREN,Annual_income,Type_Income,EDUCATION,Marital_status,Housing_type,Birthday_count,Employed_days,Mobile_phone,Work_Phone,Phone,EMAIL_ID,Type_Occupation,Family_Members
0,5008827,M,Y,Y,0,180000.0,Pensioner,Higher education,Married,House / apartment,-18772.0,365243,1,0,0,0,,2
1,5009744,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2
2,5009746,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,,-586,1,1,1,0,,2
3,5009749,F,Y,N,0,,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2
4,5009752,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2


In [5]:
target = pd.read_csv(INPUT_TARGET).rename(columns={"label": "target"})
target.head()

Unnamed: 0,Ind_ID,target
0,5008827,1
1,5009744,1
2,5009746,1
3,5009749,1
4,5009752,1


Let's join both data frames in order to have the complete dataset:

In [6]:
input_df = data.merge(target, on=["Ind_ID"], how="inner")

input_df.head()

Unnamed: 0,Ind_ID,GENDER,Car_Owner,Propert_Owner,CHILDREN,Annual_income,Type_Income,EDUCATION,Marital_status,Housing_type,Birthday_count,Employed_days,Mobile_phone,Work_Phone,Phone,EMAIL_ID,Type_Occupation,Family_Members,target
0,5008827,M,Y,Y,0,180000.0,Pensioner,Higher education,Married,House / apartment,-18772.0,365243,1,0,0,0,,2,1
1,5009744,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2,1
2,5009746,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,,-586,1,1,1,0,,2,1
3,5009749,F,Y,N,0,,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2,1
4,5009752,F,Y,N,0,315000.0,Commercial associate,Higher education,Married,House / apartment,-13557.0,-586,1,1,1,0,,2,1


First of all let's modify column names, lowercase always

In [7]:
input_df.rename(columns=str.lower, inplace=True)

Now, we are going to delete the id, we are not using it:

In [8]:
input_df.drop("ind_id", axis=1, inplace=True)

Let's do a little EDA in order to understand the data and to determine wether if we can use all the information or not.

### Duplicated data?

In [9]:
input_df.duplicated().sum()

162

In [10]:
input_df.drop_duplicates(inplace=True)

### Null values?

In [11]:
input_df.isnull().mean() * 100

gender              0.505051
car_owner           0.000000
propert_owner       0.000000
children            0.000000
annual_income       1.659452
type_income         0.000000
education           0.000000
marital_status      0.000000
housing_type        0.000000
birthday_count      1.587302
employed_days       0.000000
mobile_phone        0.000000
work_phone          0.000000
phone               0.000000
email_id            0.000000
type_occupation    31.601732
family_members      0.000000
target              0.000000
dtype: float64

Watching this, as our aim is to see the functioning of the logistic regression, we are going to delete the variables where we have some null values although this shouldn't be done but we are just focusing on logistic regression functioning

In [12]:
input_df.drop(["type_occupation", "gender", "birthday_count", "annual_income"], axis=1, inplace=True)

In [13]:
input_df.head(5)

Unnamed: 0,car_owner,propert_owner,children,type_income,education,marital_status,housing_type,employed_days,mobile_phone,work_phone,phone,email_id,family_members,target
0,Y,Y,0,Pensioner,Higher education,Married,House / apartment,365243,1,0,0,0,2,1
1,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
2,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
3,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
5,Y,N,0,Pensioner,Higher education,Married,House / apartment,-586,1,1,1,0,2,1


### Imbalance dataset?

In [14]:
input_df["target"].value_counts(normalize=True)

0    0.901154
1    0.098846
Name: target, dtype: float64

Yes, but not that much

## 6. Write the results

In [15]:
input_df.to_csv(OUTPUT_PATH, sep=";", header=True, index=False)