# Analyze the risk of customer defaults on a loan

Your job is to prepare reports for the credit division of a bank. You will find out the influence of a customer's marital status and the number of children he has on the probability of default in loan repayment. The bank already has some data regarding customer credit worthiness.

Your report will be considered when making a **Credit assessment** for prospective customers. **Credit scoring** is used to evaluate a potential borrower's ability to repay their loan.

## Open the data file and read the general information.

[Start by importing the library and loading the data. You will probably find that you need additional libraries as you work on this project and that is normal. Just make sure to update this section if needed.]

In [1]:
# Import library
import pandas as pd

In [2]:
# Load the data
data = pd.read_csv('/datasets/credit_scoring_eng.csv')


## Question 1. Data exploration

**Data description**
- `children` - number of children in the family
- `days_employed` - customer work experience in days
- `dob_years` - customer age in years
- `education` - customer education level
- `education_id` - identifier for the customer's education level
- `family_status` - marital status
- `family_status_id` - identifier for the customer's marital status
- `gender` - customer gender
- `income_type` - job type
- `debt` - whether the customer has ever defaulted on a loan
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

[Now, it's time to explore our data. You need to see how many columns and rows the data has, as well as look closely at multiple rows of data to check for potential problems with the data.]

In [3]:
# Let's see how many rows and columns our dataset has
data.shape

(21525, 12)

In [4]:
# Let's display the first 5 rows
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


Based on the data displayed, here are several things that can be observed:

1. Column 'children': There is information about the number of children in this column. The values in this column look reasonable and there are no obvious problems.

2. Column 'days_employed': There are some very extreme values, as seen in row 4 with a very high value (340,266) which may be an error. These values also appear to be mixed with negative values, which may not correspond to reality. This issue may require repair and further investigation.

3. Column 'dob_years': There is information about the customer's age in this column. There were no obvious problems in this sample.

4. 'education' and 'education_id' columns: There are variations in education spelling, such as "bachelor's degree" and "Secondary Education," that may need to be normalized for consistency.

5. 'family_status' and 'family_status_id' columns: There is information about the customer's marital status in this column. There were no obvious problems in this sample.

6. Column 'gender': There is information about the customer's gender. There were no obvious problems in this sample.

7. Column 'income_type': There is information about the customer's job type. There were no obvious problems in this sample.

8. 'debt' column: There is information about whether the customer has debt. There were no obvious problems in this sample.

9. Column 'total_income': There are some missing values in this column, which we discussed earlier. Additionally, as already discussed, there are very extreme values that need to be investigated further.

10. 'purpose' column: There is information about the purpose of the loan in this column. There were no obvious problems in this sample.

Of further concern in this sample are the extreme values in the 'days_employed' column and the missing values in the 'total_income' column. This issue may require correction and a better understanding of the cause to provide a more accurate analysis. In addition, normalization in the 'education' column can also be an additional step to facilitate analysis.

In [5]:
# Get data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


There are missing values only in a few columns, namely 'days_employed' and 'total_income'. All other columns have complete (non-null) values for each row of data.

In [6]:
# Let's look at the table that has been filtered with missing values in the first column containing the missing data
data[data.isna().any(axis=1)]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


The missing values do not appear symmetrical. Although the number of missing values for 'days_employed' and 'total_income' is the same (2.174), the assumption that the missing samples are of the same size cannot be guaranteed without further investigation. To ensure symmetry, it is necessary to compare the missing values in each row with the missing values in all columns to see whether the missing sample size is the same for each row.

In [7]:
# Let's apply some conditions to filter the data and see the number of rows in the table that have been filtered.
missing_employment_income = data[(data['days_employed'].isnull()) & (data['total_income'].isnull())]
print(len(missing_employment_income))
print('\n---------------------------------------------\n')
data.isna().sum()

2174

---------------------------------------------



children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

**Tentative conclusions**<br>
The number of rows in the filtered table (2,174 rows) corresponds to the number of missing values in the 'days_employed' and 'total_income' columns (2,174 values). This shows that the filtering conditions are functioning well.

To calculate the percentage of missing values in a dataset, we can use the following formula:

In [8]:
# Counts the number of rows in the dataset
total_rows = len(data)

# Calculate the number of missing values in 'days_employed' and 'total_income' columns
missing_values = data[['days_employed', 'total_income']].isnull().sum()

# Calculates the percentage of missing values
percentage_missing = (missing_values / total_rows) * 100

# Displays the percentage with two decimal digits and a percent sign
print("Percentage of Missing Values:")
print(percentage_missing.round(2).astype(str) + "%")

Percentage of Missing Values:
days_employed    10.1%
total_income     10.1%
dtype: object


Missing values are about 10.10% of the entire dataset. Although it is not very big, it is still significant enough to be noticed. The next steps are:

1. **Consider characteristics causing missing values**: We need to check whether there are certain customer characteristics that might be the cause of missing values, such as job type, marital status, age, etc.

2. **Identify missing value dependencies**: We need to see if there are any dependencies between the columns that identify specific customer characteristics and the missing values in the 'days_employed' and 'total_income' columns. This can help us understand whether there are certain patterns in data loss.

The correlation between the 'days_employed' and 'total_income' columns with other columns also needs to be checked to understand whether there is a significant relationship.

These steps will help us better understand and address the problem of missing values in the dataset.

In [9]:
# Let's check for customers who have no data on the identified characteristics and columns with missing values

#Calculate the percentage of missing values
#-------------
# Analysis by job type
income_type_missing = data[data['days_employed'].isnull() & data['total_income'].isnull()]['income_type']
income_type_missing_counts = income_type_missing.value_counts()
print("Number of missing values by job type:")
print(income_type_missing_counts)
print('\n---------------------------------------------\n')

# Calculates the correlation between 'days_employed' and 'total_income'
correlation_days_income = data['days_employed'].corr(data['total_income'])

print("Correlation between 'days_employed' and 'total_income':", correlation_days_income)

Number of missing values by job type:
income_type
employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: count, dtype: int64

---------------------------------------------

Correlation between 'days_employed' and 'total_income': -0.13664831355529825


The correlation between 'days_employed' and 'total_income' is -0.13664831355529825. Correlation is a statistical measure that measures the strength and direction of a linear relationship between two variables. The correlation value ranges from -1 to 1. Here, the correlation value is negative, which indicates that there is a negative linear relationship between 'days_employed' and 'total_income'.

Explanation:

Correlation Value: -0.13664831355529825
Correlation Direction: Negative
A negative correlation indicates that there is a weak trend, but in the opposite direction, between two variables. In this context, the correlation between 'days_employed' and 'total_income' shows that the greater the number of working days ('days_employed'), the lower the total income ('total_income').

However, it should be noted that this correlation is weak (close to 0), so the relationship between 'days_employed' and 'total_income' is not very strong. The correlation of -0.13664831355529825 indicates that the relationship between 'days_employed' and 'total_income' is only slight, and cannot be considered a strong or statistically significant relationship

In [10]:
# Check the distribution
# Check the distribution of 'days_employed' column
print("Distribution 'days_employed':")
print(data['days_employed'].describe())
print('\n--------------------------------------------- \n')

# Check the distribution of column 'total_income'
print("Distribution of 'total_income':")
print(data['total_income'].describe())

Distribution 'days_employed':
count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

--------------------------------------------- 

Distribution of 'total_income':
count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64


Based on the analysis that has been carried out, the following are the findings and possible causes of missing values in the data:

**Findings:**
1. There are 2,174 rows with missing values in the 'days_employed' and 'total_income' columns.
2. The 'days_employed' distribution has strange negative values (for example, a minimum value of -18,388.95 days and a maximum value of 401,755.40 days), which does not seem to make sense in the context of weekdays.
3. The 'total_income' distribution has a wide range of values, from 3,306.76 to 362,496.65, reflecting the diversity of customer income.

**Possible Causes of Loss of Value:**
1. One possible cause of missing values in the 'days_employed' column may be technical problems or inaccurate data entry. Unreasonable values as seen in the distribution may be the result of errors when filling in this column, such as measurement or data processing errors.
2. Missing values in the 'total_income' column can also be caused by various factors. Some customers may choose not to provide information about their income, and this could result in a large amount of missing data in this column.
3. Another possibility is that there is a pattern in the missing values that may be related to the customer's job type or other factors in the data.

**Possible Dependency or Pattern Check:**
To understand whether these missing values are related to certain customer characteristics, the next step is to further examine whether there are any dependencies or patterns in these missing values. In this case, we have checked the number of missing values by customer job type. The results show that different types of work have different amounts of missing values. However, next steps may include further statistical analysis or hypothesis testing to confirm the presence of significant patterns or dependencies.

Next steps should include further investigation into the causes of these missing values and finding ways to address the missing values according to the context of the data and the problem to be solved.

In [11]:
# Calculate the average of 'days_employed' and 'total_income' by job type
mean_by_income_type = data.groupby('income_type')[['days_employed', 'total_income']].mean()
median_by_income_type = data.groupby('income_type')[['days_employed', 'total_income']].median()
mode_by_income_type = data.groupby('income_type')[['days_employed', 'total_income']].agg(lambda x: x.mode().iloc[0])

print("Mean by Income Type:")
print(mean_by_income_type)
print('\n--------------------------------------------- \n')

print("\nMedian by Income Type:")
print(median_by_income_type)
print('\n--------------------------------------------- \n')

print("\nMode by Income Type:")
print(mode_by_income_type)
print('\n--------------------------------------------- \n')

# Calculate the average age by job type
mean_age_by_income_type = data.groupby('income_type')['dob_years'].mean()
print("\nMean Age by Income Type:")
print(mean_age_by_income_type)
print('\n--------------------------------------------- \n')

# Check distribution across datasets
numerical_columns = data.select_dtypes(include=['int64', 'float64'])
descriptive_stats = numerical_columns.describe()
print(descriptive_stats)

Mean by Income Type:
                             days_employed  total_income
income_type                                             
business                      -2111.524398  32386.793835
civil servant                 -3399.896902  27343.729582
employee                      -2326.499216  25820.841683
entrepreneur                   -520.848083  79866.103000
paternity / maternity leave   -3296.759962   8612.661000
retiree                      365003.491245  21940.394503
student                        -578.751554  15712.260000
unemployed                   366413.652744  21014.360500

--------------------------------------------- 


Median by Income Type:
                             days_employed  total_income
income_type                                             
business                      -1547.382223    27577.2720
civil servant                 -2689.368353    24071.6695
employee                      -1574.202821    22815.1035
entrepreneur                   -520.848083    79866

**Tentative conclusions**

From the distribution results of the original dataset and the filtered table, there are several striking differences. For example, in a table that has been filtered for missing values, there are many cases with very high 'days_employed' values, which may reflect a problem or anomaly in the data. However, in the original dataset, the distribution of 'days_employed' is more evenly distributed.

This indicates that missing 'days_employed' values in filtered tables may not occur randomly and could be associated with specific cases, such as errors in data entry or problems in the way the data is managed.

The tentative conclusion is that we need to further examine why the 'days_employed' value is missing in certain cases and whether there are certain patterns or characteristics in the data that could explain this. If a clearer cause is found, then we can take further action to resolve this problem.

In [12]:
# Check for other causes and patterns that could result in missing values
# Calculate the number of missing values based on 'education' in the 'days_employed' column
missing_days_employed_by_education = data[data['days_employed'].isnull()].groupby('education')['days_employed'].count()

print("Number of missing values by education type ('education') in column 'days_employed':")
print(missing_days_employed_by_education)

Number of missing values by education type ('education') in column 'days_employed':
education
BACHELOR'S DEGREE      0
Bachelor's Degree      0
PRIMARY EDUCATION      0
Primary Education      0
SECONDARY EDUCATION    0
SOME COLLEGE           0
Secondary Education    0
Some College           0
bachelor's degree      0
primary education      0
secondary education    0
some college           0
Name: days_employed, dtype: int64


Information:

From the calculation results, it can be seen that there are no missing values in the 'days_employed' column based on the type of education ('education'). All education categories have complete values. Thus, there are no missing values in the 'days_employed' column that can be attributed to the type of education in this dataset.

**Tentative conclusions**
<br>Based on the analysis that has been carried out, we cannot definitely conclude that the missing values are a coincidence. Although we examined several factors such as job type, education type, and age, we did not find a clear pattern that could explain the loss of values. However, we still need to examine other factors that may be contributing to the missing values, as well as conduct further analysis to understand whether there are certain factors that are related to the incomplete data. Next, we need to consider what action to take regarding these missing values, whether they need to be filled or not.

In [13]:
# Check other patterns - explain them


In [14]:
missing_values_by_column = data[['children', 'dob_years', 'education', 'family_status', 'gender', 'income_type', 'debt', 'purpose']].isnull().sum()
missing_values_by_column = missing_values_by_column.reset_index()
missing_values_by_column.columns = ['Column', 'Missing Values based on days_employed and total_income']
print(missing_values_by_column)

          Column  Missing Values based on days_employed and total_income
0       children                                                  0     
1      dob_years                                                  0     
2      education                                                  0     
3  family_status                                                  0     
4         gender                                                  0     
5    income_type                                                  0     
6           debt                                                  0     
7        purpose                                                  0     


**Conclusions**
<br>Based on the analysis results, no special pattern was found that connects certain columns with missing values in the 'days_employed' and 'total_income' columns. The number of missing values in these two columns does not appear to be correlated with the other columns.

To deal with missing values, we can perform the following steps:
1. **Handling Missing Values on 'days_employed' and 'total_income':**
    - For the 'days_employed' column, we can fill in the missing values with the average or median value of this column based on job type (income_type), because there is a correlation between job type and 'days_employed'.
    - For the 'total_income' column, we can also fill in the missing values with the average or median value based on job type (income_type).

2. **Other Value Handling:**
    - There are no missing values in other columns, such as 'children', 'dob_years', 'education', 'family_status', 'gender', 'debt', and 'purpose'. Therefore, we do not need to perform any special actions on these columns regarding missing values.

3. **Additional Data Cleaning:**
    - In addition to dealing with missing values, we also need to do additional data cleaning, such as dealing with duplicate data, normalizing data (for example, combining categories with different names), and checking the correctness of existing values.

4. **Data Transformation:**
    - If necessary, we can perform additional data transformations to prepare this dataset for further analysis, such as changing the data type, merging columns, or extracting new features.

5. **Further Analysis:**
    - Once the data has been cleaned and missing values addressed, we can proceed with further analysis, such as statistical analysis, predictive modeling, or data visualization.

All of these steps will help us to process the data better and prepare it for further analysis and better decision making.

## Data transformation

[Let's take a look at each column to see what problems they might have.]

[Start by removing duplicates and correcting data about educational information if necessary.]

In [15]:
# Let's look at all the values in the education column to check what spelling needs to be corrected
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [16]:
# Correct the recording if necessary
data['education'] = data['education'].str.lower()

In [17]:
# Check all the values in the column to ensure that we have corrected them appropriately
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

[Check `children` column data]

In [18]:
# Let's look at the distribution of values in the `children` column
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5], dtype=int64)

In the 'children' column, there are several strange values, namely -1 and 20. The percentage of problematic data is quite low compared to the total number of entries in the dataset, but this still requires attention. A value of -1 may be an input or data processing error that occurred accidentally. The value 20 may also be an input error, but it may actually be the case if there is a large family with 20 children.

To handle these values, we can consider the value -1 as an input error and replace it with the value 1 (assuming that they have one child), while the value 20 may be the case that actually exists. However, if we feel that 20 children is too extreme, we may decide to consider it an input error and replace it with a more reasonable value, such as 2 or 3.

The final choice depends on the context of the data and whether we want to retain these entries in the analysis or not. In many cases, replacing implausible values with more reasonable values is a good step to improve data quality.

In [19]:
# [correct data based on your decision]
data['children'] = data['children'].replace(-1, 1)
data['children'] = data['children'].replace(20, 2)

In [20]:
# Check the `children` column again to ensure that everything is corrected
data['children'].unique()

array([1, 0, 3, 2, 4, 5], dtype=int64)

[Check the data in the `days_employed` column. First, think about what problems the column might have, and also think about what you might want to check and how you would do it.]

In [21]:
# Find problematic data in the `days_employed` column if there is a problem and calculate the percentage
data['days_employed'].unique()

array([-8437.67302776, -4024.80375385, -5623.42261023, ...,
       -2113.3468877 , -3112.4817052 , -1984.50758853])

In [22]:
# Resolve problematic values, if any
# Replaces negative values with positive values
data['days_employed'] = data['days_employed'].abs()

In [23]:
# Check the results - make sure that the problem is fixed
data['days_employed'].unique()

array([8437.67302776, 4024.80375385, 5623.42261023, ..., 2113.3468877 ,
       3112.4817052 , 1984.50758853])

[Now, let's look at the customer's age and check if there are any problems there. Again, think about what possible oddities we could encounter in this column, such as age numbers that don't make sense.]

In [24]:
# Check `dob_years` for suspicious values and calculate the percentage
data['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51,  0, 59, 29, 60, 55, 58, 71, 22, 73,
       66, 69, 19, 72, 70, 74, 75], dtype=int64)

In [25]:
# Troubleshoot the `dob_years` column, if there is a problem
# Replace the value '0' in the 'dob_years' column with the average age value
average_age = data['dob_years'].mean()
average_age_int = int(average_age) # Convert average age to an integer

# Replace the value '0' in the 'dob_years' column with the average age value (integer)
data.loc[data['dob_years'] == 0, 'dob_years'] = average_age_int

In [26]:
# Check the results - make sure that the problem is fixed
data['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51, 59, 29, 60, 55, 58, 71, 22, 73, 66,
       69, 19, 72, 70, 74, 75], dtype=int64)

[Now, it's time to check the `family_status` column. Check what kind of values are contained in this column and what problems you might need to resolve.]

In [27]:
# Let's look at the values for this column
data['family_status'].unique()


array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [28]:
# Resolve problematic values in `family_status`, if any

In [29]:
# Check the results - make sure the values are corrected
data['family_status'].unique()

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

[Now, it's time to check the `gender` column. Check what kind of values are contained in this column and what problems you might need to resolve]

In [30]:
# Let's look at the values in this column
data['gender'].unique()

array(['F', 'M', 'XNA'], dtype=object)

there are no significant problems

[Now, it's time to check the `income_type` column. Check what kind of values are contained in this column and what problems you might need to resolve]

In [31]:
# Let's look at the values in this column
data['income_type'].unique()

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

there are no significant problems

[Now, it's time to see if there are any duplicates in our data. If we find one, you must decide what you will do with the duplicate and explain why.]

In [32]:
# Check for duplicates
#Show duplicates in column form
print(data[data.duplicated()])
print('\n--------------------------------------------- \n')
# Check for duplicates
print("Number of duplicates:", data.duplicated().sum())

       children  days_employed  dob_years            education  education_id  \
2849          0            NaN         41  secondary education             1   
3290          0            NaN         58  secondary education             1   
4182          1            NaN         34    bachelor's degree             0   
4851          0            NaN         60  secondary education             1   
5557          0            NaN         58  secondary education             1   
...         ...            ...        ...                  ...           ...   
20702         0            NaN         64  secondary education             1   
21032         0            NaN         60  secondary education             1   
21132         0            NaN         47  secondary education             1   
21281         1            NaN         30    bachelor's degree             0   
21415         0            NaN         54  secondary education             1   

           family_status  family_status

In [33]:
# Resolve duplicates, if any
data.drop_duplicates(inplace=True)

In [34]:
# Do a final check to check if we have any duplicates
print(data[data.duplicated()])
print('\n--------------------------------------------- \n')
# Check for duplicates
print("Number of duplicates:", data.duplicated().sum())

Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []

--------------------------------------------- 

Number of duplicates: 0


In [35]:
# Check the size of the dataset you now have after the first manipulation you performed
data.shape

(21454, 12)

The new dataset has 21,454 rows and 12 columns, while the initial dataset has 21,525 rows and 12 columns. This means that after performing the first data manipulation, we have deleted several rows of data that had duplicates. The percentage change in the number of rows is as follows:

In [36]:
# Number of starting lines
initial_row_count = 21525

# Number of rows after the first manipulation
current_row_count = 21454

# Calculates the percentage change
percentage_change = "{:.2f}".format(((initial_row_count - current_row_count) / initial_row_count) * 100)
print("Percentage change in number of rows:", percentage_change, "%")

Percentage change in number of rows: 0.33 %


After the first manipulation, we have reduced the number of rows by approximately 0.33%.

# Work with missing values

[To speed up work with some data, you may want to use a dictionary for some values that have IDs. Explain why and which dictionary you will use.]

In [37]:
# Find dictionary
# Dictionary for family_status_id
family_status_id_dict = {
     0: 'married',
     1: 'civil partnership',
     2: 'widow / widower',
     3: 'divorced',
     4: 'unmarried'
}

# Dictionary for education_id
education_id_dict = {
     0: "bachelor's degree",
     1: 'secondary education',
     2: 'some college',
     3: 'primary education',
     4: 'graduate degrees'
}

### Fixed missing value in `total income`

[Briefly explain which columns with missing values you need to work with. Explain how you will fix it.]


[Start by addressing the total value of lost revenue. Create age categories for customers. Create a new column containing age categories. This strategy can help to calculate the total revenue value.]

In [38]:
# Let's write a function to calculate age categories
def categorize_age(age):
    if age <= 30:
        return 'Young'
    elif 30 < age <= 45:
        return 'Middle-aged'
    else:
        return 'Elderly'
    

In [39]:
# Perform testing to see whether your function works or not
age_1 = 25
print (categorize_age(age_1))

age_2 = 31
print (categorize_age(age_2))

age_3 = 60
print (categorize_age(age_3))


Young
Middle-aged
Elderly


In [40]:
# Create a new column based on function
data['age_category'] = data['dob_years'].apply(categorize_age)

In [41]:
# Check how the values are in the new column
print(data.age_category.unique())


['Middle-aged' 'Elderly' 'Young']


[Think about the factors that income typically depends on. In the end, you will know whether you should use the mean or median to replace the missing values. To make this decision, you may want to look at the distribution of the factors you identified as impacting a person's income.]

[Create a table that contains only data without missing values. This data will be used to correct missing values.]

In [42]:
# Create a table with no missing values and display some of its rows to make sure everything works fine
# Create a table with no missing values
data_without_missing = data.dropna(subset=['total_income'])

# Display multiple rows from a table without missing values
data_without_missing.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,Middle-aged
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,Middle-aged
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,Middle-aged
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,Middle-aged
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,Elderly


In [43]:
# Pay attention to the average value for income based on the factors you have identified
# Pay attention to the average value of income based on age category ('age_category')
age_category_income_mean = data_without_missing.groupby('age_category')['total_income'].mean()

# Show the results
print("Average income by Age Category:")
print(age_category_income_mean)

Average income by Age Category:
age_category
Elderly        25605.151757
Middle-aged    28463.752600
Young          25817.674826
Name: total_income, dtype: float64


In [44]:
# Pay attention to the median value for income based on the factors you have identified
# Calculates median income by age category ('age_category')
age_category_income_median = data_without_missing.groupby('age_category')['total_income'].median()

# Displays the results
print("Median income by Age Category:")
print(age_category_income_median)

Median income by Age Category:
age_category
Elderly        22112.4450
Middle-aged    24750.7935
Young          22957.1850
Name: total_income, dtype: float64


The characteristics that most determine income are usually age, occupation (income_type), and level of education (education). In this analysis, we have categorized customers based on their age, and see that there are significant income differences between the age categories "Elderly," "Middle-aged," and "Young."

As a solution to replace the missing values in the 'total_income' column, I will use the median income by age category ('age_category'). The reason for using the median is because it is more resistant to outliers and extreme data that can affect the average. By using the median, we will take into account more conservatively the income characteristics of each age group.

Thus, we will replace the missing values in the 'total_income' column based on the age category of each customer, using the median income for the respective age category. This will provide a more realistic income estimate based on the customer's age group.

In [45]:
# Write the function we will use to fill in the missing values

# Calculate median income based on income_type
income_type_median = data.groupby('income_type')['total_income'].median()

# Function to fill in missing values
def fill_missing_income(row):
     income_type = row['income_type']
     median_income = income_type_median[income_type]
     if pd.isnull(row['total_income']):
         return median_income
     return row['total_income']

# Fill missing values with median based on income_type
data['total_income'] = data.apply(fill_missing_income, axis=1)

In [46]:
# Check how the values are in the new column
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_category
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,Middle-aged
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,Middle-aged
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,Middle-aged
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,Middle-aged
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,Elderly


In [47]:
# Apply the function to each row
# Define a function to fill in missing values
def fill_missing_income(row):
     income_type = row['income_type']
     median_income = income_type_median[income_type]
     if pd.isnull(row['total_income']):
         return median_income
     else:
         return row['total_income']

# Apply the function to each row in the 'total_income' column
data['total_income'] = data.apply(fill_missing_income, axis=1)

In [48]:
# Check if we get any errors
missing_income = data['total_income'].isnull().sum()
print("Number of missing values in column 'total_income':", missing_income)

Number of missing values in column 'total_income': 0


[After you are done with `total income`, check that the total number of values in this column matches the number of values in the other columns.]

In [49]:
# Check the number of entries in the column
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21454 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21454 non-null  int64  
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
 12  age_category      21454 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.3+ MB


### Fixed value in `days_employed`

[Think about parameters that could help you fix the missing values in this column. In the end, you will know whether you should use the mean or median to replace the missing values. You may need to do the same research you did when correcting the data in the previous column.]

In [50]:
# Median distribution of `days_employed` based on the parameters you identified
# Calculate median 'days_employed' based on 'age_category'
days_employed_median = data.groupby('age_category')['days_employed'].median()

# Show the results
print("Median days_employed based on age_category:")
print(days_employed_median)

Median days_employed based on age_category:
age_category
Elderly        5861.808462
Middle-aged    1741.796379
Young          1046.095163
Name: days_employed, dtype: float64


In [51]:
# Average distribution of `days_employed` based on the parameters you identified
# Calculates the average of `days_employed` based on `age_category`
average_days_employed = data.groupby('age_category')['days_employed'].mean()

# Displays the results
print("Average distribution of `days_employed` by `age_category`:")
print(average_days_employed)

Average distribution of `days_employed` by `age_category`:
age_category
Elderly        149995.344181
Middle-aged      6890.874098
Young            2027.154893
Name: days_employed, dtype: float64


Based on the results of the mean and median distribution of `days_employed` based on `age_category`, there is a significant difference in the mean and median values, especially in the category "Elderly." The average value is much higher than the median value, which indicates the presence of outliers (extreme values) in the data. Outliers can significantly influence the average value.

To deal with outliers and ensure more stable results, it is better to use the median value rather than the average. The median is a central measure that is more resistant to outliers, so it better represents the central tendency of the data when extreme values are present.

So, in this case, it is better to use median to replace missing values in `days_employed` column based on `age_category`.

In [52]:
# Let's write a function that calculates the mean or median (depending on your decision) based on the parameters you identified
def calculate_median_days_employed(data, age_category):
    median = data[data['age_category'] == age_category]['days_employed'].median()
    return median

In [53]:
# Check if your function works
# Example of using the function
median_elderly = calculate_median_days_employed(data, 'Elderly')
median_middle_aged = calculate_median_days_employed(data, 'Middle-aged')
median_young = calculate_median_days_employed(data, 'Young')

# Show the results
print("Median days_employed for category 'age_category':")
print(f"Elderly: {median_elderly}")
print(f"Middle-aged: {median_middle_aged}")
print(f"Young: {median_young}")

Median days_employed for category 'age_category':
Elderly: 5861.808461725183
Middle-aged: 1741.7963788170914
Young: 1046.095163372449


In [54]:
# Apply function to income_type
def calculate_median_days_employed_by_income_type(data, income_type):
     # Filter data based on income_type
     filtered_data = data[data['income_type'] == income_type]
    
     # Calculate median days_employed
     median_days_employed = filtered_data['days_employed'].median()
    
     return median_days_employed

In [55]:
# Check if your function works
# Example of using the function
income_types = data['income_type'].unique()

print("Median days_employed based on income_type:")
for income_type in income_types:
     median_days_employed = calculate_median_days_employed_by_income_type(data, income_type)
     print(f"- {income_type}: {median_days_employed}")

Median days_employed based on income_type:
- employee: 1574.2028211070854
- retiree: 365213.3062657312
- business: 1547.3822226779334
- civil servant: 2689.3683533043886
- unemployed: 366413.65274420456
- entrepreneur: 520.8480834953765
- student: 578.7515535382181
- paternity / maternity leave: 3296.7599620220594


In [56]:
# Replace missing values
def fill_missing_days_employed(row):
     if pd.isnull(row['days_employed']):
         income_type = row['income_type']
         median_days_employed = calculate_median_days_employed_by_income_type(data, income_type)
         return median_days_employed
     else:
         return row['days_employed']

# Apply function to 'days_employed' column
data['days_employed'] = data.apply(fill_missing_days_employed, axis=1)

[After you are done with `total income`, check that the total number of values in this column matches the number of values in the other columns.]

In [57]:
# Check entries in all columns - make sure we fix all missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21454 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     21454 non-null  float64
 2   dob_years         21454 non-null  int64  
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
 12  age_category      21454 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.3+ MB


## Data categorization

[To answer questions and test hypotheses, you will work with categorized data. Look at the questions that are asked of you and that you have to answer. Think about which data needs to be categorized to answer these questions. Below, you will find a template that you can use to categorize data. The first process includes text data; the second discusses numerical data that needs to be categorized.

In [58]:
# Display the data values you selected for categorization
print(data['purpose'].unique())
print('\n---------------------------------------------\n')
print(data['total_income'].unique())

['purchase of the house' 'car purchase' 'supplementary education'
 'to have a wedding' 'housing transactions' 'education' 'having a wedding'
 'purchase of the house for my family' 'buy real estate'
 'buy commercial real estate' 'buy residential real estate'
 'construction of own property' 'property' 'building a property'
 'buying a second-hand car' 'buying my own car'
 'transactions with commercial real estate' 'building a real estate'
 'housing' 'transactions with my real estate' 'cars' 'to become educated'
 'second-hand car purchase' 'getting an education' 'car'
 'wedding ceremony' 'to get a supplementary education'
 'purchase of my own house' 'real estate transactions'
 'getting higher education' 'to own a car' 'purchase of a car'
 'profile education' 'university education'
 'buying property for renting out' 'to buy a car' 'housing renovation'
 'going to university']

---------------------------------------------

[40620.102 17932.802 23341.752 ... 14347.61  39054.888 13127.587]


[Let's check the unique value]

In [59]:
# Check unique values
data['purpose'].unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [60]:
# Let's write a function to categorize data based on general topics
# Create a function to perform categorization mapping
def categorize_purpose(purpose):
     if 'hous' in purpose:
         return 'housing'
     elif 'real' in purpose:
         return 'housing'
     elif 'propert' in purpose:
         return 'housing'
     elif 'educat' in purpose:
         return 'education'
     elif 'univers' in purpose:
         return 'education'
     elif 'car' in purpose:
         return 'cars'
     elif 'weddi' in purpose:
         return 'wedding'
     else:
         return 'Other'

# Using the apply function to create a new column 'purpose_category'
data['purpose_category'] = data['purpose'].apply(categorize_purpose)

# Show the results
print(data[['purpose', 'purpose_category']].head())
print('\n--------------------------------------------- \n')

purpose_distribution = data['purpose_category'].value_counts().reset_index()
purpose_distribution.columns = ['Purpose Category', 'Count']
print(purpose_distribution)

                   purpose purpose_category
0    purchase of the house          housing
1             car purchase             cars
2    purchase of the house          housing
3  supplementary education        education
4        to have a wedding          wedding

--------------------------------------------- 

  Purpose Category  Count
0          housing  10811
1             cars   4306
2        education   4013
3          wedding   2324


In [61]:
# Create a column containing categories and calculate the values
def categorize_income(total_income):
    if total_income <= 30000:
        return 'Low Income'
    elif 30000 < total_income <= 50000:
        return 'Moderate Income'
    elif 50000 < total_income <= 80000:
        return 'Middle Income'
    else:
        return 'High Income'


In [62]:
# View all numeric data in the column you selected for categorization
data.info()

# Change the data type of column 'days_employed' to integer
data['days_employed'] = data['days_employed'].astype(int)
# Change the data type of column 'dob_years' to integer
data['dob_years'] = data['dob_years'].astype(int)

print(data[['dob_years', 'total_income', 'days_employed']])

<class 'pandas.core.frame.DataFrame'>
Index: 21454 entries, 0 to 21524
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21454 non-null  int64  
 1   days_employed     21454 non-null  float64
 2   dob_years         21454 non-null  int64  
 3   education         21454 non-null  object 
 4   education_id      21454 non-null  int64  
 5   family_status     21454 non-null  object 
 6   family_status_id  21454 non-null  int64  
 7   gender            21454 non-null  object 
 8   income_type       21454 non-null  object 
 9   debt              21454 non-null  int64  
 10  total_income      21454 non-null  float64
 11  purpose           21454 non-null  object 
 12  age_category      21454 non-null  object 
 13  purpose_category  21454 non-null  object 
dtypes: float64(2), int64(5), object(7)
memory usage: 2.5+ MB
       dob_years  total_income  days_employed
0             42     40620.102      

In [63]:
# Get statistical conclusions for the column
print(data[['dob_years', 'total_income', 'days_employed']].describe())


          dob_years   total_income  days_employed
count  21454.000000   21454.000000   21454.000000
mean      43.473665   26451.212929   67058.782838
std       12.213507   15709.968189  139199.750945
min       19.000000    3306.762000      24.000000
25%       33.000000   17219.817250    1023.000000
50%       43.000000   22815.103500    1996.000000
75%       53.000000   31331.348000    5320.000000
max       75.000000  362496.645000  401755.000000


In [64]:
# Create a function that performs categorization into different numeric groups based on ranges

In [65]:
# Create a column containing categories
data['income_category'] = data['total_income'].apply(categorize_income)

In [66]:
# Calculate each category value to see its distribution
print(data['income_category'].describe())
print('\n---------------------------------------------\n')

income_distribution = data['income_category'].value_counts().reset_index()
income_distribution.columns = ['Income Category', 'Count']
print(income_distribution)

count          21454
unique             4
top       Low Income
freq           15534
Name: income_category, dtype: object

---------------------------------------------

   Income Category  Count
0       Low Income  15534
1  Moderate Income   4599
2    Middle Income   1099
3      High Income    222


In [67]:
## Checking hypotheses

**Is there a correlation between having children and the probability of defaulting on a loan?**

In [68]:
correlation = data['children'].corr(data['debt'])
print("Correlation between children and debt:", correlation)


Correlation between children and debt: 0.024686011603250502


In [69]:
# Check child data and loan default data
grouped_data = data.groupby('children')['debt'].agg(['count', 'sum'])
grouped_data['failure_percentage'] = (grouped_data['sum'] / grouped_data['count']) * 100

grouped_data['failure_percentage'] = grouped_data['failure_percentage'].apply(lambda x: f'{x:.2f}%')
print(grouped_data)

          count   sum failure_percentage
children                                
0         14091  1063              7.54%
1          4855   445              9.17%
2          2128   202              9.49%
3           330    27              8.18%
4            41     4              9.76%
5             9     0              0.00%


**Conclusion**

Based on the analysis of child data and loan repayment failure rates, here are some conclusions:

1. The correlation between the number of children and loan default is very weak, with a correlation coefficient of approximately 0.02. This suggests that there is little to no linear relationship between the number of children and the likelihood of default.

2. **Number of Children and Loan Default:** In the customer population, the majority of them do not have children (children = 0). The loan repayment failure rate in this group is around 7.54%, which is relatively low.

3. **Effect of Number of Children:** It is seen that the more children a customer has, the loan repayment failure rate tends to increase. The customer group with one child (children = 1) has a failure rate of around 9.17%, while the customer group with two children (children = 2) has a failure rate of around 9.49%.

4. **Case of the 3-5 Children Group:** Although the number of customers with three to five children is relatively smaller, it can be seen that the loan repayment failure rate is also quite significant, ranging from 8.18% to 9.76%. However, the group with five children (children = 5) did not have any cases of loan failure.

5. **Childless Group:** The group of customers without children (children = 0) has a lower default rate, which may be due to more resources available for loan repayment. However, this analysis still requires further statistical testing to ensure there is a significant relationship between the number of children and loan repayment failure.

These conclusions may provide initial insight into the relationship between the number of children and loan default rates. However, to make more accurate decisions, it is necessary to carry out more in-depth statistical analysis and hypothesis testing.s testing.

**Is there a correlation between family status and the probability of defaulting on a loan?**

In [70]:
correlation = data['family_status_id'].corr(data['debt'])
print("Correlation between family_status_id and debt:", correlation)


Correlation between family_status_id and debt: 0.020346683082729498


In [71]:
# Check family status data and loan default data

# Calculate the percentage of default based on family status
grouped_data = data.groupby('family_status')['debt'].agg(['count', 'sum'])
grouped_data['failure_percentage'] = (grouped_data['sum'] / grouped_data['count']) * 100

grouped_data['failure_percentage'] = grouped_data['failure_percentage'].apply(lambda x: f'{x:.2f}%')
print(grouped_data)


                   count  sum failure_percentage
family_status                                   
civil partnership   4151  388              9.35%
divorced            1195   85              7.11%
married            12339  931              7.55%
unmarried           2810  274              9.75%
widow / widower      959   63              6.57%


Based on the analysis of the correlation between 'family_status_id' and loan default, as well as the percentage of loan default based on family status, the following conclusions can be drawn:

1. The correlation between 'family_status_id' and loan default is very weak, with a correlation coefficient of approximately 0.02. This suggests that there is little to no linear relationship between 'family_status_id' and the likelihood of default.

2. When examining the percentage of loan default based on different family statuses, the data shows variation in default percentages among different family status categories:

- "Civil partnership" and "unmarried" categories have the highest default percentages at 9.35% and 9.75%, respectively.
- The "divorced" category has the lowest default percentage at 7.11%.
- The "widow / widower" category also exhibits a relatively low default percentage of 6.57%.
- The "married" category falls in between, with a default percentage of 7.55%.


In conclusion, while there is a slight variation in default percentages based on family status, the correlation is very weak. This indicates that other factors may play a more significant role in predicting loan default behavior. Nonetheless, these insights can still be valuable for making lending and credit risk assessment decisions based on family status.ly status.

In [72]:
grouped_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, civil partnership to widow / widower
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   count               5 non-null      int64 
 1   sum                 5 non-null      int64 
 2   failure_percentage  5 non-null      object
dtypes: int64(2), object(1)
memory usage: 160.0+ bytes


**Conclusion**

Based on the provided data, we can draw the following conclusions regarding the loan default percentage based on family status:

1. The highest default percentage is among individuals in the "unmarried" category, with a failure percentage of 9.75%.

2. "Civil partnership" and "unmarried" categories have relatively higher default percentages, both above 9%.

3. The "divorced" category has the lowest default percentage at 7.11%.

4. The "widow / widower" category also has a relatively low default percentage of 6.57%.

5. The "married" category falls in between, with a default percentage of 7.55%.

These findings suggest that individuals in the "married" and "divorced" categories have lower default rates compared to those in the "civil partnership" and "unmarried" categories. Additionally, the "widow / widower" category shows a relatively low default rate. These insights can be valuable for making lending and credit risk assessment decisions based on family status.

**Is there a correlation between income level and the probability of defaulting on a loan?**

In [73]:
# Check income level data and loan default data
# Calculate default percentage based on income level

grouped_data = data.groupby('income_category')['debt'].agg(['count', 'sum'])
grouped_data['failure_percentage_by_income'] = (grouped_data['sum'] / grouped_data['count']) * 100

# Format the 'failure_percentage_by_income' column with two decimal digits and add the '%' sign
grouped_data['failure_percentage_by_income'] = grouped_data['failure_percentage_by_income'].apply(lambda x: '{:.2f}%'.format(x))

print(grouped_data)

                 count   sum failure_percentage_by_income
income_category                                          
High Income        222    14                        6.31%
Low Income       15534  1305                        8.40%
Middle Income     1099    78                        7.10%
Moderate Income   4599   344                        7.48%


**Conclusion**

Based on data analysis of family status and loan repayment failure rates, the following are several conclusions:

1. **Marital Status and Loan Failure:** In the customer population, the majority of them are married couples. The loan repayment failure rate in this group is around 7.55%, which is relatively low.

2. **Effect of Marital Status:** It is seen that marital status influences the rate of loan repayment failure. The customer group with "civil partnership" status has a failure rate of around 9.35%, while the "unmarried" group has a failure rate of around 9.75%. The "divorced" group had a lower failure rate, which was around 7.11%, and the "widow/widow" group had the lowest failure rate, which was around 6.57%.

3. **General Conclusion:** In this case, it appears that marital status can influence the loan repayment failure rate, with higher failure rates in the "civil partnership" and "unmarried" groups. However, other factors also need to be considered and analyzed further to better understand the influence of family status on loan repayment failure.

These conclusions provide an initial understanding of the relationship between family status and loan default rates. However, further analysis and statistical testing may be needed to confirm these findings.

**How do credit goals affect default percentage?**

In [74]:
# Check the default rate percentage for each credit objective and perform analysis
grouped_data = data.groupby('purpose_category')['debt'].agg(['count', 'sum'])
grouped_data['failure_percentage_by_purpose'] = (grouped_data['sum'] / grouped_data['count']) * 100

# Format the 'failure_percentage_by_purpose' column with two decimal digits and add the '%' sign
grouped_data['failure_percentage_by_purpose'] = grouped_data['failure_percentage_by_purpose'].apply(lambda x: '{:.2f}%'.format(x))

print(grouped_data)

                  count  sum failure_percentage_by_purpose
purpose_category                                          
cars               4306  403                         9.36%
education          4013  370                         9.22%
housing           10811  782                         7.23%
wedding            2324  186                         8.00%


**Conclusion**

In this analysis, we evaluate the loan repayment failure rate based on credit objectives (purpose_category). The following are the conclusions of the analysis:

1. **Credit Objectives and Default Rates:** There are four categories of credit objectives evaluated: "cars," "education," "housing," and "wedding."

2. **Highest Failure Rate:** The credit destination groups with the highest loan repayment failure rates are "cars" and "education," with failure rates of approximately 9.36% and 9.22%, respectively. This means that people who take out loans to buy a car or for education tend to have a higher risk of default.

3. **Lowest Failure Rate:** The credit destination groups with the lowest loan repayment failure rates are "housing" and "wedding," with default rates of approximately 7.23% and 8.00%, respectively. This suggests that people who borrow for a home purchase or for a wedding tend to have a lower risk of default.

4. **General Conclusion:** While there are differences in default rates based on credit goals, other factors such as income, employment status, and personal financial condition can also influence a person's likelihood of defaulting on a loan. Therefore, it is important to consider these factors simultaneously in credit risk analysis.

5. **Recommendation:** In banking practice, this information can help financial institutions make better decisions in providing loans. For example, they may consider adjusting interest rates or loan terms based on a customer's credit goals to reduce the risk of default.

These conclusions are preliminary results of data analysis and can be the basis for further research and deeper decision making in the banking industry.

# General conclusion


In this data analysis, we evaluate several important data aspects that include the relationship between the number of children, family status, and loan purpose on loan repayment failure rates. In addition, we also addressed the issue of missing data and identified duplicate data. Here are the general conclusions:

1. **Handling Missing Data:** The data contains a number of missing values, especially in the columns "days_employed" and "total_income." To address this issue, we fill in the missing values in the "total_income" column with median income by income type (income_type). In addition, we also filled in missing values in the "days_employed" column with the median number of days worked by income type (income_type). This helps maintain data integrity and avoids bias that can arise from deleting rows with missing values.

2. **Duplicate Data:** In the data analysis process, we found that there were around 0.33% duplicate data in the dataset. We recommend removing this duplicate data so that it does not affect the analysis results.

General conclusion:

- The correlation between 'family_status_id' 'children' and loan default is very weak, with a correlation coefficient of approximately 0.02
- Loan repayment failure rates tend to be higher for individuals with more than 2 children, with the highest rates in groups with 4 children.
- Repayment failure rates also vary by family status, with the "civil partnership" and "unmarried" groups having higher failure rates.
- Credit goals also have an influence on failure rates, with car purchases and education having the highest failure rates, while home purchases have lower failure rates.

We recommend that financial institutions use the results of this analysis to understand different credit risks and make wiser decisions in lending to customers. Further analysis and statistical modeling can be used to understand other factors that influence credit risk in more depth. more depth.