### Exploring Credit Risks

This activity is another open exploration of a dataset using both cleaning methods and visualizations.  The data describes customers as good or bad credit risks based on a small set of features specified below.  Your task is to create a Jupyter notebook with an exploration of the data using both your `pandas` cleaning and analysis skills and your visualization skills using `matplotlib`, `seaborn`, and `plotly`.  Your final notebook should be formatted with appropriate headers and markdown cells with written explanations for the code that follows.

Post your notebook file in Canvas, as well as a brief (3-4 sentence) description of what you found through your analysis. Respond to your peers with reflections on thier analysis.

-----


##### Data Description

```
1. Status of existing checking account, in Deutsche Mark.
2. Duration in months
3. Credit history (credits taken, paid back duly, delays, critical accounts)
4. Purpose of the credit (car, television,...)
5. Credit amount
6. Status of savings account/bonds, in Deutsche Mark.
7. Present employment, in number of years.
8. Installment rate in percentage of disposable income
9. Personal status (married, single,...) and sex
10. Other debtors / guarantors
11. Present residence since X years
12. Property (e.g. real estate)
13. Age in years
14. Other installment plans (banks, stores)
15. Housing (rent, own,...)
16. Number of existing credits at this bank
17. Job
18. Number of people being liable to provide maintenance for
19. Telephone (yes,no)
20. Foreign worker (yes,no)
```

In [24]:
import pandas as pd
import plotly.express as px
import seaborn as sns

In [3]:
df = pd.read_csv('/content/dataset_31_credit-g.csv')


In [None]:
df.head(3)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   checking_status         1000 non-null   object
 1   duration                1000 non-null   int64 
 2   credit_history          1000 non-null   object
 3   purpose                 1000 non-null   object
 4   credit_amount           1000 non-null   int64 
 5   savings_status          1000 non-null   object
 6   employment              1000 non-null   object
 7   installment_commitment  1000 non-null   int64 
 8   personal_status         1000 non-null   object
 9   other_parties           1000 non-null   object
 10  residence_since         1000 non-null   int64 
 11  property_magnitude      1000 non-null   object
 12  age                     1000 non-null   int64 
 13  other_payment_plans     1000 non-null   object
 14  housing                 1000 non-null   object
 15  exist

In [8]:
df.describe()

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


- averaging 21 months of credit duration with a significant standard deviation, implying a wide range of customer creditworthiness.
- averaging age of 35.5 years old with a moderate spread, so the customers are relative young to middle ages.
- credit amount that lended to customer has an averaging of 3271. With a high spread, this implies that lending amount is different across different customer.


let's check the column names

In [7]:
df.columns

Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'class'],
      dtype='object')

let's check if there is missing values

In [14]:
df.isnull().sum()

checking_status           0
duration                  0
credit_history            0
purpose                   0
credit_amount             0
savings_status            0
employment                0
installment_commitment    0
personal_status           0
other_parties             0
residence_since           0
property_magnitude        0
age                       0
other_payment_plans       0
housing                   0
existing_credits          0
job                       0
num_dependents            0
own_telephone             0
foreign_worker            0
class                     0
dtype: int64

check if the df has duplicated values

In [16]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

In [65]:
px.histogram(df, x='purpose', color='class', title='Bar Graph of Purpose of Lending across Class')

overall, no matter what the purpose of lending credit for, more customer has bad credit than good credit.

Let's observe credit amount and credit class

In [39]:
px.histogram(df, x='credit_amount', y='class', title='Credit Amount by Credit Class')

this intuitively make sense because the bank would trust good credit user in order for them to lend more to good credit customer.

let's observe class across age

In [40]:
fig = px.scatter(df, x='duration', y='credit_amount', color='class',
                 title='Relationship between Duration and Credit Amount by Credit Class')
fig.show()


1. first of all there is seem to be a moderate positive correlation between duration and credit amount
  - when the bank lend out longer, it tends that the amount it lends out increases as well.
2. secondly,  across duration, it seems that bad credit customers cover more than good customer.
  - even though it seems that there is a positive correlation, however, the spread of bad credit class tend to increase across credit amount as well.
    - maybe this explain that, over time, they may encounter financial difficulties that affect their ability to repay

let now observe

In [53]:
corr_matrix = df.corr()

# Create a heatmap
px.imshow(corr_matrix, text_auto=True, aspect='auto',
                title='Heatmap of Feature Correlation')





credit amount has the highest correlation with credit duration. Again from the analysis above, we know that the longer the bank lend out, the more credit it accumulate as well.
another surprise insight that wow me is that age and credit amount lend out has nearly 0 correlation. I just thought as age increase, you tend to save more money, or make more money overall, the bank would lend out more but no, it has no correlation.

In [64]:
px.bar(df, x='personal_status', y='credit_amount', color='class',
             title='Personal Status by Credit Class')

let's closely observe whether the bank lend more to female than male even when both has divorce status?

In [63]:
# Filter the data for divorced men and women
divorced_men = df[(df['personal_status'].str.contains('male div'))]
divorced_women = df[(df['personal_status'].str.contains('female div'))]

# Calculate summary statistics or use the entire subset for visualization
# For example, to calculate the median credit amount for divorced men and women
median_credit_men = divorced_men['credit_amount'].median()
median_credit_women = divorced_women['credit_amount'].median()

# Create a bar plot to compare the median credit amounts
px.bar(
    x=['Divorced Men', 'Divorced Women'],
    y=[median_credit_men, median_credit_women],
    title='Comparison of Median Credit Amount Between Divorced Men and Women',
    labels={'x': 'Group', 'y': 'Median Credit Amount'}
)

this intuitively make sense, that when customer face divorce, their financial would be unstable therefore it is risk to lend divorce customer either they are female or male.

# Conclusion:
1. when you have a good credit score, bank tend to lend to you more
2. If customer has risky status, the bank are unlikely to lend you more.
3. when customer lending tend to with longer duration, the lend amount tend to increase as well.
4. when you are single male, your credit amount tend to increase as well. Intuively speaking, it is probably that you are single, has no martial obligation, therefore, financially risk, you are indepently working and more potential with career growth. The bank sees this as an investment and therefore lend more to customer who are single.
5. Lastly, the most customer borrows to buy radio/tv. This surprise to me because second to that is to buy a car which make sense where young males love cars, but to buy radio and tv is interesting.