The subject of this analysis was a Credit Scoring dataset describing more than 200 hundred customers who took a loan in a bank. The dataset contains information about their age, gender, education, work experience, family and financial situation, and even their loan purposes. The project's main goal was to find a connection between specific customers' features and repaying their loans on time.
In the first step of the project, I obtained general information about the dataset and its variables and identified the main issues that should have been fixed before the analysis. Then I preprocessed the dataset and prepared for the exploratory analysis, working on all the dataset's flaws discovered at the previous stage. Finally, I found the riskiest category of customers by analyzing connections between repaying a loan on time and customer's number of children, marital status, level of income, their loan purposes.
The dataset is a CSV file credit_scoring_eng.csv.
Data Dictionary:
- children - the number of children in the family
- days_employed - how long the customer has been working
- dob_years - the customer’s age
- education - the customer’s education level
- education_id - an identifier for the customer’s education
- family_status - the customer’s marital status
- family_status_id - an identifier for the customer’s marital status
- gender - the customer’s gender
- income_type - the customer’s income type
- debt - whether the customer has ever defaulted on a loan
- total_income - the customer’s income
- purpose - a reason for taking out a loan
The following questions were investigated:
- Is there a connection between having kids and repaying a loan on time?
- Is there a connection between marital status and repaying a loan on time?
- Is there a connection between income level and repaying a loan on time?
- How do different loan purposes affect on-time loan repayment?
- Jupyter Notebook
- Python with the following libraries and modules:
- NumPy
- pandas
- Matplotlib
- Seaborn
- SciPy
- NLTK