Chukwuemeka Okoli
Practicum by Yandex Project 1
April 2, 2021
Project description
Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children have an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.
Your report will be considered when building a credit score for a potential customer. A credit score is used to evaluate the ability of a potential borrower to repay their loan.
Guiding Question
Why do borrowers' default on making on time loan repayment?
The objective of this project is to:
- Prepare a report for a bank's loan division by analyze a borrower's risk of defaulting.
- Apply Data Preprocessing to a real-life analytical case study.
The customers' credit worthiness data is a real-life analytical case study provided by Practicum by Yandex. As a Data Scientist, we are to prepare a report for a client by analyzing the clients' customers and the risk of defaulting on a loan. Various data preprocessing steps were applied and used to analyze the borrower's risk of defaulting on a loan. The insight generated from this report is to be used when building a credit score of a potential customer.
Description of the data
children
: the number of children in the familydays_employed
: how long the customer has been workingdob_years
: the customer’s ageeducation
: the customer’s education leveleducation_id
: identifier for the customer’s educationfamily_status
: the customer’s marital statusfamily_status_id
: identifier for the customer’s marital statusgender
: the customer’s genderincome_type
: the customer’s income typedebt
: whether the customer has ever defaulted on a loantotal_income
: monthly incomepurpose
: reason for taking out a loan
- Python
- Jupyter Notebook
- Pandas
- Numpy
- Matplotlib
- Seaborn
- NLTK
- WordNetLemmatizer
- SnowballStemmer
- Open the data file and have a look at the general information
- Data preprocessing
- Processing missing values
- Data type replacement
- Processing duplicates
- Categorizing Data
- Answer the business question
- Is there a connection between having kids and repaying a loan on time?
- Is there a connection between marital status and repaying a loan on time?
- Is there a connection between income level and repaying a loan on time?
- How do different loan purposes affect on-time loan repayment?
- Conclusion
Introduction
In every business, having an idea about your customers' credit worthiness is an important metric in accessing customers' value to a business. This will later form a basis for measuring essential business metrics such as sales revenue, customer acquisition costs, estimated customer lifetime value, and customer churn. In this project, the bank’s loan division is trying to find out if a customer’s marital status and number of children have an impact on whether they will default on a loan. The goal is to apply data preprocessing and analytics in order to determine customers’ credit worthiness. The insight obtained from this project will enable the bank to determine the estimated customer lifetime value, and will be useful when building a credit score for a potential customer.
Methods
To accomplish this, I first inspected the data using the pandas library to obtain general information about the data. I processed the missing values, changed data type, and processed duplicates. Next, I proceeded to categorized data and prepare the data for further analysis. To carry out lemmatization on the purpose
column, I used the WordNetLemmatizer
and SnowballStemmer
to extract frequency of words in purpose
column. I then proceeded to encode the categorical variables. In analyzing the data, I prepared various pivot tables and plotted various visualiztion using the Matplotlib and Seaborn libraries. Analysis using pivot table was important in answering some of the business needs.
Key Findings
I created a visualization of my findings using the Seaborn library.
The following are the key findings from this analysis:
- People with more than 5 kids and up to 20 kids are ~37% more likely to be in debt than people with no kid thus, there is a relationship between having kids and repaying a loan on time. This means that people with kids are likely to default on loan repayment.
- Unmarried people with up to 4 kids and divorced people with up to 20 kids are ~75% more likely to be in debt than any other family status, and about 80% more likely to be in debt than people with 3 or less number of children.
- Unmarried people are more than 2% likely to be in debt than married people. Widow/widower are least likely to be in debt than any of the other groups. This means unmarried people are more likely to default on loan repayment.
- There is no correlation between income level and defaulting on loan payment.
- People requesting a loan for car purchase and education purposes will most likely default on loan repayment. People requesting loan for house purchase make on time payment than any other category.
Deployment and Application
I plan on future deployment using Amazon Web Services. The goal is to extend the application of the project to multiple customers via web services.
Future Development
For future development, I will be working at better visualization and statistical analysis to optimize clients' customer acquisition costs determination. I would also be working on predicting customer churn using Machine Learning for the client. Future Machine Learning model will be put to production and deployed via web app.
- Applied strategies for dealing with missing values.
- Converted data from one type to another.
- Identified duplicate data and processed it in several different ways.
- Categorized data.
- Export final data into pivot tables.
- Queried and used pivot table for data manipulation and interpretation.
- Created visualizations using insights from pivot table.