Skip to content

dsp-uga/Navi-P4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Navi-P4

Home Credit Default Risk

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience; Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

This is a Kaggle Challenge. The link to the Challenge is as below.

  https://www.kaggle.com/c/home-credit-default-risk/

Goal: Predict if an applicant is capable of repaying a loan.

Prerequisites

List of requirements and links to install them:

Use the setup.py to install all the Prerequisites for the Project

  python setup.py install

Data

The Data set consist of 6 CSV files namely:

  • Application_train data (307,511 * 122)
  • Application_test data (48,744 * 121)
  • Bureau data (1,716,428 * 17))
    • Bureau_balance data (27,299,925 * 3)
  • Previous_application data (1,670,214 * 37)
    • Installments_payments data (13,605,401 * 8)
    • Credit_card_balance data (3,840,312 * 23)
    • POS_CASH_balance data (10,001,358 * 8)

The data are all available for download on:

  Kaggle : kaggle competitions download -c home-credit-default-risk

This link was provided by Kaggle .com

Approach

This project features two end-to-end approaches taken to solve the Kaggle challenge 'Home Credit Default Risk'. The first approach is using features created manually and the second approach is using an automated feature creation tool. The results of both these methods are compared and it is found that automated feature engineering can create superior features, in a shorter amount of time.

Training models:

  • Gradient Boosting Machines(GBM)
  • RandomForest

Directory Specifications

  • Input data should follow the below structure:
    The project directory should contain a folder named tier1 containing all the 6 raw csv files.

parent folder(project directory)
     |- tier1
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv

  • Output once the entire code is completed will follow the below structure:
    The submission file is written inside the project directory.

parent folder(project directory)
     |- tier1
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv
     |- tier2
         |--Application_test.csv
         |--Bureau.csv
         |--Bureau_balance.csv
         |--Previous_application.csv
         |--Installments_payments.csv
         |--Credit_card_balance.csv
         |--POS_CASH_balance.csv
     |- tier3
         |--Application_train.csv
         |--Application_test.csv
     |- p4sub_gbm.csv

Running the code

main.py -d \<data_path>\ -m \<mode>\ -ft \<feature_type>\ -p \<primitive_set>\ -i \<imp_thresh>\

Parameters:

<data_path> Path to the parent folder created as per the directory specifications.
Default : Current working directory.

<mode> Mode to run the code in.
Default : 'all'
Choices : 'all' - To generate features and also train the model and generate predictions,
            'features' - To only generate the features,
            'model' - To only run the model

<feature_type> Type of feature selection to implement
Default : 'auto'
Choices : 'auto' - To generate features using feature tools
            'manual' - To generate features based on manual feature engineering

<primitive_set> Set of primitives to consider while using feature tools
Default : 'some'
Choices : 'some' - To use some of the primitives while using feature tools
            'all' - To use all the primitives while using feature tools

<imp_thresh> Importance threshold to consider while doing feature selection
Default : '0'

Output:

The program will output p4sub_gbm.csv in the given parent directory.

Sample Run Commands and Code Flow:

  • To run the entire code using auto feature selection and gbm for model predictions:

The below command will use the data from tier1, generate features using featuretools, and save the feature matrix into tier3. Then tier3 data is used to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m all -ft auto

  • To run the entire code using manual feature selection and gbm for model predictions:

The below command will use the data from tier1, generate features using manual feature engineering techniques, and save the transformed data into tier2. Tier2 data is then joined according to the data model and saved into tier3. Then, tier3 data is used to train gbm and generate predictions.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m all -ft manual

  • To generate features alone using manual/auto feature selection:

The below command will use the data from tier1, generate features using manual feature engineering techniques, and save the transformed data into tier2. Tier2 data is then joined according to the data model and saved into tier3.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m features -ft manual

The below command will use the data from tier1, generate features using featuretools, and save the feature matrix into tier3.
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m features -ft auto

  • To train the model for features that are already generated:

The below command uses features generated by auto feature selection in tier3 data to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m model -ft auto

The below command uses features generated by manual feature selection in tier3 data to train gbm and generate predictions
python main.py -d /Users/hemanth/Desktop/MSAI/DataSciencePracticum/Projects/p4 -m model -ft manual

References

See the references Wiki page for details.

References Wiki

Ethics Considerations

This project could be used as a part of a study on the loan repayment abilities for an individual. With this context in mind, we have undertaken certain ethics considerations to ensure that this project cannot be misused for purposes other than the ones intended.

See the ETHICS.md file for details. Also see the Wiki Ethics page for explanations about the ethics considerations.

Contibutors

See the contributors file for details.

Contributors

License

This project is licensed under the MIT License- see the LICENSE.md file for details

About

Repository for team Natus Vincere

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages