- Task: Refactor
churn_notebook.ipynb
into production level code. The notebook solves the problem identify credit card customers that are most likely to churn. (Attrition_Flag-target) The data comes from kaggle.
CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category |
---|---|---|---|---|---|---|---|
768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K |
818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K |
713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K |
769911858 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K |
709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K |
- At at a high level, the
churn_library.py
file, produces distributions on age, churn, marital_status, total transtactions, as well as correlation matrix. The categorical features are target encoded. After this, two models are trained, a LogisticRegression(baseline) and RandomForestClassifier(Grid search). Those models are serialized to pickles for later use. Finally, feature importance on the test-set is computed using shap-values.
CATEGORICAL_FEATURES = [
'Gender',
'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'
]
NUMERICAL_FEATURES = [
'Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt',
'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
- Clean and modular code, DRY principle.
- Optimizing code for efficiency.
- Writing documentation.
- Follow PEP8 standards.
- Readable, simple, concise.
- This folder snapshot is after one execution.
C:.
│ .gitignore
│ churn_library.py
│ churn_notebook.ipynb
│ churn_script_logging_and_tests.py
│ constants.py
│ LICENSE
│ README.md
│ requirements_py3.8.txt
│
├───data
│ bank_data.csv
│
├───images
│ │
│ ├───eda (outputs)
│ │ age_distribution.png
│ │ churn_distribution.png
│ │ correlation_matrix.png
│ │ marital_status_distribution.png
│ │ total_trans_ct.png
│ │
│ ├───feature_importance (outputs)
│ │ feature_importance.png
│ │
│ └───results (outputs)
│ classification_report_logistic_regression.png
│ classification_report_random_forest.png
│
├───logs (outputs)
│ churn_library.log
│
├───models (outputs)
│ LogisticRegression.pkl
│ RandomForest.pkl
- on
churn_library.py
we have the refactor code. - on
churn_script_logging_and_tests.py
we have a test execution with logging (logs\churn_library.log
). No testing framework was used.
pip install -r .\requirements_py3.8.txt
- You can set a test run from terminal with
python churn_script_logging_and_tests.py
. This will clean project´s artifacts from images, models and logs folder; then perform a test run similar to the sequence image of above, with the addition of exception handling and logging. The test consist mainly on checking that the expected files are written to file system, e.g. images and models. The test will read parameters of execution fromconstants.py
. - Once you make sure everything is working, you can run from terminal
python churn_library.py path_to_any_csv
to run an execution with main.