After submitting the proposed work plan, the work we had left to do was as follows:

4. Make necessary adjustements to data for the given model (LGBM and XGB handle categorical variables differently than CatBoost)
5. Check which non-data leakage inducing combination of features the model performs best with via GridSearch cross validation
6. Adjust features based on SHAP output
7. Tune hyperparameters for models with best features
8. Retrain model on full training set and test model
9. Generate script files for hyperparameter tuning of each model and for best model and hyperparameters test run

And the questions we had left to answer included the following:

1. Can we utilize the 'customer_duration' feature without introducing data leakage? 
2. Can we utilize 'customer_duration' with start_dayofweek_sin, start_dayofweek_cos, and start_dayofweek_cat without introducing data leakage?
3. What non data leakage inducing combinatino of features will lead to the best model performance?

After running and re-running the LGBM, CatBoost, and XGBoost classifiers with different combinatinos of features, I can confirm the following:

1. Yes, we can utilize the 'customer_duration' feature without introducing any obvious data leakage. The shap values from our chosen model reveal that 'customer_duration' and 'type' are the most influential features in the set, with values between   for 'customer_duration' and    for 'type'. At first glance, this might have been concerning for data leakage. For example, if churn dates did not always align with the end of a contract period, any customer duration which did not align with 30 day or yearly intervals would be an obvious case of churn. We introduced some noise into the 'customer_duration' variable by subtracting random float values between 12 and 15 from 'customer_duration' values. However, this is only a range of 3. Therefore, any 'customer_duration' value which was more than 3 days off of a yearly or monthly interval would be easily flagged by the model. However, this is not a problem because we know that all end_dates were on the first of their month. Moreover, the noise we introduced into the 'customer_duration' variable is still a useful bullwork against data leakage because the more values we have for 'customer_duration' the harder it would be for the model to reliably hone in on certain values which reliably indicate a churn. The goal, of course, was to force the model to find the durations where churn is most risky for the sake of future prediction, rather than using it as a surefire delimiter for customers who hvae churned versus those who have not.

2. There was no way to incorporate any dayofweek features into the data without introducing obvious signs of data leakage. I tried expanding the range of values I subtracted from 3 'customer_duration' to 70. The thinking behdind this was that start_dayoThe comfweek would allow the model to triangulate a start month and start year because only certain months start with certain days in certain years. Moreover, there are only four total end_dates in our data set, each of which was a month apart from the other, which would make the triangulation process much easier. However, since each end_date was only a month from the other, the idea of introducing random variation into 'customer_duration' by a range of over two months seemed like a promising way to prevent the model from honing in on a date that obviously ended within the range of the four end_dates in our set. This seemed to offer some promise during the crossvalidation and hyperparameter tuning phase. However, for all models, when they had the change to train on the full data set, roc_auc scores skyrocketed from the high .80s during crossvalidation to .99 on the test set. Therefore, I had to abandon any inclusion of a dayofweek feature.

3. The combination of features we ultimately settled on was as follows:

 0   type               5634 non-null   category  
 1   paperless_billing  5634 non-null   category  
 2   payment_method     5634 non-null   category  
 3   monthly_charges    5634 non-null   float64   
 4   total_charges      5625 non-null   float64   
 5   gender             5634 non-null   category  
 6   senior_citizen     5634 non-null   int64     
 7   partner            5634 non-null   category  
 8   dependents         5634 non-null   category  
 9   p_i_or_b           5634 non-null   category  
 10  internet_service   4403 non-null   category  
 11  online_security    4403 non-null   category  
 12  online_backup      4403 non-null   category  
 13  device_protection  4403 non-null   category  
 14  tech_support       4403 non-null   category  
 15  streaming_tv       4403 non-null   category  
 16  streaming_movies   4403 non-null   category  
 17  multiple_lines     5089 non-null   category  
 18  customer_duration  5634 non-null   float64   

The model which demonstrated the best performance was XGBoost with a roc_auc score of 0.90 on the test set. During cross validation this same model produced a roc_auc score of 0.85, but an accuracy score of only 0.75. The disparity between the roc_auc score and the accuracy score suggested that a prediction threshold adjustment for the model would be appropriate. We determined the optimal threshold by finding the value which maximized the Youdon's j statistic (whcih is the difference between true positive rate and false positive rate). After making this adjustement to the model we selected during the hyperparameter tuning phase, the roc_auc and accuracy scores showed some convergence, although perhaps not as much as we would have hoped for, as the accuracy score reached 0.79. While accuracy score is important because we don't want to be giving away promotional offers to customers who are not going to churn, recall is arguably a higher priority, as the cost of losing a customer is higher than the cost of giving them a promotion. Therefore, we calculated a recall_score as well, and this was much more encouraging. With the adjusted threshold, recall on the test set was .88. Given an accuracy score of 0.79, this tells us that we are on the right side of the precision recall tradeoff for this task. 

What follows below is the code that followed the project plan and EDA that was submitted previously. The first part is layed out in notebook format just as the project plan and EDA were. It should just be copy and pasted to the same notebook and run all the way through. This code consists of the removal of features from the data set which were shown to lead to data leakag which included all the date features we engineered. It will also download the necessary data sets to your local device as parquet files to maintain dtype integrity. Please also note that I did edit the range I subtracted from 'customer_duration' from 13-15 to 12-15. In practice this had little effect, but the original idea was to use a range of 3 so it required correcting to stay consistent.

The rest of the below code consists of generated script files to be run on a GPU platform. There are four generated script files in total. The first three are for the hyperparameter tuning of the respective models and appear in the following order: LGBMClassifier, CatBoostClassifier, and XGBoostClassifier. 

The final generated script file includes the fitting of the XGBClassifier with tooned hyperparameters to the full training set, and the calculation of its performance metrics on the test set. It also includes a SHAP analysis, but the results were nearly identical to that of the cross validation SHAP analysis. 
