<div style="border: 5px solid purple; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>
    
Hi Amrit, my name is Svetlana (https://hub.tripleten.com/u/6dee602c).  Congratulations on submitting the Final project! 🎉 

    
<div style="border: 5px solid green; padding: 15px; margin: 5px">

- You did a great job on building a good model for churn prediction.


- The notebook demonstrates strong fundamentals: clean data merging, thoughtful preprocessing, train-test splitting, and meaningful metric evaluation.

 
- It's great that you split the data into 3 subsets.


- The conclusions clearly describe the results, well done! 
 

</div>
    
<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b> Reviewer's comment </b>

What can be improved:


- Consider introducing EDA. Distributions and feature correlations may provide helpful context before modeling. 
 

- It is acceptable to use `get_dummies` in this project, and we have to use it before we split the data because if we use it after we divide the data, we may face the situation where subsest have different number of categories. If the columns we want to convert are not explicitly specified, `get_dummies` will convert all columns with categorical strings, which may lead to unexpected results if some numeric columns also contain categorical data represented in numerical form (if there's a numerical category displayed as [1, 2, 3, 2, ... ]), so well done! However, there are more preferable tools.

<details><summary><font color="purple">click here to read more</font></summary>
<br>
    
`OneHotEncoder(handle_unknown='ignore')` or `OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)` are generally more robust than `get_dummies` because they can handle situations where test subset has features that were not available during training. [Difference between OneHotEncoder and get_dummies](https://pythonsimplified.com/difference-between-onehotencoder-and-get_dummies/). 
    
    
    
For tree-based models, `OrdinalEncoder` is a better choice because of computational cost. For boosting algorithms, we can rely on internal encoders that usually perform even better than external ones. For `CatBoost`, this is controlled by the `cat_features` parameter. For `LightGBM`, you can convert categorical features to the category type, allowing the model to handle them automatically.
    

    
`OrdinalEncoder()` or `LabelEncoder()` should not be used with linear models if there's no ordinal relationship. [How and When to Use Ordinal Encoder](https://leochoi146.medium.com/how-and-when-to-use-ordinal-encoder-d8b0ef90c28c). For linear regresison, I recommend using `OneHotEncoder(handle_unknown='ignore')`. 


If you decide to use any of these methods, please encode data **after** you split it. 

</details>



- To further improve the model's performance, I recommend applying hyperparameter tuning.


- Before training real models, it's useful to evaluate a constant (dummy) classifier. For example, by predicting the majority class. This sets a minimum performance baseline and ensures our data pipeline, target encoding, and evaluation metrics are functioning correctly. If our real model performs worse than this dummy, it signals a serious issue in preprocessing, feature engineering, or model configuration.

  
- You can also add the ROC curve for better representation. 



</div>


<hr>
    
<font color='dodgerblue'>**To sum up:**</font> you demonstrated strong analytical and coding skills by preparing the data and training the models. I do not have any questions, so the project can be accepted. Thank you for your diligence on this and other sprints! I am very glad to see your progress 😊 Good luck! 😉
    


</div>

Importing necessary libraries and reading csv files into dataframes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_contract = pd.read_csv('/datasets/final_provider/contract.csv')
df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
df_internet = pd.read_csv('/datasets/final_provider/internet.csv')
df_phone = pd.read_csv('/datasets/final_provider/phone.csv')

In [3]:
df_contract.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


In [4]:
df_personal.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [5]:
df_internet.head()

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


In [6]:
df_phone.head()

Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


Checking null values and shapes

In [7]:
print(df_contract.isna().sum())
print()
print(df_personal.isna().sum())
print()
print(df_internet.isna().sum())
print()
print(df_phone.isna().sum())
print()

customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
dtype: int64

customerID       0
gender           0
SeniorCitizen    0
Partner          0
Dependents       0
dtype: int64

customerID          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
dtype: int64

customerID       0
MultipleLines    0
dtype: int64



In [8]:
print(df_contract.shape)
print(df_personal.shape)
print(df_internet.shape)
print(df_phone.shape)

(7043, 8)
(7043, 5)
(5517, 8)
(6361, 2)


Merging all dataframes by 'customerID', df_merged

In [9]:
merged = pd.merge(df_contract, df_personal, on='customerID', how='outer')
merged2 = pd.merge(df_internet, df_phone, on='customerID', how='outer')
df_merged = pd.merge(merged, merged2, on='customerID', how='outer')
df_merged.isna().sum()

customerID             0
BeginDate              0
EndDate                0
Type                   0
PaperlessBilling       0
PaymentMethod          0
MonthlyCharges         0
TotalCharges           0
gender                 0
SeniorCitizen          0
Partner                0
Dependents             0
InternetService     1526
OnlineSecurity      1526
OnlineBackup        1526
DeviceProtection    1526
TechSupport         1526
StreamingTV         1526
StreamingMovies     1526
MultipleLines        682
dtype: int64

In [10]:
df_merged.shape

(7043, 20)

In [11]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   5517 non-null   object 
 13  OnlineSecurity    5517 non-null   object 
 14  OnlineBackup      5517 non-null   object 
 15  DeviceProtection  5517 non-null   object 
 16  TechSupport       5517 non-null   object 


Filling missing values after merging, as well as checking duplicates, and then dropping 'customerID'

In [12]:
cols_to_fill = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'MultipleLines']
df_merged[cols_to_fill] = df_merged[cols_to_fill].fillna('unknown')

In [13]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   7043 non-null   object 
 13  OnlineSecurity    7043 non-null   object 
 14  OnlineBackup      7043 non-null   object 
 15  DeviceProtection  7043 non-null   object 
 16  TechSupport       7043 non-null   object 


In [14]:
df_merged.duplicated().sum()

0

In [15]:
df_merged.drop('customerID', axis = 1, inplace = True).drop_duplicates()

Calculating when the earliet EndDate is so I can determine when to have a time snapshot to create a Tenure Feature

In [16]:
df_merged['BeginDate'].max()

'2020-02-01'

In [17]:
df_merged['EndDate'].min()

'2019-10-01 00:00:00'

Creating Tenure feature with the time snapshot of 2019-09-15, this snapshot ensures there is no data leakage when creating the feature because the earliest end date was 2019-10-01, so I will drop the customers who started (BeginDate) after 2019-09-15. As seen, 938 customers have a BeginDate after 2019-09-15 which is around 13% of the dataset. Dropping this isn't trivial but not catastrophic. I think Tenure will be a very valuable feature in predicting churn rate. 

In [18]:
df_merged['BeginDate'] = pd.to_datetime(df_merged['BeginDate'].str.strip(), errors='coerce')
df_merged['Tenure'] = pd.to_datetime('2019-09-15 00:00:00') - df_merged['BeginDate']
df_merged['Tenure'] = df_merged['Tenure'].dt.days
df_merged['Tenure'] = pd.to_numeric(df_merged['Tenure'])
neg_tenure = df_merged[df_merged['Tenure']<0].count()
neg_tenure

BeginDate           938
EndDate             938
Type                938
PaperlessBilling    938
PaymentMethod       938
MonthlyCharges      938
TotalCharges        938
gender              938
SeniorCitizen       938
Partner             938
Dependents          938
InternetService     938
OnlineSecurity      938
OnlineBackup        938
DeviceProtection    938
TechSupport         938
StreamingTV         938
StreamingMovies     938
MultipleLines       938
Tenure              938
dtype: int64

In [19]:
df_merged = df_merged[df_merged['Tenure'] > 0]
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6105 entries, 1 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   BeginDate         6105 non-null   datetime64[ns]
 1   EndDate           6105 non-null   object        
 2   Type              6105 non-null   object        
 3   PaperlessBilling  6105 non-null   object        
 4   PaymentMethod     6105 non-null   object        
 5   MonthlyCharges    6105 non-null   float64       
 6   TotalCharges      6105 non-null   object        
 7   gender            6105 non-null   object        
 8   SeniorCitizen     6105 non-null   int64         
 9   Partner           6105 non-null   object        
 10  Dependents        6105 non-null   object        
 11  InternetService   6105 non-null   object        
 12  OnlineSecurity    6105 non-null   object        
 13  OnlineBackup      6105 non-null   object        
 14  DeviceProtection  6105 n

Converting TotalCharges into a numeric float and then adjusting the values to match the time snapshot of 2019-09-15, approximately.

In [20]:
df_merged['TotalCharges'] = pd.to_numeric(df_merged['TotalCharges'])
df_merged['TotalCharges'] = df_merged['TotalCharges'] - df_merged['MonthlyCharges'] * 4.5
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6105 entries, 1 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   BeginDate         6105 non-null   datetime64[ns]
 1   EndDate           6105 non-null   object        
 2   Type              6105 non-null   object        
 3   PaperlessBilling  6105 non-null   object        
 4   PaymentMethod     6105 non-null   object        
 5   MonthlyCharges    6105 non-null   float64       
 6   TotalCharges      6105 non-null   float64       
 7   gender            6105 non-null   object        
 8   SeniorCitizen     6105 non-null   int64         
 9   Partner           6105 non-null   object        
 10  Dependents        6105 non-null   object        
 11  InternetService   6105 non-null   object        
 12  OnlineSecurity    6105 non-null   object        
 13  OnlineBackup      6105 non-null   object        
 14  DeviceProtection  6105 n

In [21]:
df_merged.head(20)

Unnamed: 0,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines,Tenure
1,2017-04-01,No,One year,No,Mailed check,56.95,1633.225,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No,897
3,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1650.4,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,unknown,1232
4,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,-166.5,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No,14
5,2019-03-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,99.65,372.075,Female,0,No,No,Fiber optic,No,No,Yes,No,Yes,Yes,Yes,198
6,2018-04-01,No,Month-to-month,Yes,Credit card (automatic),89.1,1548.45,Male,0,No,Yes,Fiber optic,No,Yes,No,No,Yes,No,Yes,532
7,2019-04-01,No,Month-to-month,No,Mailed check,29.75,168.025,Female,0,No,No,DSL,Yes,No,No,No,No,No,unknown,167
8,2017-07-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,104.8,2574.45,Female,0,Yes,No,Fiber optic,No,No,Yes,Yes,Yes,Yes,Yes,806
9,2014-12-01,No,One year,No,Bank transfer (automatic),56.15,3235.275,Male,0,No,Yes,DSL,Yes,Yes,No,No,No,No,No,1749
10,2019-01-01,No,Month-to-month,Yes,Mailed check,49.95,362.675,Male,0,Yes,Yes,DSL,Yes,No,No,No,No,No,No,257
11,2018-10-01,No,Two year,No,Credit card (automatic),18.95,241.525,Male,0,No,No,unknown,unknown,unknown,unknown,unknown,unknown,unknown,No,349


Mapping EndDate to create an efficient target column

In [22]:
df_merged['EndDate'] = pd.to_datetime(df_merged['EndDate'], errors = 'coerce')

In [23]:
df_merged['EndDate'] = df_merged['EndDate'].notna().map({True:1, False:0})
df_merged.head()

Unnamed: 0,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines,Tenure
1,2017-04-01,0,One year,No,Mailed check,56.95,1633.225,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No,897
3,2016-05-01,0,One year,No,Bank transfer (automatic),42.3,1650.4,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,unknown,1232
4,2019-09-01,1,Month-to-month,Yes,Electronic check,70.7,-166.5,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No,14
5,2019-03-01,1,Month-to-month,Yes,Electronic check,99.65,372.075,Female,0,No,No,Fiber optic,No,No,Yes,No,Yes,Yes,Yes,198
6,2018-04-01,0,Month-to-month,Yes,Credit card (automatic),89.1,1548.45,Male,0,No,Yes,Fiber optic,No,Yes,No,No,Yes,No,Yes,532


Feature Engineering relevant info from BeginDate

In [24]:
df_merged['begin_year'] = df_merged['BeginDate'].dt.year
df_merged['begin_month'] = df_merged['BeginDate'].dt.month
df_merged['begin_day'] = df_merged['BeginDate'].dt.day
df_merged.drop('BeginDate', axis = 1, inplace = True)
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6105 entries, 1 to 7042
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   EndDate           6105 non-null   int64  
 1   Type              6105 non-null   object 
 2   PaperlessBilling  6105 non-null   object 
 3   PaymentMethod     6105 non-null   object 
 4   MonthlyCharges    6105 non-null   float64
 5   TotalCharges      6105 non-null   float64
 6   gender            6105 non-null   object 
 7   SeniorCitizen     6105 non-null   int64  
 8   Partner           6105 non-null   object 
 9   Dependents        6105 non-null   object 
 10  InternetService   6105 non-null   object 
 11  OnlineSecurity    6105 non-null   object 
 12  OnlineBackup      6105 non-null   object 
 13  DeviceProtection  6105 non-null   object 
 14  TechSupport       6105 non-null   object 
 15  StreamingTV       6105 non-null   object 
 16  StreamingMovies   6105 non-null   object 


Using for loop to count unique values of the object datatype features, will help in OHE

In [25]:
for col in df_merged.columns:
    if df_merged[col].dtype == 'object':
        print(f'{col}: {df_merged[col].nunique()}')

Type: 3
PaperlessBilling: 2
PaymentMethod: 4
gender: 2
Partner: 2
Dependents: 2
InternetService: 3
OnlineSecurity: 3
OnlineBackup: 3
DeviceProtection: 3
TechSupport: 3
StreamingTV: 3
StreamingMovies: 3
MultipleLines: 3


In [26]:
cols = ['PaperlessBilling','Partner','Dependents']
df_merged[cols] = df_merged[cols].replace({'Yes': 1, 'No': 0})
df_merged['gender'] = df_merged['gender'].map({'Male':1, 'Female':0})

In [27]:
for col in df_merged.columns:
    if df_merged[col].dtype == 'object':
        print(f'{col}: {df_merged[col].nunique()}')

Type: 3
PaymentMethod: 4
InternetService: 3
OnlineSecurity: 3
OnlineBackup: 3
DeviceProtection: 3
TechSupport: 3
StreamingTV: 3
StreamingMovies: 3
MultipleLines: 3


Creating df_merged_ohe

In [28]:
df_merged_ohe = pd.get_dummies(df_merged, drop_first = True)

In [29]:
df_merged_ohe.info()
df_merged_ohe.head(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6105 entries, 1 to 7042
Data columns (total 33 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   EndDate                                6105 non-null   int64  
 1   PaperlessBilling                       6105 non-null   int64  
 2   MonthlyCharges                         6105 non-null   float64
 3   TotalCharges                           6105 non-null   float64
 4   gender                                 6105 non-null   int64  
 5   SeniorCitizen                          6105 non-null   int64  
 6   Partner                                6105 non-null   int64  
 7   Dependents                             6105 non-null   int64  
 8   Tenure                                 6105 non-null   int64  
 9   begin_year                             6105 non-null   int64  
 10  begin_month                            6105 non-null   int64  
 11  begi

Unnamed: 0,EndDate,PaperlessBilling,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,Tenure,begin_year,...,DeviceProtection_Yes,DeviceProtection_unknown,TechSupport_Yes,TechSupport_unknown,StreamingTV_Yes,StreamingTV_unknown,StreamingMovies_Yes,StreamingMovies_unknown,MultipleLines_Yes,MultipleLines_unknown
1,0,0,56.95,1633.225,1,0,0,0,897,2017,...,1,0,0,0,0,0,0,0,0,0
3,0,0,42.3,1650.4,1,0,0,0,1232,2016,...,1,0,1,0,0,0,0,0,0,1
4,1,1,70.7,-166.5,0,0,0,0,14,2019,...,0,0,0,0,0,0,0,0,0,0
5,1,1,99.65,372.075,0,0,0,0,198,2019,...,1,0,0,0,1,0,1,0,1,0
6,0,1,89.1,1548.45,1,0,0,1,532,2018,...,0,0,0,0,1,0,0,0,1,0
7,0,0,29.75,168.025,0,0,0,0,167,2019,...,0,0,0,0,0,0,0,0,0,1
8,1,1,104.8,2574.45,0,0,1,0,806,2017,...,1,0,1,0,1,0,1,0,1,0
9,0,0,56.15,3235.275,1,0,0,1,1749,2014,...,0,0,0,0,0,0,0,0,0,0
10,0,1,49.95,362.675,1,0,1,1,257,2019,...,0,0,0,0,0,0,0,0,0,0
11,0,0,18.95,241.525,1,0,0,0,349,2018,...,0,1,0,1,0,1,0,1,0,0


In [30]:
df_merged['EndDate'].value_counts(normalize = True)

0    0.756102
1    0.243898
Name: EndDate, dtype: float64

df_merged_ohe is created, ready to start model training. Important to note there is a class imbalance which I will address. 

## Model Training

In [31]:
from sklearn.model_selection import train_test_split

Data split 80-20-20 for model training: training-validation-testing

In [32]:
features = df_merged_ohe.drop('EndDate', axis = 1)
target = df_merged_ohe['EndDate']

features_temp, features_test, target_temp, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345)

features_train, features_valid, target_train, target_valid = train_test_split(
    features_temp, target_temp, test_size=0.25, random_state=12345)

First model is a Logistic Regression Model.

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

In [34]:
model_lr = LogisticRegression(class_weight = 'balanced')
model_lr.fit(features_train, target_train)
predict_lr = model_lr.predict(features_valid)
accuracy_score_lr = accuracy_score(target_valid, predict_lr)
f1_score_lr = f1_score(target_valid, predict_lr)
roc_auc_score_lr = roc_auc_score(target_valid, predict_lr)
print(f'Accuracy Score: {accuracy_score_lr}')
print(f'F1 Score: {f1_score_lr}')
print(f'Roc Auc Score: {roc_auc_score_lr}')

Accuracy Score: 0.7608517608517609
F1 Score: 0.6085790884718499
Roc Auc Score: 0.7629054054054054


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Achieved a Roc Auc Score of 0.76 on the validation set, LogisticRegression

In [35]:
from sklearn.ensemble import RandomForestClassifier

Second Model is a RandomForestClassifier

In [36]:
model_rf = RandomForestClassifier(n_estimators = 100, max_depth = 50, class_weight = 'balanced', random_state = 12345)
model_rf.fit(features_train, target_train)
predict_rf = model_rf.predict(features_valid)
accuracy_score_rf = accuracy_score(target_valid, predict_rf)
f1_score_rf = f1_score(target_valid, predict_rf)
roc_auc_score_rf = roc_auc_score(target_valid, predict_rf)
print(f'Accuracy Score: {accuracy_score_rf}')
print(f'F1 Score: {f1_score_rf}')
print(f'Roc Auc Score: {roc_auc_score_rf}')

Accuracy Score: 0.8402948402948403
F1 Score: 0.5962732919254659
Roc Auc Score: 0.72


Achieved a Roc Auc Score of 0.72 on the validation set, RandomForestClassifier

In [37]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

Third Model is LightGBM Gradient Boosting Tree

In [38]:
model_lgb = LGBMClassifier(class_weight = 'balanced',  
                           random_state = 12345,
                           n_estimator = 500,
                           learning_rate = 0.1,
                           max_depth = 20)
model_lgb.fit(features_train, target_train,
              eval_set=[(features_valid, target_valid)],
              eval_metric = 'auc',
              early_stopping_rounds = 50,
              verbose = 10)





[10]	valid_0's auc: 0.876342	valid_0's binary_logloss: 0.472121
[20]	valid_0's auc: 0.882504	valid_0's binary_logloss: 0.410484
[30]	valid_0's auc: 0.890856	valid_0's binary_logloss: 0.382523
[40]	valid_0's auc: 0.894423	valid_0's binary_logloss: 0.366237
[50]	valid_0's auc: 0.898397	valid_0's binary_logloss: 0.353372
[60]	valid_0's auc: 0.898868	valid_0's binary_logloss: 0.346105
[70]	valid_0's auc: 0.901819	valid_0's binary_logloss: 0.337751
[80]	valid_0's auc: 0.903236	valid_0's binary_logloss: 0.332033
[90]	valid_0's auc: 0.90477	valid_0's binary_logloss: 0.3265
[100]	valid_0's auc: 0.906611	valid_0's binary_logloss: 0.321533


LGBMClassifier(class_weight='balanced', max_depth=20, n_estimator=500,
               random_state=12345)

In [39]:
print("Best AUC score:", model_lgb.best_score_['valid_0']['auc'])

Best AUC score: 0.9066106647187728


Achieved a Roc Auc Score of 0.907 on its best iteration, LightGBM

In [40]:
from catboost import CatBoostClassifier

Final Model is CatBoost Gradient Boosting Tree

In [41]:
model_cb = CatBoostClassifier(random_seed = 12345,
                              #auto_class_weights='Balanced',
                              class_weights = [1,3],
                              iterations = 5000, 
                              learning_rate = 0.01,
                              loss_function='Logloss',
                              eval_metric='AUC',
                              early_stopping_rounds=500,
                              use_best_model=True,
                              verbose = 100)
model_cb.fit(features_train, target_train,
             eval_set=(features_valid, target_valid))

0:	test: 0.8303269	best: 0.8303269 (0)	total: 47.5ms	remaining: 3m 57s
100:	test: 0.8628598	best: 0.8628671 (99)	total: 190ms	remaining: 9.22s
200:	test: 0.8683017	best: 0.8683455 (199)	total: 327ms	remaining: 7.81s
300:	test: 0.8706830	best: 0.8706830 (300)	total: 472ms	remaining: 7.36s
400:	test: 0.8725420	best: 0.8726808 (383)	total: 614ms	remaining: 7.04s
500:	test: 0.8754711	best: 0.8755150 (499)	total: 751ms	remaining: 6.75s
600:	test: 0.8767604	best: 0.8768335 (598)	total: 889ms	remaining: 6.51s
700:	test: 0.8798356	best: 0.8798356 (700)	total: 1.02s	remaining: 6.28s
800:	test: 0.8828378	best: 0.8828415 (799)	total: 1.16s	remaining: 6.1s
900:	test: 0.8856245	best: 0.8856245 (900)	total: 1.3s	remaining: 5.93s
1000:	test: 0.8882652	best: 0.8882652 (998)	total: 1.44s	remaining: 5.77s
1100:	test: 0.8907122	best: 0.8907122 (1100)	total: 1.59s	remaining: 5.63s
1200:	test: 0.8934660	best: 0.8934660 (1200)	total: 1.75s	remaining: 5.52s
1300:	test: 0.8960920	best: 0.8961103 (1296)	total:

<catboost.core.CatBoostClassifier at 0x7f8b167d8c40>

Achieved an Roc Auc Score of 0.909 on its best iteration, Catboost

For the model testing I will use the best iteration of Catboost since it achieved the highest Roc Auc Score.

## Model Testing 

In [42]:
pred_proba_cb = model_cb.predict_proba(features_test)[:, 1]
roc_auc_cb = roc_auc_score(target_test, pred_proba_cb)
print("Test AUC Catboost:", roc_auc_cb)

Test AUC Catboost: 0.9000324402435967


Catboost has achieved a Roc Auc Score of 0.9 for the testing dataset which is greater than the minimum Auc Roc Score needed of 0.88 