# MCLabs Churn Analyzer - Model Creation

This Jupyter Notebook will create a ML model, train it on our training data, then offer a simple test analysis using test data.

Note that this notebook converts the previous target encoding to a new encoding:
- Not Active (Previously 0) -> Dropped
- Recovered (Previously 1) -> 0
- Churned (Previously 2) -> 1
- Active (Previously 3) -> 2

## Module and Package Imports

This section will import any required modules for use in this notebook.

In [None]:
# System Modules
import os

# Data-Related Modules
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Pipelining, Model, and Persistence Modules
import joblib
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

## Directory Setup

This section allows you to specify the locations of the `data` directory and `model-internals` directory. This is provided in case this is used in Google Colab or another location where it is easier to access folders in other locations than the standard setup found in the repository.

In [None]:
MCA_DATADIRECTORYPATH = f"../data"

In [None]:
MCA_MODELINTERNALSDIRECTORYPATH = f"m../odel-internals"

## Master Dataframe Loading and Size Indication

This section will load the master dataframe and identify how many total samples are included in it.

In [3]:
# Make master dataframe path
masterDfFilePath = os.path.join(MCA_DATADIRECTORYPATH, "master", "master_dataframe.csv")

# Load master dataframe from file
masterDf = pd.read_csv(masterDfFilePath)
print(f"Loaded master dataframe from `{masterDfFilePath}`!")

# Print master dataframe shape post-collection or post-loading
print(f"Master DataFrame Shape: {masterDf.shape}")

Loaded master dataframe from `MCA_MASTERDFPATH`!
Master DataFrame Shape: (11071, 55)


## Model Pipeline Creation

This section will:
- Split the data into feature and target sets
- Encode, transform, and scale the data
- Build separate pipelines for logistic regression and XGBoost

In [None]:
# Separate features from target
MCA_Features = masterDf.drop(columns=["churn"])
MCA_Target = masterDf["churn"]

# Transform target to be 0, 1, 2 instead of 1, 2, 3
MCA_LabelEncoder = LabelEncoder()
MCA_Target =  MCA_LabelEncoder.fit_transform(MCA_Target)

# Split the data
MCA_Features_Train, MCA_Features_Test, MCA_Target_Train, MCA_Target_Test = train_test_split(MCA_Features, MCA_Target, test_size=0.2, stratify=MCA_Target, random_state=2002)

# Identify which features are categorical
categoricalFeatures = ["plan_player_favorite_server"]

# Identify which features are numerical (note we do not include the last seen time here)
numericalFeatures = ["balance","lw_rev_total","lw_rev_phase","leaderboard_position_chems_all","leaderboard_position_chems_week","leaderboard_position_police_all","leaderboard_position_police_week","mcmmo_power_level","mcmmo_skill_ACROBATICS","mcmmo_skill_ALCHEMY","mcmmo_skill_ARCHERY","mcmmo_skill_AXES","mcmmo_skill_CROSSBOWS","mcmmo_skill_EXCAVATION","mcmmo_skill_FISHING","mcmmo_skill_HERBALISM","mcmmo_skill_MACES","mcmmo_skill_MINING","mcmmo_skill_REPAIR","mcmmo_skill_SALVAGE","mcmmo_skill_SMELTING","mcmmo_skill_SWORDS","mcmmo_skill_TAMING","mcmmo_skill_TRIDENTS","mcmmo_skill_UNARMED","mcmmo_skill_WOODCUTTING","chemrank","policerank","donorrank","goldrank","current_month_votes","plan_player_time_total_raw","plan_player_time_month_raw","plan_player_time_week_raw","plan_player_time_day_raw","plan_player_time_afk_raw","plan_player_latest_session_length_raw","plan_player_sessions_count","plan_player_relativePlaytime_totalmonth","plan_player_relativePlaytime_weekmonth","plan_player_relativePlaytime_dayweek","balance_change","lw_rev_total_change","lw_rev_phase_change","leaderboard_position_chems_all_change","leaderboard_position_chems_week_change","leaderboard_position_police_all_change","leaderboard_position_police_week_change","chemrank_change","policerank_change","donorrank_change","goldrank_change"]

# Create preprocessing transformers for encoding and scaling features
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categoricalFeatures),
        ("num", StandardScaler(), numericalFeatures)
    ]
)

# Define LogReg pipeline
MCA_Pipeline_LogReg = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LogisticRegression(max_iter=1000, solver="lbfgs"))
])

# Define XGBoost pipeline
MCA_Pipeline_XGB = Pipeline([
    ("preprocessor", preprocessor),
    ("model", XGBClassifier(use_label_encoder=False, eval_metric="logloss", num_class=3, verbosity=0))
])

## External Data Inspection

This section is used as a final check on the dataset to observe the sizes and any possible falsy values prior to training the models. It does this by fitting and transforming the features to observe the resulting sizes. The datasets in this form and results gathered in this section are not used in the rest of this notebook.

In [5]:
# Transform outside the pipeline for inspection
X_train_pre = preprocessor.fit_transform(MCA_Features_Train)
X_test_pre = preprocessor.transform(MCA_Features_Test)

print("X_train_pre shape:", X_train_pre.shape)
print("NaN count (train):", np.isnan(X_train_pre).sum())
print("Inf count (train):", np.isinf(X_train_pre).sum())

print("X_test_pre shape:", X_test_pre.shape)
print("NaN count (test):", np.isnan(X_test_pre).sum())
print("Inf count (test):", np.isinf(X_test_pre).sum())

X_train_pre shape: (8856, 57)
NaN count (train): 0
Inf count (train): 0
X_test_pre shape: (2215, 57)
NaN count (test): 0
Inf count (test): 0


## Logistic Regression Model Training and Testing

This section will train and test the logistic regression model on the data using the pipeline made above.

In [6]:
MCA_Pipeline_LogReg.fit(MCA_Features_Train, MCA_Target_Train)
MCA_Target_Pred = MCA_Pipeline_LogReg.predict(MCA_Features_Test)
print(f"Accuracy: {accuracy_score(MCA_Target_Test, MCA_Target_Pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(MCA_Target_Test, MCA_Target_Pred)}")
print(f"Classification Report:\n{classification_report(MCA_Target_Test, MCA_Target_Pred)}")

Accuracy: 0.6767494356659142
Confusion Matrix:
[[780  13  10]
 [405 520  47]
 [ 99 142 199]]
Classification Report:
              precision    recall  f1-score   support

           0       0.61      0.97      0.75       803
           1       0.77      0.53      0.63       972
           2       0.78      0.45      0.57       440

    accuracy                           0.68      2215
   macro avg       0.72      0.65      0.65      2215
weighted avg       0.71      0.68      0.66      2215



## XGBoost Model Training and Testing

This section will train and test the xgboost model on the data using the pipeline made above.

In [7]:
# Map of hyperparameters and possible values to try tuning XGBoost with
hyperParameterMap = {
    "model__n_estimators": [100, 200, 400],      # boosting rounds
    "model__max_depth": [3, 5, 7],               # tree depth
    "model__learning_rate": [0.01, 0.1, 0.3],    # step size shrinkage
    "model__subsample": [0.8, 1.0],              # row sampling
    "model__colsample_bytree": [0.8, 1.0],       # feature sampling
    "model__scale_pos_weight": [1, 2, 5]         # helps with class imbalance
}

# Grid search for best hyperparameters (5-fold CV)
MCA_Pipeline_GridSearch_XGB = GridSearchCV(
    estimator=MCA_Pipeline_XGB,
    param_grid=hyperParameterMap,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
)

# Train and test XGBoost pipeline
MCA_Pipeline_GridSearch_XGB.fit(MCA_Features_Train, MCA_Target_Train)
MCA_Target_Pred = MCA_Pipeline_GridSearch_XGB.predict(MCA_Features_Test)
print(f"Best Parameters: {MCA_Pipeline_GridSearch_XGB.best_params_}")
print(f"Best Cross-Validation Score: {MCA_Pipeline_GridSearch_XGB.best_score_}")
print(f"Accuracy: {accuracy_score(MCA_Target_Test, MCA_Target_Pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(MCA_Target_Test, MCA_Target_Pred)}")
print(f"Classification Report:\n{classification_report(MCA_Target_Test, MCA_Target_Pred)}")

Best Parameters: {'model__colsample_bytree': 0.8, 'model__learning_rate': 0.01, 'model__max_depth': 5, 'model__n_estimators': 400, 'model__scale_pos_weight': 1, 'model__subsample': 1.0}
Best Cross-Validation Score: 0.8054426533325346
Accuracy: 0.8018058690744921
Confusion Matrix:
[[653 142   8]
 [ 25 883  64]
 [ 26 174 240]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.81      0.87       803
           1       0.74      0.91      0.81       972
           2       0.77      0.55      0.64       440

    accuracy                           0.80      2215
   macro avg       0.81      0.76      0.77      2215
weighted avg       0.81      0.80      0.80      2215



## Model Pipeline Saving

This section will persist the models created in this notebook to files for later use.

In [None]:
# Save the label encoder
labelEncoderFilePath = os.path.join(MCA_MODELINTERNALSDIRECTORYPATH, "MCA_LabelEncoder.pkl")
joblib.dump(MCA_LabelEncoder, labelEncoderFilePath)

# Save the entire LogReg pipeline
logRegFilePath = os.path.join(MCA_MODELINTERNALSDIRECTORYPATH, "MCA_Pipeline_LogReg.pkl")
joblib.dump(MCA_Pipeline_LogReg, logRegFilePath)

# Save the entire XGBoost pipeline
xgBoostFilePath = os.path.join(MCA_MODELINTERNALSDIRECTORYPATH, "MCA_Pipeline_GridSearch_XGB.pkl")
joblib.dump(MCA_Pipeline_GridSearch_XGB, xgBoostFilePath)

['model-internals/MCA_Pipeline_GridSearch_XGB.pkl']