# Build the Model(s)

## Ensemble Averaging Model

### Load Data / Imports

In [23]:
import pandas as pd
from scipy import stats

In [24]:
df_churn_training_eam = pd.read_csv('training_set.csv')
df_churn_testing_eam = pd.read_csv('testing_set.csv')

In [25]:
df_churn_training_eam

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.395160,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.616170,4,Female,0,No,No,4LGYPK7VOL,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243782,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,47,Sci-Fi,3.697451,1,Male,8,Yes,No,FBZ38J108Z,0
243783,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,35,Comedy,1.449742,4,Male,20,No,No,W4AO1Y6NAI,0
243784,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,44,Action,4.012217,6,Male,13,Yes,Yes,0H3SWWI7IU,0
243785,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,36,Fantasy,2.135789,7,Female,5,No,Yes,63SJ44RT4A,0


### Start Linear Regressions

In [26]:
# For any value x, use y = mx + c to return y
# This basically means - when given the slope and intercept (forms the line), and an x value, what would y be?
# x will be the variables that I pass in, m and c will be obtained using liner regression function
def best_fit(x, slope, intercept):
    return slope * x + intercept

In [27]:
# Account Age vs Churn 

a = df_churn_training_eam['AccountAge']
b = df_churn_training_eam['Churn']

# Perform linear regression
slope1, intercept1, r, p, std_err = stats.linregress(a,b)

In [28]:
# Average Viewing Duration vs Churn

a = df_churn_training_eam['AverageViewingDuration']
b = df_churn_training_eam['Churn']

# Perform linear regression
slope2, intercept2, r, p, std_err = stats.linregress(a,b)

In [29]:
# Viewing Hours per Week vs Churn

a = df_churn_training_eam['ViewingHoursPerWeek']
b = df_churn_training_eam['Churn']

# Perform linear regression
slope3, intercept3, r, p, std_err = stats.linregress(a,b)

In [30]:
# Content Downloads per Month vs Churn

a = df_churn_training_eam['ContentDownloadsPerMonth']
b = df_churn_training_eam['Churn']

# Perform linear regression
slope4, intercept4, r, p, std_err = stats.linregress(a,b)

Create function for Ensemble Averaging

Pass in the slope, intercept, and x variable for each linear regression into my bestfit function to obtain 4 series of predicted y values. Then average them.

In [31]:
def cal_churn(account_age, avg_viewing_duration, viewing_hours_per_week, content_downloads_per_month):
    
    reg_1 = best_fit(account_age, slope1, intercept1)
    reg_2 = best_fit(avg_viewing_duration, slope2, intercept2)
    reg_3 = best_fit(viewing_hours_per_week, slope3, intercept3)
    reg_4 = best_fit(content_downloads_per_month, slope4, intercept4)
    
    return (reg_1 + reg_2 + reg_3 + reg_4) / 4

Create a new column and populate it with the ensemble scores for each customer using the cal_churn function.

This is basically to check if the function works.

In [32]:
a = df_churn_training_eam['AccountAge']
b = df_churn_training_eam['AverageViewingDuration']
c = df_churn_training_eam['ViewingHoursPerWeek']
d = df_churn_training_eam['ContentDownloadsPerMonth']

df_churn_training_eam['Predicted Churn'] = cal_churn(a, b, c, d)

In [33]:
df_churn_training_eam

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn,Predicted Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0,0.206198
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0,0.194052
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.395160,...,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0,0.199578
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0,0.172849
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,Comedy,3.616170,4,Female,0,No,No,4LGYPK7VOL,0,0.200447
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243782,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,Sci-Fi,3.697451,1,Male,8,Yes,No,FBZ38J108Z,0,0.163389
243783,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,Comedy,1.449742,4,Male,20,No,No,W4AO1Y6NAI,0,0.144698
243784,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,Action,4.012217,6,Male,13,Yes,Yes,0H3SWWI7IU,0,0.122321
243785,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,Fantasy,2.135789,7,Female,5,No,Yes,63SJ44RT4A,0,0.172654


### Finding a Base Threshold

What would be a reasonable starting point for the threshold?

In [34]:
# What does the predicted churn data look like? What is the range, how tightly clustered around the mean is it?
print(df_churn_training_eam['Predicted Churn'].describe())

print('\n')

# What is the base churn rate?
print(df_churn_training_eam['Churn'].mean())

count    243787.000000
mean          0.181232
std           0.029537
min           0.085330
25%           0.160493
50%           0.181262
75%           0.202027
max           0.277364
Name: Predicted Churn, dtype: float64


0.18123197709475894


Inspection of the ensemble score distribution showed that predicted churn values were tightly clustered between 0.08 and 0.28. Therefore, commonly used probability thresholds such as 0.5 (also used in the first version of this project), were inappropriate. As a baseline, the classification threshold was set equal to the average churn rate in the training data (≈18%), which allowed the model to meaningfully predict both churn and non-churn outcomes.

Threshold = 0.18123

### Testing the Model

In [35]:
df_churn_testing_eam

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,83,16.904894,1403.106197,Basic,Bank transfer,Yes,TV Shows,No,Mobile,10.147601,...,33,Sci-Fi,1.591564,1,Female,22,No,No,A92JC7VFI2,0
1,85,12.145583,1032.374522,Standard,Electronic check,Yes,Both,No,Tablet,14.745020,...,22,Fantasy,1.010706,7,Male,19,No,Yes,UH8QJF821S,0
2,119,6.241227,742.706007,Premium,Mailed check,Yes,Both,No,Computer,29.246975,...,37,Sci-Fi,4.247768,0,Female,3,Yes,No,J8KZIWHFST,0
3,102,10.354508,1056.159857,Premium,Electronic check,No,Both,Yes,Tablet,18.916349,...,7,Sci-Fi,2.103056,7,Female,3,No,No,FAVRCDAQMH,0
4,49,10.085545,494.191689,Premium,Bank transfer,No,Both,No,Computer,16.915389,...,17,Sci-Fi,2.369549,3,Male,15,No,No,DI0XSRMN3X,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123917,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,47,Sci-Fi,3.697451,1,Male,8,Yes,No,FBZ38J108Z,0
123918,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,35,Comedy,1.449742,4,Male,20,No,No,W4AO1Y6NAI,0
123919,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,44,Action,4.012217,6,Male,13,Yes,Yes,0H3SWWI7IU,0
123920,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,36,Fantasy,2.135789,7,Female,5,No,Yes,63SJ44RT4A,0


Create a new column and populate it with the ensemble scores for each customer using the cal_churn function (for the actual test set this time).

In [36]:
a = df_churn_testing_eam['AccountAge']
b = df_churn_testing_eam['AverageViewingDuration']
c = df_churn_testing_eam['ViewingHoursPerWeek']
d = df_churn_testing_eam['ContentDownloadsPerMonth']

df_churn_testing_eam['Predicted Churn'] = cal_churn(a, b, c, d)

In [37]:
df_churn_testing_eam

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn,Predicted Churn
0,83,16.904894,1403.106197,Basic,Bank transfer,Yes,TV Shows,No,Mobile,10.147601,...,Sci-Fi,1.591564,1,Female,22,No,No,A92JC7VFI2,0,0.190707
1,85,12.145583,1032.374522,Standard,Electronic check,Yes,Both,No,Tablet,14.745020,...,Fantasy,1.010706,7,Male,19,No,Yes,UH8QJF821S,0,0.179443
2,119,6.241227,742.706007,Premium,Mailed check,Yes,Both,No,Computer,29.246975,...,Sci-Fi,4.247768,0,Female,3,Yes,No,J8KZIWHFST,0,0.140766
3,102,10.354508,1056.159857,Premium,Electronic check,No,Both,Yes,Tablet,18.916349,...,Sci-Fi,2.103056,7,Female,3,No,No,FAVRCDAQMH,0,0.189426
4,49,10.085545,494.191689,Premium,Bank transfer,No,Both,No,Computer,16.915389,...,Sci-Fi,2.369549,3,Male,15,No,No,DI0XSRMN3X,0,0.178670
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123917,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,Sci-Fi,3.697451,1,Male,8,Yes,No,FBZ38J108Z,0,0.163389
123918,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,Comedy,1.449742,4,Male,20,No,No,W4AO1Y6NAI,0,0.144698
123919,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,Action,4.012217,6,Male,13,Yes,Yes,0H3SWWI7IU,0,0.122321
123920,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,Fantasy,2.135789,7,Female,5,No,Yes,63SJ44RT4A,0,0.172654


Turn the continuous score (Predicted_Churn) into a binary prediction by comparing to the threshold. 

In [38]:
Threshold = 0.18123

df_churn_testing_eam['Predicted Churn Binary'] = (df_churn_testing_eam['Predicted Churn'] >= Threshold).astype(int)

df_churn_testing_eam['Correct Prediction'] = df_churn_testing_eam['Predicted Churn Binary'] == df_churn_testing_eam['Churn']

In [39]:
df_churn_testing_eam

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn,Predicted Churn,Predicted Churn Binary,Correct Prediction
0,83,16.904894,1403.106197,Basic,Bank transfer,Yes,TV Shows,No,Mobile,10.147601,...,1,Female,22,No,No,A92JC7VFI2,0,0.190707,1,False
1,85,12.145583,1032.374522,Standard,Electronic check,Yes,Both,No,Tablet,14.745020,...,7,Male,19,No,Yes,UH8QJF821S,0,0.179443,0,True
2,119,6.241227,742.706007,Premium,Mailed check,Yes,Both,No,Computer,29.246975,...,0,Female,3,Yes,No,J8KZIWHFST,0,0.140766,0,True
3,102,10.354508,1056.159857,Premium,Electronic check,No,Both,Yes,Tablet,18.916349,...,7,Female,3,No,No,FAVRCDAQMH,0,0.189426,1,False
4,49,10.085545,494.191689,Premium,Bank transfer,No,Both,No,Computer,16.915389,...,3,Male,15,No,No,DI0XSRMN3X,0,0.178670,0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123917,77,9.639902,742.272460,Basic,Mailed check,No,Movies,No,Computer,13.502729,...,1,Male,8,Yes,No,FBZ38J108Z,0,0.163389,0,True
123918,117,13.049257,1526.763053,Premium,Credit card,No,TV Shows,Yes,TV,24.963291,...,4,Male,20,No,No,W4AO1Y6NAI,0,0.144698,0,True
123919,113,14.514569,1640.146267,Premium,Credit card,Yes,TV Shows,No,TV,10.628728,...,6,Male,13,Yes,Yes,0H3SWWI7IU,0,0.122321,0,True
123920,7,18.140555,126.983887,Premium,Bank transfer,Yes,TV Shows,No,TV,30.466782,...,7,Female,5,No,Yes,63SJ44RT4A,0,0.172654,0,True


In [40]:
from sklearn.metrics import classification_report

print(classification_report(
    df_churn_testing_eam['Churn'],
    df_churn_testing_eam['Predicted Churn Binary'],
    zero_division=0
))

              precision    recall  f1-score   support

           0       0.92      0.56      0.69    101524
           1       0.28      0.77      0.41     22398

    accuracy                           0.59    123922
   macro avg       0.60      0.66      0.55    123922
weighted avg       0.80      0.59      0.64    123922



### Report Breakdown
- Precision: When the model predicts a class, how often is it correct?
    - i.e. when the model predicts a non-churn case, how often is it correct?, How many of the churn warnings were correct?
    - How trustworthy/reliable are positive predictions?
    - Important when false positives are costly.
- Recall: Of all the real cases of a class, how many did it find?
    - i.e. of all the non-churn cases, how many did the model identify?
- f1-score: Balance between precision and recall.

- Support:	How many true samples of that class exist.

- Accuracy: Proportion of predictions the model got correct. - NOTE: high accuracy does not neccesarily indicate a good churn model. This data has 82% non-churn (class imbalance), a model could just predict all cases as non-churn and be 82% accurate. But it would fail to identify any churners.

- Non-Churn
    - When the model predicts a non-churn case, it is correct 92% of the time.
    - The model identified 56% of the non-churn cases.
- Churn
    - When the model predicts a churn case, it is correct 28% of the time.
    - The model indetified 77% of the churn cases.
    - The model catches most churn cases, but also produces a lot of false positives (72% of the churn predictions are false positives).

- Accuracy 
    - The model was correct 59% of the time.

## Decision Tree Model

In [41]:
df_churn_training_dtm = pd.read_csv('training_set.csv')

Although decision trees can conceptually handle categorical variables, scikit-learn implementations require all features to be numerically encoded. 

Categorical variables (sub type and gender) were therefore encoded prior to model training.

It may have been possible to use one-hot encoding for the other catergorical variables, this is something I will explore in the future.

In [42]:
df_churn_training_dtm_numeric = df_churn_training_dtm.select_dtypes([float, int])

In [43]:
from sklearn.tree import DecisionTreeClassifier # Decision Tree Algo for classification problems
from sklearn.model_selection import train_test_split # Function to split the dataset into test and training
from sklearn.metrics import classification_report # Function for calculating metrics - This is also imported earlier

# Split the data set into x (feature matrix)(all the info about the customer), and y (target vector)(whether they churned or not)
X = df_churn_training_dtm_numeric.drop('Churn', axis=1)
y = df_churn_training_dtm_numeric['Churn']

# Split the data set up into training and testing, 80% of set for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
# X_train → features used to train the model
# y_train → churn labels used during training
# X_test → features the model has never seen
# y_test → true churn labels for evaluation


# Create the decision tree classifier object (currently empty), limit to 4 splits to prevent overfitting
tree = DecisionTreeClassifier(max_depth=6, random_state=42, class_weight='balanced') 

# Train the model
# What is actually does - Finds splits that best separate churners from non-churners, e.g. splitting on AccountAge < 10 reduces churn uncertainity the most 
tree.fit(X_train, y_train)

# Make predictions on unseen data - applies the learned decision rules to the test set
y_pred = tree.predict(X_test)

# Print report that evaluates the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.90      0.66      0.76     39968
           1       0.30      0.67      0.41      8790

    accuracy                           0.66     48758
   macro avg       0.60      0.66      0.59     48758
weighted avg       0.79      0.66      0.70     48758



### Report Breakdown

- Non-Churn
    - When the model predicts a non-churn case, it is correct 90% of the time.
    - The model identified 66% of non-churn cases.

- Churn
    - When the model predicts a churn case, it is correct 30% of the time.
    - The model identified 67% of churn cases.
    - The model catches the majority of churners, but produces a high number of false positives (70% of churn predictions are false positives).

- Accuracy
    - The model was correct 66% of the time.

### Model Explanation

A decision tree classifier was trained using an 80/20 train-test split. The model was constrained to a maximum depth of four to reduce overfitting. Model performance was evaluated on the test set using precision, recall, and F1-score.

The initial decision tree classifier predicted only the majority class (non-churn), achieving an accuracy of 82%. However, recall and precision for the churn class were both zero, indicating that the model failed to identify any churners. This behaviour reflects a class imbalance in the dataset — approximately 82% of records are non-churners — causing the model to default to always predicting the majority class, as doing so minimises overall error without needing to learn patterns in the minority class.

To address this, the `class_weight='balanced'` parameter was introduced. This adjusts the penalty applied during training so that misclassifying a churner is weighted more heavily than misclassifying a non-churner, proportional to each class's frequency. This forces the model to actively learn patterns associated with churn rather than ignoring the minority class entirely.

Following this change, overall accuracy decreased from 82% to 68%, but this is expected and desirable — the drop reflects the model now attempting to identify churners rather than defaulting to the majority class. Churn recall improved to 60%, meaning the model correctly identified 6 in 10 actual churners. Precision for the churn class remained low at 30%, indicating a higher rate of false positives, though this trade-off is acceptable in a churn context where missing a churner typically carries a greater business cost than an unnecessary retention intervention.

The maximum depth was subsequently increased from four to six to test whether additional splits would improve performance. This produced a marginal improvement in churn recall (60% to 67%) and F1-score (0.40 to 0.41), but precision remained unchanged at 30%, suggesting the single decision tree architecture is approaching its performance ceiling regardless of depth. Further improvement is likely to require a more complex ensemble approach such as a random forest.