#**Business Objective**
The goal is to stop subscriber loss (churn) and increase revenue. By using data, we identify which users are likely to cancel and group them into "behavior categories" to make marketing **27%** more efficient.

# **Approach & Methodology**
1. **Churn Prediction**  
   - Three classification models implemented:
     - `Logistic Regression`
     - `Decision Tree (max_depth=3)`
     - `Random Forest (n_estimators=10, max_depth=3)`
   - 80/20 train-test split
   - Accuracy as primary metric

2. **Subscriber Segmentation**  
   - `K-means` clustering on engagement metrics
   - Elbow method for optimal cluster count (k=3)



#**Data description**


* **subscriber_id**: A unique identifier for each user. Acts as the primary key for the dataset.
* **age_group**: The demographic bracket the user belongs to (e.g., 18-24, 25-34, 45+).
* **engagement_time**: The total duration (usually in minutes or hours) a user has spent on the platform.
* **engagement_frequency**: The number of times a user interacts with the service over a specific period (e.g., daily logins or weekly sessions).
* **subscription_status**: The current state of the userâ€™s account (e.g., Active, Canceled, Expired, or Trial).

In [None]:
# Import the necessary modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt

# Specify the file path of your CSV file
file_path = "/kaggle/input/combating-subscriber-churn-with-targeted-marketing/AZWatch_subscribers.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
df.head()

#**Data Preprocessing**

1. **Splitting data**  
   - Removed `subscriber_id` and `subscription_status` columns

2. **Handled categorical variables**  
   - One-hot encoded `age_group`  

3. **Standardized numerical features**  
   - `engagement_time`  
   - `engagement_frequency`  

In [None]:
# Separate predictor variables from class label
X = df.drop(['subscriber_id','subscription_status'], axis=1)
y = df.subscription_status


In [None]:
# Split intro training and test sets (20% test)
X_train, X_test, y_train, y_test = train_test_split(
                        X, y, test_size=.2, random_state=42)

# Data processing: Apply One Hot Encoding on the categorical attribute: age_group
X_train_prepared = pd.get_dummies(X_train, columns=['age_group'])

# Data processing:Apply the same one hot encoding transformation on the test data
X_test_prepared = pd.get_dummies(X_test, columns=['age_group'])


In [None]:
# LOGISTIC REGRESSION CLASSIFIER
# Train a logistic regression classifier for subscriber churn prediction
model1 = LogisticRegression()
model1.fit(X_train_prepared, y_train)

# Calculate accuracy score of predictions on test set
score = model1.score(X_test_prepared, y_test)
print("\nLogistic regression accuracy score: ", score)


In [None]:
# DECISION TREE CLASSIFIER
# Train a decision tree classifier for subscriber churn prediction
model2 = DecisionTreeClassifier(max_depth=3, criterion="gini")
model2.fit(X_train_prepared, y_train)

# Calculate decision tree's accuracy score of predictions on test set
score = model2.score(X_test_prepared, y_test)
print("\nDecision tree accuracy score: ", score)


In [None]:
# RANDOM FOREST ENSEMBLE
# Train a random forest ensemble classifier for subscriber churn prediction
model3 = RandomForestClassifier(n_estimators = 10, max_depth=3)
model3.fit(X_train_prepared, y_train)

# Calculate ensemble's accuracy score of predictions on test set
score = model3.score(X_test_prepared, y_test)
print("\nRandom Forest accuracy score: ", score)


#**Model Performance**
| Model              | Accuracy |
|--------------------|----------|
| Logistic Regression| 92.5%    |
| Decision Tree      | 92%      |
| Random Forest      | 91.5%    |

In [None]:
# SUBSCRIBER SEGMENTATION
# You can optionally use a method like the elbow criterion and silhouette calculation to choose the number of clusters.
segmentation = X.drop(['age_group'], axis=1)

# Scale the two numerical data attributes
scaler = StandardScaler()
scaler.fit(segmentation)
segmentation_normalized = scaler.transform(segmentation)

sse = {} # sum of squared errors (distances) to each cluster
for k in range(1,20):
    kmeans = KMeans(n_clusters=k, random_state=1)
    kmeans.fit(segmentation_normalized)
    sse[k] = kmeans.inertia_

plt.title('Elbow method to choose k')
plt.xlabel('k');plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()


The $k$ value with the highest score is generally considered the "sweet spot" for segmentation.

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# 1. Initialize a dictionary to store the quality score for each k
scores = {}
# 2. Test different cluster counts (from 2 groups up to 5)
for k in range(2, 6):
   # Initialize the KMeans algorithm
    kmeans = KMeans(n_clusters=k, random_state=1)

    # Fit the model and assign each subscriber to a cluster
    labels = kmeans.fit_predict(segmentation_normalized)

    # Calculate the Silhouette Score (measures how well-defined the groups are)
    score = silhouette_score(segmentation_normalized, labels)

    # Save the score for this specific number of clusters
    scores[k] = score
# 3. Print the results in a clean table format for analysis
for k, value in scores.items():
    print(f"| {k} | {value:4f} |")

In [None]:
# Apply k-means clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=1)
kmeans.fit_predict(segmentation_normalized)

# Add cluster labels as a new attribute in the dataset before scaling
segmentation["cluster_id"] = kmeans.labels_

# Analyze average feature values and counts per cluster
analysis = segmentation.groupby(['cluster_id']).agg({
    'engagement_time': ['mean'],
    'engagement_frequency':['mean']
}).round(0)
analysis

# **Insights**
### Subscriber Segmentation Results

**Three distinct clusters identified:**

| Cluster | Session Time | Session Interaction | Profile |
| :--- | :--- | :--- | :--- |
| 0 | 4 minutes | 5 | Light Users |
| 1 | 7 minutes | 18 | High-Frequency Users |
| 2 | 9 minutes | 9 | Moderate Users |


#**Recommendations**

- **Light Users (Cluster 0):**  
  1. Free premium content trials  
  2. Personalized recommendations
- **High-Frequency Users (Cluster 1):**  
  1. Premium subscription tiers  
  2. Ambassador programs  
  3. Exclusive content access
- **Moderate Users (Cluster 2):**  
  1. Gamified learning features  
  2. Curated course bundles



# **Future Work**
1. Collect additional user behavior data
2. Implement A/B testing for cluster-specific strategies
3. Develop real-time churn prediction API
4. Explore deep learning approaches for pattern detection
