---
title: Main - Reporting
subject: Churn Analysis
subtitle: Reporting - Churn Analysis
short_title: Reporting
date: 2025-12-17

affiliations:
  - id: "ucb"
    name: "University of California, Berkeley"

authors:
  - name: Jocelyn Perez
    affiliations: ["ucb"]
    email: jocelyneperez@berkeley.edu
    orcid: 0009-0009-0231-9254

  - name: Claire Kaoru Shimazaki
    affiliations: ["ucb"]
    email: ckshimazaki@berkeley.edu
    orcid: 0009-0001-0828-3370

  - name: Colby Zhang
    affiliations: ["ucb"]
    email: colbyzhang@berkeley.edu
    orcid: 0009-0005-4786-6922

  - name: Olorundamilola Kazeem
    affiliations: ["ucb"]
    email: dami@berkeley.edu
    orcid: 0000-0003-2118-2221

# https://mystmd.org/guide/frontmatter#frontmatter-downloads
# https://mystmd.org/guide/website-downloads
# downloads:
#   -  ...

# https://mystmd.org/guide/website-downloads#include-exported-pdf
# exports:
#   - format: pdf
#     template: lapreprint-typst
#     output: exports/my-document.pdf
#     id: my-document-export
# downloads:
#   - id: my-document-export
#     title: A PDF of this document

exports:
  - format: pdf
    template: lapreprint-typst
    output: ../pdf_builds/main/main_ipynb_to.pdf
    line_numbers: true

license: CC-BY-4.0

keywords: main, churn, reprting, spotify

abstract: Reporting...
---


# Main: Reporting

In [1]:
import src.step00_utils as step00_utils


# Predicting User Churn from Listening Behavior

## Introduction
User churn is a critical challenge for subscription-based platforms such as Spotify, where long-term success depends on sustained user engagement. Understanding which behavioral patterns and user characteristics are associated with churn can help platforms identify at-risk users and design more effective retention strategies.
In this project, we analyze a synthetic Spotify churn dataset to answer the following questions:
What behavioral and demographic factors are associated with user churn?
How well can churn be predicted using machine learning models?
Which features most strongly drive churn predictions, and how do they influence model decisions?
Our analysis follows a structured, reproducible workflow, progressing from data understanding and exploratory analysis to feature engineering, modeling, and interpretability.

## Data Description and Assumptions
The dataset consists of 8,000 synthetic Spotify users generated using GPT to simulate realistic listening behavior, subscription types, device usage, and churn outcomes. Features include demographic attributes such as age, gender, and country, as well as engagement metrics such as listening time, skip rate, songs played per day, ad exposure, and offline listening usage. The target variable is a binary indicator of whether a user churned.
Because the dataset is synthetically generated rather than collected from real users, this analysis should be interpreted as exploratory rather than causal. We assume that the simulated relationships reasonably approximate real-world churn behavior, but we cannot verify the underlying data-generating process. As a result, all findings describe associations and predictive patterns, not causal effects.

Below is a preview of the first few rows of our dataset to illustrate its structure. It was sourced from Kaggle, [linked here](https://www.kaggle.com/datasets/nabihazahid/spotify-dataset-for-churn-analysis)

In [6]:
import pandas as pd
from pathlib import Path


raw_dir = Path.cwd().parent / "data" / "00_raw"
csv_path = next(raw_dir.glob("*.csv")) 

df = pd.read_csv(csv_path)
print("Loaded:", csv_path.name)

df.head()



Loaded: spotify_churn_dataset.csv


Unnamed: 0,user_id,gender,age,country,subscription_type,listening_time,songs_played_per_day,skip_rate,device_type,ads_listened_per_week,offline_listening,is_churned
0,1,Female,54,CA,Free,26,23,0.2,Desktop,31,0,1
1,2,Other,33,DE,Family,141,62,0.34,Web,0,1,0
2,3,Male,38,AU,Premium,199,38,0.04,Mobile,0,1,1
3,4,Female,22,CA,Student,36,2,0.31,Mobile,0,1,0
4,5,Other,29,US,Family,250,57,0.36,Mobile,0,1,1


## Exploratory Data Analysis 
EDA shows that about 1/3 of users churn, indicating a moderate class imbalance typical of churn prediction problems. This motivates the use of evaluation metrics beyond accuracy in later modeling stages.
Demographic features such as gender and age show limited standalone predictive power. Churn rates across gender categories are similar, and age distributions for churned and retained users overlap substantially. While some differences exist, demographics alone do not clearly separate churners from non-churners.
In contrast, engagement-related features show stronger associations with churn. Users with higher skip rates, greater ad exposure, and lower overall listening activity tend to churn more frequently. Subscription type also appears relevant, with Family plans exhibiting higher churn rates than other plans. These findings suggest that behavioral signals are more informative than static user characteristics.

## Feature Engineering and Preprocessing
Raw usage logs capture what users did, but not how they experienced the platform. To better represent user engagement and frustration, we engineer interpretable behavioral features such as average song length and ads per song, which proxy engagement intensity and ad tolerance, respectively.
To prevent data leakage and ensure reproducibility, we construct a preprocessing pipeline using scikit-learn. Numerical features are imputed and standardized, categorical features are one-hot encoded, and the dataset is split using stratified sampling to preserve the churn rate in both training and test sets. This pipeline produces consistent, vectorized inputs for all downstream models.

## Model Building and Evaluation
We first establish a baseline using Logistic Regression with class weighting to account for imbalance. While interpretable, this linear model achieves limited performance, reflecting the complexity of churn behavior.
We then train a Random Forest classifier to capture non-linear relationships and feature interactions. Hyperparameters are tuned using cross-validation with F1-score as the primary metric. The Random Forest outperforms the baseline model, confirming that churn depends on interacting behavioral factors rather than simple linear effects.
However, despite improved performance, the model’s recall for churned users remains modest. Many churners are misclassified as retained users, highlighting the inherent difficulty of predicting churn and the presence of unobserved factors beyond behavioral data.

## Results and Interpretation
To understand why the Random Forest makes its predictions, we apply model interpretability techniques.
Global SHAP analysis shows that engagement-related features dominate churn predictions. Average song length, listening time, skip rate, and songs played per day are the most influential variables, while age plays a secondary role. Device type, country, and gender contribute relatively little to overall predictions.
Local SHAP explanations reveal that individual churn predictions arise from the combined effect of many small signals rather than a single decisive feature. For example, a churned user may be predicted to churn due to slightly reduced listening time, higher skip rate, and lower engagement simultaneously.
Partial dependence plots further illustrate non-linear relationships. Churn risk decreases sharply as engagement increases at low levels, then plateaus, indicating diminishing returns to very high engagement. This suggests that early drops in engagement may be especially informative warning signs.
Finally, error analysis shows that many missed churners exhibit engagement patterns similar to retained users. This explains the model’s low recall and suggests that churn decisions are often driven by external or unobserved factors such as pricing changes, competitor offers, or personal circumstances.

## Limitations and Conclusion
This analysis is subject to several limitations. First, the dataset is synthetic and observational, limiting external validity and precluding causal interpretation. Second, behavioral data alone cannot capture all drivers of churn, placing an upper bound on predictive performance. Finally, class imbalance and overlapping feature distributions make churn inherently difficult to predict.



## Author Contributions