# Assignment Questions

## 1. Dataset Sourcing

This dataset was sourced from [Kaggle](https://www.kaggle.com/datasets/nabihazahid/spotify-dataset-for-churn-analysis/data), specifically the Spotify Dataset for Churn Analysis by user nabihazahid. It is publicly available, containing features related to user demographics, engagement behavior, and account information, along with a binary churn label.

## 2. Application Description

The application predicts Spotify user churn using demographics, subscription type, and usage patterns to identify users likely to leave, helping target retention and improve engagement.

## 3. Exploratory Data Analysis

The Spotify churn dataset contains both categorical and numerical features. The pie chart in suplemental material shows a balanced dataset: gender (Male 33.6%, Female 33.2%, Other 33.1%), subscription types (Premium 26.4%, Free 25.2%, Student 24.5%, Family 23.8%), and device usage (Desktop 34.7%, Web 32.8%, Mobile 32.5%).

Numerical features show that offline_listening is strongly negatively correlated with ads listened (-0.88). Most other numerical features show weak correlations with churn, suggesting that churn depends on combinations of behaviors rather than individual variables. There are no missing values.

## 4. Data Splitting
The 60%-20%-20% split was specifically chosen for this dataset to ensure enough samples per user class and feature type. With roughly 8,000 users, 60% (~4,800) in training provides sufficient examples across all genders, subscription types, and device usages for the model to learn patterns. The 20% validation (~1,600) allows reliable hyperparameter tuning (e.g., K in KNN, C in logistic regression) while keeping class proportions balanced. The remaining 20% (~1,600) test set ensures an unbiased evaluation of the model’s ability to generalize to unseen users, capturing churn patterns across different demographics and usage behaviors.

## 5. Logistic Regression and k-Nearest Neighbor 

I applied preprocessing transformations to the dataset by using One-Hot Encoding for categorical variables and StandardScaler for numerical features, ensuring all predictors were numerical and on the same scale to improve model performance. For model tuning, I adjusted the k parameter in KNN to balance bias and variance and the regularization parameter C in Logistic Regression to control overfitting. The most important predictors in Logistic Regression, determined from the magnitude of model coefficients, were country_CA (-0.165776), subscription_type_Free (-0.144746), and country_FR (0.135067), indicating their strong influence on predicting churn. For KNN, feature importance could not be determined as it is a non-parametric method without learned coefficients.

## 6. Model Evaluation

On the test set, KNN achieved 49.9% accuracy (misclassification 50.1%) and logistic regression 45.9% accuracy (misclassification 54.1%). KNN showed slightly better specificity (0.476 vs. 0.414), while logistic regression had slightly higher sensitivity (0.589 vs. 0.565). Overall, KNN offered a more balanced trade-off between detecting churned and retained users, whereas logistic regression was more sensitive to churn but less precise. Both models, however, demonstrated low discriminative power, indicating limited effectiveness in predicting churn.

## 7. Alternative kNN Strategy

I applied LMNN to learn a distance metric that pulls same-class points closer and pushes different-class points apart, setting k=5 for target neighbors and tuning KNN’s k on the validation set. This reshapes the feature space so that distances better reflect class similarity, improving KNN’s classification. On the test set, KNN with LMNN achieved higher accuracy (0.566 vs 0.499), better specificity (0.610 vs 0.476), and lower misclassification error (0.434 vs 0.501), demonstrating improved overall classification. Standard KNN showed slightly higher sensitivity (0.565 vs 0.440) and F1-score (0.369 vs 0.344), making it better at detecting churned users. Overall, using LMNN enhanced the model’s ability to separate classes, reduced errors, and produced more balanced performance across churned and retained users, while standard KNN favored minority-class recall. This shows that metric learning can meaningfully boost KNN performance, especially in datasets with class imbalance or complex feature relationships.

## 8. Conclusions

Churn prediction is challenging for this dataset. Both logistic regression and KNN classifiers achieved accuracy around 50–56%, which is only slightly better than random guessing. This indicates that individual features, such as age, listening time, or skip rate, have weak direct correlations with churn, and the behavior patterns that lead to churn are subtle and complex. The moderate accuracy also reflects the class imbalance, with far more users retained than churned, making it difficult for standard classifiers to correctly identify churned users without specialized techniques.

Metric learning improves performance modestly. Applying LMNN to KNN increased overall accuracy, improved specificity, and reduced misclassification error compared to standard KNN. By learning a distance metric that brings same-class users closer and separates different-class users, LMNN helps the classifier better capture patterns in user behavior. However, even with LMNN, the accuracy remains around 56%, showing that while metric learning enhances performance, predicting churn in this dataset is inherently difficult due to subtle behavioral differences and imbalanced classes.

## 9. Generative AI

Yes, generative AI can assist in answering some parts of the assignment, particularly in writing code, suggesting common methods, or providing general explanations. For example, one could prompt a tool with:

Prompt: "Split the data into a training, validation, and test sets and describe the rationale behind your choice of data splitting."

Response to the promt was too gerenral and not specific to the dataset.

The AI could generate the code and explain standard procedures. However, generative AI cannot fully replace manual analysis, as it does not automatically interpret dataset-specific patterns. While it can describe or plot individual variables, it cannot synthesize insights from multiple univariate and multivariate analyses or provide nuanced conclusions tailored to the actual dataset. Therefore, AI is best used as a coding and guidance aid, while interpretation, feature-specific insights, and dataset-driven conclusions must be performed manually.

\newpage

# References


::: {#refs}
:::

[@10.1016/j.eswa.2010.08.023]
[@zhang2023factors]  
[@spotify_churn_dataset]  