---
title: Step 03 - Featurization, Vectorization, and Pre-Modeling
subject: Churn Analysis
subtitle: Step 03 - Featurization, Vectorization, and Pre-Modeling - Churn Analysis
short_title: Featurization, Vectorization, and Pre-Modeling
date: 2025-12-17


authors:
  - name: Jocelyn Perez
    affiliations:
      - name: University of California, Berkeley
    email: jocelyneperez@berkeley.edu
    orcid: 0000-0000-0000-0000

  - name: Claire Kaoru Shimazaki
    affiliations:
      - name: University of California, Berkeley
    email: ckshimazaki@berkeley.edu
    orcid: 0000-0000-0000-0000

  - name: Colby Zhang
    affiliations:
      - name: University of California, Berkeley
    email: colbyzhang@berkeley.edu
    orcid: 0009-0005-4786-6922

  - name: Olorundamilola Kazeem
    affiliations:
      - name: University of California, Berkeley
    email: dami@berkeley.edu
    orcid: 0000-0003-2118-2221

exports:
  - format: pdf
    # template: arxiv_two_column
    output: ../pdf_builds/step03_features_ipynb_to.pdf
    line_numbers: true

license: BSD-3-Clause

keywords: featurization, vectorization, pre-modeling, churn, spotify

abstract: What are the pre-processing of the features? How is the data being vectorized? 
---

# Step 03: Featurization, Vectorization, and Pre-Modeling

In [1]:
import src.step00_utils as step00_utils
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split

from src.step03_features import (
    RAW_DATA_PATH,
    PROCESSED_DATA_DIR,
    VECTORIZED_DATA_DIR,
    engineer_features,
    make_X_y,
    build_preprocessor
)

In [2]:
step00_utils.DIR_PROJECT_CURRENT

PosixPath('/home/jovyan/final-group08/notebooks')

In [3]:
step00_utils.DIR_PROJECT_HOME

PosixPath('/home/jovyan/final-group08')

In [4]:
step00_utils.DIR_DATA

PosixPath('/home/jovyan/final-group08/data')

In [5]:
df = pd.read_csv(RAW_DATA_PATH)
print(f"Data Loaded. Shape: {df.shape}")
df.head()

Data Loaded. Shape: (8000, 12)


Unnamed: 0,user_id,gender,age,country,subscription_type,listening_time,songs_played_per_day,skip_rate,device_type,ads_listened_per_week,offline_listening,is_churned
0,1,Female,54,CA,Free,26,23,0.2,Desktop,31,0,1
1,2,Other,33,DE,Family,141,62,0.34,Web,0,1,0
2,3,Male,38,AU,Premium,199,38,0.04,Mobile,0,1,1
3,4,Female,22,CA,Student,36,2,0.31,Mobile,0,1,0
4,5,Other,29,US,Family,250,57,0.36,Mobile,0,1,1


In [6]:
# 1. Engineer Features
df_engineered = engineer_features(df)

# 2. Sanity Check: Did the new columns appear?
print("New columns:", [c for c in df_engineered.columns if c in ['ads_per_song', 'avg_song_length']])
df_engineered.head()

New columns: ['ads_per_song', 'avg_song_length']


Unnamed: 0,user_id,gender,age,country,subscription_type,listening_time,songs_played_per_day,skip_rate,device_type,ads_listened_per_week,offline_listening,is_churned,ads_per_song,avg_song_length
0,1,Female,54,CA,Free,26,23,0.2,Desktop,31,0,1,0.191358,1.083333
1,2,Other,33,DE,Family,141,62,0.34,Web,0,1,0,0.0,2.238095
2,3,Male,38,AU,Premium,199,38,0.04,Mobile,0,1,1,0.0,5.102564
3,4,Female,22,CA,Student,36,2,0.31,Mobile,0,1,0,0.0,12.0
4,5,Other,29,US,Family,250,57,0.36,Mobile,0,1,1,0.0,4.310345


In [7]:
# X = features, y = target
X, y = make_X_y(df_engineered)

# train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# build and fit preprocessor
preprocessor = build_preprocessor()
preprocessor.fit(X_train)

# 4. transform both sets
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Training Data Shape:", X_train_processed.shape)
print("Test Data Shape:", X_test_processed.shape)

Training Data Shape: (6400, 26)
Test Data Shape: (1600, 26)


In [8]:
# save preprocessor and processed data for step04_modeling
joblib.dump(preprocessor, VECTORIZED_DATA_DIR / "preprocessor.joblib")
joblib.dump({"X": X_train_processed, "y": y_train}, VECTORIZED_DATA_DIR / "train.joblib")
joblib.dump({"X": X_test_processed, "y": y_test}, VECTORIZED_DATA_DIR / "test.joblib")

print("All files saved to:", VECTORIZED_DATA_DIR)

All files saved to: /home/jovyan/final-group08/data/02_vectorized


In this notebook, we established a robust feature engineering pipeline to prepare the data for modeling, directly applying insights from the EDA. We addressed data quality by handling missing values and clipping outliers to reduce noise, then engineered interaction features such as `ads_per_song`, `daily_skips`, and `listening_efficiency` to better capture user dissatisfaction and engagement depth. Finally, we implemented a stratified 80/20 train-test split and applied a preprocessing pipeline—using `StandardScaler` for numeric features and `OneHotEncoder` for categoricals—saving the vectorized datasets to ensure consistent inputs for the subsequent modeling phase.