# Data Preparation for Predicting Customer Churn
---
---



In this demo, you are going to learn how to use SageMaker DataWrangler to prepare your data the data to train a classification model to predict if a customer is likely to churn out of a music streaming service.

## Overview

### What is Customer Churn and why is it important for businesses?
Customer churn, or customer retention/attrition, means a customer has the tendency to leave and stop paying for a business. It is one of the primary metrics companies want to track to get a sense of their customer satisfaction, especially for a subscription-based business model. The company can track churn rate (defined as the percentage of customers churned during a period) as a health indicator for the business, but we would love to identify the at-risk customers before they churn and offer appropriate treatment to keep them with the business, and this is where machine learning comes into play.

### Use Cases for Customer Churn

Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)

### Define Business problem

To start with, here are some common business problems to consider depending on your specific use cases and your focus:

 * Will this customer churn (cancel the plan, cancel the subscription)?
 * Will this customer downgrade a pricing plan?
 * For a subscription business model, will a customer renew his/her subscription?

### Machine learning problem formulation

#### Classification: will this customer churn?

To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.

#### Time Series: will this customer churn in the next X months? When will this customer churn?

You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.

### Data Requirements

#### Data collection Sources

Some most common data sources used to construct a data set for churn analysis are:

* Customer Relationship Management platform (CRM), 
* engagement and usage data (analytics services), 
* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).

#### Construct a Data Set for Churn Analysis

Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.
           
#### Challenges with Customer Churn

* Business related
    * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.
* Data issues
    * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).
    * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.
    * Not collecting the right data for the use case or Lacking enough data

## Data Selection

You will use generated music streaming data that is simulated to imitate music streaming user behaviors. The data simulated contains 1100 users and their user behavior for one year (2019/10/28 - 2020/10/28). Data is simulated using the [EventSim](https://github.com/Interana/eventsim) and does not contain any real user data.

* Observation window: you will use 1 year of data to generate predictions.
* Explanation of fields:
    * `ts`: event UNIX timestamp
    * `userId`: a randomly assigned unique user id
    * `sessionId`: a randomly assigned session id unique to each user
    * `page`: event taken by the user, e.g. "next song", "upgrade", "cancel"
    * `auth`: whether the user is a logged-in user
    * `method`: request method, GET or PUT
    * `status`: request status
    * `level`: if the user is a free or paid user
    * `itemInSession`: event happened in the session
    * `location`: location of the user's IP address
    * `userAgent`: agent of the user's device
    * `lastName`: user's last name
    * `firstName`: user's first name
    * `registration`: user's time of registration
    * `gender`: gender of the user
    * `artist`: artist of the song the user is playing at the event
    * `song`: song title the user is playing at the event
    * `length`: length of the session
 
 
 * the data will be downloaded from Github and contained in an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) bucket.

**References**:
- [Predicting Customer Churn with Amazon Machine Learning](https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-amazon-machine-learning/)
-[Customer Churn Prediction with XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html)

## Prepare the notebook

In [29]:
import json
import shutil
from pathlib import Path

import pandas as pd
import s3fs
import sagemaker as sm

Define the environment to use SageMaker and to organize the project data and artifacts in S3.

In [4]:
sagemaker_session = sm.Session()
bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
# role = sm.get_execution_role()
smclient = sagemaker_session.sagemaker_client
role = "arn:aws:iam::570358149193:role/service-role/AmazonSageMaker-ExecutionRole-20210506T162928"

s3 = s3fs.S3FileSystem()

prefix = "music-streaming"

<a id='4'></a>

### Ingest Data

We ingest the simulated data from the public SageMaker S3 training database.

In [5]:
data_source_url = "s3://sagemaker-sample-files/datasets/tabular/customer-churn/customer-churn-data.zip"

In [6]:
sm.s3.S3Downloader().download(data_source_url, "./data")
shutil.unpack_archive("data/customer-churn-data.zip", "data/", format="zip")

[
    shutil.unpack_archive(k, k.parent / "raw", format="zip")
    for k in Path("data").glob("simu*")
]
shutil.unpack_archive("data/sample.zip", "data/raw", format="zip")
[k.unlink() for k in Path("data").glob("*.zip")];

[None, None, None, None, None, None]

We upload the raw data to S3

In [7]:
sm.s3.S3Uploader.upload("data/raw/", f"s3://{bucket}/{prefix}/data/json")

's3://sagemaker-ap-southeast-1-570358149193/music-streaming/data/json'

## Preprocessing `json` to `csv` 
* Here we used a Processing Job to convert the raw streaming data files downloaded from the github repo (`simu-*.zip` files) to a full, CSV formatted file for Data Wrangler Ingestion purpose.
you are importing the raw streaming data files downloaded from the github repo (`simu-*.zip` files). The raw JSON files were converted to CSV format and combined to one file for Data Wrangler Ingestion purpose.

In [8]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

Here's the script used for the Processing job, and we save it into a separate file

In [9]:
%%writefile preprocessing_predw.py

import argparse
import glob
import json
import os
import time
import warnings

import pandas as pd
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)
start_time = time.time()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--processing-output-filename")

    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))

    input_jsons = glob.glob("/opt/ml/processing/input/data/**/*.json", recursive=True)

    df_all = pd.DataFrame()
    for name in input_jsons:
        print("\nStarting file: {}".format(name))
        df = pd.read_json(name, lines=True)
        df_all = df_all.append(df)

    output_filename = args.processing_output_filename
    final_features_output_path = os.path.join(
        "/opt/ml/processing/output", output_filename
    )
    print("Saving processed data to {}".format(final_features_output_path))
    df_all.to_csv(final_features_output_path, header=True, index=False)

Writing preprocessing_predw.py


In [10]:
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

Select the files to process and add them to the `ProcessingInput` structure.

In [11]:
s3_input_uris = [
    f"s3://{k}" for k in s3.ls(f"s3://{bucket}/{prefix}/data/json/") if "simu" in k
]

In [12]:
processing_inputs = []
for uri in s3_input_uris:
    name = uri.split("/")[-1].split(".")[0]
    processing_input = ProcessingInput(
        source=uri, input_name=name, destination=f"/opt/ml/processing/input/data/{name}"
    )
    processing_inputs.append(processing_input)

In [13]:
processing_output_path = f"s3://{bucket}/{prefix}/data/processing"
final_features_filename = "full_data.csv"

sklearn_processor.run(
    code="preprocessing_predw.py",
    inputs=processing_inputs,
    outputs=[
        ProcessingOutput(
            output_name="processed_data",
            source="/opt/ml/processing/output",
            destination=processing_output_path,
        )
    ],
    arguments=["--processing-output-filename", final_features_filename],
    wait=False,
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()


Job Name:  sagemaker-scikit-learn-2021-06-10-05-08-21-661
Inputs:  [{'InputName': 'simu-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-southeast-1-570358149193/music-streaming/data/json/simu-1.json', 'LocalPath': '/opt/ml/processing/input/data/simu-1', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'simu-2', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-southeast-1-570358149193/music-streaming/data/json/simu-2.json', 'LocalPath': '/opt/ml/processing/input/data/simu-2', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'simu-3', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-southeast-1-570358149193/music-streaming/data/json/simu-3.json', 'LocalPath': '/opt/ml/processing/input/data/simu-3', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType

## EDA on sample data

While the Processing Job is underway, we can look into the data.  
Due to the size of the data (~2GB), you will start exploring our data starting with a smaller sample, decide which pre-processing steps are necessary, and apply them to the whole dataset.

In [17]:
sample_file_name = "./data/raw/sample.json"
df = pd.read_json(sample_file_name, lines=True)
with pd.option_context("display.max_columns", 100):
    display(df)

Unnamed: 0,ts,userId,sessionId,page,auth,method,status,level,itemInSession,location,userAgent,lastName,firstName,registration,gender,artist,song,length
0,1592146267731,12065,118,NextSong,Logged In,PUT,200,paid,0,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Peter Tosh,Wanted Dread And Alive (2002 Digital Remaster),267.85914
1,1592146268731,12065,118,Thumbs Down,Logged In,PUT,307,paid,1,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,,,
2,1592146534731,12065,118,NextSong,Logged In,PUT,200,paid,2,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Jimmy Eat World,The Middle,166.00771
3,1592146555731,12065,118,Home,Logged In,GET,200,paid,3,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,,,
4,1592146700731,12065,118,NextSong,Logged In,PUT,200,paid,4,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Memphis La Blusera,Blues en Fa,307.82649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228113,1603797390731,12073,2795,NextSong,Logged In,PUT,200,paid,66,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,Razed in Black,Sin,289.30567
228114,1603797391731,12073,2795,Thumbs Up,Logged In,PUT,307,paid,67,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,,,
228115,1603797399731,12052,2910,NextSong,Logged In,PUT,200,paid,152,"Syracuse, NY",Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; r...,Chang,Jasper,1.591910e+12,M,Devendra Banhart,Foolin' (Album Version),163.34322
228116,1603797469731,12073,2795,Add to Playlist,Logged In,PUT,200,paid,68,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,,,


In [32]:
# TBA

## DataWrangler for Data Preprocessing

Now that we have a good understanding of the data and decided which steps are needed to pre-process the data, we can use Data Wrangler to design and run the process, without writing all the code for the SageMaker Processing Job.

We can create a new `.flow` file, or use the template included in this repository.

In [31]:
processing_output_filename = f"{processing_output_path}/{final_features_filename}"
flow_file = "dw_example.template"

# read flow file and change the s3 location to our `processing_output_filename`
with open(flow_file, "r") as f:
    flow = json.loads(f.read())
    flow["nodes"][0]["parameters"]["dataset_definition"]["s3ExecutionContext"][
        "s3Uri"
    ] = processing_output_filename

with open("dw_example.flow", "w") as f:
    json.dump(flow, f)

## Feature Engineering - Local Version
Here we indicate a possible feature engineering. These steps mirror the DataWrangler template flow file.

In [17]:
sample_file_name = "./data/raw/sample.json"
df = pd.read_json(sample_file_name, lines=True)
with pd.option_context("display.max_columns", 100):
    display(df)

Unnamed: 0,ts,userId,sessionId,page,auth,method,status,level,itemInSession,location,userAgent,lastName,firstName,registration,gender,artist,song,length
0,1592146267731,12065,118,NextSong,Logged In,PUT,200,paid,0,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Peter Tosh,Wanted Dread And Alive (2002 Digital Remaster),267.85914
1,1592146268731,12065,118,Thumbs Down,Logged In,PUT,307,paid,1,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,,,
2,1592146534731,12065,118,NextSong,Logged In,PUT,200,paid,2,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Jimmy Eat World,The Middle,166.00771
3,1592146555731,12065,118,Home,Logged In,GET,200,paid,3,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,,,
4,1592146700731,12065,118,NextSong,Logged In,PUT,200,paid,4,"Richmond, VA","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",Davis,Bristol,1.591971e+12,M,Memphis La Blusera,Blues en Fa,307.82649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228113,1603797390731,12073,2795,NextSong,Logged In,PUT,200,paid,66,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,Razed in Black,Sin,289.30567
228114,1603797391731,12073,2795,Thumbs Up,Logged In,PUT,307,paid,67,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,,,
228115,1603797399731,12052,2910,NextSong,Logged In,PUT,200,paid,152,"Syracuse, NY",Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; r...,Chang,Jasper,1.591910e+12,M,Devendra Banhart,Foolin' (Album Version),163.34322
228116,1603797469731,12073,2795,Add to Playlist,Logged In,PUT,200,paid,68,"Marietta, OH","""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...",White,Rebekah,1.591795e+12,F,,,


#### Remove irrelevant columns

From the first look of data, you can notice that columns `lastName`, `firstName`, `method` and `status` are not relevant features. These will be dropped from the data.

In [18]:
columns_to_remove = ["method", "status", "lastName", "firstName"]
df = df.drop(columns=columns_to_remove)

#### Check for null values

You are going to remove all events without an `userId` assigned since you are predicting which recognized user will churn from our service. In this case, all the rows(events) have a `userId` and `sessionId` assigned, but you will still run this step for the full dataset. For other columns, there are ~3% of data that are missing some demographic information of the users, and ~20% missing the song attributes, which is because the events contain not only playing a song, but also other actions including login and log out, downgrade, cancellation, etc. There are ~3% of users that do not have a registration time, so you will remove these anonymous users from the record.

In [19]:
print("percentage of the value missing in each column is: ")
df.isnull().sum() / len(df)

percentage of the value missing in each column is: 


ts               0.000000
userId           0.000000
sessionId        0.000000
page             0.000000
auth             0.000000
level            0.000000
itemInSession    0.000000
location         0.025447
userAgent        0.025447
registration     0.025447
gender           0.025447
artist           0.210330
song             0.210330
length           0.210330
dtype: float64

In [20]:
df = df[~df["userId"].isnull()]
df = df[~df["registration"].isnull()]

In [21]:
df["ts"] = pd.to_datetime(df.ts, unit="ms")
df["registration"] = pd.to_datetime(df["registration"], unit="ms").dt.date

In [22]:
df["date"] = df["ts"].dt.date
df["ts_dow"] = df["ts"].dt.weekday
df["ts_is_weekday"] = (df["ts_dow"].isin([0, 1, 2, 3, 4])).astype(int)

In [23]:
df["churned_event"] = (df["page"] == "Cancellation Confirmation").astype(int)
df["user_churned"] = df.groupby("userId")["churned_event"].transform("max")

In [24]:
events_list = [
    "NextSong",
    "Thumbs Down",
    "Thumbs Up",
    "Add to Playlist",
    "Roll Advert",
    "Add Friend",
    "Downgrade",
    "Upgrade",
    "Error",
]
df["events"] = (
    df.page.str.lower().str.replace(" ", "_").where(df.page.isin(events_list))
)

In [25]:
df = pd.get_dummies(df, columns=["events"], prefix="events")

In [26]:
base_df = (
    df.groupby(["userId", "date", "ts_is_weekday"])
    .agg({"page": "count"})
    .groupby(["userId", "ts_is_weekday"])["page"]
    .mean()
    .unstack(fill_value=0)
    .reset_index()
    .rename(columns={0: "average_events_weekend", 1: "average_events_weekday"})
)


base_df_daily = (
    df.groupby(["userId", "date"])
    .agg(
        {
            "page": "count",
            "events_nextsong": "sum",
            "events_roll_advert": "sum",
            "events_error": "sum",
        }
    )
    .reset_index()
)

feature34 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(7)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum", "events_roll_advert": "sum", "events_error": "sum"})
    .reset_index()
    .rename(
        columns={
            "events_nextsong": "num_songs_played_7d",
            "events_roll_advert": "num_ads_7d",
            "events_error": "num_error_7d",
        }
    )
)
feature5 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(30)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum"})
    .reset_index()
    .rename(columns={"events_nextsong": "num_songs_played_30d"})
)
feature6 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(90)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum"})
    .reset_index()
    .rename(columns={"events_nextsong": "num_songs_played_90d"})
)
# num_artists, num_songs, num_ads, num_thumbsup, num_thumbsdown, num_playlist, num_addfriend, num_error, user_downgrade,
# user_upgrade, percentage_ad, days_since_active
base_df_user = (
    df.groupby(["userId"])
    .agg(
        {
            "page": "count",
            "events_nextsong": "sum",
            "artist": "nunique",
            "song": "nunique",
            "events_thumbs_down": "sum",
            "events_thumbs_up": "sum",
            "events_add_to_playlist": "sum",
            "events_roll_advert": "sum",
            "events_add_friend": "sum",
            "events_downgrade": "max",
            "events_upgrade": "max",
            "events_error": "sum",
            "date": "max",
            "registration": "min",
            "user_churned": "max",
        }
    )
    .reset_index()
)

base_df_user["percentage_ad"] = (
    base_df_user["events_roll_advert"] / base_df_user["page"]
)
base_df_user["days_since_active"] = (
    base_df_user["date"] - base_df_user["registration"]
).dt.days
# repeats ratio
base_df_user["repeats_ratio"] = (
    1 - base_df_user["song"] / base_df_user["events_nextsong"]
)

# num_sessions, avg_time_per_session, avg_events_per_session,
base_df_session = (
    df.groupby(["userId", "sessionId"])
    .agg({"length": "sum", "page": "count", "date": "min"})
    .reset_index()
)
base_df_session["prev_session_ts"] = base_df_session.groupby(["userId"])["date"].shift(
    1
)
base_df_session["gap_session"] = (
    base_df_session["date"] - base_df_session["prev_session_ts"]
).dt.days
user_sessions = (
    base_df_session.groupby("userId")
    .agg(
        {"sessionId": "count", "length": "mean", "page": "mean", "gap_session": "mean"}
    )
    .reset_index()
    .rename(
        columns={
            "sessionId": "num_sessions",
            "length": "avg_time_per_session",
            "page": "avg_events_per_session",
            "gap_session": "avg_gap_between_session",
        }
    )
)

# merge features together
base_df["userId"] = base_df["userId"]  # .astype("int")
final_feature_df = base_df.merge(feature34, how="left", on="userId")
final_feature_df = final_feature_df.merge(feature5, how="left", on="userId")
final_feature_df = final_feature_df.merge(feature6, how="left", on="userId")
final_feature_df = final_feature_df.merge(user_sessions, how="left", on="userId")
final_feature_df = final_feature_df.merge(base_df_user, how="left", on="userId")

final_feature_df = final_feature_df.fillna(0)
# renaming columns
final_feature_df.columns = [
    "userId",
    "average_events_weekend",
    "average_events_weekday",
    "num_songs_played_7d",
    "num_ads_7d",
    "num_error_7d",
    "num_songs_played_30d",
    "num_songs_played_90d",
    "num_sessions",
    "avg_time_per_session",
    "avg_events_per_session",
    "avg_gap_between_session",
    "num_events",
    "num_songs",
    "num_artists",
    "num_unique_songs",
    "num_thumbs_down",
    "num_thumbs_up",
    "num_add_to_playlist",
    "num_ads",
    "num_add_friend",
    "num_downgrade",
    "num_upgrade",
    "num_error",
    "ts_date_day",
    "registration",
    "user_churned",
    "percentage_ad",
    "days_since_active",
    "repeats_ratio",
]
# only keep created feature columns
final_feature_df = final_feature_df[
    [
        "userId",
        "user_churned",
        "average_events_weekend",
        "average_events_weekday",
        "num_songs_played_7d",
        "num_ads_7d",
        "num_error_7d",
        "num_songs_played_30d",
        "num_songs_played_90d",
        "num_sessions",
        "avg_time_per_session",
        "avg_events_per_session",
        "avg_gap_between_session",
        "num_events",
        "num_songs",
        "num_artists",
        "num_thumbs_down",
        "num_thumbs_up",
        "num_add_to_playlist",
        "num_ads",
        "num_add_friend",
        "num_downgrade",
        "num_upgrade",
        "num_error",
        "percentage_ad",
        "days_since_active",
        "repeats_ratio",
    ]
]

In [27]:
final_feature_df

Unnamed: 0,userId,user_churned,average_events_weekend,average_events_weekday,num_songs_played_7d,num_ads_7d,num_error_7d,num_songs_played_30d,num_songs_played_90d,num_sessions,...,num_thumbs_up,num_add_to_playlist,num_ads,num_add_friend,num_downgrade,num_upgrade,num_error,percentage_ad,days_since_active,repeats_ratio
0,12001,1,41.333333,139.300000,1223.0,2,3,1223.0,1223.0,14,...,118.0,45,2,20,1,1,3,0.001318,38,0.092396
1,12002,0,71.142857,85.121212,3050.0,37,0,3050.0,3050.0,49,...,278.0,99,37,57,1,1,0,0.009724,136,0.159672
2,12003,0,68.833333,81.270270,2766.0,11,4,2766.0,2766.0,36,...,248.0,94,11,44,1,1,4,0.003216,136,0.142082
3,12004,0,139.000000,82.750000,1454.0,4,0,1454.0,1454.0,18,...,133.0,46,4,31,1,1,0,0.002230,135,0.103851
4,12005,0,53.142857,59.608696,1417.0,1,1,1417.0,1417.0,29,...,111.0,36,1,25,1,1,1,0.000574,135,0.106563
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,12097,0,79.333333,57.382353,1956.0,7,3,1956.0,1956.0,41,...,164.0,65,7,43,1,1,3,0.002884,138,0.117076
97,12098,0,64.750000,64.764706,1315.0,2,3,1315.0,1315.0,22,...,129.0,47,2,21,1,0,3,0.001235,127,0.095057
98,12099,0,94.000000,43.666667,335.0,13,0,335.0,335.0,7,...,23.0,12,13,1,0,1,0,0.031477,97,0.023881
99,12100,0,42.500000,117.500000,467.0,0,2,467.0,467.0,5,...,41.0,11,0,2,1,0,2,0.000000,131,0.032120
