# Data Preparation for Predicting Customer Churn
---
---



In this demo, you are going to learn how to use SageMaker DataWrangler to prepare your data the data to train a classification model to predict if a customer is likely to churn out of a music streaming service.

## Overview

### What is Customer Churn and why is it important for businesses?
Customer churn, or customer retention/attrition, means a customer has the tendency to leave and stop paying for a business. It is one of the primary metrics companies want to track to get a sense of their customer satisfaction, especially for a subscription-based business model. The company can track churn rate (defined as the percentage of customers churned during a period) as a health indicator for the business, but we would love to identify the at-risk customers before they churn and offer appropriate treatment to keep them with the business, and this is where machine learning comes into play.

### Use Cases for Customer Churn

Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)

### Define Business problem

To start with, here are some common business problems to consider depending on your specific use cases and your focus:

 * Will this customer churn (cancel the plan, cancel the subscription)?
 * Will this customer downgrade a pricing plan?
 * For a subscription business model, will a customer renew his/her subscription?

### Machine learning problem formulation

#### Classification: will this customer churn?

To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.

#### Time Series: will this customer churn in the next X months? When will this customer churn?

You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.

### Data Requirements

#### Data collection Sources

Some most common data sources used to construct a data set for churn analysis are:

* Customer Relationship Management platform (CRM), 
* engagement and usage data (analytics services), 
* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).

#### Construct a Data Set for Churn Analysis

Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.
           
#### Challenges with Customer Churn

* Business related
    * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.
* Data issues
    * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).
    * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.
    * Not collecting the right data for the use case or Lacking enough data

## Data Selection

You will use generated music streaming data that is simulated to imitate music streaming user behaviors. The data simulated contains 1100 users and their user behavior for one year (2019/10/28 - 2020/10/28). Data is simulated using the [EventSim](https://github.com/Interana/eventsim) and does not contain any real user data.

* Observation window: you will use 1 year of data to generate predictions.
* Explanation of fields:
    * `ts`: event UNIX timestamp
    * `userId`: a randomly assigned unique user id
    * `sessionId`: a randomly assigned session id unique to each user
    * `page`: event taken by the user, e.g. "next song", "upgrade", "cancel"
    * `auth`: whether the user is a logged-in user
    * `method`: request method, GET or PUT
    * `status`: request status
    * `level`: if the user is a free or paid user
    * `itemInSession`: event happened in the session
    * `location`: location of the user's IP address
    * `userAgent`: agent of the user's device
    * `lastName`: user's last name
    * `firstName`: user's first name
    * `registration`: user's time of registration
    * `gender`: gender of the user
    * `artist`: artist of the song the user is playing at the event
    * `song`: song title the user is playing at the event
    * `length`: length of the session
 
 
 * the data will be downloaded from Github and contained in an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) bucket.

**References**:
- [Predicting Customer Churn with Amazon Machine Learning](https://aws.amazon.com/blogs/machine-learning/predicting-customer-churn-with-amazon-machine-learning/)
-[Customer Churn Prediction with XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html)

## Prepare the notebook

In [None]:
import json
import shutil
from pathlib import Path

import pandas as pd
import s3fs
import sagemaker as sm

Define the environment to use SageMaker and to organize the project data and artifacts in S3.

In [None]:
sagemaker_session = sm.Session()
bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
role = sm.get_execution_role()
smclient = sagemaker_session.sagemaker_client

s3 = s3fs.S3FileSystem()

prefix = "music-streaming"

<a id='4'></a>

### Ingest Data

We ingest the simulated data from the public SageMaker S3 training database.

In [None]:
data_source_url = "s3://sagemaker-sample-files/datasets/tabular/customer-churn/customer-churn-data.zip"

In [None]:
sm.s3.S3Downloader().download(data_source_url, "./data")
shutil.unpack_archive("data/customer-churn-data.zip", "data/", format="zip")

[
    shutil.unpack_archive(k, k.parent / "raw", format="zip")
    for k in Path("data").glob("simu*")
]
shutil.unpack_archive("data/sample.zip", "data/raw", format="zip")
[k.unlink() for k in Path("data").glob("*.zip")];

We upload the raw data to S3

In [None]:
sm.s3.S3Uploader.upload("data/raw/", f"s3://{bucket}/{prefix}/data/json")

## Preprocessing `json` to `csv` 
* Here we used a Processing Job to convert the raw streaming data files downloaded from the github repo (`simu-*.zip` files) to a full, CSV formatted file for Data Wrangler Ingestion purpose.
you are importing the raw streaming data files downloaded from the github repo (`simu-*.zip` files). The raw JSON files were converted to CSV format and combined to one file for Data Wrangler Ingestion purpose.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

Here's the script used for the Processing job, and we save it into a separate file

In [None]:
%%writefile preprocessing_predw.py

import argparse
import glob
import json
import os
import time
import warnings

import pandas as pd
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)
start_time = time.time()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--processing-output-filename")

    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))

    input_jsons = glob.glob("/opt/ml/processing/input/data/**/*.json", recursive=True)

    df_all = pd.DataFrame()
    for name in input_jsons:
        print("\nStarting file: {}".format(name))
        df = pd.read_json(name, lines=True)
        df_all = df_all.append(df)

    output_filename = args.processing_output_filename
    final_features_output_path = os.path.join(
        "/opt/ml/processing/output", output_filename
    )
    print("Saving processed data to {}".format(final_features_output_path))
    df_all.to_csv(final_features_output_path, header=True, index=False)

In [None]:
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

Select the files to process and add them to the `ProcessingInput` structure.

In [None]:
s3_input_uris = [
    f"s3://{k}" for k in s3.ls(f"s3://{bucket}/{prefix}/data/json/") if "simu" in k
]

In [None]:
processing_inputs = []
for uri in s3_input_uris:
    name = uri.split("/")[-1].split(".")[0]
    processing_input = ProcessingInput(
        source=uri, input_name=name, destination=f"/opt/ml/processing/input/data/{name}"
    )
    processing_inputs.append(processing_input)

In [None]:
processing_output_path = f"s3://{bucket}/{prefix}/data/processing"
final_features_filename = "full_data.csv"

sklearn_processor.run(
    code="preprocessing_predw.py",
    inputs=processing_inputs,
    outputs=[
        ProcessingOutput(
            output_name="processed_data",
            source="/opt/ml/processing/output",
            destination=processing_output_path,
        )
    ],
    arguments=["--processing-output-filename", final_features_filename],
    wait=False,
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

## Exploration on Sample Data

While the Processing Job is underway, we can look into the data.  
Due to the size of the data (~2GB), you will start exploring our data starting with a smaller sample, decide which pre-processing steps are necessary, and apply them to the whole dataset.

In [None]:
sample_file_name = "./data/raw/sample.json"

sample = pd.read_json(sample_file_name, lines=True)
with sample.option_context("display.max_columns", 100):
    display(sample)

### Data Exploration

Let's take a look at our categorical columns first: `page`, `auth`, `level`, `location`, `userAgent`, `gender`, `artist`, and `song`, and start with looking at unique values for `page`, `auth`, `level`, and `gender` since the other three have many unique values and you will take a different approach.

In [None]:
cat_columns = ["page", "auth", "level", "gender"]
cat_columns_long = ["location", "userAgent", "artist", "song", "userId"]
for col in cat_columns:
    print("The unique values in column {} are: {}".format(col, sample[col].unique()))
for col in cat_columns_long:
    print("There are {} unique values  in column {}".format(sample[col].nunique(), col))

#### Key observations from the above information

* There are 101 unique users with 72 unique locations, this information may not be useful as a categorical feature. You can parse this field and only keep State information, but even that will give us 50 unique values in this category, so you can either remove this column or bucket it to a higher level (NY --> Northeast).
* Artist and song details might not be helpful as categorical features as there are too many categories; you can quantify these to a user level, i.e. how many artists this user has listened to in total, how many songs this user has played in the last week, last month, in 180 days, in 365 days. You can also bring in external data to get song genres and other artist attributes to enrich this feature.
* In the column `page`,  'Thumbs Down', 'Thumbs Up', 'Add to Playlist', 'Roll Advert','Help', 'Add Friend', 'Downgrade', 'Upgrade', and 'Error' can all be great features to churn analysis. You will aggregate them to user-level later. There is a "cancellation confirmation" value that can be used for the churn indicator.

* Let's take a look at the column `userAgent`:


UserAgent contains little useful information, but if you care about the browser type and mac/windows difference, you can parse the text and extract the information. Sometimes businesses would love to analyze user behavior based on their App version and device type (iOS v.s. Android), so these could be useful information. In this use case, for modeling purpose, we will remove this column. but you can keep it as a filter for data visualization.

In [None]:
columns_to_remove = ["location", "userAgent"]
sample = sample.drop(columns=columns_to_remove)

Let's take a closer look at the timestamp columns `ts` and `registration`. We can convert the event timestamp `ts` to year, month, week, day, day of the week, and hour of the day. The registration time should be the same for the same user, so we can aggregate this value to user-level and create a time delta column to calculate the time between registration and the newest event.

In [None]:
sample["date"] = pd.to_datetime(sample["ts"], unit="ms")
sample["ts_year"] = sample["date"].dt.year
sample["ts_month"] = sample["date"].dt.month
sample["ts_week"] = sample["date"].dt.week
sample["ts_day"] = sample["date"].dt.day
sample["ts_dow"] = sample["date"].dt.weekday
sample["ts_hour"] = sample["date"].dt.hour
sample["ts_date_day"] = sample["date"].dt.date
sample["ts_is_weekday"] = [1 if x in [0, 1, 2, 3, 4] else 0 for x in sample["ts_dow"]]
sample["registration_ts"] = pd.to_datetime(sample["registration"], unit="ms").dt.date

#### Define Churn

In this use case, you will use `page == "Cancellation Confirmation"` as the indicator of a user churn. You can also use `page == 'downgrade` if you are interested in users downgrading their payment plan. There are ~13% users churned, so you will need to up-sample or down-sample the full dataset to deal with the imbalanced class, or carefully choose your algorithms.

In [None]:
print(
    "There are {:.2f}% of users churned in this dataset".format(
        (
            (sample[sample["page"] == "Cancellation Confirmation"]["userId"].nunique())
            / sample["userId"].nunique()
        )
        * 100
    )
)

You can label a user by adding a churn label at a event level then aggregate this value to user level. 

In [None]:
sample["churned_event"] = [
    1 if x == "Cancellation Confirmation" else 0 for x in sample["page"]
]
sample["user_churned"] = sample.groupby("userId")["churned_event"].transform("max")

#### Imbalanced Class

Imbalanced class (much more positive cases than negative cases) is very common in churn analysis. It can be misleading for some machine learning model as the accuracy will be biased towards the majority class. Some useful tactics to deal with imbalanced class are [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html), use algorithms that are less sensitive to imbalanced class like a tree-based algorithm or use a cost-sensitive algorithm that penalizes wrongly classified minority class.

To Summarize every pre-processing steps you have covered:
* null removals
* drop irrelevant columns
* convert event timestamps to features used for analysis and modeling: year, month, week, day, day of week, hour, date, if the day is weekday or weekend, and convert registration timestamp to UTC.
* create labels (whether the user churned eventually), which is calculated by if one churn event happened in the user's history, you can label the user as a churned user (1). 

#### Exploring Data

Based on the available data, look at every column, and decide if you can create a feature from it. For all the columns, here are some directions to explore:

    * `ts`: distribution of activity time: time of the day, day of the week
    * `sessionId`: average number of sessions per user
    * `page`:  number of thumbs up/thumbs down, added to the playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered.
    * `level`: if the user is a free or paid user
    * `registration`: days the user being active, time the user joined the service
    * `gender`: gender of the user
    * `artist`: average number of artists the user listened to
    * `song`: average number of songs listened per user
    * `length`: average time spent per day per user
   
**Activity Time**

1. Weekday v.s. weekend trends for churned users and active users. It seems like churned users are more active on weekdays than weekends whereas active users do not show a strong difference between weekday v.s. weekends. You can create some features from here: for each user, average events per day -- weekends, average events per day -- weekdays. You can also create features - average events per day of the week, but that will be converted to 7 features after one-hot-encoding, which may be less informative than the previous method.
2. In terms of hours active during a day, our simulated data did not show a significant difference between day and night for both sets of users. You can have it on your checklist for your analysis, and similarly for the day of the month, the month of the year when you have more than a year of data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

events_per_day_per_user = (
    sample.groupby(["userId", "ts_date_day", "ts_is_weekday", "user_churned"])
    .agg({"page": "count"})
    .reset_index()
)
events_dist = (
    events_per_day_per_user.groupby(["userId", "ts_is_weekday", "user_churned"])
    .agg({"page": "mean"})
    .reset_index()
)


def trend_plot(
    df,
    plot_type,
    x,
    y,
    hue=None,
    title=None,
    x_axis=None,
    y_axis=None,
    xticks=None,
    yticks=None,
):
    if plot_type == "box":
        fig = sns.boxplot(x="page", y=y, data=df, hue=hue, orient="h")
    elif plot_type == "bar":
        fig = sns.barplot(x=x, y=y, data=df, hue=hue)

    sns.set(rc={"figure.figsize": (12, 3)})
    sns.set_palette("Set2")
    sns.set_style("darkgrid")
    plt.title(title)
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.yticks([0, 1], yticks)
    return plt.show(fig)


trend_plot(
    events_dist,
    "box",
    "page",
    "user_churned",
    "ts_is_weekday",
    "Weekday V.S. Weekends - Average events per day per user",
    "average events per user per day",
    yticks=["active users", "churned users"],
)

In [None]:
events_per_hour_per_user = (
    sample.groupby(["userId", "ts_date_day", "ts_hour", "user_churned"])
    .agg({"page": "count"})
    .reset_index()
)
events_dist = (
    events_per_hour_per_user.groupby(["userId", "ts_hour", "user_churned"])
    .agg({"page": "mean"})
    .reset_index()
    .groupby(["ts_hour", "user_churned"])
    .agg({"page": "mean"})
    .reset_index()
)
trend_plot(
    events_dist,
    "bar",
    "ts_hour",
    "page",
    "user_churned",
    "Hourly activity - Average events per hour of day per user",
    "hour of the day",
    "average events per user per hour",
)

**Listening Behavior**

You can look at some basic stats for a user's listening habits. Churned users generally listen to a wider variety of songs and artists and spend more time on the App/be with the App longer.
* Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, number of ad days active
* Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user


In [None]:
stats_per_user = (
    sample.groupby(["userId", "user_churned"])
    .agg(
        {
            "sessionId": "count",
            "song": "nunique",
            "artist": "nunique",
            "length": "sum",
            "ts_date_day": "count",
        }
    )
    .reset_index()
)
avg_stats_group = (
    stats_per_user.groupby(["user_churned"])
    .agg(
        {
            "sessionId": "mean",
            "song": "mean",
            "artist": "mean",
            "length": "mean",
            "ts_date_day": "mean",
        }
    )
    .reset_index()
)

print(
    "Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, days active: "
)
avg_stats_group

In [None]:
stats_per_user = (
    sample.groupby(["userId", "ts_date_day", "user_churned"])
    .agg(
        {"sessionId": "count", "song": "nunique", "artist": "nunique", "length": "sum"}
    )
    .reset_index()
)
avg_stats_group = (
    stats_per_user.groupby(["user_churned"])
    .agg({"sessionId": "mean", "song": "mean", "artist": "mean", "length": "mean"})
    .reset_index()
)
print(
    "Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user: "
)
avg_stats_group

**App Usage Behavior**

You can further explore how the users are using the App besides just listening: number of thumbs up/thumbs down, added to playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered. Churned users are slightly more active than other users, and also encounter more errors, listened to more ads, and more downgrade and upgrade. These can be numerical features (number of total events per type per user), or more advanced time series numerical features (errors in last 7 days, errors in last month, etc.).

In [None]:
events_list = [
    "NextSong",
    "Thumbs Down",
    "Thumbs Up",
    "Add to Playlist",
    "Roll Advert",
    "Add Friend",
    "Downgrade",
    "Upgrade",
    "Error",
]
usage_column_name = []
for event in events_list:
    event_name = "_".join(event.split()).lower()
    usage_column_name.append(event_name)
    sample[event_name] = [1 if x == event else 0 for x in sample["page"]]

In [None]:
app_use_per_user = (
    sample.groupby(["userId", "user_churned"])[usage_column_name].sum().reset_index()
)

In [None]:
app_use_group = (
    app_use_per_user.groupby(["user_churned"])[usage_column_name].mean().reset_index()
)
app_use_group

In [None]:
# TBA

## DataWrangler for Data Preprocessing

Now that we have a good understanding of the data and decided which steps are needed to pre-process the data, we can use Data Wrangler to design and run the process, without writing all the code for the SageMaker Processing Job.

We can create a new `.flow` file, or use the template included in this repository.

In [None]:
processing_output_filename = f"{processing_output_path}/{final_features_filename}"
flow_file = "dw_example.template"

# read flow file and change the s3 location to our `processing_output_filename`
with open(flow_file, "r") as f:
    flow = json.loads(f.read())
    flow["nodes"][0]["parameters"]["dataset_definition"]["s3ExecutionContext"][
        "s3Uri"
    ] = processing_output_filename

with open("dw_example.flow", "w") as f:
    json.dump(flow, f)

## Feature Engineering - Local Version
Here we indicate a possible feature engineering. These steps mirror the DataWrangler template flow file.

In [None]:
df = pd.read_json(sample_file_name, lines=True)

#### Remove irrelevant columns

From the first look of data, you can notice that columns `lastName`, `firstName`, `method` and `status` are not relevant features. These will be dropped from the data.

In [None]:
columns_to_remove = ["method", "status", "lastName", "firstName"]
df = df.drop(columns=columns_to_remove)

#### Check for null values

You are going to remove all events without an `userId` assigned since you are predicting which recognized user will churn from our service. In this case, all the rows(events) have a `userId` and `sessionId` assigned, but you will still run this step for the full dataset. For other columns, there are ~3% of data that are missing some demographic information of the users, and ~20% missing the song attributes, which is because the events contain not only playing a song, but also other actions including login and log out, downgrade, cancellation, etc. There are ~3% of users that do not have a registration time, so you will remove these anonymous users from the record.

In [None]:
print("percentage of the value missing in each column is: ")
df.isnull().sum() / len(df)

In [None]:
df = df[~df["userId"].isnull()]
df = df[~df["registration"].isnull()]

In [None]:
df["ts"] = pd.to_datetime(df.ts, unit="ms")
df["registration"] = pd.to_datetime(df["registration"], unit="ms").dt.date

In [None]:
df["date"] = df["ts"].dt.date
df["ts_dow"] = df["ts"].dt.weekday
df["ts_is_weekday"] = (df["ts_dow"].isin([0, 1, 2, 3, 4])).astype(int)

In [None]:
df["churned_event"] = (df["page"] == "Cancellation Confirmation").astype(int)
df["user_churned"] = df.groupby("userId")["churned_event"].transform("max")

In [None]:
events_list = [
    "NextSong",
    "Thumbs Down",
    "Thumbs Up",
    "Add to Playlist",
    "Roll Advert",
    "Add Friend",
    "Downgrade",
    "Upgrade",
    "Error",
]
df["events"] = (
    df.page.str.lower().str.replace(" ", "_").where(df.page.isin(events_list))
)

In [None]:
df = pd.get_dummies(df, columns=["events"], prefix="events")

In [None]:
base_df = (
    df.groupby(["userId", "date", "ts_is_weekday"])
    .agg({"page": "count"})
    .groupby(["userId", "ts_is_weekday"])["page"]
    .mean()
    .unstack(fill_value=0)
    .reset_index()
    .rename(columns={0: "average_events_weekend", 1: "average_events_weekday"})
)


base_df_daily = (
    df.groupby(["userId", "date"])
    .agg(
        {
            "page": "count",
            "events_nextsong": "sum",
            "events_roll_advert": "sum",
            "events_error": "sum",
        }
    )
    .reset_index()
)

feature34 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(7)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum", "events_roll_advert": "sum", "events_error": "sum"})
    .reset_index()
    .rename(
        columns={
            "events_nextsong": "num_songs_played_7d",
            "events_roll_advert": "num_ads_7d",
            "events_error": "num_error_7d",
        }
    )
)
feature5 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(30)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum"})
    .reset_index()
    .rename(columns={"events_nextsong": "num_songs_played_30d"})
)
feature6 = (
    base_df_daily.groupby(["userId", "date"])
    .tail(90)
    .groupby(["userId"])
    .agg({"events_nextsong": "sum"})
    .reset_index()
    .rename(columns={"events_nextsong": "num_songs_played_90d"})
)
# num_artists, num_songs, num_ads, num_thumbsup, num_thumbsdown, num_playlist, num_addfriend, num_error, user_downgrade,
# user_upgrade, percentage_ad, days_since_active
base_df_user = (
    df.groupby(["userId"])
    .agg(
        {
            "page": "count",
            "events_nextsong": "sum",
            "artist": "nunique",
            "song": "nunique",
            "events_thumbs_down": "sum",
            "events_thumbs_up": "sum",
            "events_add_to_playlist": "sum",
            "events_roll_advert": "sum",
            "events_add_friend": "sum",
            "events_downgrade": "max",
            "events_upgrade": "max",
            "events_error": "sum",
            "date": "max",
            "registration": "min",
            "user_churned": "max",
        }
    )
    .reset_index()
)

base_df_user["percentage_ad"] = (
    base_df_user["events_roll_advert"] / base_df_user["page"]
)
base_df_user["days_since_active"] = (
    base_df_user["date"] - base_df_user["registration"]
).dt.days
# repeats ratio
base_df_user["repeats_ratio"] = (
    1 - base_df_user["song"] / base_df_user["events_nextsong"]
)

# num_sessions, avg_time_per_session, avg_events_per_session,
base_df_session = (
    df.groupby(["userId", "sessionId"])
    .agg({"length": "sum", "page": "count", "date": "min"})
    .reset_index()
)
base_df_session["prev_session_ts"] = base_df_session.groupby(["userId"])["date"].shift(
    1
)
base_df_session["gap_session"] = (
    base_df_session["date"] - base_df_session["prev_session_ts"]
).dt.days
user_sessions = (
    base_df_session.groupby("userId")
    .agg(
        {"sessionId": "count", "length": "mean", "page": "mean", "gap_session": "mean"}
    )
    .reset_index()
    .rename(
        columns={
            "sessionId": "num_sessions",
            "length": "avg_time_per_session",
            "page": "avg_events_per_session",
            "gap_session": "avg_gap_between_session",
        }
    )
)

# merge features together
base_df["userId"] = base_df["userId"]  # .astype("int")
final_feature_df = base_df.merge(feature34, how="left", on="userId")
final_feature_df = final_feature_df.merge(feature5, how="left", on="userId")
final_feature_df = final_feature_df.merge(feature6, how="left", on="userId")
final_feature_df = final_feature_df.merge(user_sessions, how="left", on="userId")
final_feature_df = final_feature_df.merge(base_df_user, how="left", on="userId")

final_feature_df = final_feature_df.fillna(0)
# renaming columns
final_feature_df.columns = [
    "userId",
    "average_events_weekend",
    "average_events_weekday",
    "num_songs_played_7d",
    "num_ads_7d",
    "num_error_7d",
    "num_songs_played_30d",
    "num_songs_played_90d",
    "num_sessions",
    "avg_time_per_session",
    "avg_events_per_session",
    "avg_gap_between_session",
    "num_events",
    "num_songs",
    "num_artists",
    "num_unique_songs",
    "num_thumbs_down",
    "num_thumbs_up",
    "num_add_to_playlist",
    "num_ads",
    "num_add_friend",
    "num_downgrade",
    "num_upgrade",
    "num_error",
    "ts_date_day",
    "registration",
    "user_churned",
    "percentage_ad",
    "days_since_active",
    "repeats_ratio",
]
# only keep created feature columns
final_feature_df = final_feature_df[
    [
        "userId",
        "user_churned",
        "average_events_weekend",
        "average_events_weekday",
        "num_songs_played_7d",
        "num_ads_7d",
        "num_error_7d",
        "num_songs_played_30d",
        "num_songs_played_90d",
        "num_sessions",
        "avg_time_per_session",
        "avg_events_per_session",
        "avg_gap_between_session",
        "num_events",
        "num_songs",
        "num_artists",
        "num_thumbs_down",
        "num_thumbs_up",
        "num_add_to_playlist",
        "num_ads",
        "num_add_friend",
        "num_downgrade",
        "num_upgrade",
        "num_error",
        "percentage_ad",
        "days_since_active",
        "repeats_ratio",
    ]
]

In [None]:
final_feature_df