**Prediction task**

We are interested in predicting the future income from a user.
1. Please create a prediction model, aiming to predict the target variable (org_price_usd_following_30_days). Use the train set for training a model, aiming to minimize RMSE of predictions over the test set.
2. What are the three most important features that contributed to the prediction?

Note: the following columns are related to the next task, and should not be used in the current task: ”treatment”, “org price usd following 30 days after impact”.

In [None]:
# run the following chunk to install dependencies (pay attention to lightgbm installation)
# ! pip install pandas




# or run this in your terminal: pip install -r /path/to/requirements.txt

libomp installation for lightgbm:

1. run *brew install libomp* in your terminal
2. copy variables into your shell configuration (you should get them at the end of the libomp brew installation):

    echo 'export LDFLAGS="-L/usr/local/opt/libomp/lib"' >> ~/.zshrc

    echo 'export CPPFLAGS="-I/usr/local/opt/libomp/include"' >> ~/.zshrc

    source ~/.zshrc

3. run *pip install lightgbm --no-binary lightgbm* in your terminal

In [1]:
import pandas as pd
import sweetviz as sv
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import numpy as np
from lightgbm import LGBMRegressor ### need brew install libomp, then copy variables to shell configuration then run pip install lightgbm --no-binary lightgbm
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
print("LightGBM successfully imported!")


  from .autonotebook import tqdm as notebook_tqdm


LightGBM successfully imported!


In [2]:
df_train = pd.read_csv('train_home_assignment_.csv', index_col=0)
df_test = pd.read_csv('test_home_assignment.csv')

In [3]:
# drop columns related to next task
df_train.drop(columns=['treatment', 'org_price_usd_following_30_days_after_impact'], inplace=True)

I'll be carrying EDA, Feature treatment (preprocessing, engineering and selection based on feature importance) on the train set. I assume train and test set have the same prediction point, meaning the explanatory features available in the train set will match the prediction point in the user funnel within the gaming app where my model will have to return the prediction for new data (test set). For example, if the user enters the gaming app, plays and we want to return a prediction before the next level unlocks, I assume that the features available in this dataset are all available before the next level unlocks. In any case the dimensions of train and test the datasets match (with the same columns), so I assume the features used in the train set match the prediction point of the test set.

I'll further split the train set into train and validation, to assess first of all the performance of the model on the validation set and use the latter for tuning eventual hyperparameters. I'll leave the test set untouched, and I'll address it as new data, meaning the test set will go through relevant transformations or preprocessing based on the train set before inputting into the tuned model for final predictions.

Finally, I'll evaluate the tuned model with the chosen features on the test set by comparing the RMSE and other relevant metrics between different learners.

In [4]:
#check if we have the same columns in train and test
# Get the column sets
train_columns = set(df_train.columns)
test_columns = set(df_test.columns)

# Check for equality
if train_columns == test_columns:
    print("The columns in df_train and df_test are the same.")
else:
    print("The columns in df_train and df_test are different.")

    # Find columns in df_train but not in df_test
    missing_in_test = train_columns - test_columns
    if missing_in_test:
        print(f"Columns in df_train but not in df_test: {missing_in_test}")

    # Find columns in df_test but not in df_train
    missing_in_train = test_columns - train_columns
    if missing_in_train:
        print(f"Columns in df_test but not in df_train: {missing_in_train}")

The columns in df_train and df_test are the same.


EDA

In this section I use sweetviz, a module that returns a html with a comprehensive exploratory data analysis including:
- marginal and joint distributions of Y, continuous features and categorical features
- measures of associations
- the goal here is to:
    - check eventual missing values (for imputation)
    - check statistical outliers for capping or removal
    - look at the correlations:
        - with Y, to have a first glimpse of the predictive power of the feature (we ideally want features in X being highly correlated with Y in its absolute value).
        - between features, to get an idea if we need to introduce interaction variables for highly correlated explanatory variables (ideally we want explanatory features independent between each other, so we do not want multicollinearity)
    - check if we have constant features, which having 0 variance are not explanatory at all
    - the following checks will help us chose whether we are in a linear or non-linear setting, and therefore the choice of the learners
        - check the skewness of the label to assess if a transformation is needed
        - check the scale of the features and the label.
        - check the relationship between Y and the features
    - check cardinality of categorical features for binning or one-hot-encoding
    - check if we have duplicate rows

In [5]:
# Specify data types. everything that is count, total, occurrence, price, number of days - I address it as continuous
for col in df_train.columns:
    if col in ['weekday', 'village_id']:
        df_train[col] = df_train[col].astype('category')
    else:
        df_train[col] = df_train[col].astype(float)


report = sv.analyze([df_train, "Train"], target_feat="org_price_usd_following_30_days")
report.show_html("regression_report.html")

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:06 -> (00:00 left)                       


Report regression_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [6]:
# further check correlations
correlation_matrix = df_train.corr()
print(correlation_matrix['org_price_usd_following_30_days'].sort_values(ascending=False))

org_price_usd_following_30_days                                1.000000
org_price_usd_preceding_30_days                                0.728005
spins_reward_preceding_30_days                                 0.716380
pet_xp_reward_preceding_30_days                                0.689277
org_price_usd_preceding_7_to_30_days                           0.675280
org_price_usd_triple_preceding_30_days                         0.633510
payment_occurrences_preceding_30_days                          0.619148
tournament_spins_reward_7_preceding                            0.612171
org_price_usd_preceding_3_days                                 0.583488
org_price_usd_preceding_3_to_7_days                            0.577685
payment_occurrences_preceding_7_to_30_days                     0.568057
chests_reward_preceding_30_days                                0.508032
payment_occurrences_preceding_3_days                           0.458481
payment_occurrences_preceding_3_to_7_days                      0

Observations from EDA (in case the html does not work, I added the pictures_EDA folder with relevant screenshots):

1. The Label, in black: first of all, it is very skewed (43.3% are zeros). And this is reflected by the skewness of 22.5. With a skewness this high, a transformation like log or sqrt will decrease the skewness but won't make the distribution symmetrical, which means that maybe a linear setting might not be ideal.
2. For each feature, the report gives the following indexes to measure correlations:

    a. Theil's U uncertainty coefficient: between categorical variables

        - values are very low - which means that the features are independent, but they are weekly correlated with Y.

    b. Correlation Ratio (chi squared): between categorical and numerical variables

        - the 6 categorical variables show a very weak association with Y. I am going to drop them when predicting.

    c. Pearson Correlation: between numerical variables

        - we can see that there are 2 features which are highly correlated with the label: org_price_usd_preceding_30_days (0.73), spins_reward_preceding_30_days (0.72).
        - below them, 9 more features with moderate correlation (between 0.5 and 0.7)
        - we do have multicolinearity, for example we have org_price_usd_preceding_30_days being highly correlated to org_price_usd_preceding_7_to_30_days (0.98), spins_reward_preceding_30_days (0.97), pet_xp_reward_preceding_30_days (0.92), org_price_usd_triple_preceding_30_days (0.87), payment_occurrences_preceding_30_days (0.84), payment_occurrences_preceding_7_to_30_days (0.82)

3. A short explanation about the plots: each feature is depicted with both a histogram (feature values on the x axis, frequency (%) of feature value on the left vertical axis) and a line plot with aggregated data (feature values on the x axis, average Y value on the right vertical axis). We can see that many of the features present a non-linear relationship with Y.


ACTION ITEMS
- drop duplicates rows
- drop constant feature: spins_rewards_lo_preceding_7_days
- 'weekday' shows a low association with Y, I'll drop it before prediction. There are some features that sweetviz addressed as categorical: ['active_days_preceding_7_days', 'active_days_preceding_7_to_14_days', 'total_set_completed', 'total_friend_link_invites']. I am not sure they are, so I computed for them the pearson correlation which still resulted very low, so I'll be dropping them as well. Therefore I'll be dropping the following variables: ['active_days_preceding_7_days', 'active_days_preceding_7_to_14_days', 'weekday', 'total_set_completed', 'total_friend_link_invites'].
- village_id shows a moderate correlation with Y, I'll bin it according to sweetviz suggestion into less categories considering the high cardinality and I apply OHE, leaving out the "other" category for redundancy.
- we don't have missing values. so no imputation needed.

Considering the setting is non-linear, that features and label have different scales, that we might have outliers given the presence of high average Y values in the line plots, and considering the presence of highly correlated features, it seems like tree-based learners would be a better fit for this problem, as they are more suitable for non-linear relations between features and Y (as shown in the plots), they do not need scaling, they are robust to outliers and to highly correlated features, and they better handle the skewness.

In [7]:
# preprocessing
df_train.drop(columns=['spins_rewards_lo_preceding_7_days', 'active_days_preceding_7_days', 'active_days_preceding_7_to_14_days', 'weekday', 'total_set_completed', 'total_friend_link_invites'], inplace=True)
df_train.drop_duplicates(inplace=True)

In [8]:
df_train['village_id'] = df_train['village_id'].astype(int).astype(str)
# Step 1: Extract the village IDs that appear on the axis from the picture
village_ids_to_keep = ['170', '165', '168', '169', '166', '163', '164', '171', '172', '161', '173', '162', '176']

# Step 2: Bin the village_id column
df_train['village_id_binned'] = df_train['village_id'].apply(
    lambda x: x if x in village_ids_to_keep else 'Other'
)
df_train = pd.get_dummies(df_train, columns=['village_id_binned']).drop(columns=['village_id_binned_Other','village_id'])

Further split train into train-dev

In [9]:
#split into train-dev set
X = df_train.drop(columns=['org_price_usd_following_30_days'])
y = df_train['org_price_usd_following_30_days']
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.2, random_state=42)

Tree based feature importance

In [10]:
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)

importances = gb.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
cumulative_importance = np.cumsum(importances[sorted_indices])
threshold_index = np.where(cumulative_importance >= 0.95)[0][0]
selected_features = X.columns[sorted_indices[:threshold_index + 1]]

In [11]:
print(f"Selected Features (95% cumulative importance): {list(selected_features)}")

Selected Features (95% cumulative importance): ['org_price_usd_preceding_30_days', 'org_price_usd_preceding_3_days', 'spins_reward_preceding_30_days', 'tournament_spins_reward_7_preceding', 'org_price_usd_preceding_7_to_30_days', 'pet_xp_reward_preceding_30_days', 'org_price_usd_triple_preceding_30_days', 'org_price_usd_preceding_3_to_7_days', 'tournament_coins_reward_7_preceding', 'payment_occurrences_preceding_3_days', 'hours_since_installed_ma', 'ltv_gross_up_to_preceding_30_days', 'payment_occurrences_preceding_30_days', 'chests_reward_preceding_30_days', 'total_attacks_preceding_7_days', 'total_villages_completed_preceding_7_days']


In [12]:
X_train_selected = X_train[selected_features]
X_dev_selected = X_dev[selected_features]

Reasons why I used Gradient Boosting for feature selection:

Random Forest is less suitable for feature selection because it builds trees independently, which means that if we have multiple correlated features, they can be selected in different trees and inflate the importance. GB instead works sequentially, since each tree improves the performance of the previous one based on high residuals. this means that if we have 2 features that are highly correlated, GB selects one to avoid redundancy. The feature selected is the one that reduces the residual error at each step. Then once one feature in a correlated group is selected, the others offer diminishing returns in terms of residual reduction.

In [13]:
# prepare assessment metrics
def evaluate_metrics(y_true, y_pred):
    rmse = root_mean_squared_error(y_true, y_pred)
    return {"RMSE": rmse}

In [14]:
# Algorithms to evaluate (adaboost does not handle well multicolinearity)
models = {
    "RandomForest": RandomForestRegressor(random_state=42),
    "GradientBoosting": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42),
    "LightGBM": LGBMRegressor(random_state=42),
    "CatBoost": CatBoostRegressor(verbose=0, random_state=42),
}

# Evaluate each model on dev set
results = {}
for name, model in models.items():
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_dev_selected)
    metrics = evaluate_metrics(y_dev, y_pred)
    results[name] = metrics

# Print results
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.4f}")
    print()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005497 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3454
[LightGBM] [Info] Number of data points in the train set: 159996, number of used features: 16
[LightGBM] [Info] Start training from score 43.173697
Model: RandomForest
  RMSE: 109.6293

Model: GradientBoosting
  RMSE: 108.9748

Model: XGBoost
  RMSE: 109.0955

Model: LightGBM
  RMSE: 107.8366

Model: CatBoost
  RMSE: 106.0509



In [15]:
# Evaluate each model on train set to check overfitting
results = {}
for name, model in models.items():
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_train_selected)
    metrics = evaluate_metrics(y_train, y_pred)
    results[name] = metrics

# Print results
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.4f}")
    print()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005366 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3454
[LightGBM] [Info] Number of data points in the train set: 159996, number of used features: 16
[LightGBM] [Info] Start training from score 43.173697
Model: RandomForest
  RMSE: 40.2751

Model: GradientBoosting
  RMSE: 89.1578

Model: XGBoost
  RMSE: 64.5069

Model: LightGBM
  RMSE: 86.1012

Model: CatBoost
  RMSE: 69.3761



So far it seems like GB is the less prone to overfitting, because of the lowest discrepancy between train and dev RMSE.

Hyperparameter tuning on dev set through RandomSearchCV (for speed, as GridSearchCV takes time) - done on GB.

In [16]:
gb_params = {
    "learning_rate": [0.01, 0.1, 0.2],
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 7],
    "subsample": [0.7, 0.8, 1.0]
}

'\nlearning_rate: Smaller values reduce overfitting but require more trees.\nn_estimators: Number of boosting rounds.\nmax_depth: Maximum depth of trees to control overfitting.\nsubsample: Fraction of samples used for training each tree. Lower values reduce overfitting.\n'


- learning_rate: Smaller values reduce overfitting but require more trees.
- n_estimators: Number of boosting rounds.
- max_depth: Maximum depth of trees to control overfitting.
- subsample: Fraction of samples used for training each tree. Lower values reduce overfitting.

In [17]:
gb = GradientBoostingRegressor(random_state=42)
gb_random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=gb_params,
    n_iter=50,  # Number of random combinations to try
    cv=5,
    scoring="neg_root_mean_squared_error",
    verbose=1,
    n_jobs=-1,
    random_state=42
)

gb_random_search.fit(X_dev_selected, y_dev)
print("Best Parameters for Gradient Boosting:", gb_random_search.best_params_)
print("Best CV RMSE:", -gb_random_search.best_score_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters for Gradient Boosting: {'subsample': 0.7, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}
Best CV RMSE: 108.87252309616323


Predict

In [18]:
X_test = df_test.drop(columns=['org_price_usd_following_30_days'])
y_test = df_test['org_price_usd_following_30_days']

# apply feature selection (in the end the indicators village_id did not come up in the FI so I just select the important features without further need of binning or encoding)
X_test_selected = X_test[selected_features]

In [19]:
model = GradientBoostingRegressor(subsample=0.7, n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
metrics = evaluate_metrics(y_test, y_pred)
print(metrics)

{'RMSE': 108.80508623337772}


- What are the three most important features that contributed to the prediction?

The top 3 features that contributed to the prediction are: 'payment_occurrences_preceding_30_days', 'total_raids_preceding_7_days' and 'hours_since_installed_ma'.

In [20]:
# top 3 most important features
importances = model.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
selected_features = X.columns[sorted_indices[:3]]
selected_features

Index(['payment_occurrences_preceding_30_days', 'total_raids_preceding_7_days',
       'hours_since_installed_ma'],
      dtype='object')

TODO

REFACTORING: save the best model in a pickle file in a folder that is automatically created in the current working directory. reorganize code into class. use pipeline when you can (& select from model for feat selection). prepare requirements/pip install chunk and explanation on how to install lightGBM with libomp. within the class you need to have also a preprocessing for the test, matching the train, and output the top 3 features. create a PDF in case installations on their side do not work.

**Recommendation task**

We are interested in increasing the income from users. For that, we ran a randomized experiment where the population was given either a 10 usd offer or 2usd offer (see the "treatment" column), aiming to learn what offer should be given to a user. The experiment yielded a target variable named “org_price_usd_following_30_days_after_impact”, reflecting the result of the experiment in terms of income.
1. For each user in the test data, set the treatment (either 10 or 2) that you believe would maximize the target variable (add a new column for that)
2. What are the three most important features that contributed to the decision to give users a specific treatment?

In [34]:
df_train = pd.read_csv('train_home_assignment_.csv', index_col=0)
df_test = pd.read_csv('test_home_assignment.csv')

In [35]:
df_train[['org_price_usd_following_30_days','org_price_usd_following_30_days_after_impact']].corr()

Unnamed: 0,org_price_usd_following_30_days,org_price_usd_following_30_days_after_impact
org_price_usd_following_30_days,1.0,0.999656
org_price_usd_following_30_days_after_impact,0.999656,1.0


In [36]:
# dropping old label
df_train.drop(columns=['org_price_usd_following_30_days'], inplace=True)
df_test.drop(columns=['org_price_usd_following_30_days'], inplace=True)

From this task I understand that I need to recommend a treatment to each new user of the test set under the constraint of maximizing org_price_usd_following_30_days_after_impact.

In [37]:
df_train.groupby(['treatment'])['org_price_usd_following_30_days_after_impact'].sum()

treatment
2     4.313935e+06
10    4.399805e+06
Name: org_price_usd_following_30_days_after_impact, dtype: float64

By looking at the sum of org_price_usd_following_30_days_after_impact for each treatment, it seems like treatment does not have any disciminatory power, meaning including it as a further feature in a predictive model won't help much. Therefore, I'll approach this problem in the following way:
- I'll train 2 models, one for each treatment, where the label is org_price_usd_following_30_days_after_impact.
- I'll output 2 predictions for the test set, one per model
- For each new user in the test set, I'll assign 10 if the prediction for 10 is higher, otherwise 2.
- I'll compute the total predicted outcome based on this method and I'll compare it to a random baseline.

I'll use the previous preprocessing and tuned learner as we have a correlation of 0.99 between previous and current label.

In [38]:

df_train.drop(
    columns=['spins_rewards_lo_preceding_7_days', 'active_days_preceding_7_days', 'active_days_preceding_7_to_14_days',
             'weekday', 'total_set_completed', 'total_friend_link_invites'], inplace=True)
df_train.drop_duplicates(inplace=True)
df_train['village_id'] = df_train['village_id'].astype(int).astype(str)
# Step 1: Extract the village IDs that appear on the axis from the picture
village_ids_to_keep = ['170', '165', '168', '169', '166', '163', '164', '171', '172', '161', '173', '162', '176']

# Step 2: Bin the village_id column
df_train['village_id_binned'] = df_train['village_id'].apply(
    lambda x: x if x in village_ids_to_keep else 'Other'
)
df_train = pd.get_dummies(df_train, columns=['village_id_binned']).drop(
    columns=['village_id_binned_Other', 'village_id'])

In [41]:
df_test.drop(
    columns=['spins_rewards_lo_preceding_7_days', 'active_days_preceding_7_days', 'active_days_preceding_7_to_14_days',
             'weekday', 'total_set_completed', 'total_friend_link_invites'], inplace=True)
df_test.drop_duplicates(inplace=True)

# Convert village_id in both train and test to string
df_test['village_id'] = df_test['village_id'].astype(int).astype(str)

# Bin the village_id column in the test set
df_test['village_id_binned'] = df_test['village_id'].apply(
    lambda x: x if x in village_ids_to_keep else 'Other'
)

# One-hot encode the binned village_id column
village_id_dummies_test = pd.get_dummies(df_test['village_id_binned'], prefix='village_id_binned')

# Ensure test set columns align with train set columns
# Add missing columns from the training step
for col in [f'village_id_binned_{v}' for v in village_ids_to_keep]:
    if col not in village_id_dummies_test.columns:
        village_id_dummies_test[col] = 0

# Drop extra columns not in the training set
village_id_dummies_test = village_id_dummies_test[
    [f'village_id_binned_{v}' for v in village_ids_to_keep]
]

# Add the one-hot encoded columns to the test set
df_test = pd.concat([df_test, village_id_dummies_test], axis=1)

# Drop the original village_id and binned column
df_test = df_test.drop(columns=['village_id', 'village_id_binned'], errors='ignore')


   payment_occurrences_preceding_30_days  hours_since_installed_ma  \
0                                    1.0                   33544.0   
1                                   35.0                   33542.0   
2                                    4.0                   33495.0   
3                                    1.0                   33349.0   
4                                   11.0                   33241.0   

   total_raids_preceding_7_days  org_price_usd_preceding_3_days  \
0                         100.0                            0.00   
1                          26.0                           62.94   
2                           4.0                            0.00   
3                          19.0                            3.99   
4                          32.0                            0.00   

   hours_in_village  total_card_xp  total_villages_completed_preceding_7_days  \
0              59.0            0.0                                        2.0   
1             

In [60]:

# Split the data into 10 and 2 treatment.
df_train_10 = df_train[df_train.treatment == 10].drop(columns=['treatment'])
df_train_2 = df_train[df_train.treatment == 2].drop(columns=['treatment'])

# Separate data by treatment
X_treatment_10 = df_train_10.drop(columns=['org_price_usd_following_30_days_after_impact'])
y_treatment_10 = df_train_10['org_price_usd_following_30_days_after_impact']

X_treatment_2 = df_train_2.drop(columns=['org_price_usd_following_30_days_after_impact'])
y_treatment_2 = df_train_2['org_price_usd_following_30_days_after_impact']

# Train models for each treatment
model_10 = GradientBoostingRegressor(subsample=0.7, n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)
model_10.fit(X_treatment_10, y_treatment_10)

model_2 = GradientBoostingRegressor(subsample=0.7, n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)
model_2.fit(X_treatment_2, y_treatment_2)



In [61]:
#check if we have the same columns in train and test
# Get the column sets
train_columns = set(X_treatment_10.columns)
test_columns = set(df_test.columns)

# Check for equality
if train_columns == test_columns:
    print("The columns in df_train and df_test are the same.")
else:
    print("The columns in df_train and df_test are different.")

    # Find columns in df_train but not in df_test
    missing_in_test = train_columns - test_columns
    if missing_in_test:
        print(f"Columns in df_train but not in df_test: {missing_in_test}")

    # Find columns in df_test but not in df_train
    missing_in_train = test_columns - train_columns
    if missing_in_train:
        print(f"Columns in df_test but not in df_train: {missing_in_train}")

The columns in df_train and df_test are different.
Columns in df_test but not in df_train: {'predicted_outcome', 'recommended_treatment'}


In [62]:
# Predict for the test set
df_test=df_test[X_treatment_10.columns]
y_pred_10 = model_10.predict(df_test)
y_pred_2 = model_2.predict(df_test)

# Assign treatment based on predicted outcomes
df_test["recommended_treatment"] = [10 if pred_10 > pred_2 else 2 for pred_10, pred_2 in zip(y_pred_10, y_pred_2)]

In [63]:
# Calculate total predicted outcome for recommended treatments
df_test["predicted_outcome"] = [
    pred_10 if treatment == 10 else pred_2
    for treatment, pred_10, pred_2 in zip(df_test["recommended_treatment"], y_pred_10, y_pred_2)
]
total_predicted_outcome = df_test["predicted_outcome"].sum()
print(f"Total Predicted Outcome: {total_predicted_outcome}")

# Calculate uplift (compared to random assignment baseline)
baseline_outcome = max(y_pred_10.mean(), y_pred_2.mean()) * len(df_test)
uplift = total_predicted_outcome - baseline_outcome
print(f"Uplift: {uplift}")

Total Predicted Outcome: 10235817.600143086
Uplift: 655191.747320734


In [64]:
df_test.groupby(["recommended_treatment"])['predicted_outcome'].sum()

recommended_treatment
2     4.750004e+06
10    5.485813e+06
Name: predicted_outcome, dtype: float64

In [65]:
# Feature importance for \$10 treatment model
importances_10 = model_10.feature_importances_
features_10 = pd.DataFrame({
    "feature": X_treatment_10.columns,
    "importance": importances_10
}).sort_values(by="importance", ascending=False)

print("Top features for \$10 treatment:")
print(features_10.head(3))

# Feature importance for \$2 treatment model
importances_2 = model_2.feature_importances_
features_2 = pd.DataFrame({
    "feature": X_treatment_2.columns,
    "importance": importances_2
}).sort_values(by="importance", ascending=False)

print("Top features for \$2 treatment:")
print(features_2.head(3))


Top features for \$10 treatment:
                                 feature  importance
36       org_price_usd_preceding_30_days    0.373253
31  org_price_usd_preceding_7_to_30_days    0.202460
3         org_price_usd_preceding_3_days    0.084992
Top features for \$2 treatment:
                            feature  importance
36  org_price_usd_preceding_30_days    0.495140
3    org_price_usd_preceding_3_days    0.078902
12  pet_xp_reward_preceding_30_days    0.048531


For users that got treatment 10$, the most important features were:
- org_price_usd_preceding_30_days
- org_price_usd_preceding_7_to_30_days
- org_price_usd_preceding_3_days

For users that got treatment 2$, the most important features were:
- org_price_usd_preceding_30_days
- pet_xp_reward_preceding_30_days
- org_price_usd_preceding_3_days

Since I used 2 different models, it makes sense that I would get some features that reflect treatment-specific patterns, but overall it looks like recent price (org_price_usd_preceding_X_days) plays a big role in determining org_price_usd_following_30_days_after_impact.