In [32]:
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np
SEED = 2020

# 1. Load data
df = pd.read_csv('dataset.csv')
df['has_promo'] = df['has_promo'].astype(int)

# 2. Compute the revenue
df['revenue'] = df['sales'] * (df['shelf_price'] - df['shelf_price'] * df['discount_percent'])
# df['log_revenue'] = df['revenue'].apply(lambda x: np.log(x) if x > 0 else 0)


# 3. Determine the weight for each class based on the inverse proportion of its occurrence
class_proportions = df['has_promo'].value_counts(normalize=True)
weights_map = 1 / class_proportions
weights = df['has_promo'].map(weights_map)

# 4. Features and Target
features = ['store_cluster_id', 'shelf_price', 'has_promo', 'category_1', 'category_2', 'category_3', 'category_3_promo_coverage', 'discount_percent']
target = 'revenue'

# 5. Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=SEED)

# 6. LGBM parameters
params = {
    'objective': 'regression',
    'metric': 'l2',
    'boosting_type': 'gbdt',
    'learning_rate': 0.6,
    'num_leaves': 31,
    'max_depth': -1,
    'min_child_samples': 20,
    'max_bin': 255,
    'subsample': 0.9,
    'subsample_freq': 1,
    'colsample_bytree': 0.9,
    'min_child_weight': 0,
    'min_split_gain': 0,
    'subsample_for_bin': 200000,
}

# 7. Weights for train data
train_weights = weights.loc[X_train.index]

# 8. Create datasets for LGBM
train_data = lgb.Dataset(X_train, label=y_train, weight=train_weights)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# 8. Model Training
bst = lgb.train(params, train_data, 500, valid_sets=[test_data], verbose_eval=-1)

# 10. Predictions
y_pred = bst.predict(X_test)

# 11. Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}, R2: {r2}")




You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1067
[LightGBM] [Info] Number of data points in the train set: 653077, number of used features: 8
[LightGBM] [Info] Start training from score 1437.538380
MSE: 5126453.201285902, R2: 0.6118902534368351


In [34]:
# Balancing the dataset by values has_promo
df_promo = df[df['has_promo'] == 1]
df_no_promo = df[df['has_promo'] == 0].sample(len(df_promo))
df_balanced = pd.concat([df_promo, df_no_promo])


# Параметры модели
params = {
    'objective': 'regression',
    'metric': 'l2',
    'boosting_type': 'gbdt',
    'learning_rate': 0.33,
    'num_leaves': 31,
    'max_depth': -1,
    'min_child_samples': 20,
    'max_bin': 255,
    'subsample': 0.9,
    'subsample_freq': 1,
    'colsample_bytree': 0.9,
    'min_child_weight': 0,
    'min_split_gain': 0,
    'subsample_for_bin': 200000,
}

# Разбиваем данные на обучающий и тестовый наборы
X_train, X_test, y_train, y_test = train_test_split(df_balanced[features], df_balanced[target], test_size=0.2, random_state=42)

# Создаем датасеты для LGB
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Обучение модели
bst = lgb.train(params, train_data, 500, valid_sets=[test_data], verbose_eval=-1)

# Предсказания
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)

# Оценка качества модели
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}, R2: {r2}")



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1063
[LightGBM] [Info] Number of data points in the train set: 115769, number of used features: 8
[LightGBM] [Info] Start training from score 1429.264099
MSE: 5815862.066026874, R2: 0.703393098914938


Description:
This code is designed to predict the revenue from product sales based on various features using the LightGBM machine learning model. The dataset comprises historical data on retail sales, prices, and discounts. The main steps include data loading, revenue computation, feature engineering, model training, and evaluation. The target variable is the computed 'revenue', while the features include product details, store details, and promotional information.

The model's performance is evaluated using the Mean Squared Error (MSE) and R^2 metrics. Additionally, to address the class imbalance in the has_promo feature, inverse proportion weights are calculated and applied during model training. This helps in giving more importance to the underrepresented class and improving the model's generalization capability.

Approaches to Balancing the Dataset for Model Training

In our efforts to train a machine learning model to predict revenue based on various product attributes, we encountered an imbalance in our dataset with respect to the has_promo attribute. This attribute indicates whether a product had a promotion or not. Addressing this imbalance is crucial to ensure our model provides reliable predictions.

1. Weighted Loss Function Approach:
The first method we employed was to assign different weights to the classes in the has_promo attribute. This allows the model to give more importance to under-represented classes during training. The weights were inversely proportional to the class frequencies, meaning the less frequent class got a higher weight.

2. Resampling Approach:
The second method involved balancing the dataset by resampling. We created two subsets of the data: one with has_promo equal to 1 and another equal to 0. From the larger subset, we randomly sampled instances to match the size of the smaller subset, ensuring an equal number of instances for both promo and non-promo products. The model was then trained on this balanced dataset.

Comparison and Conclusion:
Upon comparing the performance of models trained with both methods, the resampling approach (the second method) yielded better results in terms of the R^2 metric. This suggests that, for our specific dataset and problem, creating a balanced dataset through resampling was more effective than using a weighted loss function.

In [36]:
# Potential discount levels
discount_levels = [0, 0.03, 0.05, 0.07, 0.1, 0.2, 0.5]

# Prepare an empty list to store results
results = []

# Iterate through each product in the dataset
for product in df['product_id'].unique():
    
    product_data = df[df['product_id'] == product].copy()
    
    max_revenue = -np.inf
    best_discount = 0
    
    # Iterate through each potential discount level
    for discount in discount_levels:
        
        # Set the discount level and adjust the 'has_promo' attribute
        product_data['discount_percent'] = discount
        product_data['has_promo'] = 1 if discount > 0 else 0
        
        # Predict the revenue at this discount level
        predicted_revenue = np.sum(bst.predict(product_data[features]))
        
        # Check if this discount level gives a higher revenue than previous ones
        if predicted_revenue > max_revenue:
            max_revenue = predicted_revenue
            best_discount = discount
    
    # Append the results
    results.append({'product_id': product, 'discount_best': best_discount * 100})

# Convert the results into a DataFrame
discount_df = pd.DataFrame(results)

discount_df.head()

Unnamed: 0,product_id,discount_best
0,0,10.0
1,1,0.0
2,2,10.0
3,3,0.0
4,4,0.0


In [37]:
discount_df.sample(10)

Unnamed: 0,product_id,discount_best
10110,10110,50.0
6880,6880,50.0
39,39,0.0
6485,6485,7.0
10789,10789,20.0
1348,1348,50.0
1286,1286,50.0
9739,9739,50.0
7738,7738,50.0
4460,4460,3.0


Optimizing Discount Levels for Products

In the second phase of our approach, we sought to determine the optimal discount level for each product to maximize revenue. Here's what we did:

Potential Discount Levels:

We predefined a set of potential discount levels: 0%, 3%, 5%, 7%, 10%, 20%, and 50%.
Iterative Testing for Each Product:

For each unique product in our dataset, we iteratively applied each potential discount level.
Modified the dataset to reflect this discount and adjusted the has_promo attribute accordingly.
Revenue Prediction:

For each applied discount level, we utilized our trained machine learning model to predict the expected revenue for that product.
Selecting the Optimal Discount:

We compared the predicted revenues across all discount levels for each product.
Chose the discount level that maximized the predicted revenue as the optimal discount for that product.
Final Dataset:

Compiled a dataset, discount_df, containing each product's ID and its corresponding optimal discount level.
Through this process, we now have a recommendation for the best discount level for each product, aimed at maximizing the revenue based on our model's predictions. This methodology allows us to make data-driven decisions on discounting strategies for each product in the inventory.