# Unraveling price elasticity of demand with atoti
\[_In case you’re unable to see the atoti visualizations in GitHub, try viewing the notebook in [nbviewer](https://nbviewer.org/github/atoti/notebooks/blob/master/notebooks/price-elasticity/main.ipynb)._]

## Overview

In this notebook, we will apply the concept of [price elasticity of demand](https://en.wikipedia.org/wiki/Price_elasticity_of_demand) in the domain of insurance pricing.  
Insurance companies give their customers a quote for the premium they would charge to insure the customer's vehicle, just like most applications of price elasticity, the customer may choose one insurance provider over other based on the quoted price, amongst other factors.  

Here, we are using this [dataset from Kaggle](https://www.kaggle.com/ranja7/vehicle-insurance-customer-data) which has been further augmented by adding synthetic data to it.  

The dataset is the quotes data for vehicle insuarnce, with the policy sales being the target variable.  We will be building a predictive model for the quotes sales and then using atoti and the model to identify different buckets of customers across different segments depending on their price sensitivity.  

So, the notebook has three main sections:
1. **Predictive Modelling** - We will start off with building a predictive model, and identify the feature importance to see which factors are affecting the model. 
2. **Identifying and visualising the  KPIs** - We will use the model to identify different KPIs from the data.
3. **Price Elasticity with atoti** - Now we wil create different what-if scenarios based on the price using atoti and find out the impact of these on the above-mentioned KPIs.

**So, let us get started with demystifying the price elasticity modeling, thanks to atoti!**



<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=creditcard-fraud-detection" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/discover.png" alt="Try atoti"></a></div>

In [1]:
# Importing the necessary packages

import pandas as pd
import numpy as np

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import f1_score, classification_report, recall_score, roc_auc_score

import statsmodels.api as sm

In [2]:
# Importing atoti
import atoti as tt

Welcome to atoti 0.6.4!

By using this community edition, you agree with the license available at https://docs.atoti.io/latest/eula.html.
Browse the official documentation at https://docs.atoti.io.
Join the community at https://www.atoti.io/register.

atoti collects telemetry data, which is used to help understand how to improve the product.
If you don't wish to send usage data, set the ATOTI_DISABLE_TELEMETRY environment variable to True.

You can hide this message by setting the ATOTI_HIDE_EULA_MESSAGE environment variable to True.


# Part 1: Predictive Modelling

We begin with loading the training and test datasets and then we try different models to find the best results.  
Once, we have found the best model, we shall see which features are the main rivers for the predictive ability of the selected model. 

## 1.1 Loading and preparing the data  
### Loading and viewing the training and the test dataset

In [3]:
df = pd.read_csv("s3://data.atoti.io/notebooks/price-elasticity/data.csv")
test_df = pd.read_csv("s3://data.atoti.io/notebooks/price-elasticity/test_df.csv")

In [4]:
print(df.shape)
df.head()

(45707, 19)


Unnamed: 0,cust_id,Sale,Driver_Age,Vehicle_Value,Price,Vehicle_Mileage,Credit_Score,Licence_Length_Years,Marital_Status,Tax,State,CLTV,Coverage_Type,Education,Employment_Status,Location_Code,Sales_Channel,Months_Policy_Inception,Policy_Type
0,402933470644,0,67.0,2000.0,445.82,9000.0,386.17,0.79,D,,Washington,2962.7,Extended,Doctor,Disabled,Rural,Agent,90,Corporate Auto
1,1727686457919,1,35.0,9000.0,,9000.0,406.26,10.58,M,52.37,Nevada,2615.1,Basic,High School or Below,Employed,Rural,Web,92,Corporate Auto
2,54216290243,1,34.0,10000.0,737.83,7000.0,437.45,9.09,M,73.78,California,10629.2,Extended,Master,Medical Leave,Rural,Web,93,Special Auto
3,844267021428,1,34.0,10000.0,582.42,6000.0,506.71,2.78,M,58.24,Arizona,9624.9,Premium,Master,Employed,Rural,Branch,39,Special Auto
4,4695160115211,0,25.0,9000.0,622.56,10000.0,501.31,1.73,M,62.26,California,7781.0,Basic,Bachelor,Unemployed,Urban,Branch,52,Personal Auto


In [5]:
print(test_df.shape)
test_df.head()

(11427, 19)


Unnamed: 0,cust_id,Sale,Driver_Age,Vehicle_Value,Price,Vehicle_Mileage,Credit_Score,Licence_Length_Years,Marital_Status,Tax,State,CLTV,Coverage_Type,Education,Employment_Status,Location_Code,Sales_Channel,Months_Policy_Inception,Policy_Type
0,9447656335642,1,,8000.0,447.64,4000.0,409.44,4.0,M,44.76,Washington,7319.6,Premium,High School or Below,Unemployed,Urban,Agent,87,Special Auto
1,1123349069299,0,37.0,6000.0,252.39,5000.0,359.11,4.42,M,25.24,California,9890.2,Extended,Bachelor,Employed,Rural,Agent,12,Personal Auto
2,4467098784674,1,27.0,10000.0,477.8,6000.0,401.06,0.68,S,47.78,Arizona,6297.0,Premium,Master,Disabled,Urban,Branch,65,Corporate Auto
3,869569892903,0,27.0,10000.0,340.29,6000.0,304.35,7.19,M,34.03,Washington,8709.8,Premium,Bachelor,Employed,Rural,Call Center,47,Corporate Auto
4,2237607703548,1,18.0,8000.0,645.61,7000.0,345.25,0.43,M,64.56,Oregon,16675.9,Extended,Bachelor,Medical Leave,Urban,Web,89,Personal Auto


### Splitting the training data into train and validation dataset  
Now for building the machine learning model, we need to split the training data into training and validation datasets.  


In [6]:
train_df, val_df = train_test_split(df, test_size=0.1, random_state=36)

In [7]:
# dropping rows with NA
train_df = train_df.dropna()
val_df = val_df.dropna()
test_df = test_df.dropna()

## 1.2 Feature Engineering
We split the labels from the features for all the dataset.  
And then, we perform one-hot encoding for all the categorical variables.

In [8]:
# split data into the features (X) and labels (y)

# Training data
X_train = train_df.iloc[:, 2:]
Y_train = train_df.iloc[:, 1]

# Validation data
X_val = val_df.iloc[:, 2:]
Y_val = val_df.iloc[:, 1]

# Test data
X_test = test_df.iloc[:, 2:]
Y_test = test_df.iloc[:, 1]

In [9]:
# One hot encoding for categorical variables
def OH_df(X):
    one_hot = pd.get_dummies(X["Marital_Status"])
    X = X.drop("Marital_Status", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["State"])
    X = X.drop("State", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Coverage_Type"])
    X = X.drop("Coverage_Type", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Education"])
    X = X.drop("Education", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Employment_Status"])
    X = X.drop("Employment_Status", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Location_Code"])
    X = X.drop("Location_Code", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Sales_Channel"])
    X = X.drop("Sales_Channel", axis=1)
    X = X.join(one_hot)

    one_hot = pd.get_dummies(X["Policy_Type"])
    X = X.drop("Policy_Type", axis=1)
    X = X.join(one_hot)
    return X

In [10]:
X_train = OH_df(X_train)
X_val = OH_df(X_val)
X_test = OH_df(X_test)

## 1.3 Time to do some predictive modelling!
Here we will be testing the below models for the predictions:  
1. Random forest classifier
2. KNeighbors Classifier
3. Gaussian Naive Bayes Classifier
4. Support Vector Machine based Classifier
5. XG Boost Classifier

In [11]:
rf = RandomForestClassifier(n_jobs=-1)
knn = KNeighborsClassifier(n_jobs=-1)
nb = GaussianNB()
svm = SVC()
xgbc = XGBClassifier(use_label_encoder=False)

In [12]:
model_list = [rf, knn, nb, svm, xgbc]

In [13]:
# Training the models on the training dataset
models = []
f1 = []
roc_auc = []
recall = []

for estimator in model_list:

    estimator.fit(X_train, Y_train.values.ravel())
    result = estimator.predict(X_val)

    models.append(estimator)
    f1.append(f1_score(Y_val, result))
    recall.append(recall_score(Y_val, result))
    roc_auc.append(roc_auc_score(Y_val, result))



In [14]:
# Compiling the results from all the models on the validation dataset

df_results = pd.DataFrame(
    {"model_name": models, "f1_score": f1, "roc_auc_score": roc_auc, "recall": recall}
)
df_results.head(10)

Unnamed: 0,model_name,f1_score,roc_auc_score,recall
0,"(DecisionTreeClassifier(max_features='auto', r...",0.927819,0.863753,0.946484
1,KNeighborsClassifier(n_jobs=-1),0.860897,0.731611,0.895703
2,GaussianNB(),0.907729,0.853806,0.899219
3,SVC(),0.893333,0.751059,0.968359
4,"XGBClassifier(base_score=0.5, booster='gbtree'...",0.936097,0.879652,0.952734


**Hence for this dataset, the best model is XGBClassifier with an F1 of 0.934 and recall of 0.948 on the validation dataset.**

### 1.4 Find the importance of the various features in the model.
Here the hypothesis is, amongst other factors, price should be a very critical factor in deciding whether a customer buys a quote or not.  


In [15]:
# Finding feature importance from the model

sorted_idx = np.argsort(xgbc.feature_importances_)[::-1]

# Finding the top 10 features contributing to the results
for index in sorted_idx[:10]:
    print([X_train.columns[index], xgbc.feature_importances_[index]])

['Price', 0.20924321]
['Driver_Age', 0.06532796]
['Employed', 0.061190005]
['Personal Auto', 0.056500446]
['Suburban', 0.04785541]
['Special Auto', 0.0455167]
['Premium', 0.036693245]
['Unemployed', 0.034293536]
['Basic', 0.033415023]
['Master', 0.032274764]


### Hence, Price is the most important feature - which makes business sense and goes completely with our hypothesis.

It is important to note that price of the quote has **4 times** more predictive power than the second next feature and **10 times** more predictive than the 10th feature in the list.

Now since we have established that price is one of the most important feature to predict sale, let us try to find the correlation between   price and sales using OLS.

In [16]:
# Verifying these findings by using an OLS model
model = sm.OLS(Y_train, X_train)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Sale,R-squared:,0.556
Model:,OLS,Adj. R-squared:,0.555
Method:,Least Squares,F-statistic:,1299.0
Date:,"Thu, 27 Jan 2022",Prob (F-statistic):,0.0
Time:,15:08:07,Log-Likelihood:,-7876.8
No. Observations:,33264,AIC:,15820.0
Df Residuals:,33231,BIC:,16100.0
Df Model:,32,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Driver_Age,-0.0139,0.000,-45.113,0.000,-0.015,-0.013
Vehicle_Value,2.274e-06,1e-06,2.263,0.024,3.05e-07,4.24e-06
Price,0.0022,2.06e-05,108.697,0.000,0.002,0.002
Vehicle_Mileage,-3.099e-05,1.26e-06,-24.579,0.000,-3.35e-05,-2.85e-05
Credit_Score,-2.296e-06,1.76e-06,-1.307,0.191,-5.74e-06,1.15e-06
Licence_Length_Years,0.0089,0.001,12.759,0.000,0.008,0.010
Tax,-0.0025,0.000,-15.158,0.000,-0.003,-0.002
CLTV,-1.203e-07,2.46e-07,-0.490,0.624,-6.02e-07,3.61e-07
Months_Policy_Inception,2.933e-05,5.9e-05,0.497,0.619,-8.63e-05,0.000

0,1,2,3
Omnibus:,474.556,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,494.973
Skew:,-0.299,Prob(JB):,3.2999999999999998e-108
Kurtosis:,2.99,Cond. No.,2.23e+19


Here, we can see that the **price is positively correlated with the sales.** This means the policies which were quoted a higher price had a better chance to be bought.  
This means they were probably specialized policies and for the policies with normal vehicles, we lose to the competitors. 

### Making predictions on the test dataset  

We shall be using the best model from the list i.e. XGBoost Classifier model to make predictions on the test dataset 

In [17]:
test_result = xgbc.predict(X_test)

print("F1 score for test data is:", f1_score(Y_test, test_result))
print("Recall for test data is:", recall_score(Y_test, test_result))
print("ROC AUC Score for test data is:", roc_auc_score(Y_test, test_result))

F1 score for test data is: 0.9372835706040786
Recall for test data is: 0.9468283582089553
ROC AUC Score for test data is: 0.8882503332276322


# Part 2: Identifying and visualising the KPIs  

In oder to understand what is the business implications of the model predictions we will use atoti to create widgets.  

### 2.1 We will start by creating the dataframe which can be laoded into atoti

In [18]:
# Adding the sales prediction column to the test dataset
prediction_df = test_df.copy()
prediction_df.drop("Sale", axis=1, inplace=True)
prediction_df["Sales_prediction"] = test_result.tolist()

In [19]:
# finding the prediction probability from the model

xbg_predictions = xgbc.predict_proba(X_test)
probability = []
for i in range(len(xbg_predictions)):
    probability.append(xbg_predictions[i][1])

In [20]:
# Adding the column for the sales probability

prediction_df["sales_prediction_probability"] = probability

### 2.2 Creating the session in atoti

Creating a session - it spins up an in-memory database - similar to Apache Spark - ready to slice’n’dice your big data set.  
In addition to that, it launches a dashboarding Tableau-like web-app


In [21]:
config = {"user_content_storage": "content"}
session = tt.create_session(config=config)

In [22]:
# loading the data:  load the prediction dataframe into an atoti data store
predictions = session.read_pandas(
    prediction_df, table_name="predictions", keys=["cust_id"]
)
predictions.head()

Unnamed: 0_level_0,Driver_Age,Vehicle_Value,Price,Vehicle_Mileage,Credit_Score,Licence_Length_Years,Marital_Status,Tax,State,CLTV,Coverage_Type,Education,Employment_Status,Location_Code,Sales_Channel,Months_Policy_Inception,Policy_Type,Sales_prediction,sales_prediction_probability
cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1123349069299,37.0,6000.0,252.39,5000.0,359.11,4.42,M,25.24,California,9890.2,Extended,Bachelor,Employed,Rural,Agent,12,Personal Auto,0,0.00159
4467098784674,27.0,10000.0,477.8,6000.0,401.06,0.68,S,47.78,Arizona,6297.0,Premium,Master,Disabled,Urban,Branch,65,Corporate Auto,1,0.766472
869569892903,27.0,10000.0,340.29,6000.0,304.35,7.19,M,34.03,Washington,8709.8,Premium,Bachelor,Employed,Rural,Call Center,47,Corporate Auto,0,0.115257
2237607703548,18.0,8000.0,645.61,7000.0,345.25,0.43,M,64.56,Oregon,16675.9,Extended,Bachelor,Medical Leave,Urban,Web,89,Personal Auto,1,0.998519
2354445725997,25.0,5000.0,384.88,8000.0,353.73,5.17,M,19.24,Washington,7163.8,Extended,Doctor,Disabled,Urban,Web,51,Corporate Auto,0,0.260724


In [23]:
# create a data cube in atoti
cube = session.create_cube(predictions, name="predictions")

In [24]:
# defining the measure, level and hierarchy in the cube.

h, l, m = cube.hierarchies, cube.levels, cube.measures

In [25]:
# Let us create some new measures

# this is nuber of predicted successful sales
m["sale_propotion"] = m["Sales_prediction.SUM"] / m["contributors.COUNT"]

# this is sum of quotes from the predicted sales
m["revenue_realised"] = tt.agg.sum(
    m["sales_prediction_probability.SUM"] * m["Price.SUM"],
    scope=tt.scope.origin(l["cust_id"]),
)
# This is the propotion of revenue being realised
m["revenue_realised_propotion"] = m["revenue_realised"] / m["Price.SUM"]

### 2.3 Visualising KPIs with atoti widgets

In [26]:
session.visualize("Sales propotion by channel and coverage type")

In [27]:
session.visualize("Revenue realized by State and location code")

In [28]:
session.visualize("Sales by policy types")

In [29]:
session.visualize("Revenue and sales by policy and coverage type")

In [30]:
session.visualize("Revenue realised gauge")

In [31]:
# Let us compile all of this information in a dashboard

In [32]:
session.link(path="#/dashboard/39f")

Open the notebook in JupyterLab with the atoti extension enabled to see this link.

# Part 3: Price Elasticity with atoti 

We will use atoti to do simulations and hence understand the price elasticity across different customer segments.

## Scenario 1: The pareto principle

Increase the price by 20% for all the policies for which sales probability is more than 80%
Drop the price by 20% for policies that have a probability to be bought less than 20%

This will affect the most price-sensitive customers from the customers who have not bought the policy.  
On the other hand, this will help identify the most price-insensitive customers from the customers who have actually bought the policy.

In [33]:
# updating the pricing based on above pareto simulation

prediction_df_pareto = prediction_df.copy()

prediction_df_pareto.loc[
    (prediction_df_pareto.sales_prediction_probability > 0.8), "Price"
] = prediction_df_pareto["Price"].apply(lambda x: x * 1.2)

prediction_df_pareto.loc[
    (prediction_df_pareto.sales_prediction_probability < 0.2), "Price"
] = prediction_df_pareto["Price"].apply(lambda x: x * 0.8)

### Using the model to make predictions on the scenario

In [34]:
X_test_pareto = X_test.copy()

X_test_pareto["Price"] = prediction_df_pareto["Price"]

test_result_pareto = xgbc.predict(X_test_pareto)

prediction_df_pareto["Sales_prediction"] = test_result_pareto.tolist()

### Creating a scenario in atoti using the model predictions

In [35]:
predictions.scenarios["Pareto price change"].load_pandas(prediction_df_pareto)

## Scenario 2: Boost Personal Auto and basic coverage
So of all the policy types, personal auto and basic coverage are the weakest sections.  
Let us see what happens if we try to boost the personal auto lines by increasing the policy price by 25%? and the basic coverage by 15%.  

In [36]:
# updating the pricingi
prediction_df_personal_auto = prediction_df.copy()

prediction_df_personal_auto.loc[
    (prediction_df_personal_auto.Policy_Type == "Personal Auto"), "Price"
] = prediction_df_personal_auto["Price"].apply(lambda x: x * 1.25)

prediction_df_personal_auto.loc[
    (prediction_df_personal_auto.Coverage_Type == "Basic"), "Price"
] = prediction_df_personal_auto["Price"].apply(lambda x: x * 1.15)

### Using the model to make predictions on the scenario

In [37]:
X_test_personal_auto = X_test.copy()
X_test_personal_auto["Price"] = prediction_df_personal_auto["Price"]

test_result_personal_auto = xgbc.predict(X_test_personal_auto)
prediction_df_personal_auto["Sales_prediction"] = test_result_personal_auto.tolist()

### Creating a scenario in atoti using the model predictions

In [38]:
predictions.scenarios["Personal Auto Boost"].load_pandas(prediction_df_personal_auto)

### Scenario 3: Geographic Improvements
In the states of Oregon and Washington, suburban areas are not performing well.  
So we can try and drop the prices by 10% for customers in rural areas of Oregon and drop by 15% in urban areas of Washington to see how price-sensitive the customers in the respective segment are. 

In [39]:
# updating the pricing based on above Scenario simulation

prediction_df_geo_improvement = prediction_df.copy()

prediction_df_geo_improvement.loc[
    (prediction_df_geo_improvement.State == "Oregon"), "Price"
] = prediction_df_geo_improvement["Price"].apply(lambda x: x * 0.9)

prediction_df_geo_improvement.loc[
    (prediction_df_geo_improvement.State == "Washington"), "Price"
] = prediction_df_geo_improvement["Price"].apply(lambda x: x * 0.85)

### Using the model to make predictions on the scenario

In [40]:
X_test_geo_improvement = X_test.copy()

X_test_geo_improvement["Price"] = prediction_df_geo_improvement["Price"]

test_result_geo_improvement = xgbc.predict(X_test_geo_improvement)
prediction_df_geo_improvement["Sales_prediction"] = test_result_geo_improvement.tolist()

### Creating a scenario in atoti using the model predictions

In [41]:
predictions.scenarios["Geographical Improvements"].load_pandas(
    prediction_df_geo_improvement
)

## Visualising Scenarios in atoti

In atoti, the data model is made of measures chained together. A simulation can be seen as changing one part of the model, either its source data or one of its measure definitions, and then evaluating how it impacts the following measures.  

The session now has different scenarios and the only differences between them are the lines corresponding to the price and prediction probability, **everything else is shared between the scenarios and has not been duplicated: source scenarios in atoti are memory-efficient.**

In [42]:
session.visualize("Scenarios: Sales propotion by channel and coverage type")

In [43]:
session.visualize("Scenarios: Revenue realised by State and location code")

In [44]:
session.visualize("Scenarios: Revenue realized by policy types")

In [45]:
session.visualize("Scenario: Revenue and sales by policy and coverage type")

In [46]:
session.visualize("Scenarios: Revenue realised gauge")

In [47]:
## Let us summarize this information in a new tab of the dashboard we created above

In [48]:
session.link(path="#/dashboard/39f")

Open the notebook in JupyterLab with the atoti extension enabled to see this link.

# Conclusion:

In this notebook, we used atoti and a predictive model based on XGBosst classification.  

We saw above, how atoti can seamlessly integrate with a predictive model to create different what-if scenarios and hence, identify the price-sensitive and insensitive customers across different segments.

**Now we would like to invite you to try for yourself how atoti can help you simplify what-if scenarios and the price elasticity of demand!!**

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=creditcard-fraud-detection" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/discover-try.png" alt="Try atoti"></a></div>