## Data Pricing

Data is value differently based on it's usecase. A hedgefund may want to use data to generate alpha, meaning they'll pay a higher sum for exclusive use. By contrast, a medical researcher likely doesn't care if the data they purchase access to is being used by others, but their willingness to pay is likely lower than that of the hedgefund. This notebook aims to explore pricing options and their consequences

In [127]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ortools.sat.python import cp_model
from ortools.linear_solver import pywraplp
from ast import literal_eval as make_tuple

#### Maximizing Profit

Our goal is simple - we as a firm want to maximize our profit $z$. We can sell each dataset $d \in D$ to one or multiple customers. The price they'll pay is defined by their willingness to pay $w_{di}$ for each $0 \leq i \leq \bar{N}$, when $N$ is the set of existing customers. But for some customers - we'll call them data hoarders (DHs) - selling to multiple people means they won't buy. We'll assume for the purposes of this project we have a good estimate of every $ith$ customer willingness to pay for dataset $d$ - $w_{di}$ - and if they're ok with multiple parties having access to a dataset $(0)$, or not $(1)$; represented by $x_{di}$. Our decision variable then becomes $m_{di}$ - if we should offer the dataset to single parties $(0)$ or multiple parties $(1)$. Thus the truth table relationship follows bellow.

| $m_{d}$ | $x_{di}$ | outcome |
|----------|----------|---------|
| 0        | 1        | 1       |
| 1        | 1        | 0       |
| 0        | 0        | 1       |
| 1        | 0        | 1       |

We also know selling to a single party means we can only give access to one person. Thus if $m_{di} = 0$, then the profit for dataset $d \in D$ is the maximum willingness to pay for that dataset. $\forall i | 0 \leq i \leq \bar{N}; z_d = max(w_{di})$.

$z = \displaystyle\sum_{d \in D}\displaystyle\sum_{i = 0}^{\bar{N}}((w_{di})-(w_{di}*m_{d}*x_{di}) - (w_{di}*not(m_{d})) + max(w_d)*not(m_{d}))$ <br />

Every dataset must have 0 or more purchasers: <br />
$\displaystyle\sum_{i = 0}^{\bar{N}}x_{di} \geq 0$ <br />
$m_{di}$ is a binary decision variable <br />
$ \forall m_{di} | d \in D, 0 \leq i \leq \bar{N}; m_{di} \in \{0,1\}$ <br />

Let's run through an example:

In [238]:
#load in CSV with cycle and batch info
df = pd.read_csv('data/user_profit_matrix.csv')
df

Unnamed: 0,dataset,u1,u2,u3,u4,u5
0,d1,"(50,1)","(20,0)","(15,0)","(30,1)","(70,1)"
1,d2,"(40,0)","(40,0)","(50,1)","(30,0)","(50,1)"


In [239]:
def find_max(users):
    maxi = 0
    print(users)
    for column in df.columns:
        if column != 'dataset':
            wtp = make_tuple(users[column][0])[0]
            if (wtp > maxi):
                maxi = wtp
    return maxi

In [249]:
y = 0
solution = {}
obj = 0
#for every dataset available
for dataset in df['dataset']:
    label_solve = []

    #find the users interested in said dataset
    users = df.loc[df['dataset'] == dataset]
    users = users.reset_index(drop=True)
    
    #declare google OR model
    model = cp_model.CpModel()
    
    #instantiate binary decision variables for each dataset, keep track of these variable to retrieve solutions
    maxi = find_max(users)
    label = 'm'+str(y)
    print(label)
    label = model.NewBoolVar(label)
    label_solve.append(label)
    eq = 0

    #for every user interested in the current dataset d
    for column in df.columns:
        #ignore the first column of df, grab willingness to pay, and x_di
        if column != 'dataset':
            weight = make_tuple(users[column][0])
            wtp = weight[0]
            x = weight[1]
            # add to objective fx
            eq = eq + wtp  - wtp * label * x - wtp * label.Not()
    # add to objective fx for every dataset
    eq = eq + maxi * label.Not()
    y = y + 1
    print(eq)
    #use google OR to maximize, store solutions in solution dictionary
    model.Maximize(eq)
    solver = cp_model.CpSolver()
    status = solver.Solve(model)
    if status == cp_model.OPTIMAL:
        for label in label_solve:
            print(label)
            solution[label] = solver.Value(label)
        obj = obj + solver.ObjectiveValue()
        solution['Z' + str(y)] = solver.ObjectiveValue()
solution['Z'] = obj

  dataset      u1      u2      u3      u4      u5
0      d1  (50,1)  (20,0)  (15,0)  (30,1)  (70,1)
m0
((((((((((((((((-50 * m0) + 50) + (-50 * not(m0))) + 20)) + (-20 * not(m0))) + 15)) + (-15 * not(m0))) + 30) + (-30 * m0)) + (-30 * not(m0))) + 70) + (-70 * m0)) + (-70 * not(m0))) + (70 * not(m0)))
m0
  dataset      u1      u2      u3      u4      u5
0      d2  (40,0)  (40,0)  (50,1)  (30,0)  (50,1)
m1
(((((((((((((((-40 * not(m1)) + 40) + 40)) + (-40 * not(m1))) + 50) + (-50 * m1)) + (-50 * not(m1))) + 30)) + (-30 * not(m1))) + 50) + (-50 * m1)) + (-50 * not(m1))) + (50 * not(m1)))
m1


In [250]:
solution

{m0(0..1): 0, 'Z1': 70.0, m1(0..1): 1, 'Z2': 110.0, 'Z': 180.0}

#### Real Life Implementation

One possibility is this can be implemented via a variation of the dutch auction. We start the auction at some arbitrarily high price and lower it every set time interval or on a bid. At every bid, we ask the bidder if they're willing to share with one or multiple users. At the end of the auction, we use the above optimizer to decide if the dataset should be sold singularly or to multiple users.