# "Fun with Loot Boxes" Lab

> Author: Caroline Schmitt, Matt Brems

### Scenario:

You're an analyst for [Zynga](https://en.wikipedia.org/wiki/Zynga), a gaming studio working on an event for an MMO (massively multiplayer online) game. This event is going to include **loot boxes**.

<img src="https://vignette.wikia.nocookie.net/2007scape/images/0/06/Culinaromancer%27s_chest.png/revision/latest?cb=20180403231423" alt="drawing" width="150"/> 

A loot box is basically a treasure chest in a game. This loot box can be opened to reveal a variety of items: some items are very rare and valuable, other items are common and less valuable. (You may consult [the esteemed Wikipedia](https://en.wikipedia.org/wiki/Loot_box) for a more extensive definition.)

In our specific game, suppose that loot boxes can be obtained in one of two ways: 
- After every three hours of playing the game, a user will earn one loot box.
- If the user wishes to purchase a loot box, they may pay $1 (in real money!) for a loot box.

These loot boxes are very good for our business!
- If a player earns a loot box, it means they are spending lots of time on the game. This often leads to advertisement revenue, they may tell their friends to join the game, etc.
- If the player purchases a loot box, it means we've earned $1 from our customer.

Suppose each loot box is opened to reveal either:
- magical elixir (super rare, very valuable), or
- nothing.

Whether each loot box contains the elixir or nothing is **random**. Our boss wants some guidance on what sort of randomness to use on these loot boxes! 
- If the magical elixir is too rare, then users may not be motivated to try to get them, because they believe they'll never find the magical elixir.
- If the magical elixir is too common, then users may not be motivated to try to get them, because the game has so much of the magical elixir that it isn't worthwhile to try to get it.

However, our boss isn't a math-y type person! When explaining things to our boss, we need to explain the impact of our choices on the game as concretely as possible.

### Version 1
In our first version of the game, we'll say that loot boxes contain magical elixir 15% of the time and nothing 85% of the time.

#### 1. Our boss asks, "If a user buys 100 loot boxes, how many elixirs will they get?" How would you respond?

15

#### 2. Our boss asks, "How many loot boxes does someone have to purchase in order to definitely get elixir?" How would you respond?

The word <i>definitely</i> makes the question hard to answer.<br>Theoretically, it will be infinite number of loot boxes.<br>In reality, it we can look for the 99.9th percentile and use that as a proxy for 100% of the population.

In [1]:
import scipy.stats as stats

result=0
n=1
while result == 0:
    result = stats.binom.ppf(q=1-0.999,n=n,p=0.15)
    n+=1
print(f"Minimum loot boxes required at 99.9th percentile: {n}")

Minimum loot boxes required at 99.9th percentile: 44


#### 3. Our boss asks, "If a user earns 100 loot boxes, what is the chance that a user gets more than 20 elixirs?" This is a bit more complicated, so let's break it down before answering.

#### 3a. Let's suppose my random variable $X$ counts up how many elixirs I observe out of my 100 loot boxes. Why is $X$ a discrete random variable?

X takes on finite values within the range [0,100], hence fits the definition of discrete random variable

#### 3b. Recall our discrete distributions: discrete uniform, Bernoulli, binomial, Poisson. Let's suppose my random variable $X$ counts up how many elixirs I observe out of my 100 loot boxes. What distribution is best suited for $X$? Why?
- Hint: It may help to consider getting the magical elixir a "success" and getting nothing a "failure." 

Binomial distribution. It is a sum of successes of Bernoulli trials.

#### 3c. Our boss asks, "If a user earns 100 loot boxes, what is the chance that a user gets more than 20 elixirs?" Use the probability mass function to answer the boss' question.

In [2]:
k=20
p=0.15
n=100

a=sum([stats.binom.pmf(x,n,p) for x in range(k+1,n+1)])

print(f"Probability is {a:.2%}")

Probability is 6.63%


#### 3d. Our boss asks, "If a user earns 100 loot boxes, what is the chance that a user gets more than 20 elixirs?" Use the cumulative distribution function to answer the boss' question.

In [3]:
k=20
p=0.15
n=100

a=1-stats.binom.cdf(k,n,p)

print(f"Probability is {a:.2%}")

Probability is 6.63%


#### 3e. Our boss asks, "If a user earns 100 loot boxes, what is the chance that a user gets more than 20 elixirs?" Answer your boss' question. *Remember that your boss is not a math-y person!*

er... simply 6.63%?

#### 4. Your boss wants to know how many people purchased how many loot boxes last month. 
> For example, last month, 70% of users did not purchase any loot boxes. 10% of people purchased one loot box. 5% of people purchased two loot boxes... and so on.

#### 4a. Recall our discrete distributions: discrete uniform, Bernoulli, binomial, Poisson. Let's suppose my random variable $Y$ counts up how many loot boxes each person purchased through the game last month. What distribution is best suited for $Y$? Why?

Poisson distribution. The random variable is a probability of event occuring in a fixed interval of time, and is independent.

#### 4b. Suppose that, on average, your customers purchased 2.7 loot boxes last month. In order for your revenue to be at least $500,000, at least how many users would you need on your platform? (Round your answer up to the nearest thousand.) 

In [4]:
import math

avg_nof_boxes=2.7
revenue_per_user=1*avg_nof_boxes
revenue_target=500_000
nof_user=revenue_target/revenue_per_user

#rounds up to 1000
nof_user -= nof_user%-1000

print(f"Number of users: {int(nof_user)}")

Number of users: 186000


#### 4c. Assume that your platform has the numer of users you mentioned in your last answer. Suppose that your platform calls anyone who purchases 5 or more loot boxes in a month a "high value user." How much money do you expect to have earned from "high value users?" How about "low value users?"

In [5]:
nof_high_value_users=sum([stats.poisson.pmf(no_purchased,avg_nof_boxes)*nof_user for no_purchased in range(5,100+1)])
nof_low_value_users=sum([stats.poisson.pmf(no_purchased,avg_nof_boxes)*nof_user for no_purchased in range(0,5)])
revenue_high_value_users=sum([stats.poisson.pmf(no_purchased,avg_nof_boxes)*nof_user*no_purchased for no_purchased in range(5,100+1)])
revenue_low_value_users=sum([stats.poisson.pmf(no_purchased,avg_nof_boxes)*nof_user*no_purchased for no_purchased in range(0,5)])
print(f"High value users: (users={nof_high_value_users:.0f}) contribute ${revenue_high_value_users:.2f}")
print(f"Low value users: (users={nof_low_value_users:.0f}) contribute ${revenue_low_value_users:.2f}")

High value users: (users=25499) contribute $143582.91
Low value users: (users=160501) contribute $358617.09


#### 4d. Suppose that you want to summarize how many people purchased how many loot boxes last month for your boss. Since your boss isn't math-y, what are 2-4 summary numbers you might use to summarize this for your boss? (Your answers will vary here - use your judgment!)

Out of 186k users,<br>
Our high value users - 25499 users (13.7% of user base) contributes \\$143,583 (28.6% of total revenue)<br>
Our low value users - 160501 users (86.3% of user base) contributes \\$358,617 (71.4% of total revenue)


#### 5. Your boss asks "How many loot boxes does it take before someone gets their first elixir?" Using `np.random.choice`, simulate how many loot boxes it takes somone to get their first elixir. 
- Start an empty list.
- Use control flow to have someone open loot boxes repeatedly.
- Once they open a loot box containing an elixir, record the number of loot boxes it took in the empty list.
- Repeat this process 100,000 times. 

This simulates how long it takes for someone to open a loot box containing elixir. Share the 5th, 25th, 50th, 75th, and 95th percentiles.

> You may find [this documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html)  and [this documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html) helpful.

In [6]:
#This code uses Parallel() multiprocessing
#Change the above markdown cell to code and run, if your system does not have joblib or tqdm installed.
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel,delayed

n_trials=100_000
options=[1,0]
options_proba=[0.15,0.85]

first_loot_box=[]

def trial_loot_box():
    counter=0
    #loop until elixir is obtained and loop is broken.
    while True:
        counter+=1
        
        #make random pick
        pick=np.random.choice(a=options,p=options_proba)
        
        #if the elixir is chosen, then add the trial number to list.
        if pick==1:
            first_loot_box.append(counter)
            return counter

#run code in parallel, using all cores. Combined with tqdm() for nice progress bar.
first_loot_box=Parallel(n_jobs=-1)(delayed(trial_loot_box)() for _ in tqdm(range(n_trials)))
    

percentiles=[5,25,50,75,95]
for q in percentiles:
    print(f"{q}th percentile: {np.percentile(first_loot_box,q)}")

  0%|          | 0/100000 [00:00<?, ?it/s]

5th percentile: 1.0
25th percentile: 2.0
50th percentile: 5.0
75th percentile: 9.0
95th percentile: 19.0


5th percentile: 1.0<br>
25th percentile: 2.0<br>
50th percentile: 5.0<br>
75th percentile: 9.0<br>
95th percentile: 19.0

### Version 2

After a substantial update to the game, suppose every loot box can be opened to reveal *one of four different* items:
- magical elixir (occurs 1% of the time, most valuable)
- golden pendant (occurs 9% of the time, valuable)
- steel armor (occurs 30% of the time, semi-valuable)
- bronze coin (occurs 60% of the time, least valuable)

#### 6. Suppose you want repeat problem 5 above, but do that for the version 2 loot boxes so you can track how many loot boxes are needed to get each item? (e.g. You'd like to be able to say that on average it takes 10 trials to get a golden pendant, 3 trials to get steel armor, and so on.) What Python datatype is the best way to store this data? Why?

There are 2 ways to store this data, depending on the method of execution.

<u>Multi Processing</u><br>
A list of lists would fit.<br>
A master list with dimension m, where m=number of trials.<br>
Each trial is recorded as a list with dimension n, where n=number of features to track.<br>
<br>
This approach is faster as it uses multi CPU cores but ends up with a nested list which needs some dealing with later.<br><br>
<u>Single/Multi Threading</u><br>
A list of counters for each item<br>
The size of the list will be n, where n=number of features to track.<br>
<br>
This approach results in a simpler output but <i>Threading</i> is inherently slower than <i>Multi Processing</i>.<br>
As this code is CPU-bound and not IO-bound, boosting threads will do no benefit.<br>
We note that Multi Threading performs slower than Single Threading as Multi Threading adds overhead.<br>
Multi Threading also presents the problem of race condition and handling thread locks takes up CPU time.

<b>Will implement 3 different codes just for the sake of doing it.</b>

<u>Multi Processing</u>

In [62]:
#This code is based on Multi Processing
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel,delayed

#0=elixir, 1=golden, 2=steel, 3=bronze
n_trials=100_000
options=[0,1,2,3]
options_name=["magical elixir","golden pendant","steel armor","bronze coin"]
options_proba=[0.01,0.09,0.3,0.6]

def trial_loot_box():
    pick=np.random.choice(a=options,p=options_proba)
    #create a list of 1 or 0, depending if that item is obtained or not.
    result=[1 if i==pick else 0 for i in options]
    return result

In [63]:
%%timeit
#code will run 7 times just to get average results
#run code in parallel, using all cores. Combined with tqdm() for nice progress bar.
loot_box_distribution=Parallel(n_jobs=-1)(delayed(trial_loot_box)() for _ in tqdm(range(n_trials)))

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

1.09 s ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [64]:
#convert to numpy array for easier manipulation below
loot_box_distribution=np.array(loot_box_distribution)

#avg number of boxes required
# = 1/p 
# = 1/(number of that item obtained/total number of trials)
# = total number of trials/number of that item obtained
avg_number_req=list(n_trials/np.sum(loot_box_distribution,axis=0))

#pair the results across 2 lists and print the results
for (item,count) in zip(options_name,avg_number_req):
    print(f"Average boxes required to obtain {item}: {count:.1f}")

Average boxes required to obtain magical elixir: 100.4
Average boxes required to obtain golden pendant: 11.1
Average boxes required to obtain steel armor: 3.3
Average boxes required to obtain bronze coin: 1.7


<u>Multi Threading</u>

In [56]:
#This code is based on Multi Threading
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel,delayed
from threading import Lock
import threading

#0=elixir, 1=golden, 2=steel, 3=bronze
n_trials=100_000
options=[0,1,2,3]
options_name=["magical elixir","golden pendant","steel armor","bronze coin"]
options_proba=[0.01,0.09,0.3,0.6]
results=[0,0,0,0]

#initiate Lock object to prevent race condition caused by Python Global Interpreter Lock
lock=Lock()

def trial_loot_box():    
    pick=np.random.choice(a=options,p=options_proba)
    lock.acquire()
    results[pick]+=1
    lock.release()

In [57]:
%%timeit
#code will run 7 times just to get average results
#run code in parallel, using all cores. Combined with tqdm() for nice progress bar.
Parallel(n_jobs=-1,prefer="threads")(delayed(trial_loot_box)() for _ in tqdm(range(n_trials)))

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

12.9 s ± 452 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [58]:
# pair the results across 2 lists and print the results
for (item,count) in zip(options_name,results):
    print(f"Average boxes required to obtain {item}: {n_trials/count:.1f}")

Average boxes required to obtain magical elixir: 12.5
Average boxes required to obtain golden pendant: 1.4
Average boxes required to obtain steel armor: 0.4
Average boxes required to obtain bronze coin: 0.2


<u>Single Threading</u>

In [59]:
#This code is based on single threading
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel,delayed
# from threading import Lock

#0=elixir, 1=golden, 2=steel, 3=bronze
n_trials=100_000
options=[0,1,2,3]
options_name=["magical elixir","golden pendant","steel armor","bronze coin"]
options_proba=[0.01,0.09,0.3,0.6]
results=[0,0,0,0]

def trial_loot_box():
    pick=np.random.choice(a=options,p=options_proba)
    results[pick]+=1

In [60]:
%%timeit
#code will run 7 times just to get average results
for _ in tqdm(range(n_trials)):
    trial_loot_box()

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

  0%|          | 0/100000 [00:00<?, ?it/s]

3.52 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [61]:
# pair the results across 2 lists and print the results
for (item,count) in zip(options_name,results):
    print(f"Average boxes required to obtain {item}: {n_trials/count:.1f}")

Average boxes required to obtain magical elixir: 12.6
Average boxes required to obtain golden pendant: 1.4
Average boxes required to obtain steel armor: 0.4
Average boxes required to obtain bronze coin: 0.2


<b>Note that the results from the monte-carlo trails follow 1/p from the problem description above.</b><br>
<b>Also note that Multi Processing is about 3x faster than Single Threading, which is in turn 3.5x faster than Multi Threading</b>

#### 7. Suppose you and your boss want to measure whether "Version 2" is better than "Version 1." What metrics do you think are important to measure? (Your answers will vary here - use your judgment!)

2 primary metrics are important to games in general.<br>
\#1 Revenue generated directly from loot boxes<br>
\#2 Monthly active users<br><br>
These metrics directly and indirectly impact the bottom line of a company.<br>
Directly through sales of in-game items, and indirectly through game-time affecting ads revenue.