# Probabilistic Analysis and Randomized Algorithms

Src: Chapter 5.1 of Cormen et al. discusses probabilistic analysis and randomized algorithms. 
4th edition of CLRS from MIT press 

Probabilistic analysis is a technique used in computer science to analyze algorithms that involve randomness or uncertainty. It involves using probability theory to calculate the expected behavior of an algorithm over many runs, rather than analyzing the behavior of a single run.

Randomized algorithms, on the other hand, are algorithms that make use of random numbers or random choices to solve problems. They can be used to solve problems that are difficult or impossible to solve deterministically, and they are often more efficient than their deterministic counterparts.

Probabilistic analysis is particularly useful for analyzing randomized algorithms, as it allows us to reason about the expected behavior of an algorithm over many runs. For example, if we run a randomized algorithm 100 times and observe that it gives the correct answer 95 times, we can use probabilistic analysis to calculate the probability that the algorithm will give the correct answer on any given run.



## Hire Assistant problem

Suppose you need to hire a new office assistant but your previous attempts have been unsuccessful. To solve this problem, you decide to use an employment agency that will send you one candidate each day for an interview. You have to pay a small fee to the agency for each interview, and hiring an applicant is even more costly as it requires firing your current office assistant and paying a substantial hiring fee to the agency. Since you are committed to having the best possible person for the job, you have decided that if a candidate is more qualified than your current assistant, you will hire the new candidate and fire the current assistant. Although you are willing to pay the resulting cost, you want to estimate the price of this strategy.

![Hire Assistant Problem](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lXgGoeOwoKEFjXmXRjXZZQ.png)

Src: https://www.cantorsparadise.com/math-based-decision-making-the-secretary-problem-a30e301d8489

## Trivial Case - Instant interviews - no costs to hire and fire

In trivial case, we could simply go through all the applicants and choose the best one (simple O(n) complexity) choose max out of all.

Situation changes if we add costs ("friction")

### Hire assistant problem pseudocode

The following pseudocode describes the algorithm for hiring an office assistant using the employment agency. The algorithm takes as input a list of candidates from the employment agency and returns the cost of hiring and firing office assistants.

* Set the current best candidate to None and the current best candidate's score to 0.
* While there are still candidates from the employment agency:

a. Interview the next candidate.

b. If the candidate's score is higher than the current best candidate's score:

1. Fire the current office assistant.

2. Hire the new candidate.

3. Set the current best candidate to the new candidate.

c. Otherwise, do not hire the candidate.
Return the cost of hiring and firing office assistants.

Note: The exact scoring system used to evaluate candidates is not specified in the problem statement, so it would need to be defined or assumed in the implementation of the procedure. Additionally, the cost of hiring and firing office assistants is not specified, so that would need to be estimated based on the specific circumstances of the problem.

In [20]:
def score(candidate):
    # you could imagine a whole process of interview here which considers 10-50 parameters, 
    # 7 rounds of interviews etc etc
    return candidate # for now we assume the candidate is the score

def hire_assistant(candidates, hire_cost=0, fire_cost=0, interview_cost=9, debug=True):
    '''
    Default costs represent ideal world where time and money are meaningless...
    '''
    best_candidate = None
    best_score = 0
    total_cost = 0
    
    for candidate in candidates:
        # Score the candidate (replace this with your own scoring function)
        candidate_score = score(candidate)
        
        if candidate_score > best_score:
            # Fire the current office assistant and hire the new candidate
            total_cost += fire_cost + hire_cost
            if debug:
                print(f"Hiring candidate with score {candidate_score}, paying {hire_cost}")
                print(f"paying {fire_cost} to fire {best_candidate}")
                print(f"Total cost now is {total_cost}")
            
            best_candidate = candidate
            best_score = candidate_score
        
        # in any case we have to pay the cost of interview (you and your co-workers spending time with prospect, hotel_rooms, food, etc)
        total_cost += interview_cost
            
    return total_cost

In [16]:
import random # we will need a lot of random in this notebook
random.seed(42)  # so everyone should get same pseudo-randoms
# it is like buying that book of 1 million random numbers on Amazon from 1950s
# here we will make some random scores
random.random() # we could also use randint or even some of the premade distribution

0.6394267984578837

In [17]:
candidates = [random.random() for _ in range(20)]
candidates

[0.025010755222666936,
 0.27502931836911926,
 0.22321073814882275,
 0.7364712141640124,
 0.6766994874229113,
 0.8921795677048454,
 0.08693883262941615,
 0.4219218196852704,
 0.029797219438070344,
 0.21863797480360336,
 0.5053552881033624,
 0.026535969683863625,
 0.1988376506866485,
 0.6498844377795232,
 0.5449414806032167,
 0.2204406220406967,
 0.5892656838759087,
 0.8094304566778266,
 0.006498759678061017,
 0.8058192518328079]

## Knowing the random distribution

In the above example since we know the maximum (1.0) we could premake a rule - heuristic say any candidate over 0.99 is amazing and stop early.

However, this range of scores most likely will not be available in a real life situation.

In [21]:
hire_assistant(candidates, 2_000, 5_000, 200)

Hiring candidate with score 0.025010755222666936, paying 2000
paying 5000 to fire None
Total cost now is 7000
Hiring candidate with score 0.27502931836911926, paying 2000
paying 5000 to fire 0.025010755222666936
Total cost now is 14200
Hiring candidate with score 0.7364712141640124, paying 2000
paying 5000 to fire 0.27502931836911926
Total cost now is 21600
Hiring candidate with score 0.8921795677048454, paying 2000
paying 5000 to fire 0.7364712141640124
Total cost now is 29000


32000

In [23]:
# employment agency gives you a list of candidates in already sorted order, 
# sadly for you it is ascending and you do not realize that
sorted_candidates = sorted(candidates)
hire_assistant(sorted_candidates, 2_000, 5_000, 200)

Hiring candidate with score 0.006498759678061017, paying 2000
paying 5000 to fire None
Total cost now is 7000
Hiring candidate with score 0.025010755222666936, paying 2000
paying 5000 to fire 0.006498759678061017
Total cost now is 14200
Hiring candidate with score 0.026535969683863625, paying 2000
paying 5000 to fire 0.025010755222666936
Total cost now is 21400
Hiring candidate with score 0.029797219438070344, paying 2000
paying 5000 to fire 0.026535969683863625
Total cost now is 28600
Hiring candidate with score 0.08693883262941615, paying 2000
paying 5000 to fire 0.029797219438070344
Total cost now is 35800
Hiring candidate with score 0.1988376506866485, paying 2000
paying 5000 to fire 0.08693883262941615
Total cost now is 43000
Hiring candidate with score 0.21863797480360336, paying 2000
paying 5000 to fire 0.1988376506866485
Total cost now is 50200
Hiring candidate with score 0.2204406220406967, paying 2000
paying 5000 to fire 0.21863797480360336
Total cost now is 57400
Hiring cand

144000

## Solution when you suspect a sorted list

You would shuffle it, and avoid the pain and cost of hiring and firing so many people (or so many marriages...)


### Hire assistant problem implementation explanation

Above code takes in a list of candidates, the cost of hiring a new office assistant, and the cost of firing the current office assistant. It then iterates through each candidate and evaluates them using the score() function (which you would need to define or replace with your own scoring function). If the candidate has a higher score than the current best candidate, the code fires the current office assistant, hires the new candidate, and updates the best_candidate and best_score variables. If the candidate has a lower or equal score, the code does not hire them and only adds the hire_cost to the total_cost. Finally, the code returns the total cost of hiring and firing office assistants.

## Online Decision Problem

An online decision problem is a problem where the input is revealed over time and decisions must be made without complete knowledge of the future input. In other words, the algorithm must make decisions without seeing the entire input in advance.

In contrast, an offline decision problem is one where the entire input is known in advance and the algorithm can take as much time as it needs to make a decision.

Online decision problems are common in many areas of computer science, including optimization, game theory, machine learning, and networking. In these problems, the algorithm must make decisions based on incomplete information, and the goal is usually to minimize some measure of cost or maximize some measure of performance.

The Hire Assistant problem is an example of an online decision problem because the candidates are revealed over time, and the algorithm must make a decision after each candidate is evaluated, without knowledge of future candidates. Similarly, other examples of online decision problems include routing packets in a computer network, scheduling tasks on a processor, or bidding in an auction.

## Monte Carlo Simulation

Monte Carlo Method is a computational algorithm that uses random sampling to estimate the solutions to problems in various fields such as physics, engineering, finance, and computer science. It is named after the famous Monte Carlo Casino in Monaco, where games of chance use random numbers to determine the outcome.

The Monte Carlo method typically involves simulating a large number of random samples or scenarios to generate estimates of complex systems or problems that are difficult to solve analytically. These random samples are used to estimate probabilities or expected values of the system or problem under investigation.

For example, in physics, the Monte Carlo method is used to simulate the behavior of particles in a system by generating random positions and velocities for each particle and then computing the resulting behavior of the system. In finance, Monte Carlo simulations are used to estimate the value of financial instruments such as options or bonds, by simulating a large number of possible future scenarios and calculating the expected value of the instrument under each scenario.

The Monte Carlo method can be particularly useful in situations where the problem is too complex to be solved analytically, and there are many sources of randomness or uncertainty involved. However, the accuracy of Monte Carlo simulations depends on the number of samples or scenarios simulated, and in some cases, the method can be computationally expensive.

We can simulate the Hire Assistant problem using the Monte Carlo method, which is a probabilistic algorithm that uses random sampling to obtain numerical results.

Here's how we can use Monte Carlo method to simulate the Hire Assistant problem:

1. Generate a large number of candidate pools, each containing a random permutation of the same set of candidates.
2. For each candidate pool, run the Hire Assistant algorithm on the candidates and record the total cost of hiring and firing assistants.
3. Compute the average cost over all the candidate pools to obtain an estimate of the expected cost.

## Law of Large Numbers

* https://en.wikipedia.org/wiki/Law_of_large_numbers

![Fair Dice](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Lawoflargenumbers.svg/450px-Lawoflargenumbers.svg.png)
### Wisdom of the crowds

### Reversal to the mean

In [24]:
throws = 1
for _ in range(7):
    throws *= 10
    # print(f"Throwing  dice{throws} times")
    dice_throws = [random.randint(1,6) for _ in range(throws)]
    avg = sum(dice_throws) / throws
    print(f"Average dice from {throws} is {avg}")

Average dice from 10 is 2.9
Average dice from 100 is 3.52
Average dice from 1000 is 3.524
Average dice from 10000 is 3.5448
Average dice from 100000 is 3.49881
Average dice from 1000000 is 3.500682
Average dice from 10000000 is 3.4990848


In [31]:
random.sample([1,2,3,4], 4) # we return a random sample without replacement - meaning we do not get doubles

[3, 1]

In [32]:
## Monte Carlo Simulation

# import random

def hire_assistant_simulate(candidates, hire_cost, fire_cost, interview_cost, num_simulations):
    total_cost = 0
    n = len(candidates)
    
    for i in range(num_simulations):
        # Generate a random permutation of the candidates
        candidate_pool = random.sample(candidates, n)
        
        # Run the Hire Assistant algorithm on the candidate pool and record the cost
        cost = hire_assistant(candidate_pool, hire_cost, fire_cost, interview_cost, debug=False)
        total_cost += cost
        
    # Compute the average cost over all the simulations
    average_cost = total_cost / num_simulations
    return average_cost

In [43]:
hire_assistant_simulate(candidates, hire_cost=2_000, fire_cost=5_000, interview_cost=200, num_simulations=100_000)

29212.32

In [None]:
# so using some simulations we could estimate how much we would end up paying 

## When distribution of scores is not known

In previous case we knew the scores 0 to 1

If we do not know the distribution of scores we need to learn it from applicants

https://en.wikipedia.org/wiki/Secretary_problem

Difficulty lies that we have to make one choice to hire and that is it

Statistically it is proven that n/e (e as in Euler's constant) candidates should be skipped - used for learning the distribution

After that you simply take the first score that is higher than the highest score in first n/e candidates.

Again this is only about 37% of working perfectly.

Worst case scenario you hire the last person - who could be really bad

This worst case scenario would require you that you strike out on 63 persons being worse than the best person from first 36

In other words the best person was in the first 36

In [44]:
import math
math.e

2.718281828459045

In [45]:
100/math.e

36.787944117144235

In [None]:
## TODO Do simulation to empirically show that n/e is optimal
## TODO show mathematical proof of n/e

In [None]:
## TODO write simulation to prove how n/e approach works
# def hire(candidates, threshold):


## Side Story - Calculating Pi via Monte Carlo method

To calculate the value of pi using the Monte Carlo method, we can use a probabilistic approach that involves simulating a large number of random points in a square and calculating the proportion of those points that lie inside a quarter-circle inscribed in the square. The value of pi can then be estimated based on the ratio of the area of the quarter-circle to the area of the square.

Here are the steps to calculate the value of pi using Monte Carlo method:

* Generate a large number of random points within a square with sides of length 2 centered at the origin.
* Count the number of points that lie inside the quarter-circle of radius 1 centered at the origin.
* Estimate the area of the quarter-circle as the proportion of points inside the quarter-circle to the total number of points generated.
* Estimate the value of pi as four times the estimated area of the quarter-circle.

![Circle](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Pi_30K.gif/440px-Pi_30K.gif)

In [46]:
# import random

def estimate_pi(num_points):
    num_points_in_circle = 0
    for _ in range(num_points):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            num_points_in_circle += 1
    pi_estimate = 4 * num_points_in_circle / num_points
    return pi_estimate

# This code takes in the number of points to generate,
#  generates random points within a square of length 2 centered at the origin, 
# counts the number of points that lie inside the quarter-circle of radius 1 centered at the origin, 
# estimates the area of the quarter-circle as the proportion of points 
# inside the quarter-circle to the total number of points generated, 
# and estimates the value of pi as four times the estimated area of the quarter-circle.
#  The more points generated, the more accurate theb estimate of pi will be.

In [47]:
throws = 1
for _ in range(7):
    throws *= 10
    # print(f"Throwing  dice{throws} times")
    print(f"Average PI from {throws} pins is {estimate_pi(throws)}")


Average PI from 10 pins is 3.6
Average PI from 100 pins is 3.32
Average PI from 1000 pins is 3.172
Average PI from 10000 pins is 3.126
Average PI from 100000 pins is 3.13924
Average PI from 1000000 pins is 3.142788
Average PI from 10000000 pins is 3.141324


## Side story: The Monty Hall Problem

![Goat](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Monty_open_door.svg/440px-Monty_open_door.svg.png)

## TODO time allowing or next time

The Monty Hall problem is a famous probability puzzle that is named after the host of the game show "Let's Make a Deal," Monty Hall. The problem is based on a hypothetical game show where a contestant is presented with three doors. Behind one of the doors is a valuable prize, while the other two doors hide goats.

The contestant chooses one of the three doors, but before the chosen door is opened, the host (Monty Hall) opens one of the other two doors to reveal a goat. The contestant is then given the option to stick with their original choice or switch to the other unopened door.

The question is whether the contestant should stick with their original choice or switch to the other door in order to increase their chances of winning the prize. The answer may seem counterintuitive, but switching actually increases the contestant's chances of winning the prize from 1/3 to 2/3. This is because when the contestant first made their choice, they had a 1/3 chance of being correct. When the host opened one of the other doors to reveal a goat, the remaining unopened door had a 2/3 chance of hiding the prize.

### Correct strategy for Monty Hall problem

The correct strategy for the Monty Hall problem is to always switch to the other unopened door. This is because the contestant's initial choice has a 1/3 chance of being correct, and the host's choice of door to open has a 2/3 chance of being incorrect. Therefore, the contestant's chances of winning the prize are 1/3 * 2/3 = 2/3 when they switch doors.

Famously in 1990, the question was discussed in Parade magazine in 1990 when the problem was solved by mathematician Marilyn vos Savant.

Wiki: https://en.wikipedia.org/wiki/Monty_Hall_problem

In [49]:

def monty_hall_simulation(switch):
    doors = ["goat", "goat", "car"]
    random.shuffle(doors)
    chosen_door = random.choice(doors)
    if chosen_door == "car":
        if switch: # so we chose the switch strategy and were unlucky to have chosen the car already - so we get goat
            return 0
        else: # no switch strategy - stay put
            return 1
    else: # when we have chosen a goat 
        if switch: # we apply switch strategy
            return 1  # we win the car
        else:  # stay put strategy fails here - we end up with the goat
            return 0

num_simulations = 1_000_000
switch = True
wins = 0

for i in range(num_simulations):
    wins += monty_hall_simulation(switch)

print(f"Probability of winning with switch: {wins / num_simulations:.4f}")
print(f"Probability of winning without switch: {(num_simulations - wins) / num_simulations:.4f}")

Probability of winning with switch: 0.6663
Probability of winning without switch: 0.3337


In [52]:
num_simulations = 1_000_000
switch = False
wins = 0

for i in range(num_simulations):
    wins += monty_hall_simulation(switch)

print(f"Probability of winning without switch: {wins / num_simulations:.4f}")
print(f"Probability of winning WITH switch: {(num_simulations - wins) / num_simulations:.4f}")

Probability of winning without switch: 0.3328
Probability of winning WITH switch: 0.6672


## Using random simulation to obtain answers

So if you have trouble coming up with an answer to some algorith, you can use simulation to come up with a good aproximation.

Key idea is to have some knowledge of distribution of inputs.

## Jupyter %%timeit also works on this principle

In [56]:
%%timeit
sorted(list(range(1_000_000)))

62.2 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Optimal Stopping Problem

The Hire Assistant problem is a classic example of an online decision problem in which we need to make a sequence of decisions without having full information about future events. The goal is to hire the best candidate while minimizing the total cost of hiring and firing assistants.

The optimal approach to solving the Hire Assistant problem is to use an algorithm called the "optimal stopping rule." This rule states that we should interview and evaluate the first k candidates, where k is a fixed number, and then hire the first candidate that is better than all the previous candidates. The value of k is determined by the expected number of candidates that we need to interview before finding the best candidate.

The expected number of candidates to be interviewed can be calculated as follows:

* Let n be the total number of candidates provided by the employment agency.
* Let p be the probability that a candidate is better than all the previous candidates.
* The expected number of candidates to be interviewed is given by n/p.
* Therefore, the optimal approach is to interview and evaluate the first k = n/e candidates, where e is the mathematical constant equal to approximately 2.71828. After evaluating the first k candidates, we hire the first candidate that is better than all the previous candidates.

This approach guarantees that we will hire the best candidate with a probability of approximately 1/e, and it minimizes the expected cost of hiring and firing assistants.

Note: he optimal stopping rule is based on mathematical analysis of the problem, and is derived using techniques such as probability theory and calculus. The rule is not guaranteed to always produce the best result, but in the long run, it is the most effective strategy for hiring the best candidate while minimizing the total cost.