# Lab 1 - Hypotheses Testing

## Required dependencies

First let us install what is needed - jmetalpy and scikit-posthocs

In [None]:
%pip install jmetalpy
%pip install scikit-posthocs

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scikit_posthocs as sp
from scipy import stats
from jmetal.core.observer import Observer
import logging
from jmetal.problem.singleobjective.unconstrained import Sphere
from jmetal.core.solution import FloatSolution
from jmetal.algorithm.singleobjective.local_search import LocalSearch
from jmetal.operator import SimpleRandomMutation
from jmetal.util.termination_criterion import StoppingByEvaluations

## Hypotheses Testing

Basic example - tossing a coin.

In [None]:
HEADS = 0
TAILS = 1

COIN_VALUES = [HEADS, TAILS]


def get_coin_tosses(number_of_tosses: int, probabilities: list[int] = [0.5, 0.5]) -> np.ndarray:
    return np.random.choice(COIN_VALUES, size=number_of_tosses, replace=True, p=probabilities)

Implement a function to `run_experiment`:

In [None]:
def run_experiment(number_of_trials: int, number_of_tosses: int, probabilities: list[int] = [0.5, 0.5]) -> list[float]:
    """The run_experiment function is designed to simulate a coin tossing experiment
    with a specified number of trials (number_of_trials) and a given number of coin tosses per trial (number_of_tosses).
    It allows you to set the probabilities of getting tails and heads for the coin, with the default values of 0.5 for both sides (representing a fair coin).
    In each trial, it calculates the mean (probability of TAILS) of coin_tosses and returns a list of means."""
    # TODO: Implement me!
    raise NotImplementedError()


def plot_experiment(tails_probabilities: list[float]) -> None:
    plt.hist(tails_probabilities, density=True)
    plt.xlabel("Mean probability")
    plt.ylabel("Density")
    plt.show()

Plot results of the experiment:

In [None]:
tails_probabilities = run_experiment(number_of_trials=10000, number_of_tosses=1000, probabilities=[0.5, 0.5])
plot_experiment(tails_probabilities)

In [None]:
tails_probabilities = run_experiment(number_of_trials=10000, number_of_tosses=1000, probabilities=[0.2, 0.8])
plot_experiment(tails_probabilities)

### Kruskal-Wallis test
We will implement a simplified version for 2 samples. <br>
#### Method:
1. Rank all data from all groups together; i.e., rank the data from 1 to N ignoring group membership. Assign any tied values the average of the ranks they would have received had they not been tied.
2. Calculate the test statistic:
$$ H = \frac{12}{N (N+1)}(n_1 \cdot \overline{r_1}^2 + n_2 \cdot \overline{r_2}^2) - 3(N+1) $$
Where:
- $n_1$ is the number of observations in $sample_1$ and $n_2$ is the number of observations in $sample_2$,
- $N = n_1 + n_2$ is the total number of observations across all samples,
- $\overline{r_1}$ is the average rank of all observations in $sample_1$.
3. Finally, the decision to reject or not the null hypothesis is made by comparing H to a critical value $H_{c}$ obtained from a table or a software for a given significance or alpha level. If  H is bigger than $H_{c}$, the null hypothesis is rejected. If possible (no ties, sample not too big) one should compare H to the critical value obtained from the exact distribution of H. Otherwise, the distribution of H can be approximated by a chi-squared distribution with 1 degree of freedom (in general $g-1$ degrees of freedom where $g$ is the number of samples). 

Source: https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance

In [None]:
def kruskal_wallis(sample1: np.ndarray, sample2: np.ndarray) -> tuple[float, float]:
    # TODO: Implement me!
    # Use rankdata function from scipy for convenience
    all_samples = np.concatenate([sample1, sample2])
    ranks = stats.rankdata(all_samples, method="min")
    # H = ...
    # p_value = stats.chi2.sf(H, 1)
    # return H, p_value
    raise NotImplementedError()

### Null and alternative hypotheses.
The null hypothesis ($H_0$) often represents either a skeptical perspective or a claim of “no difference” to be tested.
The alternative hypothesis ($H_A$) represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.

### P-value
The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true. We typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses. This summary value that is used to compute the p-value is often called the test statistic.

Source: https://openintro-ims.netlify.app/foundations-randomization

Test if your implementation is correct:

In [None]:
sample1 = np.array([1, 2, 3])
sample2 = np.array([4, 6, 5])
scipy_result = stats.kruskal(sample1, sample2)
custom_result = kruskal_wallis(sample1, sample2)

assert np.isclose(scipy_result.statistic, custom_result[0])
assert np.isclose(scipy_result.pvalue, custom_result[1])

Generate a sample for a fair coin and for a biased coin. Then check if we can detect the biased coin using Kruskal-Wallis test.

In [None]:
number_of_trials = 1000
number_of_tosses = 100
fair_coin_probabilities = [0.5, 0.5]
biased_coin_probabilities = [0.49, 0.51] 
fair_coin_means = run_experiment(
    number_of_trials=number_of_trials, number_of_tosses=number_of_tosses, probabilities=fair_coin_probabilities
)
biased_coin_means = run_experiment(
    number_of_trials=number_of_trials, number_of_tosses=number_of_tosses, probabilities=biased_coin_probabilities
)
kruskal_wallis(fair_coin_means, biased_coin_means)

Q: How to decide if a coin is biased using `pvalue`?

### Fun fact

Fair coins tend to land on the same side they started: Evidence from 350,757 flips.

Source: https://arxiv.org/abs/2310.04153

#### Key takeaways:
- statistical testing is just another tool in your toolkit,
- we use statistical tests to check if we can reject a hypothesis - e.g. to check if samples originate from the same distribution,
- there are many statistical tests already implemented in scipy,
- many ML/statistical libraries provide statistical tests out of the box (e.g. `statsmodels.regression.linear_model.OLS`) - it's crucial to understand what are the p-values,
- try to use statistical tests for your master thesis (e.g. to check if the results of one ML algorithm are better than the baseline). 

## jMetalPy

During this lab we will take a look at the jMetalPy platform, run the first simple heuristic algorithm (local search) and play with different methods of visualization and testing statistical significance.


jMetalPy provides implementation for many optimization problems - right now we will focus on `Sphere`.

![Alt text](https://www.sfu.ca/~ssurjano/spheref.png)

Source: https://www.sfu.ca/~ssurjano/spheref.html

`Sphere` is defined by a formula:
$$ f(x) = \sum_{i=1}^{d} x_i^2 $$

Global Minimum:
$ f(x^*) = 0 $, at $x^* = (0, ..., 0)$

We can check it for $d=2$ and $x = [1, 2]$. The result should be equal to:
$$ f(x) = 1^2 + 2^2 = 5$$

In [None]:
logger = logging.getLogger('jmetal.core.algorithm')
logger.setLevel(logging.INFO)

In [None]:
problem = Sphere(2)
solution = FloatSolution(problem.lower_bound, problem.upper_bound, problem.number_of_objectives())
solution.variables = [1.0, 2.0]
assert problem.evaluate(solution).objectives[0] == 5.0

`LocalSearch` works like this:
- start with 1 random point from domain (sampled with uniform distribution) called `solution`,
- for each step run mutation for the `solution`,
- if fitness value after running mutation is better than before replace `solution` with mutated `solution`.

In [None]:
# Problem definition and Local Search parameters:
PROBLEM = Sphere(10)
MAX_EVALUATIONS = 100
MUTATION = SimpleRandomMutation(1.0 / PROBLEM.number_of_variables())
TERMINATION_CRITERION = StoppingByEvaluations(max_evaluations=MAX_EVALUATIONS)


def get_local_search(max_evaluations: int = MAX_EVALUATIONS) -> LocalSearch:
    return LocalSearch(
        problem=PROBLEM,
        mutation=MUTATION,
        termination_criterion=StoppingByEvaluations(max_evaluations=max_evaluations),
    )


algorithm = get_local_search()
algorithm.run()
result = algorithm.get_result()

print("Algorithm: " + algorithm.get_name())
print("Problem: " + problem.name())
print("Solution: " + str(result.variables))
print("Fitness:  " + str(result.objectives[0]))
print("Computing time: " + str(algorithm.total_computing_time))

This is of course a heuristic algorithm, let us run it many times and show the dispersion of the outcome.


In [None]:
STEPS = 30
results = []
for _ in range(STEPS):
  algorithm = get_local_search()
  algorithm.run()
  result = algorithm.get_result().objectives[0]
  results.append(result)

plt.boxplot(results)
plt.ylabel("Fitness")
plt.show()

It would be nice to see the progress of the algorithm. First let us take a look at a toy  - a progress bar.

In [None]:
from jmetal.util.observer import ProgressBarObserver

max_evaluations = 100000
algorithm = get_local_search(max_evaluations=max_evaluations)
basic = ProgressBarObserver(max=max_evaluations)
algorithm.observable.register(observer=basic)

algorithm.run()

However it would be better to take a closer look at what is happening in the algorithm - what is the value of the best fitness in every step, or at certain moments in time.

In [None]:
class DataObserver(Observer):

    def __init__(self, frequency: float = 1.0) -> None:
        """ Show the number of evaluations, best fitness and computing time.
        :param frequency: Display frequency. """
        self.display_frequency = frequency
        self.data = []

    def update(self, *args, **kwargs):
        evaluations = kwargs['EVALUATIONS']
        solutions = kwargs['SOLUTIONS']

        if (evaluations % self.display_frequency) == 0 and solutions:
            if type(solutions) == list:
                fitness = solutions[0].objectives
            else:
                fitness = solutions.objectives
            self.data.append(fitness[0])

algorithm = get_local_search()
dataobserver = DataObserver(frequency=1.0)
algorithm.observable.register(observer=dataobserver)

algorithm.run()

print(len(dataobserver.data))

Plot results:

In [None]:
plt.plot(dataobserver.data)
plt.show()

Now let us take a proper look at the progress of the algorithm. Let us repeat it many times, and draw an appropriate box and whiskers plot for every moment in time, observing the best fitness.

In [None]:
all_data = []
for _ in range(STEPS):
  data = []
  algorithm = get_local_search()
  dataobserver = DataObserver(frequency=1.0)
  algorithm.observable.register(observer=dataobserver)
  algorithm.run()
  all_data.append(dataobserver.data)

numpy_array = np.array(all_data)
transpose = numpy_array.T
transpose_list = transpose.tolist()

Plot results:

In [None]:
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111)
bp = ax.boxplot(transpose_list)
x_ticks = np.arange(0, MAX_EVALUATIONS, 10)
plt.xticks(x_ticks, x_ticks)
plt.show()

### Statistical tests
A Kruskal-Wallis test is used for testing whether (two or more) samples originate from the same distribution.
If the results of a Kruskal-Wallis test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different. <br>
Example: https://www.statology.org/dunns-test/

In [None]:
first_epoch_population = transpose_list[0]
second_epoch_population = transpose_list[1]
last_epoch_population = transpose_list[99]

# Perform Kruskal-Wallis Test
print(stats.kruskal(first_epoch_population, second_epoch_population, last_epoch_population))


# Perform Dunn test
sp.posthoc_dunn([first_epoch_population, second_epoch_population, last_epoch_population], p_adjust="holm")

Q: Which samples originate from different distributions?

Wilcoxon test is another tool to check if there are significant differences between samples. Wilcoxon test can be used only for two groups.

In [None]:
# Perform Wilcoxon Test
print(stats.wilcoxon(first_epoch_population, second_epoch_population))

### TODO:
#### Exercise 1.
Simulate tossing a coin for:
- fair coin (probability 50/50)
- unfair coin (e.g. 20/80)
- unfair coin (e.g. 90/10)
- another two unfair coin quite close to one of the previous ones (e.g. 51/49 and 24/76).


Please run the experiments for the coin tossers (`number_of_trials=30` ,`number_of_tosses=10`) and draw box and whisker plots to compare samples from different coins.
Apply the relevant statistical hypotheses testing tools for checking whether the cumulative distribution functions of the samples coming from the experiments are different or not (Kruskal-Wallis + Dunn). Assume first the p-value and comment how you can interprete the result.


#### Exercise 2.
Investigate source code of `scipy.stats.kruskal`. What are the differences between the implementation from `scipy` and our custom implementation?

#### Exercise 3.
What are the differences between Wilcoxon and Kruskal-Wallis tests?

#### Exercise 4. 
Run `LocalSearch` with another mutation https://jmetal.github.io/jMetalPy/api/operator/mutation.html. Compare results. 