## Transformation: From $[0,1]$ space to $\mathbb{R}$ space 

As we are trying to assess the representativeness of the number of guests before and after the *break-point*, thereby assess whether there is a **structural-change** in our hotel bookings system, we need to model using proportions, `proportion_guests`. This is because we expect the number of guests to change after the potential **structural-change**, but if we can assess whether the proportions are the same or even similar, then we can still make inferences from our data before and after the change. 

This is because the similar proportions will suggest that whilst the absolute number of guests making hotel bookings have changed, the proportion of people from each group, `region`, are similar, so our data after the *break-point* is still representative of the old data before the *break-point*, and henceforth, any inferences still hold for the same population.

You can think of this in the sense that before the *break-point*, we had our target **population** being captured in our data. If the proportions/compositions of people from each `region` are similar after the *break-point*, then we have a **representative sample** of our **target population**.

However, we cannot model on proportion/compositional data because compositional data is bounded in the region $[0,1]$. There is a risk here that applying a model to it can give values outside this region, and henceforth be entirely meaningless because you cannot interpret such a value.

Instead, we can transform our compositional data by mapping our data into the real number space, $\mathbb{R}%$. There are three well-characterised isomorphisms that do this:

- **Additive logratio (alr)**
- **Centre logratio (clr)**
- **Isometric logratio (ilr)**

*Source: [Wikipedia](https://en.wikipedia.org/wiki/Compositional_data)*

Alternatively, we can apply the following by adding a very small value to $0$ values for our proportions. In particular, it is also a common transformation to transform data to be approximately **normally-distributed**. 

*Source: [Feng et al., "Log-transformation and its implications for data analysis"](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/)*

- **Log transform** 

In all these transformations, it is essential that our data does not contain any zeroes in. Thus we will use a multiplicative replacement strategy to replace zeroes with a small, positive $\delta$, and do so in a way that ensures the compositions still add up to $1$.

*Source: [J.A. Martin Fernandez, "Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation"](https://link.springer.com/article/10.1023/A:1023866030544)*

In [None]:
import pandas as pd
import numpy as np
from skbio.stats.composition import multiplicative_replacement
from skbio.stats.composition import clr
from skbio.stats.composition import ilr

# display multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# put in a python script
# create custom additive log-ratio function
def func_alr(mat, div, col_names):
    
    # check to see division can happen /log(0)
    
    # take vectors from input array `mat`, excluding column index, `div`
    numerator = np.delete(arr = mat, obj = div, axis = 1)
    # take vector for `div`
    denominator = mat[:, div]
    
    # take logs - should find way to call a package within a function
    lnum = np.log(numerator)
    lden = np.log(denominator)
    
    # subtract 'div' vector from every 'column' in matrix, 'mat'
    # https://stackoverflow.com/questions/26333005/numpy-subtract-every-row-of-matrix-by-vector
    output = (lnum.transpose() - lden).transpose()
    
    # convert array output to dataframe
    output = pd.DataFrame(data = output, columns = col_names)
    
    return output

In [None]:
# pass in variable from other notebook
%store -r data_join

# pivot so can apply trasnformations on
data_pivot = data_join.pivot(index = 'arrival_date', columns = 'region', values = 'proportion_guests')

# replace NaNs with 0s so can transform
data_pivot = data_pivot.loc[:, 'Africa':'Oceania'].fillna(value = 0, axis = 1)

## un-groupby so we get previous grouped index as columns
data_pivot = data_pivot.reset_index()
data_pivot

In [None]:
x = data_pivot.loc[:, 'Africa':'Oceania']
# store column names for later when re-creating dataframe
col_names = list(x)
x
# apply multiplicative replacement strategy to replace 0s
x = multiplicative_replacement(x)
# note, returns an array
x

In [None]:
# get index of 'Oceania' column
index_denominator = col_names.index('Oceania')
# remove this index from list of column names
col_names.pop(index_denominator)

# apply ALR transformation
data_alr = func_alr(mat = x, div = index_denominator, col_names = col_names)

# add `arrival_date` back in (ASSUMES ROW ORDERING IS PRESERVED)
data_alr['arrival_date'] = data_pivot['arrival_date']

# unpivot
data_alr = data_alr.melt(id_vars = ['arrival_date'], var_name = 'region', value_name = 'alr_guests')

data_alr

In [None]:
# clr
data_join['clr_guests'] = clr(mat = x)
data_join['ilr_guests'] = ilr(mat = x)