## Hellinger transformation 

This is recommended for count data as a means of standardization<sup>1,2</sup> and scaling allowing data to be passed to PCA and other metrics not otherwise acceptable for count data where observations within rows are not independent. Not perfect but preferred to other column based scaling techniques since the structure of the row is important for count data<sup>3</sup>. Essentially the square-root transformation by row total downweights abundant species and allows better representation of the tail represented by less dominant species. Results are identical to Decostand implementation in Vegan (R package).

**Requirements**
A dataframe consisting of raw count data with different observations arranged by row and the labels for the different species as columns. Assumes dataframe has some prior processing to remove N/A and other non-numerical flags within the body of the dataframe.


<sup>1</sup>Legendre, P. & Legendre, L. (2012) * Numerical Ecology (3rd edition)* Developments in Environmental Modelling, vol. **24** pp. 1-990. Elsevier, Amsterdam.  
<sup>2</sup>Legendre, P. & Gallagher, E. D. (2001) Ecologically meaningful transformations for ordination of species data. *Oecologia*, **129**:271-280.  
<sup>3</sup>Borcard, D., Gillet, F., & Legendre, P. (2011) *Numerical Ecology with R*. pp.1-306, Springer, Dortrecht.

In [1]:
import numpy as np
import pandas as pd

In [84]:
df = pd.read_csv('C:/../taxa_matrix.csv')
df.head()

Unnamed: 0,Depth,taxa_1,Taxa_2,Taxa_3
0,Species_1,10,5,1
1,Species_2,18,3,3
2,Species_3,10,5,0


In [65]:
## Split the taxa labels off from the df
taxa = df.iloc[:,0:1] #these are id_rows
data = df.iloc[:,1:] #these are the data

In [85]:
data.head()

Unnamed: 0,taxa_1,Taxa_2,Taxa_3
0,10,5,1
1,18,3,3
2,10,5,0


In [79]:
def hellinger_transformation(id_rows, df):
    ''' id_rows --> this is the first column or columns and is a separate df.
        df --> these columns are the counts (with the column headers) of the data matrix.'''
    margin_total = df.sum(axis=1)
    transform = df.apply(lambda x: np.sqrt(x / margin_total))
    new_df = pd.concat([id_rows, transform], axis=1) 
    return new_df
