# Compositional statistics development

This notebook is for prototyping the implementation of compositional statistics for the Compositon class.


In [3]:
import numpy as np
from AtomicEmbeddings.composition import CompositionalEmbedding

CsPbI3_magpie = CompositionalEmbedding(formula="CsPbI3", embedding="magpie")

The statistics we want to implement are:

* Weighted mean: $\bar{x} = \sum_{i=1}^n w_i x_i$
* Weighted sum: $\sum_{i=1}^n w_i x_i$
* Weighted variance: $s^2 = \sum_{i=1}^n w_i (x_i - \bar{x})^2$
* Min-pooling: $\min_{i=1}^n x_i$
* Max-pooling: $\max_{i=1}^n x_i$

If we consider a ternary compound, $A_aB_bC_c$, we can represent the individual elements with a features of dimension of N, indicated by $f_{A,i}$, $f_{B,i}$ and $f_{C,i}$($i=1,...N$).

The statistics can be represented as:

* Weighted mean: $f_{mean,i} = a^{*} f_{A,i} + b^{*} f_{B,i} + c^{*} f_{C,i}$
* Weighted sum: $f_{sum,i} = a f_{A,i} + b f_{B,i} + c f_{C,i}$
* Weighted variance: $f_{var,i} = a^{*} (f_{A,i} - f_{mean,i})^2 + b^{*} (f_{B,i} - f_{mean,i})^2 + c^{*} (f_{C,i} - f_{mean,i})^2$
* Min-pooling: $f_{min,i} = \min(f_{A,i}, f_{B,i}, f_{C,i})$
* Max-pooling: $f_{max,i} = \max(f_{A,i}, f_{B,i}, f_{C,i})$

where $a^{*} = \frac{a}{a+b+c}$, $b^{*} = \frac{b}{a+b+c}$ and $c^{*} = \frac{c}{a+b+c}$, denoting the normalized stoichiometry of the compound.

## A matrix representation of the element features

We can represent the element features as a matrix, $F$, of dimension $3 \times N$, where $N$ is the number of features and 3 is the number of elements. The matrix is defined as:

$F = \begin{bmatrix} f_{A,1} & f_{A,2} & \cdots & f_{A,N} \\ f_{B,1} & f_{B,2} & \cdots & f_{B,N} \\ f_{C,1} & f_{C,2} & \cdots & f_{C,N} \end{bmatrix}$

## A matrix representation of the stoichiometry

We can represent the stoichiometry as a matrix, $S$, of dimension $1 \times 3$, where 3 is the number of elements. The matrix is defined as:

$S = \begin{bmatrix} a & b & c \end{bmatrix}$

We can represent the normalized stoichiometry as a matrix, $S^{*}$, of dimension $1 \times 3$, where 3 is the number of elements. The matrix is defined as:

$S^{*} = \begin{bmatrix} a^{*} & b^{*} & c^{*} \end{bmatrix}$

## Implementing the statistics

We can implement the statistics using `numpy`.

## A matrix representation of the the weighted mean

We can represent the weighted mean as a matrix, $F_{mean}$, of dimension $1 \times N$, where $N$ is the number of features. The matrix is defined as:

$F_{mean} = \begin{bmatrix} f_{mean,1} & f_{mean,2} & \cdots & f_{mean,N} \end{bmatrix}$

This matrix can be calculated as:

$F_{mean} = S^{*} \cdot F$

In [8]:
# Create the matrix of element embeddings

n = int(len(CsPbI3_magpie.fractional_composition))
m = len(CsPbI3_magpie.embedding.embeddings["H"])
el_matrix = np.zeros(shape=(n, m))
for i, k in enumerate(CsPbI3_magpie.fractional_composition.keys()):
    el_matrix[i] = CsPbI3_magpie.embedding.embeddings[k]

print(f" We have {n} elements in the formula and {m} features per element.")
print(f" The shape of the element matrix is {el_matrix.shape}")
print(el_matrix)

 We have 3 elements in the formula and 21 features per element.
 The shape of the element matrix is (3, 21)
[[  5.        132.9054519 301.59        1.          6.        244.
    0.79        1.          0.          0.          0.          1.
    1.          0.          0.          0.          1.        115.765
    0.          0.        229.       ]
 [ 81.        207.2       600.61       14.          6.        146.
    2.33        2.          2.         10.         14.         28.
    0.          4.          0.          0.          4.         28.11
    0.          0.        225.       ]
 [ 96.        126.90447   386.85       17.          5.        139.
    2.66        2.          5.         10.          0.         17.
    0.          1.          0.          0.          1.         43.015
    1.062       0.         64.       ]]


In [11]:
# We can calculate the weighted mean feature vector by taking the dot product of the fractional composition and the element matrix

# Get the stoichiometric vector
stoich_vector = np.array(list(CsPbI3_magpie.fractional_composition.values()))
print(f" The stoichiometric vector is {stoich_vector}")
mean_vector = np.dot(stoich_vector, el_matrix)
print(f" The mean vector is \n {mean_vector}")

 The stoichiometric vector is [0.2 0.2 0.6]
 The mean vector is 
 [7.48000000e+01 1.44163772e+02 4.12550000e+02 1.32000000e+01
 5.40000000e+00 1.61400000e+02 2.22000000e+00 1.80000000e+00
 3.40000000e+00 8.00000000e+00 2.80000000e+00 1.60000000e+01
 2.00000000e-01 1.40000000e+00 0.00000000e+00 0.00000000e+00
 1.60000000e+00 5.45840000e+01 6.37200000e-01 0.00000000e+00
 1.29200000e+02]


In [16]:
# We can also use numpy.average to calculate the weighted mean
mean_vector_2 = np.average(el_matrix, axis=0, weights=stoich_vector)
print(f" The mean vector is \n {mean_vector_2}")

print(mean_vector == mean_vector_2)

 The mean vector is 
 [7.48000000e+01 1.44163772e+02 4.12550000e+02 1.32000000e+01
 5.40000000e+00 1.61400000e+02 2.22000000e+00 1.80000000e+00
 3.40000000e+00 8.00000000e+00 2.80000000e+00 1.60000000e+01
 2.00000000e-01 1.40000000e+00 0.00000000e+00 0.00000000e+00
 1.60000000e+00 5.45840000e+01 6.37200000e-01 0.00000000e+00
 1.29200000e+02]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]


In [18]:
# Time the two methods
%timeit np.dot(stoich_vector, el_matrix)
%timeit np.average(el_matrix, axis=0, weights=stoich_vector)

1.24 µs ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
32.2 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Using np.dot() is quicker than using np.average() to calculate the mean feature vector. As such, we will use np.dot() to calculate the weighted mean.

## A matrix representation of the the weighted sum

We can represent the weighted sum as a matrix, $F_{sum}$, of dimension $1 \times N$, where $N$ is the number of features. The matrix is defined as:

$F_{sum} = \begin{bmatrix} f_{sum,1} & f_{sum,2} & \cdots & f_{sum,N} \end{bmatrix}$

This matrix can be calculated as:

$F_{sum} = S \cdot F$


In [20]:
# We can calculate the weighted sum feature vector by taking the dot product of the stoichiometric vector and the element matrix

stoich_vector_unweighted = np.array(list(CsPbI3_magpie.composition.values()))
print(f" The stoichiometric vector is {stoich_vector_unweighted}")

sum_vector = np.dot(stoich_vector_unweighted, el_matrix)
print(f" The sum vector is \n {sum_vector}")

 The stoichiometric vector is [1. 1. 3.]
 The sum vector is 
 [3.74000000e+02 7.20818862e+02 2.06275000e+03 6.60000000e+01
 2.70000000e+01 8.07000000e+02 1.11000000e+01 9.00000000e+00
 1.70000000e+01 4.00000000e+01 1.40000000e+01 8.00000000e+01
 1.00000000e+00 7.00000000e+00 0.00000000e+00 0.00000000e+00
 8.00000000e+00 2.72920000e+02 3.18600000e+00 0.00000000e+00
 6.46000000e+02]


## A matrix representation of the the weighted variance

We can represent the weighted variance as a matrix, $F_{var}$, of dimension $1 \times N$, where $N$ is the number of features. The matrix is defined as:

$F_{var} = \begin{bmatrix} f_{var,1} & f_{var,2} & \cdots & f_{var,N} \end{bmatrix}$

This matrix can be calculated as:

$F_{var} = S^{*} \cdot (F - F_{mean})^2$


In [21]:
# We can calculate the weighted variance feature vector by
# 1. Subtracting the mean vector from each element embedding
# 2. Squaring the result
# 3. Taking the dot product of the squared difference and the stoichiometric vector

# 1. Subtract the mean vector from each element embedding
el_matrix_mean_subtracted = el_matrix - mean_vector

# 2. Square the result
el_matrix_mean_subtracted_squared = el_matrix_mean_subtracted**2

# 3. Take the dot product of the squared difference and the stoichiometric vector
var_vector = np.dot(stoich_vector, el_matrix_mean_subtracted_squared)
print(f" The variance vector is \n {var_vector}")

 The variance vector is 
 [1.25176000e+03 9.98793266e+02 9.93203104e+03 3.85600000e+01
 2.40000000e-01 1.71304000e+03 5.27560000e-01 1.60000000e-01
 4.24000000e+00 1.60000000e+01 3.13600000e+01 7.44000000e+01
 1.60000000e-01 1.84000000e+00 0.00000000e+00 0.00000000e+00
 1.44000000e+00 9.69102544e+02 2.70682560e-01 0.00000000e+00
 6.37816000e+03]


## A matrix representation of the the min-pooling

We can represent the min-pooling as a matrix, $F_{min}$, of dimension $1 \times N$, where $N$ is the number of features. The matrix is defined as:

$F_{min} = \begin{bmatrix} f_{min,1} & f_{min,2} & \cdots & f_{min,N} \end{bmatrix}$

This matrix can be calculated as:

$F_{min} = \min(F)$


In [22]:
# We can calculate the weighted minpool feature vector by taking the minimum of each column of the element matrix

min_vector = np.min(el_matrix, axis=0)
print(f" The min vector is \n {min_vector}")

 The min vector is 
 [  5.      126.90447 301.59      1.        5.      139.        0.79
   1.        0.        0.        0.        1.        0.        0.
   0.        0.        1.       28.11      0.        0.       64.     ]


## A matrix representation of the the max-pooling

We can represent the max-pooling as a matrix, $F_{max}$, of dimension $1 \times N$, where $N$ is the number of features. The matrix is defined as:

$F_{max} = \begin{bmatrix} f_{max,1} & f_{max,2} & \cdots & f_{max,N} \end{bmatrix}$

This matrix can be calculated as:

$F_{max} = \max(F)$


In [23]:
# We can calculate the weighted maxpool feature vector by taking the maximum of each column of the element matrix

max_vector = np.max(el_matrix, axis=0)
print(f" The max vector is \n {max_vector}")

 The max vector is 
 [ 96.    207.2   600.61   17.      6.    244.      2.66    2.      5.
  10.     14.     28.      1.      4.      0.      0.      4.    115.765
   1.062   0.    229.   ]


## Other statistics

We can also implement other statistics, such as the median, mode, standard deviation, etc. However, these statistics are not as useful as the ones listed above. As such, we will not implement them.

These other statistics be represented as:

* Geometry mean: $\sqrt[N]{\prod_{i=1}^n x_i}$
* Harmonic mean: $\frac{n}{\sum_{i=1}^n \frac{1}{x_i}}$

For our ternary compounds, we can represent the geometry mean as:

* Geometry mean: $f_{gmean,i}=\sqrt[a+b+c]{f_{A,i}^{a} \cdot f_{B,i}^{b} \cdot f_{C,i}^{c}}$
* Harmonic mean: $f_{hmean,i}=\frac{a+b+c}{\frac{1}{f_{A,i}}*a + \frac{1}{f_{B,i}}*b + \frac{1}{f_{C,i}}*c}$