# Fitting OSN-odorant EC50 distribution

Data of OSN dose-response curves for a panel of $30$ odorants or so, in the paper: Si et al., *Neuron*, 2019: https://doi.org/10.1016/j.neuron.2018.12.030

Also analyzed by Kadakia and Emonet, *eLife*, 2019. 

## Description of the data and distribution


EC50s are in units of relative concentration (dilution), so have amplitudes typically between $10^{-7}$ and $10^{-3}$. This means EC50s are similarly small. They look at the distribution of 

$$ x = 1/\mathrm{EC50} $$

which are, roughly speaking, the affinities $K^*$ in the model by Kadakia and Emonet, 2019. So the affinities themselves follow a power-law, it was not necessary (as Kadakia and Emonet do) to sample the dissociation constants $K_D = $ EC50, they could have

Si et al., 2019, Figure 3D plots the complementary CDF (CCDF)  of $x$, $1 - \mathrm{CDF} = 1 - F_X(x) \equiv G_X(x) $. Si et al. claim this is a power-law with exponent $-0.42$, but this is based on the tail, which contains few points. The full distribution, plotted in log-log, looks much closer to a repressive Hill function, that is

$$ G_X(x) = 1 - F_X(x) = \frac{1}{1 + A x^{\alpha}} $$

with exponent $\alpha = 0.42$ explaining the tail. Note that this is already properly  normalized, as this is the CDF (integral of the PDF) and it remains between $0$ and $1$. Note also that the log-log plot in Fig. 3D doesn't cause bin width normalization ambiguities, since the quantity plotted is a probability (CCDF is an integral of the density), so the y scale is unambiguous, not a function that transforms with a Jacobian (like the density) when the x axis is scaled. 

Note that the CDF would be a Hill function, 

$$ F_X(x) = \frac{A x^{\alpha}}{1 + A x^{\alpha}}  \,\, .$$



To show this Hill CDF is a better fit than a pure power law, I downloaded the table of EC50s from Si et al., 2019, and reanalyze it. Below, I recompute the empirical CDF of $x = 1/\mathrm{EC50}$, and fit it with such a Hill function. The data (table of log10 EC50s) available on Github at https://github.com/samuellab/Larval-ORN/blob/master/Figure3/results/log_10_EC50.csv

## Useful

This type of CDF is easy to sample via the inverse CDF, although it may not be a well-known probability distribution function. 

specifically 






Note that the CDF is valid for $x$ between $10^1$ and $10^9$; beyond that range, there is no data, but we do have $G_X(x) \approx 1$ for smaller $x$, $G_X(x) < 10^{-3}$ for larger $x$, so we can assume this distribution spans the full range of $x$. 



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from os.path import join as pj
import json

In [None]:
data_fold = pj("..", "data")
fig_fold = pj("..", "figures", "")
do_save_outputs = False   

## Load and preprocess dataset

In [None]:
df = pd.read_csv(pj(data_fold, "si2019_log10ec50s.csv"))
df["Odorant"] = df["Unnamed: 0"]
df = df.drop("Unnamed: 0", axis=1).set_index("Odorant")
df.columns.name = "OR"
df

In [None]:
# Extract non-NaN EC50s and compute complementary CDF of x = 1/EC50
ec50s_sorted = np.sort(10.0**(df.stack("OR").dropna().values))  # df contains log10ec50
x_sorted = 1.0 / ec50s_sorted[::-1]
print("number values: {:d}, min: {:.2e}, max: {:.2e}".format(
    x_sorted.size, x_sorted.min(), x_sorted.max()))

In [None]:
# Each data point contributes a frequency 1 / npoints
# remove this quantity at each encountered data point.
# ccdf is the y values associated to the x values in x_sorted
prob_per_pt = 1.0 / x_sorted.size
ccdf_exp = np.arange(1.0, 0.0, -prob_per_pt)
assert ccdf_exp.size == x_sorted.size
assert abs(ccdf_exp[-1] - prob_per_pt) < 0.01 * prob_per_pt

In [None]:
# Plot the ccdf in log-log scale, as in Fig. 3D from Si et al., 2019
fig, ax = plt.subplots()
ax.plot(x_sorted, ccdf_exp, marker="o", mec="k", mfc="none", ms=6.0, ls="none", mew=0.5)
ax.set(xscale="log", yscale="log", xlabel="x = 1/EC50 (inv. dilution)", 
       ylabel=r"Complementary CDF $G_X(x)$")
fig.tight_layout()
plt.show()
plt.close()

## Fit Hill function on the complementary CDF

$$ G_X(x) = \frac{1}{1 + b x^\alpha} $$

Fit the log of $b$, since $b$ is likely very small but positive. Also, fit the log of G_X, so the tail is not too penalized by its lower amplitude. 

In [None]:
def hill_ccdf(x, b, alpha):
    return 1.0 / (1.0 + b * x**alpha)

def hill_ccdf_logb(x, logb, alpha):
    return 1.0 / (1.0 + (10.0**logb) * (x**alpha))

def loghill_ccdf_logb(x, logb, alpha):
    return -np.log10(1.0 + (10.0**logb) * (x**alpha))

In [None]:
lowerbounds = (-3.0, 0.1)  # scale log10(b), power alpha
upperbounds = (0.0, 1.5)
p_estim = (-2.0, 0.42)

# Fit in linear scale
popt_lin, pcov_lin = curve_fit(hill_ccdf_logb, x_sorted, ccdf_exp,
                       p0=p_estim, bounds=(lowerbounds, upperbounds))

# Fit in log scale
popt_log, pcov_log = curve_fit(loghill_ccdf_logb, x_sorted, np.log10(ccdf_exp),
                       p0=p_estim, bounds=(lowerbounds, upperbounds))
                       
print("Best fit in log-scale:", popt_log)
print("With standard dev. on parameters:", np.sqrt(pcov_log[[0, 1], [0, 1]]))

print("Best fit in linear-scale:", popt_lin)
print("With standard dev. on parameters:", np.sqrt(pcov_lin[[0, 1], [0, 1]]))

In [None]:
xrange = np.geomspace(x_sorted.min(), x_sorted.max(), 200)
ccdf_fit_log = 10.0**(loghill_ccdf_logb(xrange, *popt_log))
ccdf_fit_linear = hill_ccdf_logb(xrange, *popt_lin)

fig, axes = plt.subplots(2, 1, sharex=True)
fig.set_size_inches(plt.rcParams["figure.figsize"][0], 
                   plt.rcParams["figure.figsize"][1]*1.5)
for ax in axes:
    ax.plot(x_sorted, ccdf_exp, marker="o", mec="k", mfc="none", ms=6.0, ls="none", mew=0.5, label='Data')
    ax.plot(xrange, ccdf_fit_linear, ls="-", label="Linear-scale fit", color="tab:blue")
    ax.plot(xrange, ccdf_fit_log, ls="-", label="Log-scale fit", color="tab:orange")
    ax.set_ylabel(r"Complementary CDF $G_X(x)$")
    ax.set_xscale("log")
    ax.legend(frameon=False)
axes[1].set_xlabel("x = 1/EC50 (inverse dilution units)")
axes[0].set_yscale("log")
axes[1].set_yscale("linear")
fig.tight_layout()
plt.show()
plt.close()


## Sample from this Hill distribution
See what kind of odor vectors we get, if they are similar to each other or not, etc.

In [None]:
def inverse_transform_hillcdf(r, logb, alpha):
    return ((1.0/r - 1.0)/10.0**logb)**(1.0 / alpha)

In [None]:
rgen = np.random.default_rng(0xb183e7d8079f5bde91fdff60a4b31df2)
n_dims = 50
unif = rgen.random(size=(100, n_dims))
odor_vecs_hill = inverse_transform_hillcdf(unif, *popt_log)

fig, ax = plt.subplots()
img = ax.imshow(np.log10(odor_vecs_hill))
#img = ax.imshow(odor_vecs_hill)
ax.set(xlabel="OSN", ylabel="Odor")
fig.colorbar(img, label="log10 OSN affinity (1/EC50)")
fig.tight_layout()
plt.show()
plt.close()

In [None]:
# Compute cosine similarity now
n_samples = int(1e4)
unif = rgen.random(size=(n_samples, n_dims))
odor_vecs_hill_samp = inverse_transform_hillcdf(unif, *popt_log)
odor_norms = np.sqrt(np.sum(odor_vecs_hill_samp**2, axis=1))
unit_vecs = odor_vecs_hill_samp / odor_norms[:, None]

cosine_sims = unit_vecs.dot(unit_vecs.T)
cosine_sims[np.diag_indices(n_samples)] = np.nan

mean_cosine = np.nanmean(cosine_sims)
std_cosine = np.nanstd(cosine_sims, ddof=1)

print("Average cosine similarity:", mean_cosine)
print("Standard deviation:", std_cosine)

## Alternate distribution with a tanh function

$$ G_X(x) = \mathrm{tanh}\left( x^{-\alpha} / b \right) $$ 

has a power-law behavior for large $x$, with a better cutoff for small $x$. 

In [None]:
def tanh_ccdf(x, b, alpha):
    return np.tanh(1.0 / (b * x**alpha))

def tanh_ccdf_logb(x, logb, alpha):
    return np.tanh(1.0 / ((10.0**logb) * (x**alpha)))

def logtanh_ccdf_logb(x, logb, alpha):
    return np.log10(np.tanh(1.0 / ((10.0**logb) * (x**alpha))))

In [None]:
lowerbounds = (-3.0, 0.1)  # scale log10(b), power alpha
upperbounds = (0.0, 1.5)
p_estim = (-2.0, 0.42)

# Fit in linear scale
popt_lintanh, pcov_lintanh = curve_fit(tanh_ccdf_logb, x_sorted, ccdf_exp,
                       p0=p_estim, bounds=(lowerbounds, upperbounds))

# Fit in log scale
popt_logtanh, pcov_logtanh = curve_fit(logtanh_ccdf_logb, x_sorted, np.log10(ccdf_exp),
                       p0=p_estim, bounds=(lowerbounds, upperbounds))
                       
print("Best fit in log-scale:", popt_logtanh)
print("With standard dev. on parameters:", np.sqrt(pcov_logtanh[[0, 1], [0, 1]]))

print("Best fit in linear scale:", popt_lintanh)
print("With standard dev. on parameters:", np.sqrt(pcov_lintanh[[0, 1], [0, 1]]))

In [None]:
xrange = np.geomspace(x_sorted.min(), x_sorted.max(), 200)
ccdf_fit_log = 10.0**(logtanh_ccdf_logb(xrange, *popt_logtanh))
ccdf_fit_linear = tanh_ccdf_logb(xrange, *popt_lintanh)

fig, axes = plt.subplots(2, 1, sharex=True)
fig.set_size_inches(plt.rcParams["figure.figsize"][0], 
                   plt.rcParams["figure.figsize"][1]*1.5)
for ax in axes:
    ax.plot(x_sorted, ccdf_exp, marker="o", mec="k", mfc="none", ms=6.0, ls="none", mew=0.5, label='Data')
    ax.plot(xrange, ccdf_fit_linear, ls="-", label="Linear-scale fit", color="tab:blue")
    ax.plot(xrange, ccdf_fit_log, ls="-", label="Log-scale fit", color="tab:orange")
    ax.set_ylabel(r"Complementary CDF $G_X(x)$")
    ax.set_xscale("log")
    ax.legend(frameon=False)
axes[1].set_xlabel("x = 1/EC50 (inverse dilution units)")
axes[0].set_yscale("log")
axes[1].set_yscale("linear")
fig.tight_layout()
plt.show()
plt.close()


In [None]:
def inverse_transform_tanhcdf(r, logb, alpha):
    return (10.0**logb * np.arctanh(r))**(-1.0 / alpha)

In [None]:
unif = rgen.random(size=(100, n_dims))
odor_vecs_tanh = inverse_transform_tanhcdf(unif, *popt_logtanh)

fig, ax = plt.subplots()
img = ax.imshow(np.log10(odor_vecs_tanh))
#img = ax.imshow(odor_vecs_tanh)
ax.set(xlabel="OSN", ylabel="Odor")
fig.colorbar(img, label="log10 OSN affinity (1/EC50)")
fig.tight_layout()
plt.show()
plt.close()

In [None]:
# Compute cosine similarity now
n_samples = int(1e4)
unif = rgen.random(size=(n_samples, n_dims))
odor_vecs_tanh_samp = inverse_transform_tanhcdf(unif, *popt_logtanh)
odor_norms = np.sqrt(np.sum(odor_vecs_tanh_samp**2, axis=1))
unit_vecs = odor_vecs_tanh_samp / odor_norms[:, None]

cosine_sims = unit_vecs.dot(unit_vecs.T)
cosine_sims[np.diag_indices(n_samples)] = np.nan

mean_cosine = np.nanmean(cosine_sims)
std_cosine = np.nanstd(cosine_sims, ddof=1)

print("Average cosine similarity:", mean_cosine)
print("Standard deviation:", std_cosine)

## Cosine similarity in vectors from the original data
Set inverse EC50s that are NaNs to zero: if $K_{i \mu} = 0$, this OR type $i$  is unresponsive to that odor $\mu$. 

In [None]:
empirical_inverseec_50 = 1.0 / 10.0**df.values  # each row is an odorant
empirical_vectors = np.nan_to_num(empirical_inverseec_50, copy=True, nan=0.0001*np.nanmin(empirical_inverseec_50))
n_dim_emp = empirical_vectors.shape[1]
empirical_vec_norms = np.sqrt(np.sum(empirical_vectors**2, axis=1))
emp_unit_vecs = empirical_vectors / empirical_vec_norms[:, None]

# Plot of unnormalized vectors
fig, ax = plt.subplots()
img = ax.imshow(np.log10(empirical_vectors))
#img = ax.imshow(empirical_vectors)
ax.set(xlabel="OSN", ylabel="Odor")
fig.colorbar(img, label="OSN affinity log10(1/EC50)")
fig.tight_layout()
plt.show()
plt.close()

In [None]:
# Compute cosine similarity
cosine_sims = emp_unit_vecs.dot(emp_unit_vecs.T)
cosine_sims[np.diag_indices(emp_unit_vecs.shape[0])] = np.nan

mean_cosine = np.nanmean(cosine_sims)
std_cosine = np.nanstd(cosine_sims, ddof=1)

print("Average cosine similarity:", mean_cosine)
print("Standard deviation:", std_cosine)

## Compare also to the raw power-law fitted in Si et al., 2019
Show that it would not be very good to capture the full range of values... The pdf they use is

$$ f_X(x) = (a -1 )x_{\mathrm{min}}^{a-1} x^{-a} $$

with setting a lower cutoff at $x_{\mathrm{min}}$. Note that their $a$ in that definition is $\alpha + 1$; they have $a = 1.42$ so a power-law with $\alpha=0.42$ for the tail of the complementary CDF. This corresponds to a CDF of

$$ F_X(x) = \int_{x_\mathrm{min}}^x \mathrm{d}x' (a -1 )x_{\mathrm{min}}^{a-1} x'^{-a} = 1 - \left(\frac{x_\mathrm{min}}{x} \right)^{a-1}$$

or a CCDF of 

$$ G_X(x) = \left(\frac{x_\mathrm{min}}{x} \right)^{a-1} = \left(\frac{x}{x_\mathrm{min}} \right)^{-\alpha}$$

which we can easily invert to sample from it. If we sample $r \sim U(0, 1)$, then we set

$$ r = G_X(x) \Rightarrow x = G_X^{-1}(r) = x_\mathrm{min} r^{-1/\alpha} $$


Let's try to 1) plot their fitted pure power-law $G_X(x)$ against their empirical CCDF, and 2) sample odor vectors from their fitted power-law

They have $a = 1.42$, so $\alpha = 0.42$, and $x_\mathrm{min} = 4.2 \times 10^4$. Let's give a chance and re-do this fit? We work with $\log_{10}(x_\mathrm{min})$. 

In [None]:
def power_ccdf(x, xmin, alpha):
    return (x / xmin)**(-alpha)

def power_ccdf_logxmin(x, logxmin, alpha):
    return (x / 10.0**logxmin)**(-alpha)

def logpower_ccdf_logxmin(x, logxmin, alpha):
    return -alpha * np.log10(x / 10.0**logxmin)

In [None]:
def inverse_transform_powerlaw(r, logxmin, alpha):
    return 10.0**logxmin * (1.0 - r)**(-1.0/alpha)

In [None]:
lowerbounds = (-3.0, 0.1)  # scale log10(xmin), power alpha
upperbounds = (6.0, 1.5)
p_estim_si2019 = (np.log10(4.2e4), 0.42)

# Fit in linear scale
popt_linpow, pcov_linpow = curve_fit(power_ccdf_logxmin, x_sorted, ccdf_exp,
                       p0=p_estim_si2019, bounds=(lowerbounds, upperbounds))

# Fit in log scale
popt_logpow, pcov_logpow = curve_fit(logpower_ccdf_logxmin, x_sorted, np.log10(ccdf_exp),
                       p0=p_estim_si2019, bounds=(lowerbounds, upperbounds))
                       
print("Best fit in log-scale:", popt_logpow)
print("With standard dev. on parameters:", np.sqrt(pcov_logpow[[0, 1], [0, 1]]))

print("Best fit in linear scale:", popt_linpow)
print("With standard dev. on parameters:", np.sqrt(pcov_linpow[[0, 1], [0, 1]]))

In [None]:
xrange = np.geomspace(x_sorted.min(), x_sorted.max(), 200)
ccdf_fit_log = 10.0**(logpower_ccdf_logxmin(xrange, *popt_logpow))
ccdf_fit_linear = power_ccdf_logxmin(xrange, *popt_linpow)

# Also plot their reported fit
ccdf_fit_si2019 = power_ccdf_logxmin(xrange, *p_estim_si2019)

fig, axes = plt.subplots(2, 1, sharex=True)
fig.set_size_inches(plt.rcParams["figure.figsize"][0], 
                   plt.rcParams["figure.figsize"][1]*1.5)
for ax in axes:
    ax.plot(x_sorted, ccdf_exp, marker="o", mec="k", mfc="none", ms=6.0, ls="none", mew=0.5, label='Data')
    ax.plot(xrange, ccdf_fit_linear, ls="-", label="Linear-scale fit", color="tab:blue")
    ax.plot(xrange, ccdf_fit_log, ls="-", label="Log-scale fit", color="tab:orange")
    ax.plot(xrange, ccdf_fit_si2019, ls="-", 
            label="Si 2019 reported fit\n" + r"($x_{min}=4.2 \times 10^4, \alpha=0.42$)", color="tab:green")
    ax.set_ylabel(r"Complementary CDF $G_X(x)$")
    ax.set_xscale("log")
    ax.legend(frameon=False)
axes[1].set_xlabel("x = 1/EC50 (inverse dilution units)")
axes[0].set_yscale("log")
axes[1].set_yscale("linear")
fig.tight_layout()
plt.show()
plt.close()


### Conclusion on the reported fit

Clearly, they have only fitted the very tail of their distribution, which is based on a few EC50 values, and the pure power-law pdf is a terrible fit of the full OSN distribution, which we need in this case. 

In [None]:
# Sample odor vectors from this power-law anyways
n_samples = int(1e4)
unif = rgen.random(size=(n_samples, n_dims))
odor_vecs_si2019_samp = inverse_transform_powerlaw(unif, *p_estim_si2019)

fig, ax = plt.subplots()
img = ax.imshow(np.log10(odor_vecs_si2019_samp[:100]))
#img = ax.imshow(odor_vecs_tanh)
ax.set(xlabel="OSN", ylabel="Odor")
fig.colorbar(img, label="log10 OSN affinity (1/EC50)")
fig.tight_layout()
plt.show()
plt.close()

## Histogram of vector elements in data vs fit

Just to be sure the fitted distribution is roughly OK and better than a pure power-law (which we also compare). 

In [None]:
hist1, bins1 = np.histogram(np.log10(x_sorted), bins="doane", density=True)
hist2, bins2 = np.histogram(np.log10(odor_vecs_hill_samp), bins="doane", density=True)
hist3, bins3 = np.histogram(np.log10(odor_vecs_tanh_samp), bins="doane", density=True)
hist4, bins4 = np.histogram(np.log10(odor_vecs_si2019_samp), bins="doane", density=True)
fig, ax = plt.subplots()
ax.bar(bins2[:-1], hist2, width=np.diff(bins2), align="edge", label="Generated Hill", alpha=0.5)
ax.bar(bins3[:-1], hist3, width=np.diff(bins3), align="edge", label="Generated tanh", alpha=0.5)
ax.bar(bins4[:-1], hist4, width=np.diff(bins4), align="edge", label="Si et al., 2019", alpha=0.5)
ax.bar(bins1[:-1], hist1, width=np.diff(bins1), align="edge", label="Empirical", alpha=0.5, color="grey")
ax.legend()
ax.set(xlabel="log10 OSN sensitivity", ylabel="Frequency")
plt.show()
plt.close()

## Conclusion
The clear winner is the tanh distribution fitted in log scale. The Hill has a tail for small $x$ which is not in the data distribution.


However, the full dataset had a bunch of NaN EC50s, which means many sensitivities are in fact non-detectable, so there should be a tail for small $x$; perhaps the Hill distribution is the best one to also capture some of this non-detectable responsiveness of each OR to some odors?

I will use the tanh, since for the available (non-NaN) data, it is the best fit, and I do not know the actual value of the other EC50s. 

In [None]:
# Export fit parameters and ccdf data for final plotting
results_folder = pj("..", "results", "for_plots", "nonlin_adapt")
results_dicts = {
    "x_sorted": list(x_sorted), 
    "ccdf_exp": list(ccdf_exp),
    "best_fit": {"logb": float(popt_logtanh[0]), "alpha": float(popt_logtanh[1]), 
                 "logb_cov": float(pcov_logtanh[0, 0]), "alpha_cov": float(pcov_logtanh[1, 1])}
}
if do_save_outputs:
    with open(pj(results_folder, "si2019_cdf_and_fits.json"), "w") as h:
        json.dump(results_dicts, h, indent=4)