# Mixed Signals

Suppose we have a dataset which is known to be synthesized from two distinct signals.  One signal is of interest, while the other is not.

We refer to these as "signal," $S$, and "background," $B$, respectively.

More specifically, let's identify three distributions: $A$, $B$, and $C$, where:

- $A$ is the signal of interest,
- $B$ is the background (a signal which is not interesting), and
- $C$ is a mixture of $A$ and $B$.


## Assumptions

<img src="images/whiteboard/overview.jpg" width="800" alt="Overview Diagram"/>

Using the manifold assumption, we claim that data sampled from a pure signal will lie on a certain low-dimensional manifold.  Similarly, we assume that the background also lies on a low dimensional manifold, and that this manifold must be distinct from that of the signal.

Assuming that our data are collected in some high-dimensional ambient space, we expect that the manifolds for pure signal and pure background will most likely occupy distinct (though not necessarily orthogonal) subspaces of the ambient space.

## Mixing Model

Suppose the distribution of the pure signal is represented by a multi-dimensional random variable $\mathbf{S}$.  Similarly the pure background is represented by the random variable $\mathbf{B}$.

We employ a simple mixing model:

$$ \mathbf{X} = \alpha \mathbf{S} + (1 - \alpha) \mathbf{B} $$

where the parameter $\alpha$ determins the strength of the signal relative to the background, and $\alpha \in [0, 1]$.

In this model, it is assumed that signal and background are drawn from independent distributions.  There is no correlation between signal and background.

Using this model, we can derive the covariance structure of the mixed distribution in terms of the covariance structures of the underlying distributions:

<img src="images/whiteboard/mixing-model-cov.jpg" width="600" alt="Covariance Structure"/>


## Objective

Given:

* $X \sim \gamma \mathbf{S} + (1 - \gamma) \mathbf{B}$, and
* $B \sim \mathbf{B}$.

a sample dataset $X$ drawn from the mixture distribtuion above, we want to identify the structure of the signal manifold, by removing the influence of the background.

Without further information this problem may be intractible. (How, for example, could we even generally identify what constitutes signal versus background?)  Therefore, we will also consider cases where we have a separate dataset $B$ which is a sample of pure background.

## Proposed Techniques

1. Project into null space of $\mathrm{PCA}(X, k)$.

   Assuming the background is known to be significantly *stronger* than the signal, we might assume that the background is responsible for most of the observed variance in $X$.  Therefore, we can perform a PCA of $X$ and project the data into the null space of the first $k$ singular vectors.  If these vectors mostly explain the background, then the remaining dimensions may better explain the signal.

2. Project into null space of $\mathrm{PCA}(B, k)$.

   Instead of relying on the mixed signal to identify the background subspace, we could instead compute this from a dataset of background-only values.  By indentifying the strongest background PCs, we can project the signal into a subspace orthogonal to these PCs.  Presumably the result would contain far less of the background influence, making it easier to learn the signal manifold.

3. Using contrastive PCA, project into the subspace given by $\mathrm{cPCA}(X, B, k).


## Data Generation



## Visualizing the Manifold

