# April Recap

Let's review everything I have to remind myself of every gap and resource I still have at my disposal.

## CMR_and_ICMR_Are_Different

January_22_2021 has same themes.

I'm going to want to draw important distinctions between CMR and ICMR in my initial model comparison section. 

### We Can Conceptualize a Latent $M^{FC}$ and $M^{CF}$ in ICMR
For exploring and demonstrating model equivalence, we can calculate for any state of ICMR's dual-store memory array $M$
a corresponding $M^{FC}$ (or $M^{CF}$) by computing for each orthogonal $f_i$ (or $c_i$) the model's corresponding echo
representation. 

Because echoes taken this way are how we retrieve $F\rightarrow C$ and $C \rightarrow F$ associations in the model, the
resulting matrices characterize model behavior identically to how they would in classic CMR. This convention lets us
compare model performance internally as well as externally.

### Discarding Trace Context, or How We Don't Treat $M$ Like a Single Memory
Even though traces in ICMR are unique, we discard their unique contextual content when configuring $c_{in}$ to cue
successive recalls, instead using the equivalent of $M^{FC}f_i$ as performed by CMR. 

This information could instead be used as a direct cue for the next recall, with or without its accompanying feature
content. Since that may be too insular, it could also included with the retrieved item's feature content as part of the
memory cue to our equivalent of $M^{FC}f_i$. 

If we did either wrt to retrieval, successive recalls would be biased to a _particular trace context_ rather than a
balanced sum of all contexts with relevant feature information, creating some interesting consequences for even single
presentation experiments.

<!--
More generally, we decide against ever treating $M$ as a unitary memory system when it comes to free recall. Even though
we use a single array to represent $M$, we only ever cue it with either feature information or contextual information.
-->

Probably only relevant if item feature representations are non-orthogonal. A potential extension worth a spot in my discussion section.

## ICMR as an Integrative Project

Reviews what instance-based models have covered (as reviewed by Jamieson et al, 2018). Outlines a framework for the paper that emphasizes ICMR's integrative components. I want to go further.

## Various Notes on Context

- Jamieson_2018_1 and 2
- Instance Context Thoughts

These help remind me how ICMR and CMR differ substantively. Both linear associator and instance model can adaptively select relevant information based on a cue. But one model has nonlinear activation of stored instances based on similarity to the probe, improving retrieval of rarer concurrences. Leading to the more novel analyses I have planned.

Previous work has emphasized the _timing_ of abstraction when distinguishing between prototype-based and instance-based models. Prototype-based models are characterized as performing abstraction-at-encoding, constructing and updating a single summary contextual representation for an item each time it is encoded, while instance-based models perform abstraction across instances on the fly at retrieval. When it comes at least to the _content_ of a retrieved contextual representation however, the timing of its construction is unimportant. PrototypeCMR's predictions about the temporal organization of free recall would be the same if the model stored records of item presentations until retrieval but applied the same linear mechanisms to perform abstraction. The main factor distinguishing the content of predictions made by prototype and corresponding instance-based models is the latter architecture's option to perform abstraction through nonlinear activation of traces based on relevance to a probe. Models that perform abstraction at encoding and do not preserve instances for probe-based comparison close off this option.

What does nonlinear activation of stored traces get you?

***

Students seem confused why LSA (with BEAGLE) fails in this experiment but not ITS. Indeed, Jamieson et al only really gestures at an explanation of why: it reiterates that an instance-based model can produce nonlinear activation of stored instances, while DSMs cannot.

It may be worth working through what this means why this matters, since it's the core of the critique. In ITS, the representation associated with a cue combining a word and a sense-disambiguating context (e.g. break/car) is generated by taking the product of activations associated with each word. The result is that representations in stored traces only figure substantially in the final resulting echo if they're similar to all words in the probe.

In either LSA or BEAGLE, retrieval is linear: the centroid of the vectors associated with each word in a cue is used for comparisons w/ other words. The vector for "break" will always reflect a composition of all the contexts it was presented in, weighted by the statistical distribution of those contexts, and the same goes for "car". 

Technically, ITS could do this, too - the centroid of the echoes associated with each word in a context could be taken. (We actually do this in ICMR to achieve performance that looks like CMR's). But while ITS also has the option of building semantic representations through nonlinear coactivation of traces, linear sense disambiguation is distributional models' only option. 

This is a huge problem. At encoding, distributional models and ITS get the same scarce information about relatively uncommon use contexts for polysemous words, but distributional models discards - or constrains access to, depending on how you look at it- much of this information by the time retrieval happens. ITS's retrieval mechanism is more flexible: it can selectively retrieve the traces where ONLY both of the words in a joint probe occur while inhibiting activation of traces where they don't (with its cubic function). **DSMs constrain retrieval via the integrative commitments it makes at encoding.**

***

Maybe a good clarifying thought wrt to instance vs prototype models, though maybe it mostly reiterates what we've already discussed:

Previous work has emphasized the _timing_ of abstraction when distinguishing between prototype-based and instance-based models. Prototype-based models are characterized as performing abstraction-at-encoding, constructing and updating a single summary contextual representation for an item each time it is encoded, while instance-based models perform abstraction across instances on the fly at retrieval. When it comes at least to the _content_ of a retrieved contextual representation however, the timing of its construction is unimportant.

PrototypeCMR's predictions about the temporal organization of free recall would be the same if the model stored records of item presentations until retrieval but applied the same linear mechanisms to perform abstraction. Similarly, an abstraction-at-retrieval model that keeps instances in memory until retrieval could (given the same probes) produce the same representations as a model like LSA, and just do everything that latent semantic analysis entails at the moment that a probe is presented instead of earlier. 

The more substantive difference between prototype-based and instance-based models, at least when it comes to the content of abstractive representations, is not the timing of abstraction, but whether abstraction can involve nonlinear activation of traces based on relevance to a probe. Models technically don't have to perform abstraction in this way in order to count as instance-based or as performing abstraction at retrieval. However, only models that preserve instance representations in memory until retrieval have the option to have this feature in the first place.

So in big ways, the main distinction of interest with our project and even with Jamieson et al's project isn't so much abstraction at encoding vs at retrieval or even instance-based or prototype-based per se. It's linear vs nonlinear activation of stored traces. Every other distinction between the models seems largely implementation-specific.

I think my analysis of the Lohnas dataset helps reinforce this point. When I move the choice-sensitivity parameter we've been discussing outside the activation function in InstanceCMR so that there is no nonlinear activation of stored traces in the model, its predictions about how item repetitions impact recall become largely identical with prototype-based CMR's. The results will help reinforce that this specific mechanism is where there be dragons.

## Nonlinear Activation of Stored Instances
> Notes from March_3_2021

I think I've figured out the big reason CMR needs item feature representations to be orthogonal. It seems to be closely related to why ITS handles polysemous words better than DSMs.

As Jamieson et al (2018) reiterate a lot, an integration-at-retrieval model can produce nonlinear activation of stored instances, while integration-at-encoding models (like CMR) cannot. CMR seems to sidestep this problem, though, with two commitments:

1. **It makes feature unit activations corresponding to each item orthogonal**. The result is that each index of the activation vector generated during retrieval corresponds directly to support for a particular item/experience/memory trace. 
2. **When transforming activations into probabilities, it applies a sensitivity parameter that nonlinearly scales the contrast between well-supported and poorly supported items.** The result is that CMR effectively also produces nonlinear activation of stored instances. 

While item feature representations are orthogonal, this sensitivity scaling step has the same consequence whether you perform it before or after doing your sum of trace vectors. So you can do abstraction-at-encoding while still enjoying the dividends of activating traces nonlinearly based on cue similarity.

As you break down the one-to-one correspondence between each feature unit and and some specific item/experience, the correspondence between outcomes of nonlinear scaling before and after abstraction also breakdown. Taking the cube of the first entry of your activation vector in CMR is no longer equivalent to scaling the activation of a particular trace based on cue similarity. 

Instead, learning can be distorted in this scaling step as feature information in experiences only weakly associated with the current contextual cue will be enhanced exponentially while original weightings encoded based on trace similarity only apply linearly.

So, summing up. I said this was closely related to the ITS/DSM distinction, but it's not quite the same critique. In the Jamieson 2018 paper, nonlinear activation of traces in the context of DSMs is considered impossible. In CMR, nonlinear activation is achieved via softmax, but does not weight support effectively if feature units don't identify items (that is, if item representations aren't orthogonal). By performing nonlinear activation over traces instead of over the collapsed activation vector, ICMR can have non-orthogonal item representations _without_ any breakdown in learning. If my reasoning is correct.

### Handling Repetitions
The difference between applying a nonlinear transformation before and after integration is a difference between predicting an exponential scaling of retrieval support when an item gets repeated and predicting a linear increase, between (1+2+3+...)^3 and (1)^3 + (2)^3 + (3)^3 + .... Because of the way context evolves, we can't test that prediction directly, but it's something that will color less pure cases of item repetition as well.

<!--
Here, "meaning" corresponds to the company a word keeps, and a model of semantic memory is thus evaluable in terms of how capably it can retrieve contextual information about any cue. In that sense, the tasks of episodic and semantic memory are equally concerned with context. Successful semantic memory depends on tracking contextual features that occur consistently across experiences of an arbitrary item. Episodic memory is concerned rather with tracking item features associated with an arbitrary configuration of contextual features. In the context of ITS, this distinction isn't real except to the extent that episodic memory generally entails retrieval of item information specific enough to characterize a single memory trace.
-->

## Stray To-Dos

- Plot how parameter modification affects shape of graph. Overlap of result of modifying parameter.
- Use a more recent dataset than Murdoch - maybe from lab? Pierce2016, experiment 3. (exp 1 is good too?) of 2016 paper. Not the Sederberg one - had some quirks. Though could show model handles distraction okay.
- Look into semantics. Can't get a sharp recency effect - either SPC, PFR, CRP - when feature vectors are nonorthogonal. Can use Glove or whatever to demonstrate. Capture the sharp effect w/o overpredicting semantic information. Predicting both temporal and semantic organization simultaneously is the problem.
- **Quantitative fit comparison**. AIC can do the work; MortPolyn2016 models how to do this.
- Repetitions analyses
- Etc