In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from llamabot import QueryBot, PromptRecorder
from pyprojroot import here

pr = PromptRecorder()

In [None]:
bot = QueryBot(
    "You are a bot that summarizes the contents of a paper.",
    doc_paths=[here() / "data/JMLR-23-0380-1.pdf"],
)

In [None]:
with pr:
    bot(
        "Summarize this paper the way Richard Feynman would summarize it to a college student."
    )

In [None]:
with pr:
    bot(
        "From the authors' perspective, why would breaking the co-adaptation between feature extractor and classifier can lead to better generalization?"
    )

In [None]:
with pr:
    bot(
        "There is a term I don't get: 'point-like distributions of features for each class'. What does this mean?"
    )

In [None]:
with pr:
    bot(
        "What computational experiments did the authors do to show that breaking the co-adaptation leads to better generalization?"
    )

In [None]:
with pr:
    bot(
        "I don't get this term: 'approximate geodesic distance metric'. What does this mean?"
    )

In [None]:
with pr:
    bot(
        "I don't get these terms: 'large-dataset solution' and 'small-dataset solutions'. What do these mean?"
    )

In [None]:
with pr:
    bot("Can you describe in detail what FOCA is? and what is PoF in detail?")

In [None]:
with pr:
    bot(
        "With FOCA, what is the loss function for optimizing the feature-extractor part of a deep network while keeping the classifier part undetermined?"
    )

In [None]:
with pr:
    bot(
        "Can you explain this question (ϕ⋆ = arg min ϕ (1/nD) ∑(x,t)∈D Eθ∼Θϕ L(Cθ(Fϕ(x)), t)) in plain English? What is the intuition here?"
    )

In [None]:
with pr:
    bot(
        "In this equation (ϕ⋆ = arg min ϕ (1/nD) ∑(x,t)∈D Eθ∼Θϕ L(Cθ(Fϕ(x)), t)), where do we get a distribution of classifier parameters from? Are they pre-trained top layers? Or are they randomly initialized neural networks?"
    )

In [None]:
with pr:
    bot(
        "Can you describe FOCA from a procedural perspective? How does it work? What is the algorithm, written in algorithm form, with an emphasis on translating symbols into plain English?"
    )

In [None]:
with pr:
    bot(
        "In comparing FOCA to other methods, what details in other methods did they change to make the comparison fair? For example, did they use other techniques to maximize the generalization performance of other methods, and if so, what were they?"
    )

In [None]:
with pr:
    bot(
        "Are there logical flaws in the paper that you can identify? If so, what are they?"
    )

In [None]:
with pr:
    bot(
        "What are the limitations of the theoretical proofs in the paper? What are the assumptions that the proofs make, explained in plain English? Are these assumptions realistic?"
    )

In [None]:
with pr:
    bot(
        "A specific distribution of classifier parameters is assumed. What is this distribution in both mathematical form and its intuition in English?"
    )

In [None]:
with pr:
    bot(
        "With this particular assumption, that 'the activation function satisfies a specific condition', what is that condition in mathematical form, and what is the intuition behind that condition, in plain English?"
    )

In [None]:
with pr:
    bot(
        "What are potential caveats of this work that are not mentioned by the authors?"
    )

In [None]:
with pr:
    bot(
        "What neural network architectures did the authors use in their experiments? Did they claim that their results generalize to other architectures? If so, what is the justification for that claim?"
    )

In [None]:
with pr:
    bot(
        "An architecture introduced by Lee et al. (2016) for CIFAR-10 experiments, with some modifications - what modifications were introduced?"
    )

In [None]:
with pr:
    bot(
        "The proposed methods, especially FOCA, involve the use of multiple weak classifiers during the feature-extractor training - do the authors suggest how many weak classifiers are needed?"
    )

In [None]:
with pr:
    bot(
        "Do the authors provide practical recommendations on the minibatch size for training FOCA?"
    )

In [None]:
with pr:
    bot(
        "Do the authors provide a link to a GitHub repository where we can examine the training code?"
    )

In [None]:
with pr:
    bot(
        "CIFAR-10 and CIFAR-100 are image datasets. Do the authors provide evidence of FOCA outperforming other methods on non-image datasets?",
    )

In [None]:
talking_points = """
0. I focused on FOCA, not on PoF, as I didn't have sufficient time to cover both.
   Also, I am a practitioner, not a theoretician, so I focused on the practical aspects of the paper.
1. The FOCA method is logically well-motivated.
2. I had to use GPT to translate the FOCA Approximate minimization algorithm into plain English.
   I think the authors could have done a better job explaining the intuition behind the algorithm,
   which would have improved the readability of the paper.
   (To be clear, I am only saying that the algorithm wasn't presented clearly,
   however, I think the algorithm is correct.)
   If they want, I am providing the GPT translation of the symbols for Algorithm 1 below.

<ALGORITHM BEGINS>
1. Initialize the feature extractor parameters (ϕ) with random variables.
2. For each iteration (t) from 1 to the total number of iterations (T):
   a. Create a set of indices (Ic) for each class by randomly selecting 'k' samples per class.
   b. Update the classifier parameters (θ) by minimizing the sum of sample-wise losses (L) and a regularization term for the selected samples in Ic.
   c. Randomly select a mini-batch of size 'm' from the total number of samples (nD) for ϕ-update.
   d. Update the feature extractor parameters (ϕ) by minimizing the average sample-wise loss (L) for the selected mini-batch, using the learning rate (η).
3. The final feature-extractor parameters (ϕ⋆) are obtained after all iterations.

In this algorithm:
- T: total number of iterations
- C: number of classes
- nc(c=1,···,C): number of class-c samples
- k: number of samples per class for θ-update
- nD: total number of samples
- m: mini-batch size for ϕ-update
- η: learning rate
<ALGORITHM ENDS>

3. This next question should be addressed in the discussion.
   Because FOCA uses a distribution over classifiers, is it possible to instead use a Bayesian neural network (BNN) for the classifier?
   The FOCA algorithm appears to be training a feature extractor to work with the equivalent of a BNN,
   so it seems logical to me that we should be able to use a BNN in lieu of a collection of classifiers.
   If instead we were to train a feature extractor with a BNN, would there be speedups because we could swap minibatch-wise trainign 
   with training the parameters of a single BNN?
4. FOCA has been shown to produce slightly superior test set performance for image data and convolutional neural networks. 
   However, I wonder about the generalizability of the method to other kinds of neural networks,
   such as Transformers and the simpler MLP, as well as to non-image data.
5. Related to the previous point, as a practitioner, I always consider the ease by which I am able to use a method.
   If the method requires lots of custom code to write, I would usually steer away from it.
   Given the claims of superior performance, and given that modern machine learning papers
   that seek maximal impact often come with code, I think a section in the paper that shows us how to implement FOCA in code,
   or an example of FOCA in code inside the paper, would be a good idea.
"""

with pr:
    bot(
        "Thank you for helping me with the paper Q&A. "
        "I am now going to prepare a suite of bullet points to write up as my review. "
        "Please help me rewrite it into a coherent critique of the paper. "
        "Please also try to expand on some of the points in there. "
        "Here are the bullet points:\n\n"
        f"{talking_points}"
    )