# Place for experimenting the progressive design

In [1]:
import os,sys
import yaml
import inspect
import importlib

sys.path.append('..')

import model_discovery.utils as U
from model_discovery.configs.gam_config import GAMConfig, GAMConfig_14M
from model_discovery.model.composer import GABTree,ROOT_UNIT_TEMPLATE,GAUBase
# from model_discovery.evolution import  BuildEvolution

ckpt_dir = os.environ['CKPT_DIR']
db_dir = U.pjoin(ckpt_dir, 'test_composer', 'db')
test_tree = GABTree('TestTree', db_dir)

prompts_dir='../model_discovery/agents/prompts/'
gab_py = U.read_file(U.pjoin(prompts_dir,'gab_template.py'))
gam_py = U.read_file(U.pjoin(prompts_dir,'gam_prompt.py'))
GAU_TEMPLATE = U.read_file(U.pjoin(prompts_dir,'gau_template.py'))
GAU_BASE=inspect.getsource(GAUBase)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\ChengJunyan1\.cache\huggingface\token
Login successful


In [4]:
test_tree.path
test_tree.get_source('TestTree')

'# TestTree.py\n\nimport torch\nimport torch.nn as nn\n\nfrom model_discovery.model.utils.modules import GABUnit # DO NOT CHANGE THIS IMPORT STATEMENT #\n\n\n# YOU CAN IMPORT MORE MODULES HERE #\n\n# YOU CAN DEFINE MORE CLASSES OR FUNCTIONS HERE #\n\n\nclass TestTree(GABUnit): \n    """Generalized Autoregressive Block\n        Input:        X: (batch, seqlen, embed_dim), Z: {dict of all current intermediate variables}\n        Output:       Y: (batch, seqlen, embed_dim), Z_: Optional, {dict of *new* intermediate variables to update the current Z}\n        Constraints:  Causal, differentiable, parameter number, complexity, parallelizable\n    """\n    def __init__(self, embed_dim: int, device=None, dtype=None,**kwargs): # YOU CAN ADD MORE ARGUMENTS, BUT YOU HAVE TO HAVE embed_dim, device, dtype AS THE ARGUTMENTS #\n        # argv: list of hyperparameters\n        factory_kwargs = {"device": device, "dtype": dtype} # remember to pass it to all nn layers\n        super().__init__(embed_di

In [3]:
import model_discovery.agents.prompts.prompts as P
importlib.reload(P)

gu_system_prompt=P.GU_DESIGNER_SYSTEM_prompt.format(GAB_BASE=P.GAB_BASE,GAM_PY=gam_py,GAU_BASE=GAU_BASE,GAU_TEMPLATE=GAU_TEMPLATE)

print(gu_system_prompt)



You are a professional AI researcher focusing on discovering the best
autoregressive language model block. You goal is to design a novel block
following the Generalized Autoregressive Block (GAB) structure defined in the
following base class:

```python 
class GABBase(nn.Module):
    """ Base class for Generalized Autoregressive Block """
    def __init__(self,embed_dim: int, block_loc: tuple): 
        super().__init__()
        self.embed_dim = embed_dim
        self.block_loc = block_loc # location of a block within the network, (layer_idx, n_block)

    def _forward(self,X,**kwargs): 
        raise NotImplementedError
     
    # YOU ARE NOT ALLOW TO OVERRIDE THIS METHOD #
    def forward(self,X,**Z):
        """Forward pass of the model"""
        assert X.shape[-1] == self.embed_dim
        Y_=self._forward(X,**Z)
        if isinstance(Y_,tuple):
            Y, Z = Y_
        else:
            Z = {}
        assert Y.shape == X.shape
        return Y, Z
 ```


The GAB will be us

## Parsers

In [11]:
import re

test_output={'text': '### Intuitions and Analysis\n\nIn designing a novel autoregressive block, we aim to create a structure that is both innovative and powerful, capable of outperforming existing state-of-the-art models. The core idea is to leverage a combination of attention mechanisms, feedforward networks, and gating mechanisms to enhance the model\'s ability to capture complex dependencies in the data. \n\n1. **Attention Mechanism**: While attention mechanisms are not new, we can innovate by introducing a dynamic attention mechanism that adapts based on the input sequence characteristics. This can help the model focus on the most relevant parts of the sequence, improving efficiency and accuracy.\n\n2. **Feedforward Networks**: We can enhance the traditional feedforward networks by incorporating non-linear transformations and residual connections, allowing the model to learn more complex patterns.\n\n3. **Gating Mechanisms**: Introducing gating mechanisms can help control the flow of information through the network, allowing the model to dynamically adjust its behavior based on the input. This can improve robustness and scalability.\n\n4. **Scalability and Efficiency**: The design should ensure that the model scales well with increasing data sizes and remains efficient in terms of computational resources.\n\n### Rough Plan for Children GABUnits\n\n1. **DynamicAttentionUnit**: A unit that implements a dynamic attention mechanism, adapting based on input characteristics.\n2. **EnhancedFeedforwardUnit**: A unit that incorporates non-linear transformations and residual connections in the feedforward network.\n3. **GatingMechanismUnit**: A unit that introduces gating mechanisms to control information flow.\n\n### Pseudo Code\n\n```python\nclass DynamicAttentionUnit(GABUnit):\n    def __init__(self, embed_dim, **kwargs):\n        # Initialize attention mechanism\n        pass\n\n    def _forward(self, X, **Z):\n        # Compute dynamic attention\n        return Y, Z_\n\nclass EnhancedFeedforwardUnit(GABUnit):\n    def __init__(self, embed_dim, **kwargs):\n        # Initialize feedforward network with non-linear transformations\n        pass\n\n    def _forward(self, X, **Z):\n        # Apply feedforward transformations\n        return Y, Z_\n\nclass GatingMechanismUnit(GABUnit):\n    def __init__(self, embed_dim, **kwargs):\n        # Initialize gating mechanisms\n        pass\n\n    def _forward(self, X, **Z):\n        # Apply gating mechanisms\n        return Y, Z_\n```\n\n### Name of the GABUnit\n\n```unit_name {AdaptiveGAB}```\n\n### Full Implementation\n\n```python\n# GAB_UNIT_IMPLEMENTATION\n\nimport torch\nimport torch.nn as nn\n\nfrom model_discovery.model.utils.modules import GABUnit\n\n# YOU CAN IMPORT MORE MODULES HERE #\n\n# YOU CAN DEFINE MORE CLASSES OR FUNCTIONS HERE #\n\nclass AdaptiveGAB(GABUnit): \n    """Generalized Autoregressive Block\n        Input:        X: (batch, seqlen, embed_dim), Z: {dict of all current intermediate variables}\n        Output:       Y: (batch, seqlen, embed_dim), Z_: Optional, {dict of *new* intermediate variables to update the current Z}\n        Constraints:  Causal, differentiable, parameter number, complexity, parallelizable\n    """\n    def __init__(self, embed_dim: int, device=None, dtype=None, **kwargs): \n        factory_kwargs = {"device": device, "dtype": dtype}\n        super().__init__(embed_dim)\n        \n        # Define the sub-units\n        self.dynamic_attention: GABUnit = DynamicAttentionUnit(embed_dim, **factory_kwargs)\n        self.enhanced_feedforward: GABUnit = EnhancedFeedforwardUnit(embed_dim, **factory_kwargs)\n        self.gating_mechanism: GABUnit = GatingMechanismUnit(embed_dim, **factory_kwargs)\n\n    def _forward(self, X, **Z): \n        # Apply dynamic attention\n        X, Z = self.dynamic_attention(X, **Z)\n        \n        # Apply enhanced feedforward network\n        X, Z = self.enhanced_feedforward(X, **Z)\n        \n        # Apply gating mechanisms\n        X, Z = self.gating_mechanism(X, **Z)\n        \n        return X, Z\n```\n\n### Config\n\n```config {\n    # ADD HYPERPARAMETERS HERE #\n    "attention_heads": 8,\n    "feedforward_dim": 2048,\n    "gating_type": "sigmoid"\n} ``` \n\nThis design introduces a novel combination of dynamic attention, enhanced feedforward networks, and gating mechanisms, aiming to improve the model\'s ability to capture complex dependencies while maintaining efficiency and scalability. The next steps will involve implementing the placeholder units and refining the design based on experimental results.', '_details': {'cost': 0.0, 'running_cost': 0}}
raw_text=test_output['text']
print(raw_text)

### Intuitions and Analysis

In designing a novel autoregressive block, we aim to create a structure that is both innovative and powerful, capable of outperforming existing state-of-the-art models. The core idea is to leverage a combination of attention mechanisms, feedforward networks, and gating mechanisms to enhance the model's ability to capture complex dependencies in the data. 

1. **Attention Mechanism**: While attention mechanisms are not new, we can innovate by introducing a dynamic attention mechanism that adapts based on the input sequence characteristics. This can help the model focus on the most relevant parts of the sequence, improving efficiency and accuracy.

2. **Feedforward Networks**: We can enhance the traditional feedforward networks by incorporating non-linear transformations and residual connections, allowing the model to learn more complex patterns.

3. **Gating Mechanisms**: Introducing gating mechanisms can help control the flow of information through the netw

In [9]:
codes = re.findall(r"```python(.*?)```", raw_text, re.DOTALL)
unit_name = re.findall(r"```unit_name(.*?)```", raw_text, re.DOTALL)
config = re.findall(r"```config(.*?)```", raw_text, re.DOTALL)


### Intuitions and Analysis

In designing a novel autoregressive block, we aim to create a structure that is both innovative and powerful, capable of outperforming existing state-of-the-art models. The core idea is to leverage a combination of attention mechanisms, feedforward networks, and gating mechanisms to enhance the model's ability to capture complex dependencies in the data. 

1. **Attention Mechanism**: While attention mechanisms are not new, we can innovate by introducing a dynamic attention mechanism that adapts based on the input sequence characteristics. This can help the model focus on the most relevant parts of the sequence, improving efficiency and accuracy.

2. **Feedforward Networks**: We can enhance the traditional feedforward networks by incorporating non-linear transformations and residual connections, allowing the model to learn more complex patterns.

3. **Gating Mechanisms**: Introducing gating mechanisms can help control the flow of information through the netw

In [7]:
testtext={'text': '# Proposal: Scalable and Efficient Generalized Autoregressive Block (SE-GAB)\n\n## 1. Title: Scalable and Efficient Generalized Autoregressive Block (SE-GAB)\n\n## 2. Motivation\n\nThe current landscape of autoregressive models, particularly those based on the Transformer architecture, faces challenges in terms of scalability, efficiency, and handling long sequences. The quadratic complexity of self-attention mechanisms limits their applicability to long sequences, and while alternative architectures like linear attention and state space models exist, they often underperform in terms of pretraining efficiency and downstream task accuracy. Inspired by recent advancements such as the Megalodon architecture, which introduces efficient sequence modeling with unlimited context length, this proposal aims to design a novel Generalized Autoregressive Block (GAB) that addresses these challenges.\n\n## 3. Problem Analysis\n\nThe primary issues with existing autoregressive models include:\n\n- **Quadratic Complexity**: The self-attention mechanism in Transformers has a quadratic complexity with respect to the sequence length, making it computationally expensive for long sequences.\n- **Weak Length Extrapolation**: Transformers struggle with extrapolating to sequences longer than those seen during training.\n- **Inefficiency in Pretraining**: Sub-quadratic solutions often underperform in terms of pretraining efficiency and accuracy.\n- **Scalability**: As models grow in size, maintaining efficiency and performance becomes increasingly challenging.\n\n## 4. Core Idea and Philosophy\n\nThe core idea behind SE-GAB is to create a scalable and efficient autoregressive block that can handle long sequences without sacrificing performance. The design philosophy is centered around:\n\n- **Efficiency**: Incorporating mechanisms that reduce computational complexity while maintaining or improving performance.\n- **Scalability**: Designing the architecture to scale effectively with model size and sequence length.\n- **Robustness**: Ensuring the model is robust to variations in input sequence length and can generalize well to unseen data.\n- **Modularity**: Building the GAB as a composition of smaller, efficient units (GAUs) that can be easily modified or extended.\n\n## 5. Plan of the Design\n\n### 5.1. Efficient Attention Mechanism\n\n- **Chunked Attention**: Implement a chunked attention mechanism to reduce the computational complexity of self-attention. This involves dividing the sequence into smaller chunks and applying attention within each chunk.\n- **Rotary Positional Embeddings**: Use rotary positional embeddings to enhance the model\'s ability to handle long sequences and improve extrapolation capabilities.\n\n### 5.2. Moving Average Gated Attention\n\n- **Exponential Moving Average (EMA)**: Integrate an EMA mechanism to capture long-term dependencies efficiently.\n- **Gated Attention**: Use gated attention to control the flow of information and improve model stability.\n\n### 5.3. Normalized Feedforward Network\n\n- **Layer Normalization**: Apply layer normalization to stabilize training and improve convergence.\n- **Feedforward Network with SWIGLU**: Implement a feedforward network with the SWIGLU activation function to enhance non-linear transformations.\n\n### 5.4. Modular Design with GAUs\n\n- **Nested GAUs**: Design the GAB as a composition of nested GAUs, each responsible for a specific function (e.g., attention, normalization, feedforward).\n- **Intermediate Variable Management**: Efficiently manage intermediate variables to facilitate information flow between GAUs.\n\n## 6. Conclusion\n\nThe proposed SE-GAB aims to address the limitations of current autoregressive models by introducing a scalable and efficient architecture. By leveraging techniques such as chunked attention, EMA, and modular design, SE-GAB is expected to achieve lower perplexity, higher accuracy, and better scalability. This design will serve as a foundation for future advancements in autoregressive models, enabling them to handle increasingly complex tasks and longer sequences.\n\n## 7. References\n\n- Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou. "MEGALODON: Efficient LLM Pretraining and Inference with Unlimited Context Length." arXiv.org, 2024.', '_details': {'cost': 0.0, 'running_cost': 0}}


print(testtext['text'])

# Proposal: Scalable and Efficient Generalized Autoregressive Block (SE-GAB)

## 1. Title: Scalable and Efficient Generalized Autoregressive Block (SE-GAB)

## 2. Motivation

The current landscape of autoregressive models, particularly those based on the Transformer architecture, faces challenges in terms of scalability, efficiency, and handling long sequences. The quadratic complexity of self-attention mechanisms limits their applicability to long sequences, and while alternative architectures like linear attention and state space models exist, they often underperform in terms of pretraining efficiency and downstream task accuracy. Inspired by recent advancements such as the Megalodon architecture, which introduces efficient sequence modeling with unlimited context length, this proposal aims to design a novel Generalized Autoregressive Block (GAB) that addresses these challenges.

## 3. Problem Analysis

The primary issues with existing autoregressive models include:

- **Quadratic Co