In [2]:
%%capture
pip install -q -U google-generativeai

In [12]:
from google.generativeai import caching
import google.generativeai as genai
import os
import time
import datetime

from dotenv import load_dotenv

load_dotenv()

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

In [7]:
file_name = "weekly-ai-papers.txt"
file  = genai.upload_file(path=file_name)

In [9]:
# Wait for the file to finish processing
while file.state.name == "PROCESSING":
    print('Waiting for video to be processed.')
    time.sleep(2)
    video_file = genai.get_file(file.name)

In [11]:
print(f'File processing complete: ' + file.uri)

File processing complete: https://generativelanguage.googleapis.com/v1beta/files/n146hu3zpxvv


In [21]:
# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model="models/gemini-1.5-flash-001",
    display_name="ml papers of the week", # used to identify the cache
    system_instruction="You are an expert AI researcher, and your job is to answer user's query based on the file you have access to.",
    contents=[file],
    ttl=datetime.timedelta(minutes=15),
)

# create the model
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

In [22]:
# query the model
response = model.generate_content(["Can you please tell me the latest AI papers of the week?"])

print(response.text)

The latest AI papers of the week, according to the file provided, are from **June 3 - June 9, 2024**. 

Here is a summary:

1. **NLLB**: This paper proposes a massive multilingual model that leverages transfer learning across 200 languages. It achieves a significant improvement in translation quality. 
2. **Extracting Concepts from GPT-4**: This paper presents a new method to extract interpretable patterns from GPT-4, making the model more understandable and predictable.
3. **Mamba-2**: This paper introduces an enhanced architecture combining state space models (SSMs) and structured attention, leading to improved performance on tasks requiring large state capacity.
4. **MatMul-free LLMs**: This paper proposes an implementation that eliminates matrix multiplication operations from LLMs, achieving significant memory reduction while maintaining performance.
5. **Buffer of Thoughts**: This paper presents a new prompting technique to enhance LLM-based reasoning, improving accuracy and effic

In [18]:
response = model.generate_content(["Can you list the papers that mention Mamba? List the title of the paper and summary."])
print(response.text)

Here are the papers mentioned in the document that discuss Mamba:

* **Mamba-2** - a new architecture that combines state space models (SSMs) and structured attention; it uses 8x larger states and trains 50% faster; the new state space duality layer is more efficient and scalable compared to the approach used in Mamba; it also improves results on tasks that require large state capacity. 

* **MoE-Mamba** -  an approach to efficiently scale LLMs by combining state space models (SSMs) with Mixture of Experts (MoE); MoE-Mamba, outperforms both Mamba and Transformer-MoE; it reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer. 

* **MambaByte** - adapts Mamba SSM to learn directly from raw bytes; bytes lead to longer sequences which autoregressive Transformers will scale poorly on; this work reports huge benefits related to faster inference and even outperforms subword Transformers. 



In [23]:
response = model.generate_content(["What are some of the innovations around long context LLMs? List the title of the paper and summary."])
print(response.text)

Here are some of the innovations around long-context LLMs from the papers listed:

**1. Leave No Context Behind** 
* **Paper:** Leave No Context Behind
* **Summary:**  This paper proposes Infini-attention, an attention mechanism that incorporates a compressive memory module into a vanilla attention mechanism, enabling Transformers to effectively process infinitely long inputs with bounded memory footprint and computation.

**2. DeepSeek-V2**
* **Paper:** DeepSeek-V2
* **Summary:** A 236B parameter Mixture-of-Experts (MoE) model that supports a context length of 128K tokens. It uses Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache into a latent vector.

**3. Make Your LLM Fully Utilize the Context**
* **Paper:** Make Your LLM Fully Utilize the Context
* **Summary:** This paper presents an approach to overcome the "lost-in-the-middle" challenge in LLMs. It applies an "information-intensive" training procedure to enable the LLM to fully uti