## 3. `ConversationSummaryMemory`
- **Purpose**: Maintains a **running summary of the conversation**, updated after each turn.
- **Use‑case**: Long conversations where full verbatim memory is impractical.
- **Main difference**: Compresses past discourse into concise summaries, keeping token usage bounded :contentReference[oaicite:2]{index=2}.

In [16]:
# ================================
# Step 1: Install dependencies
# ================================
!pip install -q langchain langchain-groq langchain-community langchain-openai

In [25]:
# ================================
# Step 2: Import libraries
# ================================
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.memory import ConversationSummaryMemory
from IPython.display import Markdown, display
from google.colab import userdata

In [26]:
# ================================
# Step 3: Setup LLMs (main + summarizer)
# ================================
api_key = userdata.get("GROQ_API_KEY")
openai_key = userdata.get("OPENAI_API_KEY3")

llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    api_key=api_key,
    temperature=0.3,
)

summarizer_llm = ChatOpenAI(
    model="gpt-4o-mini",  # You can also use "gpt-3.5-turbo"
    api_key=openai_key,
    temperature=0.3,
)

In [27]:
# ================================
# Step 4: Define Prompt Template
# ================================
system_msg = SystemMessagePromptTemplate.from_template(
    "You are an intelligent assistant. Use the provided conversation summary to inform your responses."
)
human_msg = HumanMessagePromptTemplate.from_template("{input}")

chat_prompt = ChatPromptTemplate.from_messages([
    system_msg,
    MessagesPlaceholder(variable_name="history"),
    human_msg
])

In [28]:
# ================================
# Step 5: Setup ConversationSummaryMemory
# ================================
# This memory summarizes all prior messages to keep the prompt short
memory = ConversationSummaryMemory(
    llm=summarizer_llm,
    return_messages=True
)

In [23]:
# # ================================
# # Step 6: Combine LLM, Prompt, and Memory
# # ================================
# chat_chain = chat_prompt | llm

# chatbot = RunnableWithMessageHistory(
#     chat_chain,
#     lambda session_id: memory,
#     input_messages_key="input",
#     history_messages_key="history"
# )

In [29]:
# ================================
# Step 7: Start the Chat Loop
# ================================
session_id = "chat-summary-session"
user_inputs = [
    "Hey! Can you explain what gradient descent is?",
    "Cool. And how is it different from stochastic gradient descent?",
    "Which one is better for large datasets?"
]

for input_text in user_inputs:
    # Step 1: Load past memory (summary history)
    memory_variables = memory.load_memory_variables({})

    # Step 2: Format prompt with memory + new input
    messages = chat_prompt.format_messages(
        input=input_text,
        history=memory_variables["history"]
    )

    # Step 3: Invoke LLM
    response = llm.invoke(messages)

    # Step 4: Save this turn to memory
    memory.save_context(
        {"input": input_text},
        {"output": response.content}
    )

    # Step 5: Display result
    display(Markdown(f"### ❓ User: {input_text}"))
    display(Markdown(f"**🤖 Assistant:** {response.content}"))

### ❓ User: Hey! Can you explain what gradient descent is?

**🤖 Assistant:** Gradient descent is a fundamental concept in machine learning and optimization. It's an iterative algorithm used to minimize the loss or cost function of a model by adjusting its parameters.

**How it works:**

1. **Initialization**: The algorithm starts with an initial set of model parameters (e.g., weights and biases).
2. **Forward pass**: The model makes predictions using the current parameters.
3. **Loss calculation**: The difference between the predicted output and the actual output (target) is calculated, resulting in a loss or cost value.
4. **Backward pass**: The algorithm computes the gradient of the loss with respect to each model parameter. The gradient represents the direction of the steepest ascent.
5. **Parameter update**: The model parameters are updated by subtracting a fraction of the gradient (scaled by a learning rate) from the current parameters. This moves the parameters in the direction of the negative gradient, which is the direction of the steepest descent.
6. **Repeat**: Steps 2-5 are repeated until convergence or a stopping criterion is reached.

**Key aspects:**

* **Learning rate**: A hyperparameter that controls the step size of each update. A high learning rate can lead to fast convergence but may also cause oscillations.
* **Gradient**: The direction of the steepest ascent, which is used to update the parameters.
* **Convergence**: The algorithm stops when the loss stops improving or reaches a minimum value.

Gradient descent is a simple yet powerful algorithm for optimizing model parameters. However, it can be sensitive to the choice of learning rate, initialization, and other hyperparameters.

Do you have any specific questions about gradient descent or its applications?

### ❓ User: Cool. And how is it different from stochastic gradient descent?

**🤖 Assistant:** Stochastic Gradient Descent (SGD) is a variant of the traditional Gradient Descent (GD) algorithm. The key difference between the two lies in how they calculate the gradient of the loss function.

In traditional Gradient Descent, the gradient is calculated using the entire training dataset at once. This is known as batch gradient descent. The algorithm computes the gradient of the loss function with respect to the model's parameters using all the training examples, and then updates the parameters accordingly.

On the other hand, Stochastic Gradient Descent uses only one training example at a time to calculate the gradient. The algorithm iterates through the training dataset, one example at a time, and updates the parameters after each example. This process is repeated multiple times, with the algorithm cycling through the entire dataset.

There's also a third variant, known as Mini-Batch Gradient Descent, which falls somewhere in between. In this approach, the algorithm uses a small batch of training examples (typically between 10 and 1000) to calculate the gradient, rather than using the entire dataset or a single example.

The advantages of Stochastic Gradient Descent include:

* Faster computation, since the algorithm only needs to process one example at a time
* Ability to handle large datasets, since the algorithm doesn't need to load the entire dataset into memory
* Improved robustness to outliers, since the algorithm is less affected by individual noisy examples

However, Stochastic Gradient Descent can also have some disadvantages, such as:

* Noisier gradient estimates, since the algorithm is only using a single example to calculate the gradient
* Potential for slower convergence, since the algorithm may need to iterate through the dataset multiple times to achieve the same level of accuracy as traditional Gradient Descent

In general, the choice between Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent depends on the specific problem, dataset, and computational resources available. Do you have any further questions about these algorithms or their applications?

### ❓ User: Which one is better for large datasets?

**🤖 Assistant:** For large datasets, Stochastic Gradient Descent (SGD) or Mini-Batch Gradient Descent is generally better than traditional Gradient Descent. Here's why:

1. **Computational Efficiency**: Traditional Gradient Descent requires calculating the gradient of the loss function over the entire dataset, which can be computationally expensive and time-consuming for large datasets. SGD, on the other hand, calculates the gradient using one example at a time, making it much faster.
2. **Memory Usage**: Traditional Gradient Descent requires loading the entire dataset into memory, which can be a challenge for large datasets. SGD, by contrast, only needs to load one example at a time, making it more memory-efficient.
3. **Robustness to Outliers**: SGD is more robust to outliers in the data, as the gradient is calculated using one example at a time. This means that outliers will have less impact on the overall gradient calculation.

That being said, Mini-Batch Gradient Descent is often a good compromise between traditional Gradient Descent and SGD. It calculates the gradient using a small batch of examples (typically 32-128) at a time, which can provide a good balance between computational efficiency and accuracy.

In general, the choice between SGD and Mini-Batch Gradient Descent depends on the specific problem and dataset. If you have a very large dataset and limited computational resources, SGD might be a good choice. However, if you have a moderately sized dataset and want to balance computational efficiency with accuracy, Mini-Batch Gradient Descent might be a better option.

Do you have any further questions about these algorithms or their applications?