<a href="https://github.com/amjadraza/ai-agents-collection/blob/main/Langchain/csv_agents_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain CSV Analysis Agent with Groq LLM

## Introduction
This Jupyter notebook implements an intelligent CSV analysis system using LangChain and the Groq LLM (specifically the llama-3.3-70b-versatile model). The system is designed to perform natural language-based analysis on salary data, allowing users to query the dataset conversationally without writing explicit code.

## Goals
1. **Data Analysis Automation**: Create an AI-powered agent capable of analyzing a salary dataset (`salaries_2023.csv`) using natural language queries.

2. **Enhanced Query Accuracy**: Implement a robust verification system through custom prompts that:
   - Requires multiple calculation methods for verification
   - Enforces data formatting standards (comma-separated numbers)
   - Demands explanation of methodologies used
   - Prevents hallucination by requiring calculations based only on available data

3. **Interactive Interface**: Provide two ways to interact with the data:
   - Direct Python interface for programmatic access
   - Streamlit web interface for user-friendly interaction (currently commented out)

## Key Features
- Integration with Groq's LLM through LangChain
- Pandas DataFrame agent for data manipulation
- Custom prompt engineering for accurate responses
- Built-in error handling and result verification
- Markdown formatting for clear result presentation
- Support for complex analytical queries about salary data, including:
  - Departmental salary analysis
  - Gender pay comparison
  - Grade-based salary analysis

In [13]:
%pip install -qU pyodbc tabulate langchain langchain-community langchain-core langchain-experimental groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/346.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m337.9/346.2 kB[0m [31m15.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.2/346.2 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m2.3/2.5 MB[0m [31m61.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
from langchain.schema import HumanMessage, SystemMessage
from langchain_groq import ChatGroq

import os
from google.colab import userdata
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

In [12]:
llm_name = "llama-3.3-70b-versatile"
model = ChatGroq(model=llm_name)

In [18]:
import pandas as pd
from langchain.agents import AgentExecutor
# read csv file
df = pd.read_csv("/content/salaries_2023.csv").fillna(value=0)

# print(df.head())

from langchain_experimental.agents.agent_toolkits import (
    create_pandas_dataframe_agent,
    create_csv_agent,
)

agent = create_pandas_dataframe_agent(
    llm=model,
    df=df,
    verbose=True,
    allow_dangerous_code=True

)
# res = agent.invoke("how many rows are there in the dataframe?")

# print(res)

# then let's add some pre and sufix prompt
CSV_PROMPT_PREFIX = """
First set the pandas display options to show all the columns,
get the column names, then answer the question.
"""

CSV_PROMPT_SUFFIX = """
- **ALWAYS** before giving the Final Answer, try another method.
Then reflect on the answers of the two methods you did and ask yourself
if it answers correctly the original question.
If you are not sure, try another method.
FORMAT 4 FIGURES OR MORE WITH COMMAS.
- If the methods tried do not give the same result,reflect and
try again until you have two methods that have the same result.
- If you still cannot arrive to a consistent result, say that
you are not sure of the answer.
- If you are sure of the correct answer, create a beautiful
and thorough response using Markdown.
- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE,
ONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**.
- **ALWAYS**, as part of your "Final Answer", explain how you got
to the answer on a section that starts with: "\n\nExplanation:\n".
In the explanation, mention the column names that you used to get
to the final answer.
"""
QUESTION = "Which department makes the most on average and give the actual amount?"

# Which department makes the most on average and give the actual amount?
#Which grade has the highest average base salary, and compare the average female pay vs male pay

agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent,
    tools=agent.tools,
    verbose=True,
    handle_parsing_errors=True  # Add this line
)

res = agent.invoke(CSV_PROMPT_PREFIX + QUESTION + CSV_PROMPT_SUFFIX)

# print(f"Final result: {res["output"]}")

# import streamlit as st

# st.title("Database AI Agent with LangChain")

# st.write("### Dataset Preview")
# st.write(df.head())

# # User input for the question
# st.write("### Ask a Question")
# question = st.text_input(
#     "Enter your question about the dataset:",
#     "Which grade has the highest average base salary, and compare the average female pay vs male pay?",
# )

# # Run the agent and display the result
# if st.button("Run Query"):
#     QUERY = CSV_PROMPT_PREFIX + question + CSV_PROMPT_SUFFIX
#     res = agent.invoke(QUERY)
#     st.write("### Final Answer")
#     st.markdown(res["output"])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: Which department makes the most on average and give the actual amount?
Thought: First, we need to set the pandas display options to show all the columns, then get the column names.
Action: python_repl_ast
Action Input:
```python
import pandas as pd
pd.set_option('display.max_columns', None)
print(df.head())
print(df.columns)
```[0m  Department            Department_Name                        Division  \
0        ABS  Alcohol Beverage Services           ABS 85 Administration   
1        ABS  Alcohol Beverage Services           ABS 85 Administration   
2        ABS  Alcohol Beverage Services           ABS 85 Administration   
3        ABS  Alcohol Beverage Services  ABS 85 Administrative Services   
4        ABS  Alcohol Beverage Services  ABS 85 Administrative Services   

  Gender  Base_Salary  Overtime_Pay  Longevity_Pay Grade  
0      M   175873.000          0.00            0.0    M2  
1      M   145613.360     