# CSV Analyzer


Generative Question and Answer (QnA) analyzer of csv files. The focus is on small csv files (100s or 1000s of rows) where traditionally users are doing analysis on a spreadsheet but want a more interactive way to do their analysis. 

Today you can drop a csv on your LLM of choice and can get answers to some complex questions that focus on language summarization tasks like topic categorization, sentiment analysis, sampling of data. Where LLM usually fail is in doing computation of your data i.e. filtering, aggregation, counts etc. 

In order to overcome this drawback we are taking advantage of an agentic approach while giving it access to a pandas runner. This allows the agent the ability to run computations via pandas commands. The agent will use a reAct framework to be able to think through the problem and access the pandas dataframe only when necessary. 

# SetUp

Install the Langchain packages: Open a terminal window and enter the command below

In [20]:
#!pip install --upgrade google-cloud-aiplatform
!pip install vertexai
!pip install langchain-google-genai
!pip install langchain
!pip install langchain-experimental
!pip install --no-deps langchain-google-vertexai 

Collecting vertexai
  Downloading vertexai-1.71.0-py3-none-any.whl (7.3 kB)
Collecting google-cloud-aiplatform[all]==1.71.0
  Using cached google_cloud_aiplatform-1.71.0-py2.py3-none-any.whl (6.2 MB)
Installing collected packages: google-cloud-aiplatform, vertexai
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 1.60.0
    Uninstalling google-cloud-aiplatform-1.60.0:
      Successfully uninstalled google-cloud-aiplatform-1.60.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-google-vertexai 2.0.6 requires google-cloud-storage<3.0.0,>=2.18.0, but you have google-cloud-storage 2.14.0 which is incompatible.[0m
Successfully installed google-cloud-aiplatform-1.71.0 vertexai-1.71.0


Initialize variables

In [28]:
PROJECT = 'felipe-sandbox-354619' # replace with project ID
BUCKET = PROJECT + "-bucket" 
CSV_FILE = "files/customer-comments.csv"

Colab setup if needed

In [29]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


## Langchain Setup

### Tools

In [50]:
import pandas as pd
import pandas as pd
from langchain.agents import Tool
from langchain.tools import tool
from io import StringIO
from pydantic import BaseModel, Field

@tool("pandas_calculator", return_direct=False)
def pandas_calculator(dictionary:str):
    """
    Evaluates an expression on a Pandas DataFrame and returns the result.
    Inputs include an expression expr and a file path csv_file_path
    Inputs should be in format {\"expr\": str, \"csv_file_path\": str}
    """
    dictionary = json.loads(dictionary)
    csv_file_path = dictionary["csv_file_path"]
    expr = dictionary["expr"]
    
    try:
        df = pd.read_csv(csv_file_path)
        print(df)
        # evaluate expr on the pandas dataframe
        result = eval(expr)
        # return the result
        return str(result)
    except Exception as e:
        return f"Error: Could not load data or evaluate expression. {e}"

In [115]:
@tool("csv_summary_statistics", return_direct=False)
def csv_summary_statistics(dictionary:str):
    """
    Gets summary statistics on the csv to understand the data
    Before running pandas calculator please run the statistics first
    Inputs should be in format {\"csv_input_file\": str}
    """
    
    print(dictionary)
    dictionary = json.loads(dictionary)
    input_path = dictionary["csv_input_file"]
    
    try:
        df = pd.read_csv(input_path)
        # evaluate expr on the pandas dataframe
        result = "sample data: \n" + df.head().to_string() + "\n\n"
        #result += "information on the data: " + df.info().to_string()
        #print(df.info())
        # return the result
        return str(result)
    except Exception as e:
        return f"Error: Could not load data or evaluate expression. {e}"
    
test_path = "{\"csv_input_file\":\"" + CSV_FILE + "\"}"
print(test_path)
csv_summary_statistics(test_path)

{"csv_input_file":"files/customer-comments.csv"}
{"csv_input_file":"files/customer-comments.csv"}


"sample data: \n   index                                       comment country\n0      1                  The product arrived damaged.      US\n1      2              I'm very happy with the service!  Mexico\n2      3             I have a question about my order.  Canada\n3      4                        The delivery was late.      US\n4      5  The customer support agent was very helpful.  Canada\n\n"

In [116]:
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain_google_vertexai import ChatVertexAI
from langchain.memory import ConversationBufferWindowMemory


llm = ChatVertexAI(model="gemini-1.5-pro-002")
prompt = hub.pull("hwchase17/react-chat")
chat_history = ConversationBufferWindowMemory(k=10)

agent = create_react_agent(llm, [pandas_calculator,csv_summary_statistics], prompt)

agent_executor = AgentExecutor(agent=agent,
                               tools=[pandas_calculator,csv_summary_statistics],
                               verbose=True, handle_parsing_errors=True,
                               max_iterations=5)

def append_chat_history(input, response):
    chat_history.save_context({"input": input}, {"output": response})

def invoke(input):
    msg = {
        "input": input,
        "chat_history": chat_history.load_memory_variables({}),
    }
    print(f"Input: {msg}")

    response = agent_executor.invoke(msg)
    print(f"Response: {response}")

    append_chat_history(response["input"], response["output"])
    print(f"History: {chat_history.load_memory_variables({})}")



In [117]:
invoke("Hi")
invoke(CSV_FILE + "\n\n how many rows are in this file?")

Input: {'input': 'Hi', 'chat_history': {'history': ''}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? No
Final Answer: Hello! How can I help you today?
 How can I help you today?
[0m

[1m> Finished chain.[0m
Response: {'input': 'Hi', 'chat_history': {'history': ''}, 'output': 'Hello! How can I help you today?\n How can I help you today?'}
History: {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?'}
Input: {'input': 'files/customer-comments.csv\n\n how many rows are in this file?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: csv_summary_statistics
Action Input: {"csv_input_file": "files/customer-comments.csv"}[0m{"csv_input_file": "files/customer-comments.csv"}
[33;1m[1;3msample data: 
   index                                 

In [120]:
invoke("can you count how many rows are from the US?")

Input: {'input': 'can you count how many rows are from the US?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word package?\nAI: There are 0 rows that include the word "package".\n rows that include the word "package".'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: pandas_calculator
Action Input: {"expr": "df[df['Country'] == 'US'].shape[0]", "csv_file_path": "files/customer-comments.csv"}[0m    index                                       comment country
0       1                  The product arrived damaged.      US
1       2              I'm very happy with the service!  Mexico
2       3     

In [122]:
invoke("how many rows include the word product?")

Input: {'input': 'how many rows include the word product?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word package?\nAI: There are 0 rows that include the word "package".\n rows that include the word "package".\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word package?\nAI: There are 0 rows that include the word "package".\n rows that include the word "package".'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: pandas_calculator
Action Input: {"expr": "df['Customer Comment'].str.contains('product')

In [123]:
invoke("how many rows are positive and how many are negative sentiment?")

Input: {'input': 'how many rows are positive and how many are negative sentiment?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word package?\nAI: There are 0 rows that include the word "package".\n rows that include the word "package".\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word package?\nAI: There are 0 rows that include the word "package".\n rows that include the word "package".\nHuman: how many rows include the word product?\nAI: There are 26 rows that include the word "product".\n6 rows that include the word "product".'}}


[1m> Entering new Agent