# CSV Analyzer


Generative Question and Answer (QnA) analyzer of csv files. The focus is on small csv files (100s or 1000s of rows) where traditionally users are doing analysis on a spreadsheet but want a more interactive way to do their analysis. 

Today you can drop a csv on your LLM of choice and can get answers to some complex questions that focus on language summarization tasks like topic categorization, sentiment analysis, sampling of data. Where LLM usually fail is in doing computation of your data i.e. filtering, aggregation, counts etc. 

In order to overcome this drawback we are taking advantage of an agentic approach while giving it access to a pandas runner. This allows the agent the ability to run computations via pandas commands. The agent will use a reAct framework to be able to think through the problem and access the pandas dataframe only when necessary. 

# SetUp

Install the Langchain packages: Open a terminal window and enter the command below

In [3]:
#!pip install --upgrade google-cloud-aiplatform
!pip install --quiet vertexai
!pip install --quiet langchain-google-genai
!pip install --quiet langchain
!pip install --quiet langchain-experimental
!pip install --no-deps --quiet langchain-google-vertexai 
!pip install --quiet panel

[0m

Initialize variables

In [4]:
PROJECT = 'felipe-sandbox-354619' # replace with project ID
BUCKET = PROJECT + "-bucket" # make sure this bucket exists
CSV_FILE = "files/customer-comments.csv" 

Colab setup if needed

In [5]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


## Langchain Setup

### Tools

create a tool that loads csv file into a pandas dataframe and run analysis

In [6]:
import pandas as pd
import json
from langchain.agents import Tool
from langchain.tools import tool
from io import StringIO
from pydantic import BaseModel, Field


@tool("pandas_calculator", return_direct=False)
def pandas_calculator(dictionary:str):
    """
    Evaluates an expression on a Pandas DataFrame and returns the result.
    Inputs include an expression expr and a file path csv_file_path
    Inputs should be in format {\"expr\": str, \"csv_file_path\": str}
    """
    dictionary = json.loads(dictionary)
    csv_file_path = dictionary["csv_file_path"]
    expr = dictionary["expr"]
    
    try:
        df = pd.read_csv(csv_file_path)
        print(df)
        # evaluate expr on the pandas dataframe
        result = eval(expr)
        # return the result
        return str(result)
    except Exception as e:
        return f"Error: Could not load data or evaluate expression. {e}"

Tool to summarize the file so that LLM can generate correct SQLs

In [33]:
@tool("csv_summary_statistics", return_direct=False)
def csv_summary_statistics(dictionary:str):
    """
    Gets summary statistics on the csv to understand the data
    Before running pandas calculator please run the statistics first
    Inputs should be in format {\"csv_input_file\": str}
    """
    
    dictionary = json.loads(dictionary)
    input_path = dictionary["csv_input_file"]
    
    try:
        df = pd.read_csv(input_path)
        # evaluate expr on the pandas dataframe
        result = "first few rows of data: \n" + df.head().to_string() + "\n\n"
        result += "information on the data: " + df.describe(include='all').to_string()
        #print(df.info())
        # return the result
        return str(result)
    except Exception as e:
        return f"Error: Could not load data or evaluate expression. {e}"
    
test_path = "{\"csv_input_file\":\"" + CSV_FILE + "\"}"
print(csv_summary_statistics(test_path))

first few rows of data: 
   index                                       comment country
0      1                  The product arrived damaged.      US
1      2              I'm very happy with the service!  Mexico
2      3             I have a question about my order.  Canada
3      4                        The delivery was late.      US
4      5  The customer support agent was very helpful.  Canada

information on the data:              index                       comment country
count   100.000000                           100     100
unique         NaN                            15       3
top            NaN  The product arrived damaged.      US
freq           NaN                             7      42
mean     50.500000                           NaN     NaN
std      29.011492                           NaN     NaN
min       1.000000                           NaN     NaN
25%      25.750000                           NaN     NaN
50%      50.500000                           NaN     NaN
7

Tool to fill in any columns that do not exist (being tested)

In [96]:
@tool("generate_column", return_direct=False)
def generate_column(dictionary:str):
    """
    For some tasks the column that is needed does not exist and 
    might need to be generated via LLM
    This function will fill in the column via some prompt instructions and generate it
    Inputs should be in format {\"csv_input_file\": str, "prompt_instructions":str, "column_name":str}
    
    csv_input_fil: is the file to be loaded
    column_name: is the name of the new column that will be generated
    prompt_instructions: is the prompt used by the LLM to generate the column. This will be applied to 
    each row based on function: df[column_name] = df.apply(lambda x: llm.generate(prompt_instructions + "\n row context" + x + "\n result:\n"))
    """
    
    llm = ChatVertexAI(model="gemini-1.5-flash-002")
    
    dictionary = json.loads(dictionary)
    input_path = dictionary["csv_input_file"]
    prompt_instructions = dictionary["prompt_instructions"]
    column_name = dictionary["column_name"]
    
    def generate_sentiment(row, prompt_instructions):
        prompt = f"classify this row as postive, negative or neutral sentimet\n{row.to_string()}\nresult:\n"
        return llm.invoke(prompt).content  # Pass the message object to generate()
    
    #try:
        
    df = pd.read_csv(input_path)
    df[column_name] = df.apply(lambda x:generate_sentiment(x, prompt_instructions), axis=1)

    return df[column_name]
    #except Exception as e:
    #    return f"Error: Could not load data or evaluate expression. {e}"

data_test = '{\"csv_input_file\":\"'+CSV_FILE+'\", \"prompt_instructions\":\"classify this row as postive, negative or neutral sentiment\", \"column_name\":\"sentiment\"}'
generate_column(data_test)
    

0                                            Negative\n
1                                            Positive\n
2     Neutral.  The comment is a question, not expre...
3                                            Negative\n
4                                            Positive\n
                            ...                        
95                                           Negative\n
96    Neutral.  The comment expresses a need, not a ...
97                                           Positive\n
98                                           Negative\n
99                                           Negative\n
Name: sentiment, Length: 100, dtype: object

In [86]:
from langchain_google_vertexai import ChatVertexAI
from langchain.schema import SystemMessage  # Import HumanMessage

llm = ChatVertexAI(model="gemini-1.5-flash-002")
df = pd.read_csv(CSV_FILE)


def generate_sentiment(row):
    prompt = f"classify this row as postive, negative or neutral sentimet\n{row.to_string()}\nresult:\n"
    print(prompt)
    message = System(content=prompt)  # Create a HumanMessage object
    return llm.generate([message]).content  # Pass the message object to generate()

row_test = df.iloc[1]
print(row_test.to_string())

prompt = f"classify this row as postive, negative or neutral sentimet\n{row_test.to_string()}\nresult:\n"
print(prompt)
message = SystemMessage(content=prompt)  # Create a HumanMessage object
print(message)
print(llm.invoke(prompt).content)
#generate_sentiment(row_test)
#df[column_name] = df.apply(generate_sentiment, axis=1)

index                                     2
comment    I'm very happy with the service!
country                              Mexico
classify this row as postive, negative or neutral sentimet
index                                     2
comment    I'm very happy with the service!
country                              Mexico
result:

content="classify this row as postive, negative or neutral sentimet\nindex                                     2\ncomment    I'm very happy with the service!\ncountry                              Mexico\nresult:\n" additional_kwargs={} response_metadata={}
Positive



In [46]:
llm = ChatVertexAI(model="gemini-1.5-pro-002")
llm.invoke("hello").content

'Hello there! How can I help you today?\n'

In [47]:
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain_google_vertexai import ChatVertexAI
from langchain.memory import ConversationBufferWindowMemory


llm = ChatVertexAI(model="gemini-1.5-pro-002")
prompt = hub.pull("hwchase17/react-chat")
chat_history = ConversationBufferWindowMemory(k=10)

agent = create_react_agent(llm, [pandas_calculator,csv_summary_statistics], prompt)

agent_executor = AgentExecutor(agent=agent,
                               tools=[pandas_calculator,csv_summary_statistics],
                               verbose=True, handle_parsing_errors=True,
                               max_iterations=5)

def append_chat_history(input, response):
    chat_history.save_context({"input": input}, {"output": response})

def invoke(input):
    if (input):
        print("input: " + str(input))
        msg = {
            "input": input,
            "chat_history": chat_history.load_memory_variables({}),
        }
        print(f"Input: {msg}")

        response = agent_executor.invoke(msg)
        print(f"Response: {response}")

        append_chat_history(response["input"], response["output"])
        print(f"History: {chat_history.load_memory_variables({})}")

        return response["output"]
    else:
        return ""



In [48]:
invoke("Hi")
invoke(CSV_FILE + "\n\n how many rows have positive sentiment vs negative sentiment?")

input: Hi
Input: {'input': 'Hi', 'chat_history': {'history': ''}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? No
Final Answer: Hi there! How can I help you today?
! How can I help you today?
[0m

[1m> Finished chain.[0m
Response: {'input': 'Hi', 'chat_history': {'history': ''}, 'output': 'Hi there! How can I help you today?\n! How can I help you today?'}
History: {'history': 'Human: Hi\nAI: Hi there! How can I help you today?\n! How can I help you today?'}
input: files/customer-comments.csv

 how many rows have positive sentiment vs negative sentiment?
Input: {'input': 'files/customer-comments.csv\n\n how many rows have positive sentiment vs negative sentiment?', 'chat_history': {'history': 'Human: Hi\nAI: Hi there! How can I help you today?\n! How can I help you today?'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: csv_summary_statistics
Action Input: {"csv_input_file":

'Agent stopped due to iteration limit or time limit.'

In [10]:
invoke("can you count how many rows are from the US?")

input: can you count how many rows are from the US?
Input: {'input': 'can you count how many rows are from the US?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: csv_summary_statistics
Action Input: {"csv_input_file": "files/customer-comments.csv"}[0m{"csv_input_file": "files/customer-comments.csv"}
[33;1m[1;3msample data: 
   index                                       comment country
0      1                  The product arrived damaged.      US
1      2              I'm very happy with the service!  Mexico
2      3             I have a question about my order.  Canada
3      4                        The delivery was late.      US
4      5  The customer support agent was very helpful. 

'There are 42 rows from the US.\n2 rows from the US.'

In [11]:
invoke("how many rows include the word product?")

input: how many rows include the word product?
Input: {'input': 'how many rows include the word product?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: pandas_calculator
Action Input: {"expr": "df['Comments'].str.contains('product').sum()", "csv_file_path": "files/customer-comments.csv"}[0m    index                                       comment country
0       1                  The product arrived damaged.      US
1       2              I'm very happy with the service!  Mexico
2       3             I have a question about my order.  Canada
3       4                        The deli

'There are 26 rows that include the word "product".\n6 rows that include the word "product".'

In [12]:
invoke("how many rows are positive and how many are negative sentiment?")

input: how many rows are positive and how many are negative sentiment?
Input: {'input': 'how many rows are positive and how many are negative sentiment?', 'chat_history': {'history': 'Human: Hi\nAI: Hello! How can I help you today?\n How can I help you today?\nHuman: files/customer-comments.csv\n\n how many rows are in this file?\nAI: There are 100 rows in the file.\n00 rows in the file.\nHuman: can you count how many rows are from the US?\nAI: There are 42 rows from the US.\n2 rows from the US.\nHuman: how many rows include the word product?\nAI: There are 26 rows that include the word "product".\n6 rows that include the word "product".'}}


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: pandas_calculator
Action Input: {"expr": "df['Sentiment'].value_counts()", "csv_file_path": "files/customer-comments.csv"}[0m    index                                       comment country
0       1                  The product arrived damaged

'There are no rows explicitly marked with "positive" or "negative".  The data doesn\'t seem to directly contain sentiment labels. To properly analyze sentiment, I would need additional information or a model trained to detect sentiment in text.\n seem to directly contain sentiment labels. To properly analyze sentiment, I would need additional information or a model trained to detect sentiment in text.'

## Create UI for the Chatbot

In [22]:
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML


# Create text box for user input
user_input = widgets.Textarea(
    value='',
    placeholder='Enter text here...',
    layout=widgets.Layout(width='100%')
)

# Create output area for conversation display
conversation_history = widgets.Output()
display(conversation_history)

# Create button for user to click to send their message
send_button = widgets.Button(description='Send')

def send_button_clicked(b):
    with conversation_history:
        # User message with gray bubble
        user_message = f"""
        <div style="display: flex; align-items: flex-start;">
            <div style="background-color: lightgray; padding: 10px; border-radius: 10px; margin-right: 10px;">
                {user_input.value}
            </div>
        </div>
        """
        display(HTML(user_message))

        # Get chatbot response
        response = invoke(user_input.value)

        # AI response with blue bubble
        ai_message = f"""
        <div style="display: flex; align-items: flex-start; justify-content: flex-end;">
            <div style="background-color: lightblue; padding: 10px; border-radius: 10px; margin-left: 10px;">
                {response}
            </div>
        </div>
        """
        display(HTML(ai_message))

        # Clear user input box
        user_input.value = ''

send_button.on_click(send_button_clicked)

# Display the input box and send button
display(widgets.HBox([user_input, send_button]))

Output()

HBox(children=(Textarea(value='', layout=Layout(width='100%'), placeholder='Enter text here...'), Button(descr…