# [Pandas Dataframe](https://python.langchain.com/v0.2/docs/integrations/toolkits/pandas/)
This notebook shows how to use agents to interact with a Pandas DataFrame. It is mostly optimized for question answering.

In [26]:
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_experimental.agents.agent_toolkits.pandas.prompt import PREFIX
from langchain_openai import ChatOpenAI

titanic のデータ

In [3]:
import pandas as pd
from langchain_openai import OpenAI

df = pd.read_csv(
    "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)

In [7]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [19]:
test = OpenAI(temperature=0)
print(test)

[1mOpenAI[0m
Params: {'model_name': 'gpt-3.5-turbo-instruct', 'temperature': 0.0, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'logit_bias': {}, 'max_tokens': 256}


## Using ZERO_SHOT_REACT_DESCRIPTION
This shows how to initialize the agent using the ZERO_SHOT_REACT_DESCRIPTION agent type.

単純にモデルをgpt-4にするとエラー起きた  
https://stackoverflow.com/questions/77089403/issue-with-custom-prompt-with-an-agent-using-langchain-and-gpt-4

In [39]:
#agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
#agent = create_pandas_dataframe_agent(OpenAI(model='gpt-3.5-turbo-instruct', temperature=0), df, verbose=True)
llm = ChatOpenAI(temperature=0, model="gpt-4")
agent = create_pandas_dataframe_agent(llm, 
                                      df,   # 複数もListでできるっぽい
                                      verbose=True,
                                      max_iterations=5,
                                      )

In [40]:
agent.invoke("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find out how many rows there are in a dataframe, we can use the `shape` attribute of the dataframe. The first element of the tuple returned by `shape` is the number of rows.
Action: python_repl_ast
Action Input: df.shape[0][0m[36;1m[1;3m891[0m[32;1m[1;3mI now know the final answer
Final Answer: There are 891 rows in the dataframe.[0m

[1m> Finished chain.[0m


{'input': 'how many rows are there?',
 'output': 'There are 891 rows in the dataframe.'}

In [41]:
agent = create_pandas_dataframe_agent(
    ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613"),
    df,
    verbose=True,
    agent_type=AgentType.OPENAI_FUNCTIONS,
)
agent.invoke("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': 'df.shape[0]'}`


[0m[36;1m[1;3m891[0m[32;1m[1;3mThere are 891 rows in the dataframe.[0m

[1m> Finished chain.[0m


{'input': 'how many rows are there?',
 'output': 'There are 891 rows in the dataframe.'}

In [42]:
agent.invoke("how many people have more than 3 siblings")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[df['SibSp'] > 3]['PassengerId'].count()"}`


[0m[36;1m[1;3m30[0m[32;1m[1;3mThere are 30 people in the dataframe who have more than 3 siblings.[0m

[1m> Finished chain.[0m


{'input': 'how many people have more than 3 siblings',
 'output': 'There are 30 people in the dataframe who have more than 3 siblings.'}

In [33]:
agent.invoke("whats the square root of the average age?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the square root of the average age, I first need to calculate the average age. I can do this by using the mean() function on the 'Age' column of the dataframe. Then, I can use the math.sqrt() function to find the square root of the average age.
Action: python_repl_ast
Action Input: df['Age'].mean()[0m[36;1m[1;3m29.69911764705882[0m[32;1m[1;3mThe average age is approximately 29.7. Now I need to find the square root of this number.
Action: python_repl_ast
Action Input: import math
math.sqrt(df['Age'].mean())[0m[36;1m[1;3m5.449689683556195[0m[32;1m[1;3mI now know the final answer
Final Answer: The square root of the average age is approximately 5.45.[0m

[1m> Finished chain.[0m


{'input': 'whats the square root of the average age?',
 'output': 'The square root of the average age is approximately 5.45.'}

## Multi DataFrame Example
This next part shows how the agent can interact with multiple dataframes passed in as a list.

In [34]:
df1 = df.copy()
df1["Age"] = df1["Age"].fillna(df1["Age"].mean())

GPT3.5-Turboだと誤り

In [38]:
llm = ChatOpenAI(temperature=0, model="gpt-4")
#agent = create_pandas_dataframe_agent(OpenAI(temperature=0, model="gpt-4"), [df, df1], verbose=True)
agent = create_pandas_dataframe_agent(llm, [df, df1], verbose=True)
agent.invoke("how many rows in the age column are different?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find out how many rows in the age column are different, I need to compare the 'Age' column of both dataframes. I can use the 'compare' function in pandas to do this.

Action: python_repl_ast
Action Input: df1.compare(df2)['Age'][0m[36;1m[1;3m     self      other
5     NaN  29.699118
17    NaN  29.699118
19    NaN  29.699118
26    NaN  29.699118
28    NaN  29.699118
..    ...        ...
859   NaN  29.699118
863   NaN  29.699118
868   NaN  29.699118
878   NaN  29.699118
888   NaN  29.699118

[177 rows x 2 columns][0m[32;1m[1;3mThe 'compare' function returns a dataframe that only includes the rows where there are differences. From the observation, it seems there are 177 rows in the 'Age' column that are different between the two dataframes.
Final Answer: There are 177 rows in the 'Age' column that are different.[0m

[1m> Finished chain.[0m


{'input': 'how many rows in the age column are different?',
 'output': "There are 177 rows in the 'Age' column that are different."}