### Reading Pandas data 
We can use our qa system to ask questions about our data using llama index


In [7]:
from llama_index.llms.openai import OpenAI
from llama_index.readers.file.base import SimpleDirectoryReader
from llama_index.query_engine.pandas import PandasQueryEngine
from llama_index.service_context import ServiceContext
import pandas as pd

In [8]:
# Test on some sample data
df = pd.DataFrame(
    {
        "city": ["Toronto", "Tokyo", "Berlin"],
        "population": [2930000, 13960000, 3645000],
    }
)


In [10]:
llm = OpenAI(model="mistralai/Mixtral-8x7B-Instruct-v0.1")
service_context = ServiceContext.from_defaults(llm=llm)

In [21]:
query_engine = PandasQueryEngine(df=df, service_context=service_context, synthesize_response=True, verbose=True)

In [22]:
from pprint import pprint

In [23]:
response = query_engine.query("what is name of the city with the highest population?")
pprint(response.response)

> Pandas Instructions:
```
df['city'][df['population'].idxmax()]
```
> Pandas Output: Tokyo
'The city with the highest population is Tokyo.'


### That was fun now lets analyze the Titanic dataset
We would be downloading the titanic dataset and using the PandasQueryEngine provided by llama index to extract data from the dataset

In [24]:
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/csv/titanic_train.csv' -O 'data/titanic_train.csv'

--2024-02-11 23:19:04--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/csv/titanic_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57726 (56K) [text/plain]
Saving to: ‘data/titanic_train.csv’


2024-02-11 23:19:05 (1.84 MB/s) - ‘data/titanic_train.csv’ saved [57726/57726]



In [26]:
df = pd.read_csv("./data/titanic_train.csv")

In [34]:
query_engine = PandasQueryEngine(df=df, service_context=service_context, synthesize_response=True, verbose=True)

In [39]:
response = query_engine.query(
    "What is the age of the youngest person on the dataset and their other information?",
)

> Pandas Instructions:
```
df[df['age'] == df['age'].min()][['survived', 'pclass', 'age', 'name', 'sex', 'embarked']]
```
> Pandas Output:      survived  pclass   age                             name   sex embarked
803         1       3  0.42  Thomas, Master. Assad Alexander  male        C


In [40]:
pprint(response.response)

('The youngest person in the dataset is 0.42 years old at the time of the '
 'survey. This individual is listed as "Thomas, Master. Assad Alexander," and '
 'is identified as male. They were traveling in third class, and embarked from '
 'the port labeled "C." Based on the \'survived\' column, it is unclear '
 'whether they survived or not, as the value is listed as 1, which could '
 'indicate a survived passenger, but could also be an ID number or some other '
 'unrelated value. More information about the context of the dataset would be '
 'needed to accurately interpret this value.')


In [42]:
prompts = query_engine.get_prompts()

In [43]:
type(prompts)

dict

In [44]:
prompts.keys()

dict_keys(['pandas_prompt', 'response_synthesis_prompt'])

In [47]:
pprint(prompts["pandas_prompt"].template)

('You are working with a pandas dataframe in Python.\n'
 'The name of the dataframe is `df`.\n'
 'This is the result of `print(df.head())`:\n'
 '{df_str}\n'
 '\n'
 'Follow these instructions:\n'
 '{instruction_str}\n'
 'Query: {query_str}\n'
 '\n'
 'Expression:')


In [48]:
pprint(prompts["response_synthesis_prompt"].template)

('Given an input question, synthesize a response from the query results.\n'
 'Query: {query_str}\n'
 '\n'
 'Pandas Instructions (optional):\n'
 '{pandas_instructions}\n'
 '\n'
 'Pandas Output: {pandas_output}\n'
 '\n'
 'Response: ')
