This notebook analyses the [Yelp Dataset](https://www.yelp.com/dataset) using a LLM agent as a copilot. First we'll reorganize the data and put it in pandas a dataframe. This is in order to get the data organized in a format where it is easier to instruct the LLM how to analyze it. Alternatively we could try to get the LLM agent to restructure the data for us. Feel free to give this a shot!

In [1]:
import json
import pandas as pd
from typing import Dict, List
import progressbar


def read_json_data(filepath: str):
    data = []
    print('loading json')
    with open(filepath, 'r') as jsonfile:
        for line in progressbar.progressbar(jsonfile.readlines()):
            data.append( json.loads(line) )
    return data

def extract_non_nested_fields(data: List[Dict]):
    record = data[0]
    columns = set([c for c in record if type(record[c]) != dict and c != 'hours'])
    print('reformatting json')
    records = []
    return [
        {key: record[key] for key in record if key in columns }
        for record in progressbar.progressbar(data)
    ]

def load_json_to_df(filepath: str):
    data = read_json_data(filepath)
    data = extract_non_nested_fields(data)
    print('loading data into dataframe')
    return pd.DataFrame(data)
        

In [2]:
filepath = 'data/yelp_dataset/yelp_academic_dataset_business.json'

In [3]:
business_df = load_json_to_df('data/yelp_dataset/yelp_academic_dataset_business.json')

loading json


100% (150346 of 150346) |################| Elapsed Time: 0:00:02 Time:  0:00:02
 30% (45675 of 150346) |#####            | Elapsed Time: 0:00:00 ETA:   0:00:00

reformatting json


100% (150346 of 150346) |################| Elapsed Time: 0:00:00 Time:  0:00:00


loading data into dataframe


Lets take al ook at the dataframes content.

In [4]:
business_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food"


We will use a custom `pythonCodeExecution` tool to enable our langchain agent to perform operations on the Pandas dataframes in our python environment and store the results. This tool requires all input dataframes which will be used for analysis to be registered with the `chat_agent` by calling the function `chat_agent.add_agent_input`. This function takes the variable, a variable name, and a variable description as input arguments
For example:
```
import chat_agent

chat_agent.add_agent_input(
    business_df,
    'business_df',
    'this is a dataframe containing information about businesses'
)
```

In [5]:
import chat_agent

In [6]:
chat_agent.add_agent_input(
    business_df,
    'business_df',
    'this is a dataframe containing information about businesses',
    include_df_head=False
)

In [7]:
%%chat_agent -n    
Give me all of the businesses that are in PA.

Prompt:
Give me all of the businesses that are in PA.

When using the pythonCodeExecution tool you may assume that you have access to the following variables when writing the code:

business_df - this is a dataframe containing information about businesses


[1m> Entering new AgentExecutor chain...[0m
on action
[32;1m[1;3m I need to use the pythonCodeExecution tool to filter the business_df dataframe for businesses in PA.
Action: pythonCodeExecution
Action Input: result = business_df[business_df['state'] == 'PA'][0m
saving result to agent input  result

Observation: [33;1m[1;3mAnswer has been successfully derived. Key: result[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: All of the businesses that are in PA are stored in the result variable.[0m

[1m> Finished chain.[0m


'All of the businesses that are in PA are stored in the result variable.'

In [8]:
chat_agent.get_result('result')

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food"
15,MUTTqe8uqyMdBl186RmNeA,Tuna Bar,205 Race St,Philadelphia,PA,19106,39.953949,-75.143226,4.0,245,1,"Sushi Bars, Restaurants, Japanese"
19,ROeacJQwBeh05Rqg7F6TCg,BAP,1224 South St,Philadelphia,PA,19147,39.943223,-75.162568,4.5,205,1,"Korean, Restaurants"
25,PSo_C1Sfa13JHjzVNW6ziQ,Indian Walk Veterinary Center,662 Durham Rd,Newtown,PA,18940,40.273493,-74.987496,5.0,15,1,"Pet Services, Pet Groomers, Pets, Veterinarians"
...,...,...,...,...,...,...,...,...,...,...,...,...
150325,l9eLGG9ZKpLJzboZq-9LRQ,Wawa,19 N Bishop Ave,Clifton Heights,PA,19018,39.925656,-75.310344,3.0,11,1,"Restaurants, Sandwiches, Convenience Stores, C..."
150326,JDM7kiCEmeBc_5W-z6eZZw,2601 Parkway Apartments,2601 Pennsylvania Ave,Philadelphia,PA,19130,39.968806,-75.180608,3.0,6,1,"Real Estate, Apartments, Home Services"
150329,9U1Igcpe954LoWZRmNc-zg,Hand & Stone Massage And Facial Spa,"1100 S Columbus Blvd, Ste 24",Philadelphia,PA,19147,39.932756,-75.144504,3.0,32,1,"Day Spas, Beauty & Spas, Skin Care, Massage"
150334,LJ4GjQ1HL6kqvIPpNUNNaQ,Shanti Yoga and Ayurveda,"1638 Pine St, Fl 1",Philadelphia,PA,19103,39.945966,-75.169666,4.5,39,1,"Health & Medical, Yoga, Shopping, Naturopathic..."


In [9]:
%%chat_agent -n
what 10 business in the dataframe result have the greatest number of reviews

Prompt:
what 10 business in the dataframe result have the greatest number of reviews

When using the pythonCodeExecution tool you may assume that you have access to the following variables when writing the code:

business_df - this is a dataframe containing information about businesses

result - object of type <class 'pandas.core.frame.DataFrame'> related to the thought "I need to use the pythonCodeExecution tool to filter the business_df dataframe for businesses in PA.". The dataframe has the columns ['business_id' 'name' 'address' 'city' 'state' 'postal_code' 'latitude'
 'longitude' 'stars' 'review_count' 'is_open' 'categories']


[1m> Entering new AgentExecutor chain...[0m
on action
[32;1m[1;3m I need to use the pythonCodeExecution tool to filter the business_df dataframe for businesses in PA.
Action: pythonCodeExecution
Action Input: 
result = business_df[business_df['state'] == 'PA'].sort_values(by='review_count', ascending=False).head(10)[0m
saving result to agent input  res

'The 10 businesses in PA with the greatest number of reviews are: result2.'

In [None]:
chat_agent.get_result('result2')

In [None]:
%%chat_agent
plot result2 on a map

Now we'll load the reviews.

In [None]:
review_df = load_json_to_df('data/yelp_dataset/yelp_academic_dataset_review.json')