In [1]:
from openai import OpenAI
import pandas as pd

In [2]:
client = OpenAI()

In [5]:
employees = open('tables/employees.csv', 'r').read()
volunteers = open('tables/volunteers.csv', 'r').read()

In [37]:
def table_augmented_completion(model, scenario:str , prompt: str, tables: list[str]):
    messages = [{'role':'system', 'content': scenario + ' Here are the tables:'}]
    messages += [{'role':'user', 'content': table} for table in tables]
    messages.append(
        {'role':'system', 'content': prompt})
    return client.chat.completions.create(
        model=model,
        messages=messages
    )

In [39]:
completion = table_augmented_completion(
    'gpt-3.5-turbo',
    'You are a volunteer coordinator at a non-profit organization. You have two tables of data. The first table contains information about your employees, and the second table contains information about your volunteers. You need to find the employees who are also volunteers.',
    'Find the employees who are also volunteers.',
    [employees, volunteers])

In [40]:
print(completion.choices[0].message.content)

To find the employees who are also volunteers, you need to look for matches in the "Workplace" column in the employees table and the "Representing" column in the volunteers table.

Based on the given tables, the employees who are also volunteers are:

- J. Doe, representing Apple
- J. Murphy, representing DFA Records
- L. Smith, representing Google


This is the incorrect answer. Although "J. Doe" and "J. Murphy" do correspond to entries in the volunteers table, "L. Smith" and "F. Bernard" do not. I see if GPT-4 can do a better job.

In [None]:
completion4 = table_augmented_completion(
    'gpt-4-0125-preview',
    'You are a volunteer coordinator at a non-profit organization. You have two tables of data. The first table contains information about your employees, and the second table contains information about your volunteers. You need to find the employees who are also volunteers.',
    'Find the employees who are also volunteers.',
    [employees, volunteers])

In [28]:
print(completion4.choices[0].message.content)

To find the employees who are also volunteers, we should match the names in the "Employee name" column with the names in the "Volunteer Name" column and consider their workplaces and representing organizations if necessary for verification. Here are the matches based on the provided data:

1. **J. Doe** from Apple is likely **Jane Doe** representing Apple. 
2. **J. Murphy** from DFA Records may correspond to **James Murphy** representing DFA Records.

Thus, the employees who are also volunteers are:

- J. Doe (Jane Doe) from Apple
- J. Murphy (James Murphy) from DFA Records

Please note that the initials and the workplaces align for these matches, but without more details (e.g., full names for all entries), some assumptions are unavoidable.


GPT-4-Turbo is able to offer the correct answer. I now test to see if this will generalize well to other tasks. Particularly, I am interested in entity-resolution *column-wise*. That is, are there non-obvious join paths?

In [30]:
response = client.chat.completions.create(
    model = 'gpt-4-0125-preview',
    messages = [
        {'role': 'system', 'content': 'You would like to join two tables after a company wide fundraiser. One table contains employee information. The other contains fundraising results. Here are the column names for each table:'},
        {'role': 'user', 'content': 'name,id,role,department,location,email'},
        {'role': 'user', 'content': 'first,last,amount_raised,cause'},
        {'role':'system', 'content': 'Return all plausible join-columns with a confidence rating between 0 and 100. Your response should only be a set of columns-pairs and confidence ratings with the format, "(column1, column2): confidence", and no additional information.'}
    ]
)

print(response.choices[0].message.content)

Given the lack of directly matching column names between the two tables, the join between them would likely require contextual understanding or assumptions based on common data linking practices. However, without explicit data to analyze or more information, the plausible join columns are theorized as follows:

- Since there is no direct column that appears to match between the two sets, a common practice might involve using the "name" from the employee information table and potentially splitting or combining the "first" and "last" from the fundraising results table to establish a link based on employee names. However, this approach assumes that the names are consistently formatted and uniquely identify individuals across both tables, which might not always be the case.

Given this, the confidence in any direct match is speculative:

- ("name", "first"): Confidence 50
- ("name", "last"): Confidence 50

The confidence ratings reflect the indirect and potentially error-prone nature of at

Next, I attempt true table-joining.

In [44]:
employees2 = open('tables/employees2.csv','r').read()
fundraise = open('tables/fundraise.csv', 'r').read()

In [48]:
resp = table_augmented_completion(
    'gpt-4-0125-preview', 
    'You are analyzing data after a company fundraiser featuring employees from different departments. Your first table lists employees and their departments and your second table contains the amount each participant raised.',
     'From which department was the employee that raised 328? Give the full record from the first table that corresponds to that employee, and a confidence rating from 0 to 1. Do not detail the process by which you came to this conclusion.',
     [employees2, fundraise])

print(resp.choices[0].message.content)

Employee Name: Yang,Kevin
Role: Accounting Manager
Confidence Rating: 1


In [49]:
def fundraiser_problem(question):
    return table_augmented_completion(
    'gpt-4-0125-preview', 
    'You are analyzing data after a company fundraiser featuring employees from different departments. Your first table lists employees and their departments and your second table contains the amount each participant raised.',
    question, [employees2, fundraise])

In [52]:
fundraiser_problem('How much money did Michael raise? Give only the number.').choices[0].message.content

'3100'

In [56]:
employees2 += '\n"Scott,Martin",Accountant'
print(employees2)

employee_name, Role
"Brown,Sarah",HR Representative
"Scott,Michael",Branch Manager
"Tao,Alex",Business Development Strategy Lead
"Yang,Kevin",Accounting Manager
"Patel,Mahesh",Sales Representative
"Davis,Christopher",Sales Representative

"Scott,Martin",Accountant


In [57]:
fundraiser_problem('What was the role of the employee that raised 3100? Do not give any reasoning, simply give your answer (or answers) and a confidence level for each.').choices[0].message.content

'Role: Branch Manager\nConfidence Level: High'

In [58]:
fundraiser_problem('Was the role of the employee that raised 3100 an accountant?').choices[0].message.content

'No, the role of the employee who raised $3100 was not an Accountant. The employee who raised $3100 is M. Scott, whose full name is Scott, Michael, and his role is listed as Branch Manager.'

In [59]:
fundraiser_problem('How much money did Martin raise?').choices[0].message.content

'Martin Scott is not listed in the table of Fundraised Amount. Therefore, we cannot determine how much money Martin raised based on the provided information.'

In [60]:
fundraiser_problem('Could M. Scott be Martin Scott?').choices[0].message.content

'Based on the information provided in the two tables, M. Scott appears to be Michael Scott, not Martin Scott. This is deduced from the role listed next to "Scott,Michael" as a Branch Manager, and there is a significant amount of funds raised under the name M. Scott (3100). Martin Scott is listed as an Accountant without a corresponding amount raised in the second table; thus, it\'s reasonable to conclude M. Scott refers to Michael Scott.'