## Missing Tabular Data Enrichment With YOU Research API 

In this tutorial, we will enrich the missing values in a tabular dataset primarily using `You.com` research API. 

Firstly, set the environment variable `YDC_API_KEY` with your `You.com` api key.

In [None]:
import os
os.environ["YDC_API_KEY"] = "<YOUR YOU.COM API KEY>"

### Creating a synthetic dataset for enrichment

We will create a dataframe with a few company domains and columns such as number of employees, NAICS code, headquarter and year founded with missing values. Our aim is to fill these missing values using `You.com` research API.  

In [208]:
import pandas as pd

data = {
    'company': ["Apple", 'Canadian Tire', 'Home Depot', 'LinkedIn Corporation', 'General Dynamics'],
    'number of employees': ['', '', '', '', ''],
    'NAICS code(s)': ['', '', '', '', ''],
    'headquarter location': ['', '', '', '', ''],
    'founded year': ['', '', '', '', ''],
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,company,number of employees,NAICS code(s),headquarter location,founded year
0,Apple,,,,
1,Canadian Tire,,,,
2,Home Depot,,,,
3,LinkedIn Corporation,,,,
4,General Dynamics,,,,


### Searching and getting results with `You.com` API

We will obtain the missing values with a simple call to `You.com` API

In [209]:
import requests
import os

# obtain the API key from the environment
headers = {'x-api-key': os.environ['YDC_API_KEY']}

def get_research_data(company_name, missing_cols, mode="research"):
    """
    Run the query with the given company domain and missing columns to obtain the missing data
    in the form of a JSON object.
    """
    endpoint = f"https://chat-api.you.com/{mode}"
    params = {"query": f"""
              I am trying to find some missing information corresponding to {company_name}.
              For these companies, get me more information on each of the following {missing_cols}.
              """
        }
    response = requests.get(endpoint, params=params, headers=headers)
    return response.json()["answer"]

In [210]:
# Let's look at the results returned by the API for one of the companies
you_response = get_research_data("General Dynamics",
                           ['number_of_employees', 'NAICS_code(s)', 'headquarter_location', 'founded_year'])
print(you_response)

# General Dynamics: Comprehensive Company Information

## Number of Employees

As of December 31, 2023, General Dynamics had 111,600 employees, marking an increase of 5,100 employees or 4.79% compared to the previous year [[1]](https://stockanalysis.com/stocks/gd/employees/#:~:text=General%20Dynamics%20had%20111%2C600%20employees%20on%20December%2031%2C%202023). This number is consistent across multiple sources, confirming the company's substantial workforce [[2]](https://www.forbes.com/companies/general-dynamics/#:~:text=Employees%20111%2C600)[[3]](https://www.gd.com/about-gd/faqs#:~:text=As%20of%20January%202021%2C%20General%20Dynamics%20had%20more%20than%20100%2C000%20full%2Dtime%20employees.%20About%2084%2C000%20of%20these%20are%20based%20in%20the%20United%20States%2C%20and%2016%2C000%20are%20based%20in%20more%20than%2070%20countries%20outside%20the%20United%20States).

## NAICS Code(s)

General Dynamics operates under the NAICS code 336414, which pertains to Guided Missile and Spa

### Value extraction with GPT function calling

The response from You.com API is in long text format along with the relevant links. As we want to fill our table with the exact information, we will be using chatGPT funtion calling to extract the information for each of the missing fields. 

In [207]:
from openai import OpenAI
import json

os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI API KEY>"

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

In [221]:
# Let's create a function to extract the missing data from the You.com response using chatGPT function calling
def get_missing_data(you_com_response):
    """
    Given the response from the You.com API, extract the missing data fields using function calling with chatGPT
    and return the extracted data as a json object. 
    """
    prompt = f"Given the {you_com_response}, extract number of employees, NAICS code, headquarter location, and founded year."

    # function description to specify the format of the extracted data
    function = [
        {
            "name": "get_company_data",
            "description": "Extract the relevant data corresponding to each field",
            "parameters": {
                "type": "object",
                "properties": {
                    "number of employees": {
                        "type": "string",
                        "description": "The number of employees working in the company",
                    },
                    "NAICS code(s)": {
                        "type": "string",
                        "description": "The NAICS code of the company",
                    },
                    "headquarter location": {
                        "type": "string",
                        "description": "The headquarter location of the company",
                    },
                    "founded year": {
                        "type": "string",
                        "description": "The year in which the company was founded",
                    },
                },
                "required": ["number of employees", "NAICS code", "headquarter location", "founded year"],
            },
        }
    ]

    # call the chatGPT model with the function calling
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ],
        functions=function,
        function_call="auto"
    )

    # get the message from the completion which contains the extracted data
    output = completion.choices[0].message
    # parse the output to extract the json object
    json_output = json.loads(output.function_call.arguments)
    
    return json_output

In [222]:
# Let's look at the extracted missing data for the company from earlier
get_missing_data(you_response)

{'number of employees': '111600',
 'NAICS code(s)': '336414',
 'headquarter location': 'Reston, Virginia, USA',
 'founded year': '1952'}

### Fill the missing data

Finally, let's run our functions to first search and get the infromation using `You.com` research API then extract the values of the missing columns in `json` format with GPT function calling.

In [224]:
# call the API to get the missing data and update the DataFrame
for index, row in df.iterrows():
    missing_cols = [col for col in df.columns if row[col] == '']

    if missing_cols:
        # get the missing data from the You.com API
        you_response = get_research_data(row['company'], missing_cols)

        # extract the missing data using chatGPT function calling
        json_response = get_missing_data(you_response)

        # fill the data into original DataFrame
        for col in missing_cols:
            df.at[index, col] = json_response.get(col, '')

In [225]:
# Finally, let's display the updated DataFrame with the missing data filled in
df.head()

Unnamed: 0,company,number of employees,NAICS code(s),headquarter location,founded year
0,Apple,161000,"511210, 334111","Cupertino, California, 95014, United States","April 1, 1976"
1,Canadian Tire,68000,"441310, 441320, 441110, 441120, 441210, 441222...","Toronto, Ontario",1922
2,Home Depot,463100,"444130, 444110, 23","2455 Paces Ferry Road, Atlanta, Georgia 30339,...",1978
3,LinkedIn Corporation,19400,541511,"Sunnyvale, California",2003
4,General Dynamics,111600,336414,"Reston, Virginia, USA",1952
