## Missing Tabular Data Enrichment With YOU Research API 

In this tutorial, we will enrich the missing values in a tabular dataset primarily using `You.com` research API. 

Firstly, set the environment variable `YDC_API_KEY` with your `You.com` api key.

In [None]:
import os
os.environ["YDC_API_KEY"] = "<YOUR YOU.COM API KEY>"

#### Creating a synthetic dataset for enrichment

We will create a dataframe with a few company domains and columns such as number of employees, NAICS code, headquarter and founded year with missing values. Our aim is to fill these missing values using `You.com` research API.  

In [62]:
import pandas as pd

data = {
    'company': ['https://www.enbridge.com/', 'https://www.canadiantire.ca/en.html', 'https://www.homedepot.ca/en/home.html', 'https://atsautomation.com/', 
                'https://www.gd.com/'],
    'number_of_employees': ['', '', '', '', '106500'],
    'NAICS_code': ['', '', '', '', ''],
    'headquarter_location': ['', '', '', '', ''],
    'founded_year': ['', '', '', '', '1952'],
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,company,number_of_employees,NAICS_code,headquarter_location,founded_year
0,https://www.enbridge.com/,,,,
1,https://www.canadiantire.ca/en.html,,,,
2,https://www.homedepot.ca/en/home.html,,,,
3,https://atsautomation.com/,,,,
4,https://www.gd.com/,106500.0,,,1952.0


#### Obtaining results with `You.com` API

We will obtain the missing values with a simple call to `You.com` API

In [87]:
import requests
import os

# obtain the API key from the environment
headers = {'x-api-key': os.environ['YDC_API_KEY']}

def get_research_data(company_domain, missing_cols, mode="research"):
    """
    Run the query with the given company domain and missing columns to obtain the missing data
    in the form of a JSON object.
    """
    endpoint = f"https://chat-api.you.com/{mode}"
    params = {"query": f"""
              I'm trying to enrcih data corresponding to {company_domain}. I have a few questions, and 
              the answer to them is a single word answer. Get me more information on {missing_cols}. 
              Return response as a JSON object.
              """
        }
    response = requests.get(endpoint, params=params, headers=headers)
    return response.json()["answer"]

In [None]:
# call the API to get the missing data and update the DataFrame
for index, row in df.iterrows():
    missing_cols = [col for col in df.columns if row[col] == '']

    if missing_cols:
        response = get_research_data(row['company'], missing_cols)

        # TODO: add post-processing to get the correct column values
        for col in missing_cols:
            df.at[index, col] = response[col]

In [88]:
empty_fields = df.apply(lambda row: [row['company']] + [col for col in df.columns if row[col] == ''], axis=1)[1]

result = get_research_data(empty_fields[0], empty_fields[1:])
print(result["answer"])

```json
{
  "number_of_employees": "10,000+",
  "NAICS_code": "441310",
  "headquarter_location": "Toronto, Ontario",
  "founded_year": "1922"
}
```
