In [34]:
import requests
import json
import time
import pandas as pd

In [35]:
fields = ["Firm_Name", "Registered_Address", "CEO", "Establishment_Year", "Number_Of_Employees", "Revenue_Size" ,
        "Website", "NAICS_Code", "SIC_Code", "Status" ]

Define llama function 
- Need to install ollama first as in here: https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb

In [36]:
url = "http://localhost:11434/api/chat"
def llama3(prompt):
    data = {
        "model": "llama3.2",
        "messages": [
            {
                "role": "user",
                "content": prompt

            }
        ],
        "stream": False,
    }

    headers = {
        "Content-Type": "application/json"
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()["message"]["content"]

Test

In [37]:
response = llama3("which llama model are you and how many parameters do you have?")
print(response)

I'm an instance of the Transformers model, specifically a large language model based on the BERT (Bidirectional Encoder Representations from Transformers) architecture. My primary model is based on the "big" version of the transformer, called BIGBERT.

My specific configuration has 110 million parameters, which are divided among multiple layers and sub-modules within the model.

Here's a brief breakdown:

1. **Model Architecture**: I'm based on a multi-layer bidirectional transformer encoder with a self-attention mechanism.
2. **Pre-training Data**: My weights were trained on a massive corpus of text data, including books, articles, and websites.
3. **Training Objective**: The goal was to predict the next word in a sequence given the context of previous words.

Keep in mind that my architecture is quite complex, and while I can process human language with remarkable accuracy, my performance may not be perfect for every specific task or domain.

If you have any more questions about my c

In [38]:
def form_prompt(context, query, data):
    prompt = f"""
    Context:
    {context}

    Query:
    {query}

    Relevant Data:
    {data}
    """

    return prompt

### Main Loop

Get google search result and list of firm names and data fields

In [47]:
search_results = json.load(open("firm_google_search_results.json"))
# Get Firm names and data fields
df_firms = pd.read_csv('FirmData.csv')
data_fields = df_firms.columns.tolist()
data_fields.remove('Firm_Name')
data_fields
firm_names = df_firms.Firm_Name.tolist()
firm_names

['01K Capital LLC.',
 '1 Act Services, LLC',
 "TIN DRUM ASIACAFE', LLC",
 'Dancing Goats Coffee ARR, LLC',
 'Clickety Clack Vape Gifts LLC',
 'Amin petrol electric llc',
 'CAB CHINA, LLC',
 'E R Enterprise for Freedom LLC',
 'Georgia Tech Savannah, LLC',
 'ANDREW THOMAS LEE PHOTOGRAPHY, LLC']

Load in dataframe with just company names filled

In [48]:
# llm_firm_data = json.load(open("llm_firm_data.json"))
llm_firm_data = {}

In [49]:
general_context = "You will be assisting me with filling in data fields for a firm database I am building. I will tell you the name of the firm i am interested in, and the field I want you to fill. I will give you relevant information from websites or google search results that I gathered by searching for the firm name and field. You will give your answer by simply stating the value of the field I am interested in. Do not form sentences, just give the value of the field. If you have absolutely no idea about the answer, then answer with 'null' ."

Ask the llm to extract the relevant field based on search results for the given firm
- Alternate between Gemini 1.5 Flash and 1.0 Pro when usage limit is hit

In [50]:
for firm_name in firm_names:
    if firm_name not in llm_firm_data:
        llm_firm_data[firm_name] = {}
    for field in data_fields:
        if field not in llm_firm_data[firm_name]:
            llm_firm_data[firm_name][field] = {}

        prompt = form_prompt(
            context=general_context,
            query= f"Fill in the field {field} for the firm {firm_name}", 
            data = search_results[firm_name][field])
        
        llm_firm_data[firm_name][field]['prompt']  = prompt

        # check if we already filled the field      
        if 'response' in llm_firm_data[firm_name][field]:
            # print("response exists for ", firm_name, field)            
            continue
        
        success = False

        while not success:
            try:
                response = llama3(prompt)
                llm_firm_data[firm_name][field]['response'] = response
                print("Success for ", firm_name, field)
                success = True
            except Exception as e:
                time.sleep(1) 
        

Success for  01K Capital LLC. Registered_Address
Success for  01K Capital LLC. CEO
Success for  01K Capital LLC. Establishment_Year
Success for  01K Capital LLC. Number_Of_Employees
Success for  01K Capital LLC. Revenue_Size
Success for  01K Capital LLC. Website
Success for  01K Capital LLC. NAICS_Code
Success for  01K Capital LLC. SIC_Code
Success for  01K Capital LLC. Status
Success for  1 Act Services, LLC Registered_Address
Success for  1 Act Services, LLC CEO
Success for  1 Act Services, LLC Establishment_Year
Success for  1 Act Services, LLC Number_Of_Employees
Success for  1 Act Services, LLC Revenue_Size
Success for  1 Act Services, LLC Website
Success for  1 Act Services, LLC NAICS_Code
Success for  1 Act Services, LLC SIC_Code
Success for  1 Act Services, LLC Status
Success for  TIN DRUM ASIACAFE', LLC Registered_Address
Success for  TIN DRUM ASIACAFE', LLC CEO
Success for  TIN DRUM ASIACAFE', LLC Establishment_Year
Success for  TIN DRUM ASIACAFE', LLC Number_Of_Employees
Suc

In [43]:
with open("llm_firm_data_llama.json", "w") as f:
    json.dump(llm_firm_data, f)

Fill in dataframe

In [44]:
for firm_name in firm_names:
    for field in data_fields:
        value = llm_firm_data[firm_name][field]['response']
        value = value.strip("\n")
        if value == "null":
            value = None
        df_firms.loc[df_firms.Firm_Name == firm_name, field] = value

  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value
  df_firms.loc[df_firms.Firm_Name == firm_name, field] = value


In [45]:
df_firms

Unnamed: 0,Firm_Name,Registered_Address,CEO,Establishment_Year,Number_Of_Employees,Revenue_Size,Website,NAICS_Code,SIC_Code,Status
0,01K Capital LLC.,,,,,,,,,
1,"1 Act Services, LLC",,,,10-19,6,https://www.actservicesllc.com/,541990,8748,
2,"TIN DRUM ASIACAFE', LLC","1117 Perimeter Center West, Suite W200, Atlant...",Steven Chan,2003,30-31,<$5 Million,tindrumasiankitchen.com,722513,5812,Private
3,"Dancing Goats Coffee ARR, LLC",,David Wasson,1988,88,$22 Million,dancinggoats.com,422120,5812,Active
4,Clickety Clack Vape Gifts LLC,"1396 Gray Hwy, Macon, GA 31211",,,<25,<$5 Million,,453999,5999,Active/Compliance
5,Amin petrol electric llc,,,,11-50,,,,4911,
6,"CAB CHINA, LLC",,Terri Jondahl,1982,51-200,,cabww.com,339999,3499,
7,E R Enterprise for Freedom LLC,"1235 woodington cir, atlanta, GA, 30044, USA",,,11-50,$1M to $10M,,236210,1542,Active
8,"Georgia Tech Savannah, LLC","210 Technology Circle Savannah, GA 31407",,The provided JSON response appears to be from ...,The provided output is a JSON response from th...,The provided output appears to be a JSON respo...,"The provided output is from the Serpapi API, w...",The `serpapi_pagination` object contains metad...,The provided code appears to be a response fro...,The information provided is not in a format th...
9,"ANDREW THOMAS LEE PHOTOGRAPHY, LLC",The provided output appears to be in JSON form...,The provided output is a JSON object that cont...,The provided output is in JSON format and appe...,The API request is using the Serpapi search en...,The response from the Serpapi API is a JSON ob...,"This is a JSON response from the Serpapi API, ...",The provided output appears to be a JSON respo...,The provided output is a JSON object that cont...,The provided JSON response is from the Serpapi...


In [None]:
df_firms.to_csv("FirmDataLLLLMAugmented.csv", index=False)