## **Description and usecases**

The aim of this notebook is to generate a brief description of the company and potential usecases of AI for the companies.

The input for this dataset is the Excel file **"Missions"** which contains the full set of companies in the Fortune Global 500.

The output is csv file called **1_companies_df** which contains the full dataset of companies with valid mission statements (441 companies) with the new columns Description and Usecases of AI.






In [None]:
! pip install langchain_community tiktoken langchain-openai langchainhub faiss-cpu langchain pypdf cryptography langchain-huggingface

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting pypdf
  Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0

**Dataset**

In [None]:
import pandas as pd
missions = pd.read_excel('Missions.xlsx', sheet_name="Fortune500") #the full dataset of Fortune Global 500
missions= missions[missions["Validated"]==1] #1 means the mission was found, validated and collected either by using Chat GPT or manual collection. 0 is not found be either method
missions

Unnamed: 0,ID,Company,Revenues ($M),Revenue Percent Change,Profits ($M),Profits Percent Change,Assets ($M),Employees,ID BvD,ISO,...,Mission_manual,URL_manual,Original,Sector,Industries,Validated,Missions_validated,Sector BvD,Description,l
0,1,Walmart,"$648,125",0.06,"$15,511",0.328,"$252,399",2100000,1,US,...,,,,Retailing,General Merchandisers,1.0,We aim to build a better world — helping peopl...,Retail,"Walmart Inc., incorporated on October 31, 1969...",3124
1,2,Amazon,"$574,785",0.118,"$30,425",-,"$527,854",1525000,2,US,...,,,,Retailing,Internet Services and Retailing,1.0,"As part of Amazon, we strive to be Earth’s mos...",Retail,"Amazon.com, Inc. provides a range of products ...",3933
2,3,State Grid,"$545,947.5",0.03,"$9,204.3",0.124,"$781,126.2",1361423,3953,CN,...,,,,Energy,Utilities,1.0,"Power Your Beautiful Life, Empower Our Beautif...",,Engaged in the operation and management of ele...,122
3,4,Saudi Aramco,"$494,890.1",-0.18,"$120,699.3",-0.241,"$660,819.2",73311,3,SA,...,,,,Energy,"Mining, Crude-Oil Production",1.0,"Aramco strives to provide reliable, affordable...",Mining & Extraction,"The company is engaged in the exploration, pro...",3687
4,5,Sinopec Group,"$429,699.7",-0.088,"$9,393.4",-0.027,"$382,688",513434,4,CN,...,Powering a better life,http://www.sinopecgroup.com/group/en/000/000/0...,,Energy,Petroleum Refining,1.0,Powering a better life,Mining & Extraction,"China Petroleum & Chemical Corporation (the ""C...",2499
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,495,Subaru,"$32,540.1",0.167,"$2,664.4",0.799,"$31,835.4",37693,848,JP,...,,,,Motor Vehicles and Parts,Motor Vehicles and Parts,1.0,We aim to be a compelling company with a stron...,Transport Manufacturing,The Companyis a Japan-based company engaged in...,716
495,496,Air France-KLM Group,"$32,452.3",0.169,"$1,009.7",0.319,"$38,093.7",76271,772,FR,...,,,,Transportation,Airlines,1.0,to be at the forefront of a more responsible E...,"Transport, Freight & Storage",The Company is one of the world's leading airl...,521
496,497,Enbridge,"$32,349.5",-0.21,"$4,588.3",0.988,"$136,769.6",12450,792,CA,...,to be the first choice for energydelivery in N...,https://www.enbridge.com/~/media/Enb/Documents...,,Energy,Pipelines,1.0,to be the first choice for energydelivery in N...,Utilities,Enbrige Inc (formerly IPL Energy Inc) is engag...,3147
497,498,ABB,"$32,235",0.095,"$3,745",0.513,"$40,940",107900,800,CH,...,,,,Industrials,Industrial Machinery,1.0,to enable a more sustainable and resource-effi...,"Industrial, Electric & Electronic Machinery",The history of ABB Ltd was started through the...,3084


**Libraries**

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

**Prompts**

The first prompt is to generate a brief description of the company in one sentence for context. The input is the column "Description" from the dataset which contains the description and history of the company. This column contain a very long description, sometimes in more than 2 paragraphs, so I applied a LLM to generate a summary in one sentence in order to have a brief context of the company.

In [None]:
prompt_description= [
    {
        "role": "system",
        "content": """You are an expert analyst specializing in creating concise and accurate corporate summaries for companies from the Fortune Global 500.

**Your Task**
Synthesize the provided company `{description}` into a single, clear sentence. This sentence must summarize the company's primary business activity or main value proposition.
Use the {sector} as context.

---
**Inputs**
- **Company:** {company}
- **Description:** {description}
- **Sector:** {sector}

---
**Rules for the Summary**
1.  **Maximum Length:** Must be **25 words or less**.
2.  **Core Focus:** Must describe what the company primarily does and, if relevant, something characteristic or famous about the company.
3.  **Clarity:** Must be easily understandable by a general audience, avoiding technical jargon.

---
**Mandatory Output Format**
You must return a **single JSON object only**. Do not include any other text, notes, preambles, or markdown. The JSON object must conform to this exact structure:
You must ensure that the name of the key is "summary".

{{
  "summary": "The concise, under-20-word summary of the company."
}}
"""
    },
    {
        "role": "user",
        "content": "Company: {company}\nDescription: {description}"
    }
]

The second prompt is to generate three potential AI usecases in the company given the description, name and sector of the company.

In [None]:
prompt_usecases = [
    {"role": "system",
        "content": """You are an expert in designing impactful and practical AI
        applications for companies from the Fortune Global 500.

Your task is to generate **three distinct, realistic AI use cases** for the company
described below.
These use cases should be directly related to the {sector} sector, for {company}
described as {description}.

Each use case must:
- Be relevant to the company’s actual operations.
- Be plausible with today’s AI capabilities (not speculative).
- Be clear and understandable by a general audience (no jargon).
- Be concise: **each use case must be 15 words or fewer**.
- Be unique from each other (no thematic or functional overlap).
- Be separated by commas, and returned in a single string.
- It has to be three usecases.

---
**You must return a single JSON object only**, following this exact format:
```json
{{
  "usecases": "usecase_1, usecase_2, usecase_3"
}}
Do not include any other output or explanation.
"""
}]

**Generation**

Based on the prompts above I generated the description of the company and usecases. These are stored in new columns called *Description_summary* and *Usecases_[name of the LLM].*
The usecases are generated by three LLM (Orca_mini, Qwen and Nemotron) and then Llama 3.3 to select the most consistent answer using Universal self consistency approach by [Deepmind](https://https://deepmind.google/research/publications/50879/).

The lines below are for generating the description summary.



In [None]:
#A class to helps ensuring that the output from the language model conforms to the structure.
from typing import Optional, List
from typing_extensions import Annotated, TypedDict
class company_description(TypedDict):
    summary: str

In [None]:
#definition of the llm to use
from langchain_core.output_parsers import JsonOutputParser
prompt = ChatPromptTemplate.from_messages(prompt_description)
llm_llama = ChatOllama(
    model="llama3.3:70b",
    format="json")
# Define a parser to handle the JSON string from the LLM
# I passed the TypedDict to ensure the output matches the structure.
parser = JsonOutputParser(pydantic_object=company_description)

chain = prompt | llm | parser

In [None]:
#To invoke the chain by batches
from tqdm import tqdm

summaries = []
batch_size = 50

for start_index in tqdm(range(0, len(missions), batch_size), desc="Processing batches"):
    end_index = min(start_index + batch_size, len(missions))
    batch_missions = missions.iloc[start_index:end_index]
    #The description prompt uses as input the company name, description from Orbis database and Sector
    for i, row in batch_missions.iterrows():
        company = row.get("Company", "")
        description = row.get("Description", "")
        sector = row.get("Sector", "")


        if pd.isna(company) or pd.isna(description) or not isinstance(company, str) or not isinstance(description, str):
            summaries.append("Error: Missing or invalid input")
            continue

        try:
            result = chain.invoke({
                "company": company.strip(),
                "description": description.strip(),
                "sector": sector.strip()

            })

            if isinstance(result, dict) and "summary" in result:
                summaries.append(result["summary"])
            else:
                summaries.append("Error: Output missing 'summary' key")

        except Exception as e:
            summaries.append(f"Error: {str(e)}")

Processing batches: 100%|██████████| 9/9 [27:20<00:00, 182.24s/it]


In [None]:
missions['Description_summary']= summaries

In [None]:
missions.loc[0].Description_summary

'Walmart is a technology-powered omnichannel retailer operating retail and wholesale stores globally.'

The lines below are for generating the usecases.

In [None]:
#class with the structure
class ai_usecases(TypedDict):
    usecases: str

In [None]:
prompt = ChatPromptTemplate.from_messages(prompt_usecases)

In [None]:
#Firt usecases with orca mini
llm_1 = ChatOllama(
    model="orca-mini:70b",
    format="json")
parser_1 = JsonOutputParser(pydantic_object=ai_usecases)
chain_1 = prompt | llm_1 | parser_1
usecases_1 = []
batch_size = 50

for start_index in tqdm(range(0, len(missions), batch_size), desc="Processing batches"):
    end_index = min(start_index + batch_size, len(missions))
    batch_missions = missions.iloc[start_index:end_index]

    #Use as inputs the columns Company, the recent generated description and Sector
    for i, row in batch_missions.iterrows():
        company = row.get("Company", "")
        description = row.get("Description_summary", "")
        sector = row.get("Sector", "")

        if pd.isna(company) or pd.isna(description) or not isinstance(company, str) or not isinstance(description, str):
            usecases_1.append("Error: Missing or invalid input")
            continue

        try:
            result = chain_1.invoke({
                "company": company.strip(),
                "description": description.strip(),
                "sector": sector.strip()

            })

            if isinstance(result, dict) and "usecases" in result:
                usecases_1.append(result["usecases"])
            else:
                usecases_1.append("Error: Output missing 'usecases' key")

        except Exception as e:
            usecases_1.append(f"Error: {str(e)}")

Processing batches: 100%|██████████| 9/9 [11:45<00:00, 78.44s/it]


In [None]:
#second usecases with nemotron
llm_2 = ChatOllama(
    model="nemotron:70b",
    format="json")
parser_2 = JsonOutputParser(pydantic_object=ai_usecases)
chain_2 = prompt | llm_2 | parser_2

usecases_2 = []
batch_size = 50

for start_index in tqdm(range(0, len(missions), batch_size), desc="Processing batches"):
    end_index = min(start_index + batch_size, len(missions))
    batch_missions = missions.iloc[start_index:end_index]

    for i, row in batch_missions.iterrows():
        company = row.get("Company", "")
        description = row.get("Description_summary", "")
        sector = row.get("Sector", "")

        if pd.isna(company) or pd.isna(description) or not isinstance(company, str) or not isinstance(description, str):
            usecases_2.append("Error: Missing or invalid input")
            continue

        try:
            result = chain_2.invoke({
                "company": company.strip(),
                "description": description.strip(),
                "sector": sector.strip()

            })

            if isinstance(result, dict) and "usecases" in result:
                usecases_2.append(result["usecases"])
            else:
                usecases_2.append("Error: Output missing 'usecases' key")

        except Exception as e:
            usecases_2.append(f"Error: {str(e)}")

Processing batches: 100%|██████████| 9/9 [40:00<00:00, 266.74s/it]


In [None]:
# Third usecases with qwen 2.5
llm_3 = ChatOllama(
    model="qwen2.5:72b",
    format="json")
parser_3 = JsonOutputParser(pydantic_object=ai_usecases)
chain_3 = prompt | llm_3 | parser_3

usecases_3 = []
batch_size = 50

for start_index in tqdm(range(0, len(missions), batch_size), desc="Processing batches"):
    end_index = min(start_index + batch_size, len(missions))
    batch_missions = missions.iloc[start_index:end_index]


    for i, row in batch_missions.iterrows():
        company = row.get("Company", "")
        description = row.get("Description_summary", "")
        sector = row.get("Sector", "")

        if pd.isna(company) or pd.isna(description) or not isinstance(company, str) or not isinstance(description, str):
            usecases_3.append("Error: Missing or invalid input")
            continue

        try:
            result = chain_3.invoke({
                "company": company.strip(),
                "description": description.strip(),
                "sector": sector.strip()

            })

            if isinstance(result, dict) and "usecases" in result:
                usecases_3.append(result["usecases"])
            else:
                usecases_3.append("Error: Output missing 'usecases' key")

        except Exception as e:
            usecases_3.append(f"Error: {str(e)}")

Processing batches: 100%|██████████| 9/9 [1:12:21<00:00, 482.37s/it]


In [None]:
# to append the new columns to the dataset
missions['Usecases_1']= usecases_1
missions['Usecases_2']= usecases_2
missions['Usecases_3']= usecases_3
missions

Unnamed: 0,ID,Company,Revenues ($M),Revenue Percent Change,Profits ($M),Profits Percent Change,Assets ($M),Employees,ID BvD,ISO,...,Sector,Industries,Validated,Missions_validated,Sector BvD,Description,l,Usecases_1,Usecases_2,Usecases_3
0,1,Walmart,"$648,125",0.06,"$15,511",0.328,"$252,399",2100000,1,US,...,Retailing,General Merchandisers,1.0,We aim to build a better world — helping peopl...,Retail,"Walmart Inc., incorporated on October 31, 1969...",3124,"AI for demand forecasting, AI-enhanced supply ...",AI-driven demand forecasting for inventory man...,"AI-driven dynamic pricing, Predictive inventor..."
1,2,Amazon,"$574,785",0.118,"$30,425",-,"$527,854",1525000,2,US,...,Retailing,Internet Services and Retailing,1.0,"As part of Amazon, we strive to be Earth’s mos...",Retail,"Amazon.com, Inc. provides a range of products ...",3933,"Recommendation engine improvement, Logistics o...","AI-powered product recommendation engine, Pred...",AI-driven personalized shopping recommendation...
2,3,State Grid,"$545,947.5",0.03,"$9,204.3",0.124,"$781,126.2",1361423,3953,CN,...,Energy,Utilities,1.0,"Power Your Beautiful Life, Empower Our Beautif...",,Engaged in the operation and management of ele...,122,"AI for energy grid optimization, AI-powered ou...","AI predicts power grid energy demand, Detects ...","AI for predictive maintenance of grid assets, ..."
3,4,Saudi Aramco,"$494,890.1",-0.18,"$120,699.3",-0.241,"$660,819.2",73311,3,SA,...,Energy,"Mining, Crude-Oil Production",1.0,"Aramco strives to provide reliable, affordable...",Mining & Extraction,"The company is engaged in the exploration, pro...",3687,"AI enhances oil reservoir management, AI optim...","AI predicts oil reservoir depletion rates, Det...","AI for predictive maintenance of oil rigs, Opt..."
4,5,Sinopec Group,"$429,699.7",-0.088,"$9,393.4",-0.027,"$382,688",513434,4,CN,...,Energy,Petroleum Refining,1.0,Powering a better life,Mining & Extraction,"China Petroleum & Chemical Corporation (the ""C...",2499,"AI for catalyst optimization, AI-powered predi...","Predictive Maintenance for Refinery Equipment,...","AI optimizes refinery operations, Predicts equ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,495,Subaru,"$32,540.1",0.167,"$2,664.4",0.799,"$31,835.4",37693,848,JP,...,Motor Vehicles and Parts,Motor Vehicles and Parts,1.0,We aim to be a compelling company with a stron...,Transport Manufacturing,The Companyis a Japan-based company engaged in...,716,"Predictive maintenance for vehicles, AI-powere...","AI predicts vehicle maintenance needs, Enhance...","AI for predictive maintenance of vehicles, AI-..."
495,496,Air France-KLM Group,"$32,452.3",0.169,"$1,009.7",0.319,"$38,093.7",76271,772,FR,...,Transportation,Airlines,1.0,to be at the forefront of a more responsible E...,"Transport, Freight & Storage",The Company is one of the world's leading airl...,521,"AI for route optimization, AI-powered customer...","Predictive Aircraft Maintenance Scheduling, AI...","AI for flight schedule optimization, AI-powere..."
496,497,Enbridge,"$32,349.5",-0.21,"$4,588.3",0.988,"$136,769.6",12450,792,CA,...,Energy,Pipelines,1.0,to be the first choice for energydelivery in N...,Utilities,Enbrige Inc (formerly IPL Energy Inc) is engag...,3147,"AI-enhanced pipeline inspections, AI-powered l...",AI monitors pipeline integrity for leak detect...,"AI for predictive maintenance of pipelines, AI..."
497,498,ABB,"$32,235",0.095,"$3,745",0.513,"$40,940",107900,800,CH,...,Industrials,Industrial Machinery,1.0,to enable a more sustainable and resource-effi...,"Industrial, Electric & Electronic Machinery",The history of ABB Ltd was started through the...,3084,"Autonomous grid management, Predictive mainten...","Predictive Maintenance for Industrial Robots, ...","AI-driven predictive maintenance, Optimized en..."


In [None]:
#Universal self consistency to select the usecases most consistent that represents the answers of the three LLM's

import re
from tqdm import tqdm

usc_llm = ChatOllama(
    model="llama3:70b") #I'm using Llama 3.3 as the judge

usecases_best = []
batch_size = 50

for start_index in tqdm(range(0, len(missions), batch_size), desc="Processing batches"):
    end_index = min(start_index + batch_size, len(missions))
    batch_missions = missions.iloc[start_index:end_index]

    for i, row in batch_missions.iterrows():
        company = row["Company"]
        uc1 = row["Usecases_1"]
        uc2 = row["Usecases_2"]
        uc3 = row["Usecases_3"]

        try: # I used almost the same prompt as described in the Deep mind paper
            usc_prompt = [
                {
                    "role": "system",
                    "content": """You are an expert in evaluating AI-generated outputs.

Given three different AI-generated responses to the same prompt, your task is to select the one that is most **consistent** with the others.

"Consistent" means it shares common ideas or themes with the other responses. Reply exactly like this:

The most consistent response is Response X
"""
                },
                {
                    "role": "user",
                    "content": f"""Evaluate these responses for the company "{company}":

Response 0: {uc1}
Response 1: {uc2}
Response 2: {uc3}

Which one is most consistent with the others?"""
                }
            ]

            usc_result = usc_llm.invoke(usc_prompt)
            usc_text = usc_result.content
            match = re.search(r"Response\s+(\d)", usc_text)
            if match:
                best_index = int(match.group(1))
                best_usecase = [uc1, uc2, uc3][best_index]
                usecases_best.append(best_usecase)
            else:
                usecases_best.append("Error: No valid response index found")
        except Exception as e:
            usecases_best.append(f"Error: {str(e)}")

Processing batches: 100%|██████████| 9/9 [09:54<00:00, 66.05s/it]


In [None]:
#To append the new column to the dataset
missions['Usecases_USC']= usecases_best

The main dataset to be used in the other notebooks. It contains the 441 companies with valid mission statements.

In [None]:
missions.to_csv("1_companies_df.csv", index=False)