# Advanced Product Mapping 

Author: Yihao Fang, Ph.D.


## Introduction
The stakeholder operates convenience-store-like markets and receives weekly shipments from suppliers, including an assortment of newly introduced items. Two CSV files are provided: one containing the stakeholder’s internal product list, and another containing the suppliers’ external product list.

### Objective
The existing method of mapping the two product lists is slow and relies on manual effort. The objective is to analyze the datasets and design an intelligent, automated system to match external products with internal items. The solution should incorporate prompt engineering as part of the technology stack. It is important to ensure that matches are exact, requiring identical product manufacturer, name, and size.

### Examples
Here are a few examples of correct and wrong matches:

#### Correct Matches:
|External_Product_Name|Internal_Product_Name|
|---|---|
|DIET LIPTON GREEN TEA W/ CITRUS 20 OZ|Lipton Diet Green Tea with Citrus (20oz)|
|CH-CHERRY CHS CLAW DANISH 4.25 OZ|Cloverhill Cherry Cheese Bearclaw Danish (4.25oz)|

#### Wrong Matches:
|External_Product_Name|Internal_Product_Name|
|---|---|
|Hersheys Almond Milk Choco 1.6 oz|Hersheys Milk Chocolate with Almonds (1.85oz)|
|COOKIE PEANUT BUTTER 2OZ|Famous Amos Peanut Butter Cookie (2oz)|


## Methodology
The two approaches differ significantly in methodology, cost-efficiency, and complexity. The BM25 Retriever Augmented Llama 3.2 approach relies on a sophisticated pipeline combining probabilistic retrieval with advanced NLP techniques and the Llama 3.2 language model. It employs retrieval-augmented generation, chain-of-thought reasoning, and self-consistency mechanisms to ensure accurate and interpretable product mapping. This open-source approach is far more cost-efficient compared to proprietary ChatGPT solutions, making it attractive for budget-conscious applications. 

In contrast, the ChatGPT o1-mini approach simplifies intricate tasks by focusing on an iterative and modular workflow that leverages the reasoning capabilities of ChatGPT. By eschewing complex pipelines in favor of streamlined logic and iterative refinement, it achieves robust and scalable results with reduced implementation complexity. While the open-source approach offers advanced customization and interpretability, the proprietary ChatGPT method emphasizes ease of use and simplicity, making it more accessible for use cases requiring rapid deployment and minimal setup.

### Approach One: BM25 Retriever Augmented Llama 3.2 Large Language Model (Open Source)
This approach implements a pipeline to match external product names with internal product names using a combination of advanced NLP techniques and probabilistic models. It begins by preprocessing internal product names with <b>SentencePiece subword encoding</b> (Kudo, T., & Richardson, J., 2018), ensuring robust tokenization for downstream tasks. External products are mapped to internal products by first retrieving the top K most relevant candidates using the <b>BM25 retrieval model</b> (Robertson, S., & Zaragoza, H., 2009), which ranks potential matches based on tokenized similarity. The matching process relies on the <b>Llama 3.2 large language model (LLM)</b> (Touvron, H., Lavril et al., 2023), which leverages <b>few-shot prompting</b> to provide structured examples and guide the model's behavior. Additionally, <b>chain-of-thought reasoning</b> (Wei, J. et al., 2022) is employed, requiring the LLM to provide a detailed rationale for its match or mismatch decisions, improving interpretability and reliability.

To handle the inherent probabilistic nature of LLMs, the system employs <b>self-consistency through majority voting</b> (Wang, X. et al., 2023), where multiple generations of model outputs are compared, and the most frequent result is selected for higher accuracy. This process is computationally intensive but enhances reliability. Finally, the results are aggregated in a <b>map-reduce-style framework</b>: individual product matches are mapped and recorded in a detailed matrix, then reduced into a summary DataFrame showing aggregated matches for each external product. This structured pipeline effectively combines retrieval, reasoning, and aggregation techniques, ensuring accurate and explainable product mapping while demonstrating advanced NLP and machine learning practices.


#### Importing Python libraries
This script imports Python modules and libraries for the product mapping application. 

- <b>os:</b> Provides functionality to interact with the operating system, such as file and directory operations.

- <b>subprocess:</b> Used to spawn new processes, interact with the operating system, and run shell commands.

- <b>time:</b> Provides time-related functions. Commonly used for measuring execution time, delays, or timestamps.

- <b>requests:</b> A popular HTTP library for making API calls or downloading data from web services.

- <b>json.loads:</b> Parses JSON strings into Python objects.

- <b>re:</b> Provides support for regular expressions.

- <b>pandas as pd:</b> A powerful library for data manipulation and analysis.

- <b>numpy as np:</b> A library for numerical computing.

- <b>rank_bm25.BM25Okapi:</b> Part of the rank_bm25 library, which implements the BM25 ranking algorithm for text retrieval.

- <b>joblib.Memory:</b> A caching library to save computation time by storing results of expensive functions.

- <b>sentencepiece as spm:</b> A library for unsupervised text tokenization and subword modeling.

In [23]:
import os
import subprocess
import time
import requests
from json import loads
import re
import pandas as pd
import numpy as np
from rank_bm25 import BM25Okapi
from joblib import Memory
import sentencepiece as spm

#### Initializing constants and the job identifier (ID)
This Python code snippet initializes constants, generates a unique job identifier (ID), and then prints it. 

In [13]:
TOP_K = 4
N_VOTES = 3
np.random.seed(int(time.time()))
JOB_ID = np.random.randint(0, 100000)
JOB_ID = f'llama_3_2_{JOB_ID}'
print('JOB_ID:', JOB_ID)

JOB_ID: llama_3_2_17313


#### Setting pandas display options
This Python code snippet modifies Pandas display options to ensure that when data is printed (e.g., in a Jupyter Notebook or console), all of its contents are fully visible without truncation. 

* pd.set_option('display.max_rows', None):

  Ensures that all rows in a DataFrame are displayed without truncation. Setting it to None removes this limit.

* pd.set_option('display.max_columns', None):

  Ensures that all columns in a DataFrame are displayed. Setting it to None removes this limit.

* pd.set_option('display.width', None):

  Removes restrictions on the total width of the display, allowing the output to adapt to the actual size of the DataFrame. Without this, the output might wrap or truncate rows when the DataFrame is too wide.

* pd.set_option('display.max_colwidth', None):

  Ensures that the full content of each column is displayed, even if the values are very long (e.g., long strings). By default, long strings might be truncated with ellipses (...).

In [14]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

#### Loading data
This Python code snippet reads product information from two CSV files and extracts specific columns into lists for further processing.

In [15]:
external_product_list = pd.read_csv('Data_External.csv')["PRODUCT_NAME"].tolist()
internal_product_list = pd.read_csv('Data_Internal.csv')["LONG_NAME"].tolist()

#### Initializing the SentencePiece tokenizer
This Python code checks if a SentencePiece model file (Data_Internal.model) exists; if not, it creates a model from the internal_product_list. First, it preprocesses the product names by converting them to lowercase and saving them as a text corpus (Data_Internal.txt). Then, it trains a SentencePiece model using the command-line tool spm_train with specific parameters, such as byte-pair encoding (BPE), a vocabulary size of 22,675, and full character coverage. Once the model is created, it initializes a SentencePieceProcessor for tokenization. The function tokenize(text) leverages this processor to tokenize input text into subword units after converting the text to lowercase. 

In [16]:
if not os.path.exists("Data_Internal.model"):
    corpus = [internal_product.lower() for internal_product in internal_product_list]
    with open('Data_Internal.txt', 'w') as f:
        f.write('\n'.join(corpus)+'\n')
    command = 'spm_train --input=Data_Internal.txt --model_prefix=Data_Internal --vocab_size=22675 --character_coverage=1.0 --model_type=bpe'
    subprocess.run(command.split(' ')) 
sp = spm.SentencePieceProcessor(model_file='Data_Internal.model')

def tokenize(text):
    return sp.encode(text.lower(), out_type=str)

#### Initiating the BM25 retriever
This code creates a tokenized corpus of the product names from the internal_product_list and initializes a BM25 retrival model using the BM25Okapi class. First, it tokenizes each product name in the internal_product_list by applying the previously defined tokenize function, which uses the trained SentencePiece model to break text into subword units. The resulting list, tokenized_corpus, contains the tokenized representation of each product name. This tokenized corpus is then used to initialize the BM25Okapi object, BM25, which is a ranking algorithm commonly used in text retrieval. This setup enables efficient ranking and searching of the product names based on query relevance.

In [17]:
tokenized_corpus = [tokenize(internal_product) for internal_product in internal_product_list]
bm25 = BM25Okapi(tokenized_corpus)

#### Initializing the Llama 3.2 client
This Python code sets up a caching mechanism for an API request function to improve efficiency and reduce redundant network calls. It defines a request function that makes POST requests to a specified URL multiple times (n) using the requests library. The function takes parameters for the url, headers, and a json payload, sends the POST request, and appends the parsed JSON responses to a list. The joblib.Memory utility is used to cache the results of the request function, storing them in a directory (./cachedir_joblib). When the same function is called with identical parameters, the cached result is returned instead of making a new API call. Finally, the url and headers are pre-defined for interacting with the Llama 3.2 REST API endpoint (https://www.lexisophy.com/api/rest/v1/chat) using JSON payloads. 

In [18]:
cachedir = './cachedir_joblib'
memory = Memory(cachedir, verbose=0)

def request(url, headers, json, n):
    results = []
    for _ in range(n):
        response = requests.post(url, headers=headers, json=json)
        result = loads(response.content)
        results.append(result)
    return results
request = memory.cache(request)

url = 'https://www.lexisophy.com/api/rest/v1/chat'
headers = {'Content-Type': 'application/json'}

#### Loading the prompt template for bulk matching
This Python code reads the contents of a text file named bulk_match_prompt_template_llama_3_2.txt and stores it in the variable bulk_match_prompt_template.

- bulk_match_prompt_template_llama_3_2.txt:
```
You are asked to judge if the external product maps to an internal product. If it is a match, return TRUE, otherwise return FALSE. You are asked to provide a rationale, regardless of whether it is a match.
Note: The match has to be exact, meaning the product manufacturer, name, and size must be identical.
You may first normalize the long product name (e.g., DIET LIPTON GREEN TEA W/ CITRUS 20 OZ) into the manufacturer (e.g., LIPTON), name (e.g., DIET GREEN TEA WITH CITRUS), and size (e.g., 20 OZ). 
Please do not show the source code and provide the output as a table.

Here are some examples:
External_Product	Internal_Product	Rationale	Result
DIET LIPTON GREEN TEA W/ CITRUS 20 OZ	Lipton Diet Green Tea with Citrus (20oz)	Manufacturer Lipton matches, name Diet Green Tea with Citrus matches, size 20oz matches.	TRUE
CH-CHERRY CHS CLAW DANISH 4.25 OZ	Cloverhill Cherry Cheese Bearclaw Danish (4.25oz)	Manufacturer Cloverhill (CH) matches, name Cherry Cheese Bearclaw Danish (CHERRY CHS CLAW DANISH) matches, size 4.25oz matches.	TRUE
Hersheys Almond Milk Choco 1.6 oz	Hersheys Milk Chocolate with Almonds (1.85oz)	Size 1.6 oz and 1.85oz does not match.	FALSE
COOKIE PEANUT BUTTER 2OZ	Famous Amos Peanut Butter Cookie (2oz)	Name does not match, one is butter, the other is cookie.	FALSE

You are now asked to judge if the external product maps to an internal product or not:
External_Product	Internal_Product	Rationale	Result
```

This prompt template designs to guide a system in evaluating whether an external product maps to an internal product by applying specific criteria. 

1. <b>Explicit Instructions</b>

   The instructions within the template are highly explicit, specifying that a match requires the manufacturer, product name, and size to be identical. Furthermore, it explains that the language model must normalize product names into components (manufacturer, name, and size) for comparison, and provide the output in a table format.

2. <b>Prompting using Few-Shot Learning</b>

   The use of few-shot learning is evident in the examples included within the template. By presenting several structured examples of how the comparison process should work, the template trains or conditions the system to understand the expected format and reasoning process. Few-shot prompting is a technique that enables the language model to generalize based on these examples, thereby improving its ability to produce consistent and accurate results when applied to new inputs. The inclusion of varied examples ensures the language model can handle edge cases and nuances in product matching.

3. <b>Chain of Thought Technique</b>

   Involves asking the language model for providing rationales to help guide its thinking and generate a more coherent and relevant response. This technique can be useful for generating more thoughtful and well-reasoned responses from language models.

In [19]:
with open('bulk_match_prompt_template_llama_3_2.txt','r') as f:
    bulk_match_prompt_template = f.read()

#### Product mapping
This Python function, bulk_match, performs a task to map external products to internal ones by generating prompts for a language model (LLM), collecting multiple results (via majority voting for self-consistency), and determining the most accurate match between products. The function takes a task identifier and slices of external and internal product lists as input. It constructs prompts for each external-internal product pair based on the predefined template and appends them to an output file for tracking. These prompts are sent to the REST API endpoint (using the request function) multiple times (N_VOTES), simulating the generation of several possible outputs from the LLM. The responses are saved to a results file, each annotated with the corresponding choice identifier for traceability.

<b>Majority Voting</b> plays a central role in improving the accuracy of results. Since LLMs are probabilistic and may produce inconsistent outputs for the same input, the function aggregates results across multiple model responses for each external-internal product pair. Each individual result is parsed into a DataFrame (res_df), and columns corresponding to each "vote" are added to the main DataFrame (vote_df). The final consensus is determined by selecting the most frequently occurring answer (mode) across all responses for each product pair. This process, known as self-consistency through majority voting, enhances reliability by favoring the most probable output. The final consolidated results are appended to an output file, ensuring transparency and enabling validation. This approach trades increased computational cost for improved accuracy and robustness in decision-making.

In [20]:
# Mapping external products to internal ones
def bulk_match(item):
    task_id = item[0]
    sliced_external_product_list = item[1]
    sliced_internal_product_list = item[2]
    
    rows = []
    for external_product in sliced_external_product_list:
        for internal_product in sliced_internal_product_list:
            row = f'{external_product}\t{internal_product}\t\t\n'
            rows.append(row)
    prompt = bulk_match_prompt_template + (''.join(rows))
    with open(f'output/prompts_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nPROMPT: {prompt}", file=f)
        print("\n\n----------------------------\n\n", file=f)
        
    data = {'role': 'user', 'content': prompt}

    results = request(url, headers, json=data, n=N_VOTES)

    with open(f'output/results_{JOB_ID}.txt', 'a') as f:
        for choice_id, result in enumerate(results):
            print(f"Task: {task_id}\n\nChoice: {choice_id}\n\nRESULT:\n{result}", file=f)
            print("\n\n----------------------------\n\n", file=f)
    
    # Using majority voting
    SEP_PATTERN = r'\s*\|\s*'
    vote_df_data = []
    for external_product in sliced_external_product_list:
        for internal_product in sliced_internal_product_list:
            vote_df_data.append([external_product, internal_product])
    vote_df = pd.DataFrame(vote_df_data, columns=['External_Product', 'Internal_Product'])
    for choice_id, result in enumerate(results):
        result = result.strip('\n')
        rows = result.split('\n')
        res_df_data = []
        for row in rows:
            row = re.sub(r'^'+SEP_PATTERN, '', row)
            row = re.sub(SEP_PATTERN+r'$', '', row)
            mo = re.match(r'\-+'+SEP_PATTERN+r'\-+'+SEP_PATTERN+r'\-+'+SEP_PATTERN+r'\-+', row)
            if mo is None:
                cells = re.split(SEP_PATTERN, row)
                res_df_data.append(cells)
        res_df = pd.DataFrame(res_df_data[1:], columns=res_df_data[0])

        vote_df[f'Result_{choice_id}'] = res_df['Result']
    
    vote_df['Result'] = vote_df.loc[:, vote_df.columns.difference(['External_Product', 'Internal_Product'])].mode(axis=1)[0]
    
    with open(f'output/results_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nRESULT:\n", file=f)
        print(vote_df, file=f)
        print("\n\n----------------------------\n\n", file=f)
    
    return vote_df

#### Performing map-reduce
This Python code applies the bulk_match function to map external products to the most relevant internal products using BM25 for candidate retrieval and stores the results in vote_df_list. It constructs a list of tasks, where each task consists of a unique task_id, a single external product from external_product_list, and the top TOP_K internal products ranked by BM25 based on their similarity to the tokenized external product. The bulk_match function is then applied to each task using the map function, which iterates over the tasks and returns a list of DataFrames (vote_df_list). Each DataFrame contains the matching results for one external product, including the final decision derived from majority voting. This approach combines efficient retrieval (BM25) with robust matching logic (LLM) to handle a potentially large dataset of products.

In [21]:
vote_df_list = list(map(bulk_match, [(task_id, 
                                      [external_product], 
                                      bm25.get_top_n(tokenize(external_product), internal_product_list, n=TOP_K)) 
                                      for task_id, external_product in enumerate(external_product_list)]))

#### Coalescing map-reduce results
This code organizes the matching results into a structured format:

- mapping_df: A detailed matrix for analyzing the raw pairwise mapping results.
- aggr_df: A high-level summary for each external product, listing its corresponding internal product matches or "NULL" if no matches are found.

In [22]:
mapping_df = pd.DataFrame(columns=external_product_list)
mapping_df['Internal_Product'] = internal_product_list
mapping_df.set_index('Internal_Product', inplace=True)

for vote_df in vote_df_list:
    for index, row in vote_df.iterrows():
        mapping_df.loc[row['Internal_Product'], row['External_Product']] = row['Result']

mapping_df.to_csv(f'output/mapping_raw_{JOB_ID}.csv')

aggr_df = pd.DataFrame(columns=['External_Product', 'Internal_Product'])
aggr_df['External_Product'] = external_product_list
aggr_internal_product_list = []
for external_product in external_product_list:
    internal_product = mapping_df.index[mapping_df[external_product] == 'TRUE'].tolist()
    internal_product = 'NULL' if len(internal_product) == 0 else ','.join(internal_product)
    aggr_internal_product_list.append(internal_product)
aggr_df['Internal_Product'] = aggr_internal_product_list

with open(f'output/results_{JOB_ID}.txt', 'a') as f:   
    print(aggr_df, file=f)
    print("\n\n----------------------------\n\n", file=f)
    
aggr_df.to_csv(f'output/mapping_aggr_{JOB_ID}.csv')
print(aggr_df)

                                            External_Product  \
0                                  5 HOUR XTRA GRAPE 1.93 OZ   
1                                     B - PB & HONEY SAMMICH   
2                  B - RUDY FARMS - SAUSAGE AND BISCUIT TWIN   
3                                            BANANAS - FRESH   
4                                    BOBOS PB&J GRAPE 2.1 OZ   
5                            BODY ARMOR STRWBRY BANANA 16 OZ   
6                                 BR ESPRESSO W/ CREAM 11 OZ   
7                                Bumble Bee Tuna Salad 3.5oz   
8                                CELSIUS ORANGE ENERGY 12 OZ   
9                                   CELSIUS PEACH VIBE 12 OZ   
10                                   DIET MOUNTAIN DEW 20 OZ   
11                               DOLE STRWBRY LEMONADE 20 OZ   
12                                DOVE BAR DARK CHOC 1.44 OZ   
13                                        Dr Pepper 12oz Can   
14                                F - TU

### Approach Two: ChatGPT o1-mini
The approach leverages the ChatGPT o1-mini to tackle a complex task of matching and coalescing product data through a streamlined, iterative process. By dividing the task into smaller, manageable components and using a structured workflow, the system efficiently processes product lists in parallel, generating intermediate outputs and incrementally refining them through majority voting and coalescing. This method demonstrates how traditionally intricate tasks, requiring extensive logic or custom algorithms, can be approached with simpler and more intuitive procedures by leveraging the reasoning and language understanding capabilities of the ChatGPT o1 family. Through prompt engineering and iterative reduction, the system achieves robust, scalable, and accurate results with reduced complexity.

#### Importing Python libraries
This snippet imports concurrent.futures for managing multiple threads or processes. The openai module, specifically the OpenAI class,  interacts with OpenAI's API for tasks like natural language processing. The pandas library (pd) offers data manipulation and analysis capabilities, while numpy (np) is used for numerical computations. The time module is imported for handling time-based operations.

In [None]:
import concurrent.futures
from openai import OpenAI
import pandas as pd
import numpy as np
import time

#### Initializing constants and the job identifier (ID)
This code initializes configuration parameters and a unique identifier for a job. The MAX_MATCHING_COUNT and MAX_COALESCING_COUNT constants set limits, for iterative product matching. N_PROC is set to 24, utilizing 24 parallel processes. 

In [None]:
MAX_MATCHING_COUNT = 30
MAX_COALESCING_COUNT = 20
N_PROC = 24
np.random.seed(int(time.time()))
JOB_ID = np.random.randint(0, 100000)
JOB_ID = f'o1_mini_{JOB_ID}'
print('JOB_ID:', JOB_ID)

#### Initializating the OpenAI client
This code snippet initializes an instance of the OpenAI client by passing an API key as a parameter. 

In [None]:
client = OpenAI(api_key='')

#### Loading data
This code reads product data from two CSV files, Data_External.csv and Data_Internal.csv, into Python lists using the pandas library. The first line loads the "PRODUCT_NAME" column from Data_External.csv and converts its values into a list, storing them in external_product_list. Similarly, the second line reads the "LONG_NAME" column from Data_Internal.csv and converts its values into a list, storing them in internal_product_list.

In [None]:
external_product_list = pd.read_csv('Data_External.csv')["PRODUCT_NAME"].tolist()
internal_product_list = pd.read_csv('Data_Internal.csv')["LONG_NAME"].tolist()

#### Loading the prompt template
This code snippet reads a text file, bulk_match_prompt_template_o1_mini.txt, which contains a template for a prompt, and customizes it by dynamically inserting the external product list to the prompt.

- bulk_match_prompt_template_o1_mini.txt
```
You are provided with two product list: one is the external product list, and the other is the internal product list.
You are asked to create a table including all external products with the corresponding mapped internal product. If no match is found, the table should indicate NULL for the internal product. (Note: The match has to be exact, meaning the product manufacturer, name, and size must be identical.) 
Only output the table and trim any excess whitespace from the table’s output.

---
Examples:  
To help you understand our requirements, here are a few examples of correct and wrong matches:
Correct Matches: 
External_Product_Name 	Internal_Product_Name 
DIET LIPTON GREEN TEA W/ CITRUS 20 OZ 	Lipton Diet Green Tea with Citrus (20oz) 
CH-CHERRY CHS CLAW DANISH 4.25 OZ 	Cloverhill Cherry Cheese Bearclaw Danish (4.25oz) 

Wrong Matches:
External_Product_Name 	Internal_Product_Name 
Hersheys Almond Milk Choco 1.6 oz 	Hersheys Milk Chocolate with Almonds (1.85oz) 
COOKIE PEANUT BUTTER 2OZ 	Famous Amos Peanut Butter Cookie (2oz) 

---
External Product List:
<external_product_list>

---
Internal Product List:
<internal_product_list>
```

The bulk_match_prompt_template_o1_mini.txt prompt demonstrates the effective use of <b>Explicit Instructions</b> and <b>Few-Shot Prompting</b> to guide a model in generating precise and accurate outputs for matching products between two lists.

- Explicit Instructions

  The model is explicitly instructed to "create a table including all external products with the corresponding mapped internal product" and to "indicate NULL for unmatched internal products." The prompt specifies that the match must be exact, including product manufacturer, name, and size.

- Few-Shot Prompting

  The prompt uses examples of both correct and incorrect matches to demonstrate the expected behavior, which helps guide the model in producing high-quality output

In [None]:
with open('bulk_match_prompt_template_o1_mini.txt','r') as f:
    bulk_match_prompt_template = f.read()
bulk_match_prompt_template = bulk_match_prompt_template.replace('<external_product_list>', '\n'.join(external_product_list))

#### Product mapping
This Python code defines a function bulk_match for mapping external products to internal ones by interacting with the language model o1-mini via the OpenAI API. The function processes a task identified by task_id and customizes the matching prompt by injecting a specific internal_product_list into a predefined template. The prompt is logged into a file for traceability. The language model generates five different outputs (n=5) for each task, and these outputs are then saved to an output file.

The function employs <b>self-consistency through majority voting</b> to refine the outputs generated by the model. All results from the initial generation step are combined into a single prompt explaining the concept of majority voting, where the final decision is based on the most frequently occurring outcome among the alternatives. This technique improves the reliability of the model's outputs by emphasizing consensus among multiple responses, thereby reducing errors and inconsistencies. 

In [None]:
def bulk_match(item):
    task_id = item[0]
    prompt = bulk_match_prompt_template.replace('<internal_product_list>', '\n'.join(item[1]))
    with open(f'output/prompts_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nPROMPT: {prompt}", file=f)
        print("\n\n----------------------------\n\n", file=f)
    completion = client.chat.completions.create(
      model="o1-mini-2024-09-12",
      messages=[
        {"role": "user", "content": prompt}
      ],
      n=5
    )
    results = []
    with open(f'output/results_{JOB_ID}.txt', 'a') as f:
        for choice_id in range(len(completion.choices)):
            result = completion.choices[choice_id].message.content
            print(f"Task: {task_id}\n\nChoice: {choice_id}\n\nRESULT: {result}", file=f)
            print("\n\n----------------------------\n\n", file=f)
            results.append(result)
    
    # Using majority voting
    prompt = "Please combine the following results using majority voting. Majority voting is a decision-making rule in which the final outcome is determined by choosing the option that receives more votes than any other alternative. Only output the table and trim any excess whitespace from the table’s output: \n\n---"+("\n\n----------------------------\n\n".join(results))
    with open(f'output/prompts_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nPROMPT: {prompt}", file=f)
        print("\n\n----------------------------\n\n", file=f)
    
    completion = client.chat.completions.create(
      model="o1-mini-2024-09-12",
      messages=[
        {"role": "user", "content": prompt}
      ],
      n=1
    )
    result = completion.choices[0].message.content
    with open(f'output/results_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nRESULT: {result}", file=f)
        print("\n\n----------------------------\n\n", file=f)
    return result

#### Performing map-reduce
This code snippet uses multithreading with the concurrent.futures.ThreadPoolExecutor to efficiently process product matching tasks in parallel. The max_workers=N_PROC parameter specifies that up to N_PROC threads (24 in this case) can run concurrently, which maximizes resource utilization for the workload. The start_indexes list is generated to partition the internal_product_list into smaller chunks, each of size MAX_MATCHING_COUNT (30 items). These chunks are then processed in parallel using the executor.map function, where the bulk_match function is applied to each partition. The input to bulk_match includes a unique task identifier (e.g., 'm-0', 'm-1') and a corresponding subset of internal_product_list. The results from all tasks are collected as a list and stored in the results variable. This approach ensures efficient processing of large datasets by breaking them into manageable pieces and distributing the work across multiple threads, significantly improving performance in API-intensive tasks.

In [None]:
with concurrent.futures.ThreadPoolExecutor(max_workers=N_PROC) as executor:
    start_indexes = list(range(0, len(internal_product_list), MAX_MATCHING_COUNT))
    results = list(executor.map(bulk_match, [(f'm-{index//MAX_MATCHING_COUNT}', internal_product_list[index:index+MAX_MATCHING_COUNT]) for index in start_indexes]))

#### Iteratively coalescing results
This code implements a coalescing step to aggregate and refine the results from a map-reduce process, leveraging an iterative approach with a multithreaded executor for efficiency. The coalesce function handles individual coalescing tasks. For each task, it constructs a prompt instructing the model to merge results by selecting non-NULL values if available and defaulting to NULL when all values are NULL. The constructed prompt is logged into a file for traceability. The function then uses the OpenAI client to query the language model with the prompt and retrieves the coalesced result. This result is logged into an output file for review and returned.

The iterative loop processes the list of intermediate results (results) in batches, reducing their size at each iteration. It uses a ThreadPoolExecutor with a concurrency limit of N_PROC to coalesce the results in parallel. The results are divided into chunks of size MAX_COALESCING_COUNT (20 items) using indices generated by start_indexes. For each chunk, the coalesce function is called with a unique task identifier (e.g., 'c-0-0', 'c-0-1') and a subset of results. After each iteration, the number of results decreases, and the loop continues until only one result remains, representing the fully coalesced output. This iterative reduction strategy ensures scalability and efficiency in merging large datasets while adhering to memory and performance constraints.

In [None]:
def coalesce(item):
    task_id = item[0]
    with open(f'output/prompts_{JOB_ID}.txt', 'a') as f:
        prompt = "Please coalesce the following results. If a non-NULL value is found, use it; if all values are NULL, return NULL. Only output the table and trim any excess whitespace from the table’s output: \n\n---"+("\n\n----------------------------\n\n".join(item[1]))
        print(f"Task: {task_id}\n\nPROMPT: {prompt}", file=f)
        print("\n\n----------------------------\n\n", file=f)
        
    completion = client.chat.completions.create(
          model="o1-mini-2024-09-12",
          messages=[
            {"role": "user", "content": prompt}
          ],
          n=1
        )
    result = completion.choices[0].message.content
    with open(f'output/results_{JOB_ID}.txt', 'a') as f:
        print(f"Task: {task_id}\n\nRESULT: {result}", file=f)
        print("\n\n----------------------------\n\n", file=f)
        
    return result

j = 0
while len(results) > 1:
    with concurrent.futures.ThreadPoolExecutor(max_workers=N_PROC) as executor:
        start_indexes = list(range(0, len(results), MAX_COALESCING_COUNT))
        results = list(executor.map(coalesce, [(f'c-{j}-{index//MAX_COALESCING_COUNT}', results[index:index+MAX_COALESCING_COUNT]) for index in start_indexes]))
    j += 1

## References

- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In <i>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</i>, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Robertson, S. and Zaragoza, H., 2009. The probabilistic relevance framework: BM25 and beyond. <i>Foundations and Trends® in Information Retrieval</i>, 3(4), pp.333-389.
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. and Rodriguez, A., 2023. Llama: Open and efficient foundation language models. <i>arXiv preprint arXiv:2302.13971</i>.
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In <i>Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22)</i>. Curran Associates Inc., Red Hook, NY, USA, Article 1800, 24824–24837.
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A. and Zhou, D., 2023. Self-consistency improves chain of thought reasoning in language models. <i>The Eleventh International Conference on Learning Representations</i>.