# Exploratory Data Analysis - Grouping Metrics
This notebook is focused on creating a metric to make sure that products are being correctly categorized.

Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description. [source](https://cookiecutter-data-science.drivendata.org/)

In [1]:
import os
import json
import numpy as np

In [2]:
!ls ../examples/model_outputs/

2.json	4.json	5.json	7.json


In [3]:
!ls ../examples/ground_truth/

example_01.json  example_05.json  example_09.json  example_13.json
example_02.json  example_06.json  example_10.json
example_03.json  example_07.json  example_11.json
example_04.json  example_08.json  example_12.json


In [4]:
data = {}
for model in os.listdir("../examples/model_outputs/"):
    with open(f"../examples/model_outputs/{model}", 'r') as f:
        data[f'model_{model.split(".json")[0]}'] = json.load(f)

In [5]:
data

{'model_4': {'estimate_preds': [{'valid_file_name': 'example_01',
    'rows': [{'label': 'Protect Work Area and Surroundings',
      'qty': 74.0,
      'uom': 'SF',
      'rateUsd': 0.25,
      'rowTotalCostUsd': 18.5,
      'category': 'labor',
      'sectionName': 'Demolition',
      'metadata': None},
     {'label': 'Demo Floor Tile',
      'qty': 8.0,
      'uom': 'HRS',
      'rateUsd': 65.0,
      'rowTotalCostUsd': 520.0,
      'category': 'labor',
      'sectionName': 'Demolition',
      'metadata': None},
     {'label': 'Demo Tub/Shower Combo',
      'qty': 4.0,
      'uom': 'HRS',
      'rateUsd': 65.0,
      'rowTotalCostUsd': 260.0,
      'category': 'labor',
      'sectionName': 'Demolition',
      'metadata': None},
     {'label': 'Demo Vanity',
      'qty': 2.0,
      'uom': 'HRS',
      'rateUsd': 65.0,
      'rowTotalCostUsd': 130.0,
      'category': 'labor',
      'sectionName': 'Demolition',
      'metadata': None},
     {'label': 'Demo Toilet',
      'qty': 1.0,
  

In [6]:
data.keys()

dict_keys(['model_4', 'model_7', 'model_5', 'model_2'])

In [7]:
gt_data = {}
for fn in os.listdir("../examples/ground_truth/"):
    with open(f"../examples/ground_truth/{fn}", 'r') as f:
        gt_data[fn.split('.json')[0]] = json.load(f)

In [8]:
gt_data.keys()

dict_keys(['example_08', 'example_11', 'example_09', 'example_03', 'example_05', 'example_06', 'example_13', 'example_12', 'example_07', 'example_10', 'example_02', 'example_01', 'example_04'])

### Grouping Metrics

Using LLAMA 3.1 was too getting very slow, so I've decided to create a new metric to make it easier for deployment

In [34]:
product_grouping_template = """
Category of product {}
"""

labor_grouping_template = """
Category of labor {}
"""

In [35]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer('all-mpnet-base-v2')
s1 = labor_grouping_template.format("Protect Work Area and Surroundings")
s2 = "Demolition"
cos_sim(model.encode(s1), model.encode(s2))

tensor([[0.1949]])

In [36]:
s1 = labor_grouping_template.format("Lay down all day")
s2 = "Demolition"
cos_sim(model.encode(s1), model.encode(s2))

tensor([[0.1654]])

In [37]:
s1 = labor_grouping_template.format('Demo Floor Tile')
s2 = "Demolition"
cos_sim(model.encode(s1), model.encode(s2))

tensor([[0.3404]])

Some improvements could be made using better prompting and a better model.

In [38]:
s1 = product_grouping_template.format("Floor Tile - 12x24 Porcelain")
s2 = "Tile"
cos_sim(model.encode(s1), model.encode(s2))

tensor([[0.6231]])

In [39]:
s1 = product_grouping_template.format("Protect Work Area and Surroundings")
s2 = "Tile"
cos_sim(model.encode(s1), model.encode(s2))

tensor([[0.0998]])

Works fine for the purpose of this challenge.