## Demo Notebook for CaML paper
This notebook shows how you can use text similarity matching to estimate the manufacturing emissions (cradle-to-gate) of a product using economic input-output life cycle assessment (EIO-LCA). We use a language model to match the product with a North American Industry Classification System (NAICS) defined industry sector. The emissions are based on USEEIO database published by the United States Environmental Protection Agency (US EPA).  

We recommend running this notebook on a GPU machine for fast inference. 

In [2]:
# dataframe tools
import pandas as pd
from tqdm import tqdm

# metrics functions
from sklearn.metrics import mean_absolute_percentage_error as mape
from sklearn.metrics import r2_score

# custom package
from caml import config
from caml.similarity import MLModel

# interactive input tools
import ipywidgets as widgets
from ipywidgets import VBox

In [3]:
naics_df = pd.read_pickle('../data/naics_codes.pkl')

In [4]:
product_list = [
    "chocolate chip cookie",
    "mint tea",
    "bottled water",
    "wet canned cat food",
    "apple juice",
]

In [5]:
model = MLModel(config.model_name)
naics_list = naics_df.text_clean.values
cosine_scores = model.compute_similarity_scores(product_list, naics_list)

In [6]:
sorted_cs, indices = cosine_scores.sort(dim=1, descending=True)
result_df = pd.DataFrame()

for ix, product in enumerate(product_list):
    sorted_product_cs = sorted_cs[ix].cpu().numpy()
    naics_ix = indices[ix].cpu().numpy()
    result_df.loc[ix, 'product'] = product
    result_df.loc[ix, 'naics_code'] = naics_df.loc[naics_ix[0], 'naics_code']
    result_df.loc[ix, 'naics_title'] = naics_df.loc[naics_ix[0], 'Title']
    result_df.loc[ix, 'co2e_per_dollar'] = naics_df.loc[naics_ix[0], 'eio_co2']
    result_df.loc[ix, 'cosine_score'] = float("{:.3f}".format(sorted_product_cs[0]))

result_df

Unnamed: 0,product,naics_code,naics_title,co2e_per_dollar,cosine_score
0,chocolate chip cookie,311821.0,Cookie and Cracker Manufacturing,0.875952,0.596
1,mint tea,311920.0,Coffee and Tea Manufacturing,0.604888,0.736
2,bottled water,312112.0,Bottled Water Manufacturing,0.689474,0.643
3,wet canned cat food,311111.0,Dog and Cat Food Manufacturing,1.03369,0.743
4,apple juice,312111.0,Soft Drink Manufacturing,0.689474,0.573


#### Try your own product

In [28]:
# Custom product name
product = input()
cosine_scores = model.compute_similarity_scores([product], naics_list)
sorted_cs, indices = cosine_scores.sort(dim=1, descending=True)

In [29]:
sorted_product_cs = sorted_cs[0].cpu().numpy()
naics_ix = indices[0].cpu().numpy()
naics_code = naics_df.loc[naics_ix[0], 'naics_code']
naics_title = naics_df.loc[naics_ix[0], 'Title']
co2e_per_dollar = naics_df.loc[naics_ix[0], 'eio_co2']
cosine_score = float("{:.3f}".format(sorted_product_cs[0]))

print("The NAICS code for the product '{}' is '{}'. Prediction confidence is: {:.1f}%.".format(product, naics_title.strip(), cosine_score*100))
print("The cradle-to-gate carbon emission per dollar for this product is {:.3f}.".format(co2e_per_dollar))

The NAICS code for the product 'Organic Carrots' is 'Other Vegetable (except Potato) and Melon Farming'. Prediction confidence is: 58.5%.
The cradle-to-gate carbon emission per dollar for this product is 0.997.
