In [1]:
import pandas

import spacy
from webcolors import CSS3_NAMES_TO_HEX

from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import torch

from Helpers.helpers import preprocess_description, preprocess_specification_column, reg_extract, reg_extract_color, spacy_extract,preprocess_training_examples,preprocess_validation_examples
nlp = spacy.load('en_core_web_sm')

# Data Loading and Analysis

In [2]:
df=pandas.read_csv("marketing_sample_for_amazon_com-ecommerce_data.csv")

In [3]:
df.head(5)

Unnamed: 0,Uniq Id,Product Name,Brand Name,Asin,Category,Upc Ean Code,List Price,Selling Price,Quantity,Model Number,...,Product Url,Stock,Product Details,Dimensions,Color,Ingredients,Direction To Use,Is Amazon Seller,Size Quantity Variant,Product Description
0,4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fib...",,,Sports & Outdoors | Outdoor Recreation | Skate...,,,$237.68,,,...,https://www.amazon.com/DB-Longboards-CoreFlex-...,,,,,,,Y,,
1,66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, ...",,,Toys & Games | Learning & Education | Science ...,,,$99.95,,55324.0,...,https://www.amazon.com/Electronic-Circuits-Cla...,,,,,,,Y,,
2,2c55cae269aebf53838484b0d7dd931a,3Doodler Create Flexy 3D Printing Filament Ref...,,,Toys & Games | Arts & Crafts | Craft Kits,,,$34.99,,,...,https://www.amazon.com/3Doodler-Plastic-Innova...,,,,,,,Y,,
3,18018b6bc416dab347b1b7db79994afa,Guillow Airplane Design Studio with Travel Cas...,,,Toys & Games | Hobbies | Models & Model Kits |...,,,$28.91,,142.0,...,https://www.amazon.com/Guillow-Airplane-Design...,,,,,,,Y,,
4,e04b990e95bf73bbe6a3fa09785d7cd0,Woodstock- Collage 500 pc Puzzle,,,Toys & Games | Puzzles | Jigsaw Puzzles,,,$17.49,,62151.0,...,https://www.amazon.com/Woodstock-Collage-500-p...,,,,,,,Y,,


In [4]:
df.shape

(10002, 28)

Unstructured data, empty columns and missing values:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Uniq Id                10002 non-null  object 
 1   Product Name           10002 non-null  object 
 2   Brand Name             0 non-null      float64
 3   Asin                   0 non-null      float64
 4   Category               9172 non-null   object 
 5   Upc Ean Code           34 non-null     object 
 6   List Price             0 non-null      float64
 7   Selling Price          9895 non-null   object 
 8   Quantity               0 non-null      float64
 9   Model Number           8232 non-null   object 
 10  About Product          9729 non-null   object 
 11  Product Specification  8370 non-null   object 
 12  Technical Details      9212 non-null   object 
 13  Shipping Weight        8864 non-null   object 
 14  Product Dimensions     479 non-null    object 
 15  Im

Remove irrelevant columns: (Empty, redundant or the ones that don't add value in our case)

In [6]:
cols = [0,2,3,4,5,6,7,8,9,13,15,16,17,18,19,20,21,22,23,24,25,26,27]
df.drop(df.columns[cols], axis =1, inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Product Name           10002 non-null  object
 1   About Product          9729 non-null   object
 2   Product Specification  8370 non-null   object
 3   Technical Details      9212 non-null   object
 4   Product Dimensions     479 non-null    object
dtypes: object(5)
memory usage: 390.8+ KB


# Preprocessing

- The `product specification` column contains mainly the information about the dimension and weight and requires particularly preprocessing as it's unstructured.
- The `preprocess_specification_column` function adds spaces before and after digits and key words in order to make it clearer for the POS tagging system. Example: (pre and after applying the function):

In [15]:
print(df['Product Specification'][2],'\n'*2,preprocess_specification_column(df['Product Specification'][2]))

ProductDimensions:10.3x3.4x0.8inches|ItemWeight:12.8ounces|ShippingWeight:12.8ounces(Viewshippingratesandpolicies)|ASIN:B07D36747F|Manufacturerrecommendedage:14yearsandup 

 ProductDimensions : 10.3 x 3.4 x 0.8 inches | item weight : 12.8 ounces |ShippingWeight : 12.8 ounces (Viewshippingratesandpolicies)|ASIN : B07D36747F|Manufacturerrecommendedage : 14yearsandup


- The `Context` column is the concatenation of all relevant columns 
- The `preprocess_description` function removes stop words, punctuation and tokens with irrelevant tags (verbs, determinants...)
- The `Product Dimensions` (Answer) column will be used later as the answer part in a question answering task

In [16]:
dataset=(
     df
     .fillna({'Product Name':'','About Product':'','Product Specification':'','Technical Details':''})
     .assign(**{'Product Specification': lambda df: df['Product Specification'].apply(preprocess_specification_column),
               'Context':lambda df: df['Product Name']+' '+df['About Product']+' '+df['Product Specification']+' '+df['Technical Details']})
     .assign(Context=lambda df: df['Context'].apply(preprocess_description))
     .drop(columns=['Product Name','About Product','Product Specification','Technical Details'],axis=1)
     .rename(columns={'Product Dimensions':'Answer'})
     
 )

In [17]:
dataset.head(5)

Unnamed: 0,Answer,Context
0,,41 bamboo fiberglass complete sure model numbe...
1,14.7 x 11.1 x 10.2 inches 4.06 pounds,electronic snap circuits mini kits radio motio...
2,,3doodler flexy 3d printing filament refill bun...
3,,guillow airplane design studio travel case bui...
4,,collage 500 pc puzzle sure model number | puzz...


# Use Regular Expressions

Are defined below three regex patterns to extract the three wanted features which go with the preprocessing we have.<br>
The CSS3_NAMES_TO_HEX is a color dictionary imported from the python color library.

In [18]:
reg_dimension_pattern = r'(\d+(\.\d+)?\s*x\s*\d+(\.\d+)?\s*x\s*\d+(\.\d+)?\s*inches)'
reg_weight_pattern=r'(item weight\s*(\d+(\.\d+)?)\s*(pounds|ounces|lbs))'
reg_color_pattern=r'\b(' + '|'.join(list(CSS3_NAMES_TO_HEX.keys())) + r')\b'

Extracting color form context using the `reg_extract_color` function and weight and dimensions using the `reg_extract ` function

In [19]:
features_extracted_regex=( 
    dataset['Context'].to_frame()
    .assign(**{'Color': lambda df: df['Context'].apply(lambda x: reg_extract_color(x,reg_color_pattern)),
               'weight': lambda df: df['Context'].apply(lambda x: reg_extract(x,reg_weight_pattern)),
               'Dimension': lambda df: df['Context'].apply(lambda x: reg_extract(x,reg_dimension_pattern))})
)
features_extracted_regex.to_excel('Extracted Features.xlsx',sheet_name='Regex')

### Results

In [20]:
nbr_dimension=len(features_extracted_regex[features_extracted_regex['Dimension'].notnull()])
nbr_weight=len(features_extracted_regex[features_extracted_regex['weight'].notnull()])
nbr_color=len(features_extracted_regex[features_extracted_regex['Color'].notnull()])
print('The dimension feature is found for: ',nbr_dimension,' rows')
print('The weight feature is found for: ',nbr_weight,' rows')
print('The Color feature is found for: ',nbr_color, ' rows')

The dimension feature is found for:  7991  rows
The weight feature is found for:  7483  rows
The Color feature is found for:  3503  rows


# Use The Spacy Library : POS based

Define rule-based patterns using POS tags and keywords:
- Dimension: xx * xx * xx inches
- Weight: item weight xx ( pounds ||ounces ||lbs)
- Color: in list of colors

In [21]:
dimension_pattern = [{'POS': 'NUM'},{'LOWER': 'x'},{'POS': 'NUM'},{'LOWER': 'x'}, {'POS':'NUM'},
                     {'LOWER': 'inches'}] 
weight_pattern = [{'LOWER': 'item'},{'LOWER': 'weight'},{'POS': 'NUM'},
                  {'LOWER': {'IN': ['ounces', 'pounds','lbs']}}]
color_pattern=[{'LOWER': {'IN': list(CSS3_NAMES_TO_HEX.keys())}}]

`spacy_extract` is a function to find patterns in a text using the SpaCy library

In [24]:
features_extracted_spacy=( 
    dataset['Context'].to_frame()
    .assign(**{'Color': lambda df: df['Context'].apply(lambda x: spacy_extract(x,color_pattern)),
               'weight': lambda df: df['Context'].apply(lambda x: spacy_extract(x,weight_pattern)),
               'Dimension': lambda df: df['Context'].apply(lambda x: spacy_extract(x,dimension_pattern))})
)

### Results

Among 10002 samples we could extract features for:

In [26]:
nbr_dimension=len(features_extracted_spacy[features_extracted_spacy['Dimension'].notnull()])
nbr_weight=len(features_extracted_spacy[features_extracted_spacy['weight'].notnull()])
nbr_color=len(features_extracted_spacy[features_extracted_spacy['Color'].notnull()])
print('The dimension feature is found for: ',nbr_dimension,' rows')
print('The weight feature is found for: ',nbr_weight,' rows')
print('The Color feature is found for: ',nbr_color, ' rows')

The dimension feature is found for:  7991  rows
The weight feature is found for:  7484  rows
The Color feature is found for:  3495  rows


Store results:

In [27]:
writer = pandas.ExcelWriter('Extracted Features.xlsx', engine='xlsxwriter')
features_extracted_regex.to_excel(writer, sheet_name='Regex')
features_extracted_spacy.to_excel(writer, sheet_name='SpaCy')
writer.save()

#### Observation:

The regular expressions succeeded at extracting more features in less time which is due to the structure and nature of the information we're looking for and we can't generalize this for more complicated cases where regular expressions are limited ( Entity recognition, particular linguistics)

# Use a Question Answering Pretrained Model

Intuitively we could think of that task as a question answering task where 'context' is the product description and 'answer' is the explicit feature value we could get from a specific column (column with the same feature name). <br> In our dataset this labelling is only available for the Product Dimensions feature ( we have 479 non null values That's why we will try to use a pretrained model to answer the question "What are the product dimensions?" and evaluate it.

In [28]:
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

One sample to test this pretrained model:

In [29]:
sample=dataset['Context'][1]

In [30]:
question_weight = [ "What is the product weight?"]
question_color=["What is the product color?"]
question_dimension=[ "What is the product dimensions?"]
context = [sample]

Answer generator:

In [31]:
def answer_question(context,question):
    input = tokenizer(question, context, padding=True, truncation=True, return_tensors="pt")
    output = model(**input)
    start_logit, end_logit = output.start_logits, output.end_logits
    answer_start = torch.argmax(start_logit, dim=1)
    answer_end = torch.argmax(end_logit, dim=1) + 1
    answer = [tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input["input_ids"][0][answer_start[0]:answer_end[0]]))]
    return answer

In [32]:
answer_question(context,question_dimension)

['14. 7 x 11. 1 x 10. 2 inches']

In [33]:
results=( 
    dataset['Context'].to_frame()
    .assign(**{'Color': lambda df: df['Context'].apply(lambda x: answer_question([x],question_color)),
               'weight': lambda df: df['Context'].apply(lambda x: answer_question([x],question_weight)),
               'Dimension': lambda df: df['Context'].apply(lambda x: answer_question([x],question_dimension))})
)
results.to_excel('results.xlsx',sheet_name='QuAn')

We get numerous examples where a regular expression or a POS tags based system would fail to extract the information and this could be observed especially with the Dimensions feature because it has multiple possible patterns, examples:

In [13]:
print(results['Dimension'][9296],'\n',results['Dimension'][9171],'\n',results['Dimension'][8666])

['59. 8 ( l ) * 48 ( w ) * 36. 8 ( h ) / 173. 5 / nimh cells'] 
 ['10 length x 0. 5 depth x 15 height inches'] 
 ['49 x 32 total length 20 tail 10 x 6. 5 x | 300 x 80']


However, we get many flawed predictions. That's why it's imperative to find a way to eliminate such predictions which are particularly frequent in the Color case. <br> Since fine-tuning requires important computational resources, we could prioritize features for which the pretrained model performs badly. <br>
But in this situation, we only have labeled data for the Dimensions feature ( if we make the assumption that the values we have in the Dimensions column are correct)

# Fine Tune on our Dataset

The dataset we have is tiny (479 samples: labeled data points), but the pattern is kind of obvious and repetitive, so we expect finetuning to improve the results.

In [34]:
question=["What are the product dimensions?"]

In [35]:
dataset=dataset.dropna(subset=['Answer'])

In order to use Hugging face transformers we need to adjust our data to the compatible form accepted by the predefined functions from Hugging face

### Setup

In [38]:
data_dict = {'id': list(range(len(dataset))),
             'context': dataset['Context'].tolist(),
             'answer': dataset['Answer'].tolist(),
             'question':question*len(dataset)}
data = Dataset.from_dict(data_dict)
data

Dataset({
    features: ['id', 'context', 'answer', 'question'],
    num_rows: 479
})

- Split the data into training and validation sets
- Convert the sets to the compatible dataset type
- Apply the necessary preprocessing to the training and validation sets

In [39]:
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
train_data, val_data = Dataset.from_dict(train_data),Dataset.from_dict(val_data)
train_data, val_data = train_data.map(preprocess_training_examples, batched=True),val_data.map(preprocess_validation_examples, batched=True)
train_data, val_data

Map:   0%|          | 0/383 [00:00<?, ? examples/s]

Map:   0%|          | 0/96 [00:00<?, ? examples/s]

(Dataset({
     features: ['id', 'context', 'answer', 'question', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
     num_rows: 383
 }),
 Dataset({
     features: ['id', 'context', 'answer', 'question', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 96
 }))

In [40]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [41]:
device

device(type='cuda')

In [42]:
torch.cuda.empty_cache()

Using the Trainer API from the Hugging face platform ( Hyperparameters tuned for the available computational resources), but still couldn't get rid of the 'Cuda out of memory' error

In [43]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,              
    per_device_train_batch_size=6,  
    per_device_eval_batch_size=6,   
    warmup_steps=2,                
    weight_decay=0.01,               
    #logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: answer, id, question, context. If answer, id, question, context are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 383
  Num Epochs = 2
  Instantaneous batch size per device = 6
  Total train batch size (w. parallel, distributed & accumulation) = 6
  Gradient Accumulation steps = 1
  Total optimization steps = 128
  Number of trainable parameters = 334094338


RuntimeError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 4.00 GiB total capacity; 3.42 GiB already allocated; 0 bytes free; 3.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Conclusion

Regular expressions and rule-based pipelines can be effective for information retrieval, particularly when working with well-structured data. However, they have limitations when dealing with highly unstructured data where patterns and rules are difficult to define. Moreover, these methods may not work well when the information we seek does not follow a particular pattern. This is where Language Models (LLMs) come into play, as they can help to overcome these limitations. However, LLMs also have their own limitations, such as requiring labeled data for fine-tuning and evaluation in order to achieve high confidence scores. <br> Ensembling can be a useful approach, where multiple methods are combined to balance accuracy and performance, and the most effective method is used for each feature. It would also be worthwhile to explore unsupervised fine-tuning methods.