# **Women's E-Commerce Clothing Reviews**

23,000 Customer Reviews and Ratings.

> [**Kaggle Dataset**](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews)

In [None]:
# Install Kaggle.
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Files Upload.
from google.colab import files

files.upload()

In [None]:
# Create a Kaggle Folder.
!mkdir ~/.kaggle

# Copy the kaggle.json to the folder created.
!cp kaggle.json ~/.kaggle/

# Permission for the json file to act.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Dataset Download.
!kaggle datasets download -d nicapotato/womens-ecommerce-clothing-reviews

In [None]:
# Unzip Dataset.
!unzip womens-ecommerce-clothing-reviews.zip

# **Multimodal Toolkit - Incorporate Tabular Data with HuggingFace Transformers**

*In real-world scenarios, we often encounter data that includes both text and tabular features. Leveraging the latest advances for transformers, effectively handling situations with both data structures can increase performance in our models.*

*   [**Multimodal Transformers | GitHub**](https://github.com/georgian-io/Multimodal-Toolkit)

*   [**Multimodal Transformers | Documentation**](https://multimodal-toolkit.readthedocs.io/en/latest/index.html)

*   [**Georgian Blog Post**](https://medium.com/georgian-impact-blog/how-to-incorporate-tabular-data-with-huggingface-transformers-b70ac45fcfb4)

In [None]:
!pip install multimodal-transformers

In [None]:
# Import Library.
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import json

from transformers import AutoTokenizer, AutoConfig, Trainer, EvalPrediction, set_seed
from transformers.training_args import TrainingArguments
from multimodal_transformers.data import load_data_from_folder
from multimodal_transformers.model import TabularConfig, AutoModelWithTabular

In [None]:
# Load Dataset.
data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [None]:
# Split Dataset into Training, Validation, and Test Set. The Split is in a ratio of 8:1:1 training splits.
X_train, X_valid, X_test = np.split(
    data.sample(frac=1), [int(0.8 * len(data)), int(0.9 * len(data))]
)

print(
    "Number of data points in Train-Valid-Test are",
    len(X_train),
    len(X_valid),
    len(X_test),
)

# Save Training, Validation, and Test Dataset into .csv format. Note: Names of the saving files must remain unchanged.
X_train.to_csv("train.csv")
X_valid.to_csv("val.csv")
X_test.to_csv("test.csv")

Number of data points in Train-Valid-Test are 18788 2349 2349


In [None]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={
            "help": "Path to pre-trained model or model identifier from huggingface.co/models"
        }
    )

    config_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "Pre-trained config name or path if not the same as model_name"
        },
    )

    tokenizer_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "Pre-trained tokenizer name or path if not the same as model_name"
        },
    )

    cache_dir: Optional[str] = field(
        default=None,
        metadata={
            "help": "Where do you want to store the pre-trained models downloaded from S3?"
        },
    )


@dataclass
class MultimodalDataTrainingArguments:
    """
    Arguments pertaining to how we combine tabular features using `HfArgumentParser`.
    We can turn this class into argparse arguments to be able to specify them on the command line.
    """

    data_path: str = field(
        metadata={"help": "Path to the .csv file containing the dataset."}
    )

    column_info_path: str = field(
        default=None,
        metadata={
            "help": "Path to the .json file detailing which columns are textual, categorical, numerical, and the label."
        },
    )

    column_info: dict = field(
        default=None,
        metadata={
            "help": "A dictionary referencing the textual, categorical, numerical, and label columns."
        },
    )

    categorical_encode_type: str = field(
        default="ohe",
        metadata={
            "help": "Scikit-learn encoder to use for encoding the categorical data.",
            "choices": ["ohe", "binary", "label", "none"],
        },
    )

    numerical_transformer_method: str = field(
        default="yeo_johnson",
        metadata={
            "help": "Scikit-learn numerical transformer to scale the numerical data.",
            "choices": ["yeo_johnson", "box_cox", "quantile_normal", "none"],
        },
    )

    task: str = field(
        default="classification",
        metadata={"help": "Training Task", "choices": ["classification", "regression"]},
    )

    mlp_division: int = field(
        default=4,
        metadata={
            "help": "The ratio of the number of hidden dims in a current layer to the next MLP layer."
        },
    )

    combine_feat_method: str = field(
        default="individual_mlps_on_cat_and_numerical_feats_then_concat",
        metadata={
            "help": "Method to combine categorical and numerical features (see README for all methods)."
        },
    )

    mlp_dropout: float = field(
        default=0.3, metadata={"help": "Dropout ratio to be used for MLP layers."}
    )

    numerical_batch_norm: bool = field(
        default=True,
        metadata={
            "help": "Whether or not to use Batch Normalization on numerical features?"
        },
    )

    use_simple_classifier: bool = field(
        default=True,
        metadata={
            "help": "Whether to use a single layer or multi-layer as the final classifier?"
        },
    )

    mlp_activation: str = field(
        default="relu",
        metadata={
            "help": "The activation function is to be used for fine-tuning layers.",
            "choices": ["relu", "prelu", "sigmoid", "tanh", "linear"],
        },
    )

    gating_beta: float = field(
        default=0.2,
        metadata={
            "help": "The beta hyperparameter is to be used for gating tabular data."
        },
    )

    def __post_init__(self):
        assert self.column_info != self.column_info_path
        if self.column_info is None and self.column_info_path:
            with open(self.column_info_path, "r") as f:
                self.column_info = json.load(f)

In [None]:
# Separate Textual, Categorical, and Numerical Features.
textual_cols = ["Title", "Review Text"]
categorical_cols = ["Clothing ID", "Division Name", "Department Name", "Class Name"]
numerical_cols = ["Rating", "Age", "Positive Feedback Count"]

column_info_dict = {
    "textual_cols": textual_cols,
    "numerical_cols": numerical_cols,
    "categorical_cols": categorical_cols,
    "label_col": "Recommended IND",
    "label_list": ["Not Recommended", "Recommended"],
}

model_args = ModelArguments(model_name_or_path="bert-base-uncased")

# Multimodal Training Arguments (Model Hyperparameter).
data_args = MultimodalDataTrainingArguments(
    data_path=".",
    combine_feat_method="gating_on_cat_and_num_feats_then_sum",
    column_info=column_info_dict,
    task="classification",
)

training_args = TrainingArguments(
    output_dir="./logs/model_name",
    logging_dir="./logs/runs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=32,
    num_train_epochs=2,
    evaluate_during_training=True,
    logging_steps=25,
    eval_steps=250,
)

set_seed(training_args.seed)

In [None]:
# Instantiate the HuggingFace Tokenizer.
tokenizer_path_or_name = (
    model_args.tokenizer_name
    if model_args.tokenizer_name
    else model_args.model_name_or_path
)
print("Specified Tokenizer: ", tokenizer_path_or_name)

tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name, cache_dir=model_args.cache_dir
)

Specified Tokenizer:  bert-base-uncased


Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
"""
Load the .csv dataset to torch datasets - The function `load_data_from_folder()` expects a path to a folder 
that contains `train.csv`, `test.csv`, and/or `val.csv` containing the respective split datasets.
"""

# Get Torch Datasets.
train_dataset, validation_dataset, test_dataset = load_data_from_folder(
    folder_path=data_args.data_path,
    text_cols=data_args.column_info["textual_cols"],
    tokenizer=tokenizer,
    label_col=data_args.column_info["label_col"],
    label_list=data_args.column_info["label_list"],
    categorical_cols=data_args.column_info["categorical_cols"],
    numerical_cols=data_args.column_info["numerical_cols"],
    sep_text_token_str=tokenizer.sep_token,
)

num_labels = len(np.unique(train_dataset.labels))  # Total two labels.

config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)

tabular_config = TabularConfig(
    num_labels=num_labels,
    cat_feat_dim=train_dataset.cat_feats.shape[1],
    numerical_feat_dim=train_dataset.numerical_feats.shape[1],
    **vars(data_args)
)

config.tabular_config = tabular_config

model = AutoModelWithTabular.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    config=config,
    cache_dir=model_args.cache_dir,
)

In [None]:
""" Define a task-specific way of computing relevant metrics. """

from scipy.special import softmax
from sklearn.metrics import (
    auc,
    precision_recall_curve,
    roc_auc_score,
    f1_score,
    confusion_matrix,
)


def classification_metrics(p: EvalPrediction):
    pred_labels = np.argmax(p.predictions, axis=1)
    pred_scores = softmax(p.predictions, axis=1)[:, 1]
    labels = p.label_ids
    if len(np.unique(labels)) == 2:
        # Binary Classification.
        roc_auc_pred_score = roc_auc_score(labels, pred_scores)
        precisions, recalls, thresholds = precision_recall_curve(labels, pred_scores)
        fscore = (2 * precisions * recalls) / (precisions + recalls)
        fscore[np.isnan(fscore)] = 0
        ix = np.argmax(fscore)
        threshold = thresholds[ix].item()
        pr_auc = auc(recalls, precisions)
        tn, fp, fn, tp = confusion_matrix(labels, pred_labels, labels=[0, 1]).ravel()
        result = {
            "roc_auc": roc_auc_pred_score,
            "threshold": threshold,
            "pr_auc": pr_auc,
            "recall": recalls[ix].item(),
            "precision": precisions[ix].item(),
            "f1": fscore[ix].item(),
            "tn": tn.item(),
            "fp": fp.item(),
            "fn": fn.item(),
            "tp": tp.item(),
        }
    else:
        acc = (pred_labels == labels).mean()
        f1 = f1_score(y_true=labels, y_pred=pred_labels)
        result = {"accuracy": acc, "f1": f1, "acc_and_f1": (acc + f1) / 2}

    return result

In [None]:
""" The HuggingFace Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. """

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=classification_metrics,
)

# Launching the training is as simple as doing `trainer.train()` 🤗.
trainer.train()

In [None]:
# Test Prediction Evaluation.
predictions = trainer.predict(test_dataset)
predictions

Prediction:   0%|          | 0/294 [00:00<?, ?it/s]

PredictionOutput(predictions=array([[ 2.49439   , -1.1896064 ],
       [-3.700037  ,  3.7272983 ],
       [-1.1800157 ,  2.4081929 ],
       ...,
       [-3.7289512 ,  3.542851  ],
       [ 1.5014398 ,  0.10571562],
       [-3.9226604 ,  4.731914  ]], dtype=float32), label_ids=array([0, 1, 1, ..., 1, 0, 1]), metrics={'eval_loss': 0.1645574186541114, 'eval_roc_auc': 0.9741790288140021, 'eval_threshold': 0.3687329590320587, 'eval_pr_auc': 0.9945012838624114, 'eval_recall': 0.9691991786447639, 'eval_precision': 0.9540171803941384, 'eval_f1': 0.9615482556659027, 'eval_tn': 332, 'eval_fp': 69, 'eval_fn': 82, 'eval_tp': 1866})

In [None]:
y_test = X_test["Recommended IND"]
y_pred = predictions[1]
print("Confusion Matrix is \n", confusion_matrix(y_test, y_pred))

Confusion Matrix is 
 [[ 401    0]
 [   0 1948]]


In [None]:
# Load the TensorBoard Notebook Extension.
%load_ext tensorboard

%tensorboard --logdir ./logs/runs --port=6006

## **References:**

> [**Research Journal - BERT with Categorical Features**](https://docs.google.com/document/d/1NGBtUurxT4COhbq_2g50YZxqmTNJPtsoYxC-FJcUUKs/edit#)

> [**Combining Categorical and Numerical Features with Text in BERT - Chris McCormick**](https://mccormickml.com/2021/06/29/combining-categorical-numerical-features-with-bert/)