# 03-wine_reviews_transformers_huggingface

# 1) Fine-tuning DistilBERT

Fine-tuning [DistilBERT](https://huggingface.co/distilbert-base-uncased) model from Hugging Face.

In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Preprocessing

In [32]:
df = pd.read_csv('data/wine-reviews.csv')
df = df[['description', 'points']]
df = df.drop_duplicates().reset_index(drop=True)

In [33]:
def points_binning(points):
    if points >= 80 and points <= 84:
        return 1
    elif points >= 85 and points <= 89:
        return 2 
    elif points >= 90 and points <= 94:
        return 3 
    elif points >= 95 and points <= 100:
        return 4
    
df['points'] = df['points'].apply(points_binning)
df

Unnamed: 0,description,points
0,This tremendous 100% varietal wine hails from ...,4
1,"Ripe aromas of fig, blackberry and cassis are ...",4
2,Mac Watson honors the memory of a wine once ma...,4
3,"This spent 20 months in 30% new French oak, an...",4
4,"This is the top wine from La Bégude, named aft...",4
...,...,...
97826,A Syrah-Grenache blend that's dry and rustical...,1
97827,Oreo eaters will enjoy the aromas of this wine...,1
97828,"Outside of the vineyard, wines like this are w...",1
97829,"Heavy and basic, with melon and pineapple arom...",1


In [34]:
df.to_csv('data/wine-reviews-preprocessed.csv', index=False)

In [35]:
df = df.rename(columns={'description': 'text', 'points': 'label'})
X = df['text']
y = df['label'] - 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [36]:
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

## Tokenizing

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [9]:
from datasets import Dataset

train = Dataset.from_pandas(train)
test = Dataset.from_pandas(test)

tokenized_train = train.map(preprocess_function, batched=True)
tokenized_test = test.map(preprocess_function, batched=True)

  0%|          | 0/79 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

## Training

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [42]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_train) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [94]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_39', 'classifier', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [12]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_train,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_test,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [96]:
model.compile(optimizer=optimizer,
              metrics=['acc'])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [98]:
# model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)
# Trained on virutal machine

## Evaluation

In [11]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("directtt/wine-reviews-distilbert")

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at directtt/wine-reviews-distilbert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [32]:
pred = model.predict(tf_test_set)

In [135]:
print(classification_report(y_test, np.argmax(pred.logits, axis=1)))

              precision    recall  f1-score   support

           0       0.74      0.71      0.72      2924
           1       0.78      0.82      0.80     10187
           2       0.77      0.76      0.76      6074
           3       0.76      0.33      0.46       382

    accuracy                           0.77     19567
   macro avg       0.76      0.65      0.69     19567
weighted avg       0.77      0.77      0.77     19567



+0.03 boost from against LSTM models.

## Evaluation from pipeline (this takes a bit more time...)

In [7]:
from transformers import pipeline
from tqdm import tqdm

model = pipeline("text-classification",
                 model="directtt/wine-reviews-distilbert")

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at directtt/wine-reviews-distilbert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [9]:
y_pred = []

for text in tqdm(X_test):
    y_pred.append(int(model(text)[0]['label'][-1])) # parsing from 'label_X' to X

100%|████████████████████████████████████████████████████████████████████████████| 19567/19567 [47:54<00:00,  6.81it/s]


In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.71      0.72      2924
           1       0.78      0.82      0.80     10187
           2       0.77      0.76      0.76      6074
           3       0.76      0.33      0.46       382

    accuracy                           0.77     19567
   macro avg       0.76      0.65      0.69     19567
weighted avg       0.77      0.77      0.77     19567



# 2) Fine-tuning RoBERTa

Fine-tuning [RoBERTa](https://huggingface.co/xlm-roberta-base) model from Hugging Face.

Preprocessing & tokenizing & training steps are same as before, model has been trained on virtual machine. <br>
Transforming from raw pandas dataframe to transformer ready tensorflow dataset has been defined inside external function.

In [14]:
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("directtt/wine-reviews-roberta")
model = TFAutoModelForSequenceClassification.from_pretrained("directtt/wine-reviews-roberta")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at directtt/wine-reviews-roberta.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [15]:
from scripts.transformers.df_to_tf_dataset import df_to_tf_dataset

tf_test_set = df_to_tf_dataset(df=test, tokenizer=tokenizer, model=model)

  0%|          | 0/20 [00:00<?, ?ba/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


## Evaluation

In [21]:
y_pred = model.predict(tf_test_set)



In [22]:
print(classification_report(y_test, np.argmax(y_pred.logits, axis=1)))

              precision    recall  f1-score   support

           0       0.74      0.72      0.73      2924
           1       0.79      0.81      0.80     10187
           2       0.77      0.74      0.75      6074
           3       0.54      0.54      0.54       382

    accuracy                           0.77     19567
   macro avg       0.71      0.70      0.71     19567
weighted avg       0.77      0.77      0.77     19567



# 3) Fine-tuning GPT-2

Fine-tuning [GPT-2](https://huggingface.co/gpt2) model from Hugging Face.

Preprocessing & tokenizing & training steps are same as before, model has been trained on virtual machine. <br>
Transforming from raw pandas dataframe to transformer ready tensorflow dataset has been defined inside external function.

## Evaluation

In [21]:
from transformers import pipeline
from tqdm import tqdm

model = pipeline("text-classification",
                 model="directtt/wine-reviews-distilbert")

Some layers from the model checkpoint at directtt/wine-reviews-distilbert were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at directtt/wine-reviews-distilbert and are newly initialized: ['dropout_151']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
y_pred = []

for text in tqdm(X_test):
    y_pred.append(int(model(text)[0]['label'][-1])) # parsing from 'label_X' to X

100%|████████████████████████████████████████████████████████████████████████████| 19567/19567 [43:46<00:00,  7.45it/s]


In [10]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.71      0.72      2924
           1       0.78      0.82      0.80     10187
           2       0.77      0.76      0.76      6074
           3       0.76      0.33      0.46       382

    accuracy                           0.77     19567
   macro avg       0.76      0.65      0.69     19567
weighted avg       0.77      0.77      0.77     19567

