## Imports, Instalations and Constants

In [None]:
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 20.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 22.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 1.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 2.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fou

In [None]:
import pandas as pd
import tensorflow as tf
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig, AutoModelForSequenceClassification, EvalPrediction, GlueDataset
from transformers import ConvBertTokenizer
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from tensorflow.keras.utils import to_categorical

pd.set_option('display.max_colwidth', None)
BATCH_SIZE = 16
N_EPOCHS = 3 # we can put more, because evaluation of the model shows big difference in loss with accuracy 1.0

## A common data set (with source text, preprocesses text, new features, and labels) before text-to-sequence transformation

We will take a column with not preprocecced text data for pure experiment with Hugging Face distilbert model

In [None]:
# test = pd.read_csv('drugsComTest_raw.tsv', sep='\t')
# train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t')
# df = pd.concat([train,test])

loaded_df = pd.read_csv('drugsComTrain_raw.tsv', sep='\t')
# df = loaded_df[['review', 'rating']]
df = loaded_df[:10000]


def get_sentiment(rating):
  if rating < 4.0:
    return 'neg'
  elif rating >= 4.0 and rating <= 7.0:
    return 'neutral'
  else:
    return 'pos'

df['sentiment'] = df['rating'].map(lambda x: get_sentiment(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


https://www.sunnyville.ai/fine-tuning-distilbert-multi-class-text-classification-using-transformers-and-tensorflow/  below

In [None]:
encode_method = 'onehot'


if encode_method == 'encode':
    print('encode_method: ', encode_method)
    df['encoded_sent'] = df['sentiment'].astype('category').cat.codes

    data_texts = df["review"].to_list() # Features (not-tokenized yet)
    data_labels = df["encoded_sent"].to_list() # Lables
    X_train, X_test, y_train, y_test = train_test_split(data_texts, data_labels, test_size=0.3, random_state=1)


    print(X_train[:10])
elif encode_method == 'onehot':
    print('encode_method: ', encode_method)
    # encode class names to integers
    labelencoder = preprocessing.LabelEncoder()
    labels = labelencoder.fit_transform(df['sentiment'])

    cat_labels = to_categorical(labels)

    X_train, X_test, y_train, y_test = train_test_split(df['review'], cat_labels, random_state=1)

    X_train = X_train.to_list()
    X_test = X_test.to_list()

encode_method:  onehot


In [None]:
# print(labels)
# cat_labels.shape
# df.head()
labelencoder.classes_

array(['neg', 'neutral', 'pos'], dtype=object)

# ConvBERT

## Training

In [None]:
# # EXAMPLE
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')   # switch to MODEL_NAME
# train_encodings = tokenizer(train_texts, truncation=True, padding=True)
# val_encodings = tokenizer(val_texts, truncation=True, padding=True)

tokenizer_convbert = ConvBertTokenizer.from_pretrained("YituTech/conv-bert-base")

train_encodings_convbert = tokenizer_convbert(X_train, truncation=True, padding=True)
test_encodings_convbert = tokenizer_convbert(X_test, truncation=True, padding=True)

In [None]:
train_dataset_convbert = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings_convbert),
    y_train
))
val_dataset_convbert = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings_convbert),
    y_test
)) 

In [None]:
train_dataset_convbert

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'token_type_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int32, name=None)}, TensorSpec(shape=(3,), dtype=tf.float32, name=None))>

In [None]:
from transformers import TFConvBertForSequenceClassification

model_convbert = TFConvBertForSequenceClassification.from_pretrained("YituTech/conv-bert-base", problem_type="multi_label_classification", num_labels=3)

learning_rate = 5e-8  # 5e-5 = 0.00005
optimizer_convbert = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss_convbert = tf.keras.losses.CategoricalCrossentropy() # Computes the crossentropy loss between the labels and predictions. 
model_convbert.compile(optimizer=optimizer_convbert, loss=loss_convbert, metrics=['accuracy'])
# model_convbert.compile(optimizer=optimizer_convbert, loss=loss_convbert, metrics=['categorical_accuracy'])

# model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])

All model checkpoint layers were used when initializing TFConvBertForSequenceClassification.

Some layers of TFConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
len(X_train)

7500

In [None]:
BATCH_SIZE = 8

# model_convbert.fit(train_dataset_convbert.shuffle(len(X_train)).batch(BATCH_SIZE), 
#           epochs=N_EPOCHS,
#           batch_size=BATCH_SIZE)

model_convbert.fit(train_dataset_convbert.batch(BATCH_SIZE), 
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fcc8daeb210>

## save zip file to my drive

In [None]:
model_save_name = "04-08-model_convbert_v3"

In [None]:
# SAVE MODEL  ( https://huggingface.co/docs/transformers/main_classes/model )
model_convbert.save_pretrained(model_save_name)
!zip -r 04-08-model_convbert_v3.zip 04-08-model_convbert_v3

  adding: 04-08-model_convbert_v3/ (stored 0%)
  adding: 04-08-model_convbert_v3/config.json (deflated 52%)
  adding: 04-08-model_convbert_v3/tf_model.h5 (deflated 7%)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# with open('/content/drive/My Drive/NLP Models/foo.txt', 'w') as f:
#   f.write('Hello Google Drive!')
# !cat /content/drive/My\ Drive/NLP\ Models/foo.txt

!cp 04-08-model_convbert_v3.zip "/content/drive/My Drive/04-08-model_convbert_v3.zip"

Mounted at /content/drive


In [None]:
# Download to my local computer

from google.colab import files
files.download('04-08-model_convbert_v3.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Evaluate loaded model

In [None]:
loaded_model = TFConvBertForSequenceClassification.from_pretrained("04-08-model_convbert_v3", problem_type="multi_label_classification", num_labels=3)   # switch to MODEL_NAME

All model checkpoint layers were used when initializing TFConvBertForSequenceClassification.

All the layers of TFConvBertForSequenceClassification were initialized from the model checkpoint at 04-08-model_convbert_v3.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFConvBertForSequenceClassification for predictions without further training.


### with predict_proba function

In [None]:
# MAX_LEN = X_train.apply(lambda s: len([x for x in s.split()])).max()

MAX_LEN = max([len(x) for x in X_train])
MAX_LEN

2436

In [None]:
def predict_proba(text_list, model, tokenizer):
  """
  To get array with predicted probabilities for 0 - instructions, 1- ingredients classes 
  for each paragraph in the list of strings
  :param text_list: list[str]
  :param model: transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification
  :param tokenizer: transformers.models.distilbert.tokenization_distilbert.DistilBertTokenizer
  :return res: numpy.ndarray
  """
     
  encodings = tokenizer(text_list, max_length=MAX_LEN, truncation=True, padding=True)
  dataset = tf.data.Dataset.from_tensor_slices((dict(encodings))) 
  preds = model.predict(dataset.batch(1)).logits
  res = tf.nn.softmax(preds, axis=1).numpy()
    
  return res

In [None]:
string1 = ['This helped a lot.']

predict_proba(string1, model_convbert, tokenizer_convbert)

array([[0.32198203, 0.31275803, 0.36525995]], dtype=float32)

In [None]:
predict_proba(string1, loaded_model, tokenizer_convbert)

array([[0.32198203, 0.31275803, 0.36525995]], dtype=float32)

In [None]:
# string2 = ['i felt sick after 2 days']
string2 = ['this was bad']


predict_proba(string2, model_convbert, tokenizer_convbert)



array([[0.3211965 , 0.31004646, 0.36875707]], dtype=float32)

In [None]:
predict_proba(string2, loaded_model, tokenizer_convbert)



array([[0.3211965 , 0.31004646, 0.36875707]], dtype=float32)

### with encoded data

In [None]:
# Load test data
loaded_test_df = pd.read_csv('drugsComTest_raw.tsv', sep='\t')
test_df = loaded_test_df[:2000]

test_df['sentiment'] = test_df['rating'].map(lambda x: get_sentiment(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [None]:
labelencoder = preprocessing.LabelEncoder()
labels = labelencoder.fit_transform(test_df['sentiment'])

cat_labels = to_categorical(labels)

X_test_final = test_df['review']
X_test_final = X_test_final.to_list()

y_test_final = cat_labels

In [None]:
# loaded_test_df.head()
X_test_final[0]

'"I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'

In [None]:
# tokenizer_convbert = ConvBertTokenizer.from_pretrained("YituTech/conv-bert-base")

final_test_encodings_convbert = tokenizer_convbert(X_test_final, truncation=True, padding=True)
final_test_dataset_convbert = tf.data.Dataset.from_tensor_slices((
    dict(final_test_encodings_convbert),
    y_test_final
))

final_test_dataset_convbert

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'token_type_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int32, name=None)}, TensorSpec(shape=(3,), dtype=tf.float32, name=None))>

In [None]:
val_dataset_convbert

<TensorSliceDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'token_type_ids': TensorSpec(shape=(512,), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int32, name=None)}, TensorSpec(shape=(3,), dtype=tf.float32, name=None))>

In [None]:
loaded_model.compile(optimizer_convbert, loss_convbert)

In [None]:
loaded_model.metrics_names

[]

info on evaluating model
https://swatimeena989.medium.com/bert-text-classification-using-keras-903671e0207d#a1bb

#### evaluate with val data

In [None]:
# Get predictions with validation set
y_pred = loaded_model.predict(val_dataset_convbert.batch(16))
y_pred_proba = [float(x[1]) for x in tf.nn.softmax(y_pred.logits)]
y_pred_label = [0 if x[0] > x[1] else 1 for x in tf.nn.softmax(y_pred.logits)]



In [None]:
import numpy as np

In [None]:
np.shape(y_pred.logits)
pred_labels = np.argmax(y_pred.logits, axis=1)
# pred_labels = np.argmax(y_pred, axis=1)
pred_labels

array([2, 2, 2, ..., 2, 2, 2])

In [None]:
y_test_labels = np.argmax(y_test, axis=1)
y_test_labels

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test_labels,pred_labels,target_names=labelencoder.classes_))


              precision    recall  f1-score   support

         neg       0.00      0.00      0.00       550
     neutral       0.00      0.00      0.00       443
         pos       0.60      1.00      0.75      1507

    accuracy                           0.60      2500
   macro avg       0.20      0.33      0.25      2500
weighted avg       0.36      0.60      0.45      2500



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### evaluate with test data

In [None]:
# Get predictions with validation set
y_final_pred = loaded_model.predict(final_test_dataset_convbert.batch(16))
y_final_pred_proba = [float(x[1]) for x in tf.nn.softmax(y_final_pred.logits)]
y_final_pred_label = [0 if x[0] > x[1] else 1 for x in tf.nn.softmax(y_final_pred.logits)]

In [None]:
np.shape(y_final_pred.logits)
final_pred_labels = np.argmax(y_final_pred.logits, axis=1)
# pred_labels = np.argmax(y_pred, axis=1)
final_pred_labels

array([2, 2, 2, ..., 2, 2, 2])

In [None]:
y_test_final_labels = np.argmax(y_test_final, axis=1)
y_test_final_labels

array([2, 2, 2, ..., 2, 0, 2])

In [None]:
print(classification_report(y_test_final_labels,final_pred_labels,target_names=labelencoder.classes_))

              precision    recall  f1-score   support

         neg       0.00      0.00      0.00       446
     neutral       0.00      0.00      0.00       348
         pos       0.60      1.00      0.75      1206

    accuracy                           0.60      2000
   macro avg       0.20      0.33      0.25      2000
weighted avg       0.36      0.60      0.45      2000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:

# # Evaluate the model
# from sklearn.metrics import (
#     confusion_matrix,
#     roc_auc_score,
#     average_precision_score,
# )

# # print("Confusion Matrix : ")
# # print(confusion_matrix(y_test, y_pred_label))

# print("ROC AUC score : ", round(roc_auc_score(y_test, y_pred_proba), 3))

# print("Average Precision score : ", round(average_precision_score(y_test, y_pred_proba), 3))


In [None]:
test_text = X_test_final[0]
test_text

'"I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'

In [None]:
predict_input = tokenizer_convbert.encode(test_text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

output = loaded_model(predict_input)[0]

prediction_value = tf.argmax(output, axis=1).numpy()[0]
prediction_value

2

# not run after this

In [None]:
# https://huggingface.co/docs/transformers/training

from datasets import load_metric
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="training_args_convbert")
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")


trainer = Trainer(
    model=model_convbert,
    args=training_args,
    train_dataset=X_train,
    eval_dataset=X_test,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1400
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 525


TypeError: ignored

In [None]:
# model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16,
#           validation_data=val_dataset.shuffle(1000).batch(16))
model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE), 
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

In [None]:
y_train[:10]

In [None]:
# from sklearn import preprocessing
# # from sklearn.preprocessing import OneHotEncoder
# from tensorflow.keras.utils import to_categorical
 
# # encode class names to integers
# labelencoder = preprocessing.LabelEncoder()
# y_train_encode_df = pd.DataFrame(labelencoder.fit_transform(y_train))
# y_test_encode_df = pd.DataFrame(labelencoder.fit_transform(y_test))
 
# y_train = to_categorical(y_train_encode_df)
# # ydev = to_categorical(labels_val.values)
# y_test = to_categorical(y_test_encode_df)


# # encoder = OneHotEncoder(handle_unknown='ignore')
# # y_train_encoder_df = pd.DataFrame(encoder.fit_transform(y_train).toarray())
# # y_train = df.join(y_train_encoder_df)


In [None]:
y_train

In [None]:
y_test

## check the shapes and split proportion 

In [None]:
X_train.shape, X_test.shape, y_train.shape

In [None]:
print('The proportion in y_train\n',y_train.value_counts(normalize=True).mul(100))
print('The proportion in y_test\n',y_test.value_counts(normalize=True).mul(100))

## Preprocess

### Decode byte arrays into string representation. 

In [None]:
# X_train = X_train.apply(lambda x: str(x[0], 'utf-8'))
# X_test = X_test.apply(lambda x:  str(x[0], 'utf-8'))
# X_train[:3]

In [None]:
sent_lens = [len(sent) for sent in X_train]
MAX_LEN = max(sent_lens)

### Max sentence length

In [None]:
# rename variables
X_train = train_texts
X_test = val_texts
y_train = train_labels
y_test = val_labels

test_dataset = val_dataset

In [None]:
# MAX_LEN = X_train.apply(lambda s: len([x for x in s.split()])).max()
# MAX_LEN

## Encode with  DistilBertTokenizer

In [None]:
#define a tokenizer object
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

#tokenize the text (padding to max sequence in batch)
train_encodings = tokenizer(list(X_train.values), truncation=True, padding=True)
test_encodings = tokenizer(list(X_test.values), truncation=True, padding=True)

#print the first paragraph and it transformation
print(f'First paragraph: \'{X_train[:1]}\'')
print(f'Input ids: {train_encodings["input_ids"][0]}')
print(f'Attention mask: {train_encodings["attention_mask"][0]}')


## Length check

In [None]:
# pd.DataFrame(train_encodings["input_ids"]).hist();

In [None]:
len(train_encodings["attention_mask"][0]) #max len tokenized sentence - 362

In [None]:
len(train_encodings["input_ids"][0])

###  Turn our labels and encodings into a tf.Dataset object

In [None]:
# train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
#                                                     list(y_train.values)))

# test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
#                                                     list(y_test.values)))

# y_train and y_test are one-hot encoded arrays
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                                    list(y_train)))

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                                    list(y_test)))

In [None]:
train_dataset

In [None]:
tf.data.experimental.cardinality(train_dataset)


## Fine-tuning with native TensorFlow


In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)

optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)
# losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # Computes the crossentropy loss between the labels and predictions. 
losss = tf.keras.losses.CategoricalCrossentropy() # Computes the crossentropy loss between the labels and predictions. 
model.compile(optimizer=optimizerr,
              loss=losss,
              metrics=['accuracy'])

model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE), 
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, problem_type="multi_label_classification")

optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)
# losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # Computes the crossentropy loss between the labels and predictions. 
losss = tf.keras.losses.CategoricalCrossentropy() # Computes the crossentropy loss between the labels and predictions. 
model.compile(optimizer=optimizerr,
              loss=losss,
              metrics=['accuracy'])

In [None]:
model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE), 
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

## Model Evaluation

In [None]:
model.evaluate(test_dataset.shuffle(len(X_test)).batch(BATCH_SIZE), return_dict=True, batch_size=BATCH_SIZE)

## Predict on the different text examples

In [None]:
def predict_proba(text_list, model, tokenizer):
  """
  To get array with predicted probabilities for 0 - instructions, 1- ingredients classes 
  for each paragraph in the list of strings
  :param text_list: list[str]
  :param model: transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification
  :param tokenizer: transformers.models.distilbert.tokenization_distilbert.DistilBertTokenizer
  :return res: numpy.ndarray
  """
     
  encodings = tokenizer(text_list, max_length=MAX_LEN, truncation=True, padding=True)
  dataset = tf.data.Dataset.from_tensor_slices((dict(encodings))) 
  preds = model.predict(dataset.batch(1)).logits
  res = tf.nn.softmax(preds, axis=1).numpy()
    
  return res

We take a txt file [here](https://github.com/Galina-Blokh/ai_assignment_aidock/blob/refator/data/test_links.txt). This file contains links to the recipe pages which our model didn't saw yet. Assuming you scraped data from the first [url](https://www.loveandlemons.com/green-bean-salad-recipe/). The data you feed into your model for prediction will be looking like in the cell below. (*A list with one first string of ingredients and following three strings with instructions.)

In [None]:
strings_list =["""
                  1 pound green beans, trimmed
                  ½ head radicchio, sliced into strips
                  Scant ¼ cup thinly sliced red onion
                  Honey Mustard Dressing, for drizzling
                  2 ounces goat cheese
                  2 tablespoons chopped walnuts
                  2 tablespoons sliced almonds
                  ¼ cup tarragon
                  Flaky sea salt
                  """,
                  """
                  Bring a large pot of salted water to a boil and set a bowl of ice water nearby.
                  Drop the green beans into the boiling water and blanch for 2 minutes.
                    Remove the beans and immediately immerse in the ice water long enough 
                    to cool completely, about 15 seconds. Drain and place on paper towels to dry.
                  """,
                  """
                  Transfer the beans to a bowl and toss with the radicchio, onion, 
                  and a few spoonfuls of the dressing.
                  """,
                  """
                  Arrange on a platter and top with small dollops of goat cheese, the walnuts, 
                  almonds, and tarragon. Drizzle with more dressing, season to taste with flaky 
                  salt, and serve.
                  """]
predict_proba(strings_list, model, tokenizer)

The result of the predictive function gives an array of arrays. Each inner array contains probability for 0 and 1 classes (i.e. for instructions and ingredients labels). We got a pretty accurate model!

Even if you'll do a single paragraph as an input, you'll get a very accurate model's answer (data from [second line in .txt document](https://github.com/Galina-Blokh/ai_assignment_aidock/blob/refator/data/test_links.txt) - recipe page [url](https://www.loveandlemons.com/any-vegetable-vinegar-pickles/))

In [None]:
string1 = ["""
            any vegetables you like (I used cucumbers, broccoli, cauliflower, onions and radishes)
            fresh or dried spices (I used peppercorns, cumin, coriander, mustard seeds, & caraway)
            1 cup any kind of vinegar (I used white wine vinegar)
            1 cup filtered water
            1 tablespoon kosher or any non-iodized salt
            optional: 1 teaspoon sugar
            """]
predict_proba(string1, model, tokenizer)

In [None]:
string2 = ['Wash and cut up your vegetables and pack them into a clean jar.']

predict_proba(string2, model, tokenizer)

In [None]:
string3 = ['Add between ¼ - ½ teaspoon of whole dried spices.']

predict_proba(string3, model, tokenizer)

In [None]:
string4 = ['Combine vinegar, filtered water and salt in a medium saucepan and bring to a boil.']

predict_proba(string4, model, tokenizer)

In [None]:
string5 = ['Put your just boiled brine over the vegetables in the jar.']

predict_proba(string5, model, tokenizer)

In [None]:
string6 = ['Wipe any vinegar spills from the rim with a clean towel and put on the lid.']

predict_proba(string6, model, tokenizer)

In [None]:
string7 = ['Hide the jar in the back of the friedge for at least a week. Two weeks is better, three is best.']

predict_proba(string7, model, tokenizer)

In [None]:
string8 = ['Keep them in the fridge for up to 6 months.']

predict_proba(string8, model, tokenizer)

## Well, now you know all steps of how to fine-tune the Hugging Face DistilBert model with Tensorflow API

## The end