<a href="https://colab.research.google.com/github/aakhterov/ML_projects/blob/master/news_sentiment_analysis/classification_news_fine_tunning_and_predict.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune the LLM to perform sentiment analysis of the news agencies news regarding the pro- and anti-Israel attitudes.

We're going to use the collected earlier "Israel-HAMAS war news" [dataset](https://huggingface.co/datasets/aav-ds/Israel-HAMAS_war_news). You can explore the data collection process  [here](https://github.com/aakhterov/ML_projects/blob/master/news_sentiment_analysis/collect_news.ipynb). But the dataset isn't annotated, therefore we need to annotate it. We'll assume that all news by the Palestinian "news" agency "WAFA", Lebanese "Al Mayadeen" is anti-Israel. Much of the Qatar's "Al Jazeera" news has also anti-Israel position, but not all. On the contrary, most of the "The Times of Israel" news stands for Israel. We also  annotated a bit of BBC's news by hand. Then we'll leverage one of the LLMs from the Hugging Face Hub to fine-tune it.

In [None]:
# TODO
# Errors analysis

In [2]:
!pip install datasets transformers nltk huggingface_hub

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
from datetime import datetime
# from collections import Counter
# from nltk.util import bigrams, trigrams
from huggingface_hub import notebook_login
from datasets import Dataset, load_dataset, load_from_disk, concatenate_datasets, ClassLabel
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding, TFAutoModel
from tensorflow.keras.losses import SparseCategoricalCrossentropy, BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.metrics import Accuracy

In [4]:
from google.colab import drive
drive.mount('/content/drive')
base_path = '/content/drive/MyDrive/Colab Notebooks/'

Mounted at /content/drive


## 1. Load dataset from the Hugging Face Hub

In [5]:
ds = load_dataset("aav-ds/Israel-HAMAS_war_news")['train']

Downloading readme:   0%|          | 0.00/6.62k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/13103 [00:00<?, ? examples/s]

## 2. Dataset preprocessing and annotation

In [6]:
ds.unique('provider') # Take a look at the news producers

['BBC',
 'The Times of Israel',
 'Al Jazeera',
 'Al Mayadeen',
 'WAFA News Agency',
 'CNN']

### 2.1. Preprocessing

In [7]:
def get_length(example):
  '''
  Create a new field "len" which equals the number of words in the news.
  '''
  example['len'] = len(example['text'].split())
  return example

In [8]:
ds_w_len = ds.map(get_length)

Map:   0%|          | 0/13103 [00:00<?, ? examples/s]

In [9]:
# Take a look at the average length of the news. We can see that posts by "Al Mayadeen" are much longer then other posts.
ds_w_len.set_format("pandas")
ds_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])

  ds_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])


Unnamed: 0_level_0,Unnamed: 1_level_0,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
provider,source,Unnamed: 2_level_2,Unnamed: 3_level_2
Al Jazeera,site-live-news,104.884615,3198
Al Mayadeen,site-articles,1196.256757,74
BBC,site-live-news,129.625935,802
CNN,site-live-news,197.663165,1428
The Times of Israel,site-live-news,127.901383,6581
WAFA News Agency,site-occupation,151.498039,1020


In [10]:
# As we mentioned before articles by "Al Mayadeen" are much longer than other news.
# Hence, we can split each article into several parts. If you take a look at these posts you can see
# that there is so much antisemitism that we don't lose anything from the sentiment analysis perspective
# if split into parts.
def split_text(example):
  '''
  Split the field "text" into parts_number parts
  :param example - Python dictionary corresponding to the dataset sample

  Ex.
  example = {
    "field1": 1,
    "field2": 2,
    "text": "abc def ghk qwe dfh uij"
    }
  parts_number = 6

  Result is
  examples = {
    "field1": [1, 1, 1, 1, 1, 1],
    "field2": [2, 2, 2, 2, 2, 2],
    "text": ["abc", "def", "ghk", "qwe" "dfh" "uij"]
    }
  '''
  parts_number = 6
  splitting = example["text"].split()
  words_per_part = int(len(splitting) / parts_number)
  examples = {}
  for i in range(parts_number):
    for k, v in example.items():
      examples[k] = examples.get(k, [])
      if k == 'text':
        examples[k].append(" ".join(splitting[i*words_per_part:(i+1)*words_per_part]))
      else:
        examples[k].append(v)

  return examples

In [11]:
example = {"field1": 1, "field2": 2, "text": "abc def ghk qwe dfh uij"}
split_text(example)

{'field1': [1, 1, 1, 1, 1, 1],
 'field2': [2, 2, 2, 2, 2, 2],
 'text': ['abc', 'def', 'ghk', 'qwe', 'dfh', 'uij']}

In [12]:
ds_am = ds.filter(lambda x: x["provider"] == 'Al Mayadeen').map(split_text) # Apply split_text function to each "Al Mayadeen" news
ds_am.set_format("pandas")
df = ds_am[:].explode(ds_am.column_names, ignore_index=True) # Explode list-like fields to the separate rows (it's a pandas function))
ds_am = Dataset.from_pandas(df)
ds_am

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Map:   0%|          | 0/74 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source'],
    num_rows: 444
})

In [13]:
# Check the new avearge length of the "Al Mayadeen" news
ds_am_new_w_len = ds_am.map(get_length)
ds_am_new_w_len.set_format("pandas")
ds_am_new_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])

Map:   0%|          | 0/444 [00:00<?, ? examples/s]

  ds_am_new_w_len[:].groupby(["provider", "source"]).agg(['mean', 'count'])


Unnamed: 0_level_0,Unnamed: 1_level_0,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
provider,source,Unnamed: 2_level_2,Unnamed: 3_level_2
Al Mayadeen,site-articles,199.013514,444


### 2.2. Annotation

In [14]:
# As we mentioned above a part of the BBC news was annotated by hand. Load this news.
ds_bbs_pos = load_dataset("csv", data_files=base_path + 'Data/bbc_news_ds/bbc_news_ds_pos.csv')['train']
ds_bbs_neg = load_dataset("csv", data_files=base_path + 'Data/bbc_news_ds/bbc_news_ds_neg.csv')['train']

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [15]:
print(f"Total positive BBC news: {ds_bbs_pos.num_rows}")
print(f"Example of the positive BBC news:\n{ds_bbs_pos[0]['text']}\n")
print(f"Total negative BBC news: {ds_bbs_neg.num_rows}")
print(f"Example of the megative BBC news:\n{ds_bbs_neg[1]['text']}")

Total positive BBC news: 116
Example of the positive BBC news:
A little earlier we reported on claims that a UN school-turned-shelter in Jabalia, northern Gaza, had been hit. (Our BBC Verify colleagues analysed some footage of the incident here.)
Since then the Israeli military has shared an operational update, describing how it's "continuing and expanding its operational activities" in parts of the Palestinian enclave.
The Israel Defence Forces (IDF) say that in the last 24 hours troops have "conducted activities in the Zaytun and Jabalia areas, during which they encountered terrorists who intentionally operated from civilian areas and attacked the troops using anti-tank missiles and explosives".
It adds that its forces "eliminated" a number of Hamas operatives in the process, "and struck a large number of terrorist infrastructure".
Earlier, IDF spokesperson Lt Col Peter Lerner said the force was investigating the reported school blast.

Total negative BBC news: 95
Example of the mega

In [16]:
# As we said at the beginning - not all messages from "Al Jazeera" are anti-Israel. Sometimes their posts are quite pro-Israel.
# Especially, when they quote the Israeli official speakers. A similar situation is about "The Times of Israel" news but vice versa - there is
# a number of the "The Times of Israel" news that looks anti-Israel because, for eaxmaple, they cite some pro-Palestinian agency.
# So we need to remove the anti-Israel news from the "The Times of Israel" subset and pro-Israel from the "Al Jazeera" subset.
# we'll do it based on the keywords.

def keywords_filter(example, keywords):
  for keyword in keywords:
    if keyword in example['text'].lower():
      return True
  return False

In [17]:
# Keywords that are characteristic of an anti-Israel context
keywords_neg = ["occup", "crimes", "aggression",  "genocid", "sanction israel", "fighters", "resistance", "israeli regime", "ethnic cleansing"]

In [18]:
# Keywords that are characteristic of a pro-Israel context. We suggest that Israeli official speakers speak pro-Israel speeches.
keywords_pos = ["terror group", "terror", "israel defense force says", "idf says", "daniel hagari", "israeli army spokesperson says"]

In [19]:
# Remove the anti-Israel news from the "The Times of Israel" subset
ds_toi = ds.filter(lambda x: x["provider"] == 'The Times of Israel' and not keywords_filter(x, keywords_neg))
ds_toi

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source'],
    num_rows: 6054
})

In [20]:
# Remove the pro-Israel news from the "The Times of Israel" subset
ds_aj = ds.filter(lambda x: x["provider"] == 'Al Jazeera' and not keywords_filter(x, keywords_pos))
ds_aj

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source'],
    num_rows: 3054
})

In [21]:
# Save the news from the WAFA "News" agency into a separate dataset
ds_wafa = ds.filter(lambda x: x["provider"] == "WAFA News Agency")
ds_wafa

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source'],
    num_rows: 1020
})

In [22]:
def make_neg_labels(example):
  '''
  Annotate example as a negative sample
  '''
  example['label'] = 0
  return example

def make_pos_labels(example):
  '''
  Annotate example as a positive sample
  '''
  example['label'] = 1
  return example

In [23]:
# Merge all the negative datasets
ds_neg = concatenate_datasets([ds_am, ds_aj, ds_wafa, ds_bbs_neg]).map(make_neg_labels)
ds_neg

Map:   0%|          | 0/4613 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source', 'author', 'label'],
    num_rows: 4613
})

In [24]:
# Merge all the positive datasets
ds_pos = concatenate_datasets([ds_toi, ds_bbs_pos]).map(make_pos_labels)
ds_pos

Map:   0%|          | 0/6170 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source', 'author', 'label'],
    num_rows: 6170
})

In [25]:
# Merge the negative ad positive datasets
ds_classification = concatenate_datasets([ds_neg, ds_pos])
ds_classification

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source', 'author', 'label'],
    num_rows: 10783
})

In [26]:
# To prevent losing some useful information from the news title let's union news title and news text to the body field
def merge_title_with_text(example):
  '''
  Union news title and news text to the body field
  '''
  example['body'] = (example['title'] + '\n' + example['text']) if example['title'] else example['text']
  return example

In [27]:
ds_classification = ds_classification.map(merge_title_with_text)
ds_classification

Map:   0%|          | 0/10783 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'datetime', 'title', 'text', 'provider', 'source', 'author', 'label', 'body'],
    num_rows: 10783
})

In [28]:
# For the classification purpose we need only "body" and "label" fileds
ds_classification = ds_classification.select_columns(['body', 'label'])
ds_classification

Dataset({
    features: ['body', 'label'],
    num_rows: 10783
})

In [29]:
# Make the dataset balanced
ds_negative = ds_classification.filter(lambda x: x["label"] == 0)
negotive_count = len(ds_negative)

ds_pos_balanced = ds_classification.filter(lambda x: x["label"] == 1).shuffle().select(range(negotive_count))

ds_balanced = concatenate_datasets([ds_pos_balanced, ds_negative])

print(f"Positive samples count: {len(ds_balanced.filter(lambda x: x['label'] == 1))}")
print(f"Negative samples count: {len(ds_balanced.filter(lambda x: x['label'] == 0))}")

Filter:   0%|          | 0/10783 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10783 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9226 [00:00<?, ? examples/s]

Positive samples count: 4613


Filter:   0%|          | 0/9226 [00:00<?, ? examples/s]

Negative samples count: 4613


## 3. Load model and train it

In [30]:
# Split dataset into the train, validation and test datasets
ds_splitted = ds_balanced.train_test_split(train_size=.8)
test_and_val_ds = ds_splitted['test'].train_test_split(train_size=0.75)
ds_splitted['test'] = test_and_val_ds['train']
ds_splitted['validation'] = test_and_val_ds['test']
ds_splitted

DatasetDict({
    train: Dataset({
        features: ['body', 'label'],
        num_rows: 7380
    })
    test: Dataset({
        features: ['body', 'label'],
        num_rows: 1384
    })
    validation: Dataset({
        features: ['body', 'label'],
        num_rows: 462
    })
})

In [31]:
ds_splitted['train'].features

{'body': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [32]:
# Cast "label" field to the ClassLabel type
new_features = ds_splitted['train'].features.copy()
new_features['label'] = ClassLabel(names=['anti-Israel', 'pro-Israel'])
ds_splitted = ds_splitted.cast(new_features)

Casting the dataset:   0%|          | 0/7380 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/1384 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/462 [00:00<?, ? examples/s]

In [33]:
# We will use DistilBERT base model (uncased) (https://huggingface.co/distilbert-base-uncased)

# checkpoint = "roberta-base"
# checkpoint = "bert-base-uncased"
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Load from Hugging Face Hub tokenizer used with DistilBERT model

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
tokenizer(ds_splitted['train']['body'][0]) # Let's look at the result of tokenizer work

{'input_ids': [101, 9100, 3554, 2012, 3956, 1011, 8341, 3675, 2045, 2038, 2042, 9100, 3554, 2006, 1996, 3675, 2090, 3956, 1998, 8341, 1012, 25713, 2056, 2009, 2018, 5045, 2012, 2195, 7889, 1999, 1996, 3675, 2555, 1998, 1523, 3495, 2718, 1524, 11382, 10322, 20267, 7658, 6590, 2006, 1996, 5611, 2217, 1012, 1996, 2177, 2036, 4457, 1037, 3295, 2007, 12496, 1998, 4893, 10986, 1998, 1037, 1523, 7215, 1997, 10420, 2111, 1998, 4683, 1524, 1012, 3956, 1521, 1055, 2390, 2623, 2008, 2009, 2018, 4457, 2195, 1523, 10027, 2250, 7889, 1524, 2008, 2018, 10583, 2013, 8341, 2875, 3956, 1012, 2028, 1997, 2122, 2018, 2042, 5147, 16618, 1012, 2045, 2020, 2036, 2582, 4491, 2013, 8341, 1010, 2000, 2029, 3956, 5838, 1010, 1996, 5611, 2390, 2056, 1012, 2045, 2020, 2053, 4311, 1997, 8664, 1012, 2144, 1996, 2927, 1997, 1996, 14474, 2162, 2206, 1996, 2255, 1021, 2886, 1010, 2045, 2031, 2042, 5567, 13111, 2015, 2090, 1996, 5611, 2390, 1998, 25713, 2247, 1996, 3675, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 

In [34]:
def make_tokenize(example):
  return tokenizer(example['body'], truncation=True)

In [35]:
ds_tokenized = ds_splitted.map(make_tokenize, batched=True) # Apply tokenizer to the whole dataset

Map:   0%|          | 0/7380 [00:00<?, ? examples/s]

Map:   0%|          | 0/1384 [00:00<?, ? examples/s]

Map:   0%|          | 0/462 [00:00<?, ? examples/s]

In [36]:
ds_tokenized

DatasetDict({
    train: Dataset({
        features: ['body', 'label', 'input_ids', 'attention_mask'],
        num_rows: 7380
    })
    test: Dataset({
        features: ['body', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1384
    })
    validation: Dataset({
        features: ['body', 'label', 'input_ids', 'attention_mask'],
        num_rows: 462
    })
})

In [37]:
# We will use batching during training, therefore we need to pad all samples from every batch to the same length.
# We employ DataCollatorWithPadding on this work.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='tf')

In [38]:
batch_size = 8
num_epochs = 5

In [None]:
# Convert train, validation and test datasets to the Tensorflow dataset
tf_train_dataset = ds_tokenized["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)

tf_validation_dataset = ds_tokenized["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

tf_test_dataset = ds_tokenized["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [39]:
# Load DistilBERT model from the Hugging Face Hub. Pay attention that DistilBERT model was trained on the other task (predict the masked word).
# But we want to employ it on a classification task. Hence we replace the model head and will train only the classification head weights
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
# We will use Learning rate decay from 5e-5 to 0 in num_train_steps steps

num_train_steps = len(tf_train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(initial_learning_rate=5e-5,
                               end_learning_rate=0.0,
                               decay_steps=num_train_steps)

In [None]:
model.compile(
    optimizer=Adam(learning_rate=lr_scheduler),
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

In [None]:
model.summary() # Take a look at the model layers

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=num_epochs
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7cc6ce56c520>

In [None]:
model.save_weights(base_path + f'checkpoints/{checkpoint}/{checkpoint}') # Save weights

In [40]:
model.load_weights(base_path + f'checkpoints/{checkpoint}/{checkpoint}')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x790488d797e0>

## 4. Prediction and calculation accuracy

In [None]:
def get_accuracy(y_true, logits_pred):
  '''
  Calculate accuracy metric
  :param y_true - ground truth vector (n, )
  :param logits_pred - logits (n, 2) (model output)
  :return accuracy metric
  '''
  class_preds = np.argmax(logits_pred, axis=1)
  acc = Accuracy()
  acc.update_state(y_true, class_preds)
  return acc.result().numpy()

In [None]:
# Make prediction and evaluate accuracy on validation dataset
preds = model.predict(tf_validation_dataset)["logits"]
get_accuracy(ds_tokenized["validation"]['label'], preds)



0.9957447

In [None]:
# Make prediction and evaluate accuracy on test dataset
test_preds = model.predict(tf_test_dataset)["logits"]
get_accuracy(ds_tokenized["test"]['label'], test_preds)



0.9879433

### 4.1. Errors analysis

In [None]:
# Let's take a look at the prediction errors
class_preds = np.argmax(test_preds, axis=1)
counts = 1
for i in range(len(ds_tokenized["test"])):
  if ds_tokenized["test"]['label'][i] != class_preds[i]:
    print(f"== {counts} {10*'='}")
    print(f"True label: {ds_tokenized['test']['label'][i]}")
    print(f"Predicted label: {class_preds[i]}")
    print(f"Predicted vector: {test_preds[i]}, {tf.math.softmax(test_preds[i])}")
    print(f"Body: {ds_tokenized['test']['body'][i]}")
    counts += 1

True label: 1
Predicted label: 0
Predicted vector: [ 0.4558785  -0.26043764], [0.67179525 0.32820472]
Body: Biden praises Netanyahu, Sissi on Gaza humanitarian aid deal: They ‘stepped up’
US President Joe Biden praised Prime Minister Benjamin Netanyahu and Egyptian President Abdel Fattah al-Sissi for an agreement to let humanitarian aid in to the Gaza Strip through the Rafah crossing from Egypt.
Speaking to reporters at the Ramstein Air Base in Germany following a brief solidarity visit to Israel on Wednesday, Biden said Sissi agreed to open the crossing and to let in an initial group of 20 trucks with humanitarian aid, and possibly more at a later time.
“Sissi deserves some real credit because he was very accommodating,” Biden said, adding that the Egyptian president was “fair” and “very cooperative.”
Israel stopped all entry of food, water, medicine and fuel to Gaza following Hamas’s brutal onslaught on October 7 when some 2,500 terrorists burst across the border into Israel from the

### 4.2. BBC news prediction

Now let's try to predict all collected BBC news.

In [10]:
# Make a tokenized dataset
ds_bbc_tokenized = ds.filter(lambda x: x['provider'] == 'BBC')\
                     .map(merge_title_with_text)\
                     .select_columns(['body'])\
                     .map(make_tokenize, batched=True)
ds_bbc_tokenized

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Map:   0%|          | 0/802 [00:00<?, ? examples/s]

Map:   0%|          | 0/802 [00:00<?, ? examples/s]

Dataset({
    features: ['body', 'input_ids', 'attention_mask'],
    num_rows: 802
})

In [15]:
# Convert the tokenized dataset to a Tensorflow dataset
tf_bbc_dataset = ds_bbc_tokenized.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [16]:
preds_bbc = model.predict(tf_bbc_dataset)["logits"] # Make prediction



In [1]:
# Let's define if the model isn't sure (class values difference <= threshold) with the prediction it will be the third class - neutral.
# So, the positive class will be 1, the neutral class will be 0, and the negative class will be -1.
def get_three_classes(preds, thr=0.5):
  '''
  Get class values
  :param preds - prediction logits
  :param thr - the threshold
  return array with the classes
  '''
  softmax = tf.math.softmax(preds, axis=1)
  thr = 0.5
  class_preds = []
  for v in softmax.numpy():
    if v[1] - v[0] >= thr:
      class_preds.append(1)
    elif v[0] - v[1] >= thr:
      class_preds.append(-1)
    else:
      class_preds.append(0)
  return np.array(class_preds)

In [None]:
class_preds_bbc = get_three_classes(preds_bbc, thr=0.5)

In [40]:
total_pos = np.sum(class_preds_bbc == 1)
total_neg = np.sum(class_preds_bbc == -1)
total_nrl = np.sum(class_preds_bbc == 0)

print(f'Total number of samples: {len(class_preds_bbc)}')
print(f'Predicted as positive: {total_pos}/{total_pos/len(class_preds_bbc):.0%}')
print(f'Predicted as neutral: {total_nrl}/{total_nrl/len(class_preds_bbc):.0%}')
print(f'Predicted as negative: {total_neg}/{total_neg/len(class_preds_bbc):.0%}')

Total number of samples: 802
Predicted as positive: 155/19%
Predicted as neutral: 48/6%
Predicted as negative: 599/75%


In [42]:
def get_samples_by_class(ds, class_predictions, current_class, samples_count):
  '''
  Get random news by specified class
  :param ds - dataset
  :param class_predictions - array of predictions
  :param current_class - class, can be -1, 0, 1
  :param samples_count - number of the news
  '''
  cond = class_predictions == current_class
  indexes = np.where(cond == 1)[0]
  filtered_ds = ds.filter(lambda x, idx: idx in indexes, with_indices=True)
  return filtered_ds.shuffle().select(range(samples_count))

In [43]:
get_samples_by_class(ds_bbc_tokenized, class_preds_bbc, -1, 3)['body'] # Get 3 random negative (anti-Israel) BBC news

Filter:   0%|          | 0/802 [00:00<?, ? examples/s]

['The Gaza hospital struggling on with battery power\nA senior doctor at the Al-Awda hospital in northern Gaza has told the anti-poverty campaign group, ActionAid, that the hospital is using battery power to help with the delivering of babies and the treatment of the injured.\nAl-Awda Hospital has closed its main generators and has been without electricity\nor fuel for three days, the doctor says.\nDespite this, using power and light from batteries, the doctor says it has delivered 16 babies by caesarean\nsections on Sunday.\n"We\nare now receiving about 18-20 newborn deliveries every 24 hours. I\nthink this number will increase in the next days because people will come\nto Al-Awda Hospital from Gaza City.\n"We\nare providing our services to injured patients from the northern\narea because Al-Awda Hospital is the only\nhospital in the northern area that is active and working."\nEarlier, the Hamas-run health ministry told AFP news that all hospitals in the north of the enclave were "out

In [44]:
get_samples_by_class(ds_bbc_tokenized, class_preds_bbc, 0, 3)['body']  # Get 3 random neutral BBC news

Filter:   0%|          | 0/802 [00:00<?, ? examples/s]

['G7 expected to call for temporary pauses\nG7 foreign ministers including US Secretary of State Antony Blinken have been meeting in Tokyo today - where they\'re hammering out a consensus line on Gaza.\nThe group will release a communique shortly - it is expected that it will call for temporary pauses in fighting to allow aid into the Strip but stop short of urging a ceasefire.\nIsrael\'s leader Netanyahu has also rejected calls for a ceasefire which he says would allow Hamas to regroup - but has said he will consider "tactical little pauses".\nThe joint statement from the group of wealthy nations - the US, the UK,\nCanada, France, Germany, Italy, Japan and the European Union-  will be only the second from the group since the fighting began last month.',
 "Israel says it took over Hamas stronghold in northern Gaza Strip\nThe Israel Defense Forces has just given its morning update. It says:\nAlthough the IDF didn't mention it, we're getting pictures of the aftermath of an apparent air s

In [45]:
get_samples_by_class(ds_bbc_tokenized, class_preds_bbc, 1, 3)['body'] # Get 3 random positive (pro-Israel) BBC news

Filter:   0%|          | 0/802 [00:00<?, ? examples/s]

['Israel reports strikes on \'senior\' Hamas figures in Gaza\nHere\'s what we\'ve heard so far from the Israeli military on its strikes on "underground sites".\nThey said the strikes were aimed at "senior" Hamas figures in two different sites in Gaza in "the past few days".\n"A number of senior Hamas commanders were hiding in one of them, including Ahmed Randor, the head of Hamas’ northern Gaza brigade and Hyman Sian, the head of the Hamas rocket brigade," said military spokesman Daniel Hagari.\nAnother underground site which was attacked, he said, contained "senior members of Hamas’ political wing, including Raukhi Mushta, who is a very close associate of Yahya Sinwar, Asam Dalyis, head of the Hamas government in Gaza who is close to Ismael Haniyah, and Samech El Sarg, who is also a close associate to Sinwar and other senior Hamas figures in Gaza".\nBoth sites were "significantly damaged", the Israeli spokesman said without giving further details.\nThe Israeli report could not be veri

### 4.3. CNN news prediction

In [23]:
ds_cnn_tokenized = ds.filter(lambda x: x['provider'] == 'CNN')\
                     .map(merge_title_with_text)\
                     .select_columns(['body'])\
                     .map(make_tokenize, batched=True)
ds_cnn_tokenized

Filter:   0%|          | 0/13103 [00:00<?, ? examples/s]

Map:   0%|          | 0/1428 [00:00<?, ? examples/s]

Map:   0%|          | 0/1428 [00:00<?, ? examples/s]

Dataset({
    features: ['body', 'input_ids', 'attention_mask'],
    num_rows: 1428
})

In [24]:
tf_cnn_dataset = ds_cnn_tokenized.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

In [25]:
preds_cnn = model.predict(tf_cnn_dataset)["logits"]



In [46]:
class_preds_cnn = get_three_classes(preds_cnn, thr=0.5)

In [47]:
total_pos = np.sum(class_preds_cnn == 1)
total_neg = np.sum(class_preds_cnn == -1)
total_nrl = np.sum(class_preds_cnn == 0)

print(f'Total number of samples: {len(class_preds_cnn)}')
print(f'Predicted as positive: {total_pos}/{total_pos/len(class_preds_cnn):.0%}')
print(f'Predicted as neutral: {total_nrl}/{total_nrl/len(class_preds_cnn):.0%}')
print(f'Predicted as negative: {total_neg}/{total_neg/len(class_preds_cnn):.0%}')

Total number of samples: 1428
Predicted as positive: 368/26%
Predicted as neutral: 165/12%
Predicted as negative: 895/63%


In [48]:
get_samples_by_class(ds_cnn_tokenized, class_preds_cnn, -1, 3)['body']

Filter:   0%|          | 0/1428 [00:00<?, ? examples/s]

["Central Gaza hospital says it saw over 100 fatalities in one day\nOn a day of heavy Israeli bombardment, workers at central Gaza's Al-Aqsa Martyrs hospital saw more than 100 fatalities in a single day, according to the institution's media office.\nSeparately, journalist Hassan Eslayeh saw dozens of casualties brought into the hospital Monday. Most came in private cars since people have not been able to call for emergency service due to the disruption in communications in Gaza.\nAmbulances were seen following cars back to bombing sites and returning with more casualties.\nCNN video shows more than 20 body bags lined up in front of the hospital during the day for funeral prayer. People then carried the bags into trucks and ambulances for burial.",
 'Lebanese prime minister fears escalation of Israel-Hamas war could plunge Middle East "into chaos"\nAn escalation of the war in Gaza could plunge the whole region into chaos, Lebanon\'s caretaker prime minister said Monday.\n“I see that the

In [49]:
get_samples_by_class(ds_cnn_tokenized, class_preds_cnn, 0, 3)['body']

Filter:   0%|          | 0/1428 [00:00<?, ? examples/s]

['Group of hostages seen leaving Gaza in Red Cross convoy\n\n      A group of hostages was seen leaving Gaza in a Red Cross convoy late Saturday night local time.\n  \n\n      The Red Cross had earlier confirmed that it was headed to the Rafah crossing between Gaza and Egypt with the hostages, according to the Israel Defense Forces.\n  \n\n      Hamas said it had handed over 13 Israelis and seven foreign nationals to the organization, though Qatar and the IDF each said the number of released foreign nationals was only four.\n  \n\nRemember: Saturday’s planned exchange of hostages and Palestinian prisoners between Israel and Hamas was delayed after a dispute over terms between the two sides, which Qatar later said was resolved through mediation.\n  \n\nThis post has been updated to reflect developments on the ground and an update on hostages from the IDF and Qatar.\n',
 'Qatar has yet to receive identifying information of Gaza hostages to be released, diplomatic source says\nHamas had n

In [50]:
get_samples_by_class(ds_cnn_tokenized, class_preds_cnn, 1, 3)['body']

Filter:   0%|          | 0/1428 [00:00<?, ? examples/s]

['US officials warn of increased domestic threats, particularly against Jewish, Muslim and Arab Americans\nSecretary of Homeland Security Alejandro Mayorkas and FBI Director Christopher Wray warned Tuesday of the spike in domestic threats following the breakout of the Israel-Hamas war, particularly the increased threats against Jewish, Muslim and Arab-American communities. \nIn their opening remarks at a US Senate panel hearing in front of the Senate Committee on Homeland Security and Governmental Affairs, Mayorkas and Wray outlined the threat landscape that the United States faces and emphasized how the ongoing war poses challenges to American security.\n“As the last few weeks have shown, the threat environment our Department is charged with confronting has evolved and expanded constantly in the 20 years since our founding after 9/11,” Mayorkas said in his opening remarks.\nMayorkas said that since “Hamas terrorists horrifically attacked thousands of innocent men, women, and children 

### 4.4. Predict a single text

In [31]:
def predict(tokenizer, model, text):
  '''
  Make prediction
  :param tokenizer
  :param model
  :param text
  return classes prediction
  '''
  text_tokenized = tokenizer(text, truncation=True, return_tensors="tf")
  output = model(**text_tokenized)
  result = tf.math.softmax(output["logits"]).numpy()
  return result

In [None]:
text = '''
Earlier this morning, an official for UNRWA - the UN's agency for Palestinian refugees - spoke to the BBC from Rafah in south Gaza. He described a "desperate situation" for locals.
Tom White told BBC Radio 4's Today programme that a ceasefire was crucial. "Here in Rafah we have hundreds of thousands of people who are living in the open," he explained. "There is a lack of water. Everyone in the street is asking for flour to feed their children.
"Our shelters have well over 7,000 people. There are hundreds using the one toilet for example.
"If the bombs aren't going to kill them, it is the disease, or for those living out on the streets, it'll be the exposure."
Describing his trip to a UN distribution centre in Rafah on Friday, White said: "All you could hear was air strikes going into the city." He also told the BBC a guesthouse he shared with colleagues was hit last night.
'''

In [None]:
text = '''
KHAN YUNIS, Sunday, December 10, 2023 (WAFA) – At least 10 civilians were killed, mostly children, and dozens more were wounded early this morning as Israeli warplanes bombed a residential house in Khan Yunis, south of the Gaza Strip, as the Israeli aggression on the enclave enters its 65th day in a row.
Medical sources confirmed the death toll resulting from the Israeli airstrike and reported further injuries, along with several individuals missing under the desbris, following the Israeli bombardment which targeted a house belonging to the Abdulwahab family west of Khan Yunis.
Israeli artillery also shelled the vicinity of the European Hospital in the city, which has been under a tight Israeli military siege and ground invasion for over a week now.
In the central Gaza Strip, medical sources at the Al-Aqsa Martyrs Hospital in the city of Deir al-Balah confirmed the arrival of several fatalities and wounded following an Israeli airstrike which targeted a residential building in the city.
Intense Israeli air raids were reported also in the nearby refugee camps of Nuseirat, Maghazi, and the town of Al-Zawaida in the central region of the strip.
Additionally, Israeli airstrikes targeted areas in the neighborhoods of Tuffah and Shujaeya east of Gaza City, as well as various locations in northern Gaza.
Furthermore, Israeli occupation forces renewed their artillery bombardment on the Jabalia refugee camp in northern Gaza, which also has been under a ground invasion for weeks.
The ongoing Israeli aggression has brought immense suffering to the civilian Palestinian population in the Gaza Strip, with casualties, particularly among innocent children and healthcare workers, mounting around the clock.
'''

In [None]:
text = '''
Several thousand people demonstrate against antisemitism in Berlin as Germany grapples with a large increase in anti-Jewish incidents following Hamas’s assault on Israel two months ago.
Police estimate that around 3,200 people gathered in the rain in the German capital, while organizers put the figure at 10,000, German news agency dpa reports. Participants in the protest, titled “Never again is now,” march to the Brandenburg Gate.
Germany’s labor minister, Hubertus Heil, says that many decent people are too quiet on the growing antisemitic sentiment in the country: “We don’t need a decent, silent majority — we need a clear and loud majority that stands up now, and not later,” he says.
The event had wide support, with the speaker of the German parliament and Berlin’s mayor among its backers.
'''

In [None]:
predict(tokenizer, model, text)

array([[6.569553e-05, 9.999343e-01]], dtype=float32)

## 5. Loading modal on the Hugging Face Hub

In [41]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
tokenizer.push_to_hub("news_sentiment_model")
model.push_to_hub("news_sentiment_model")

tf_model.h5:   0%|          | 0.00/268M [00:00<?, ?B/s]