In this stage, the goal is to clean the dataset and assign initial sentiment on the target column. Let's see how it goes further...

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('Labeled_Dataset.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Source,Link,Headline,Description,Timestamp,Date,Topic,Author,Region,Article_Content,Processed_Content,Sentiment_Bias
0,0,Al Jazeera,https://www.aljazeera.com/tag/israel-palestine...,Israel-Palestine conflict | Today's latest fro...,How Israel destroyed Gaza · 'The birds are wit...,3 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Highly Negative
1,1,Al Jazeera,https://www.aljazeera.com/tag/gaza/,Gaza | Today's latest from Al Jazeera,... Israeli. Nicaragua breaks diplomatic ties ...,12 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Highly Negative
2,2,Al Jazeera,https://www.aljazeera.com/news/2023/9/28/turki...,Turkish neutrality: How Erdogan manages ties w...,"Sep 28, 2023 ... But Erdogan's stance does hel...",Last update 28 Sep 2023,2024-10-14,Ukraine War,AlJazeera,Ukraine,"‘The West is reliable, Russia is equally relia...","['west', 'reliable', 'russia', 'equally', 'rel...",Neutral
3,3,Al Jazeera,https://www.aljazeera.com/features/2016/11/8/u...,US elections in Nigeria: 'The best reality TV ...,"Nov 8, 2016 ... Efeoghene Ori-Jesu, 34, is wat...",Last update 8 Nov 2016,2024-10-15,US Presidential Elections,AlJazeera,USA,“I’m excited at the possibility of a first fem...,"['im', 'excited', 'possibility', 'first', 'fem...",Highly Positive
4,4,Al Jazeera,https://www.aljazeera.com/news/liveblog/2024/9...,Israel's war on Gaza updates: New blasts in Le...,"Sep 18, 2024 ... A day after simultaneous blas...",Last update 19 Sep 2024,2024-10-14,Israel War,AlJazeera,Middle East,A day after simultaneous blasts across Lebanon...,"['day', 'simultaneous', 'blast', 'across', 'le...",Highly Negative


Data Processing: In this step, we will perform processing on the target feature, article_content. Main steps include cleaning, stop words removal, tokenization, etc

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# nltk.download('punkt')         # For tokenization
# nltk.download('stopwords')     # For stop word removal
# nltk.download('wordnet')       # For lemmatization
# nltk.download('omw-1.4')       # Additional lemmatization data

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [4]:
def preprocess_text(text, max_length=50):
    #Clean text
    text = re.sub(r'[^A-Za-z\s]', '', text).lower().strip()

    #Tokenize
    tokens = word_tokenize(text)

    #Remove stop words
    tokens = [word for word in tokens if word not in stop_words]

    #Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]



    return tokens

In [5]:
df['Processed_Content'] = df['Article_Content'].apply(preprocess_text)

In [6]:
df[['Article_Content', 'Processed_Content']].head(10)

Unnamed: 0,Article_Content,Processed_Content
0,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo..."
1,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo..."
2,"‘The West is reliable, Russia is equally relia...","[west, reliable, russia, equally, reliable, tu..."
3,“I’m excited at the possibility of a first fem...,"[im, excited, possibility, first, female, pres..."
4,A day after simultaneous blasts across Lebanon...,"[day, simultaneous, blast, across, lebanon, le..."
5,A dozen Palestinians killed in Israeli militar...,"[dozen, palestinian, killed, israeli, military..."
6,A German research institute is tracking the fu...,"[german, research, institute, tracking, fundin..."
7,A look at the devastating toll Israel’s war on...,"[look, devastating, toll, israel, war, gaza, t..."
8,A Reuters investigation found that the Biden a...,"[reuters, investigation, found, biden, adminis..."
9,A school sheltering displaced Palestinians in ...,"[school, sheltering, displaced, palestinian, g..."


In [7]:
df[['Article_Content', 'Processed_Content']].tail(10)

Unnamed: 0,Article_Content,Processed_Content
5003,Austria has shut down a mosque and an Islamic ...,"[austria, shut, mosque, islamic, association, ..."
5004,Austria’s right-wing government has agreed to ...,"[austria, rightwing, government, agreed, make,..."
5005,Authorities in Germany say there has been an i...,"[authority, germany, say, increasing, number, ..."
5006,Baroness Sayeeda Warsi has joined public calls...,"[baroness, sayeeda, warsi, joined, public, cal..."
5007,Blogger Amani Al-Khatahtbeh says she got into ...,"[blogger, amani, alkhatahtbeh, say, got, alter..."
5008,"Bodies of two women, aged 20 and 22, were foun...","[body, two, woman, aged, found, rented, apartm..."
5009,Bosnian genocide survivor and researcher Arnes...,"[bosnian, genocide, survivor, researcher, arne..."
5010,Both Israeli Prime Minister Benjamin Netanyahu...,"[israeli, prime, minister, benjamin, netanyahu..."
5011,But Benny Gantz was chief of the Israeli army ...,"[benny, gantz, chief, israeli, army, raid, gaz..."
5012,But the US president has assured his administr...,"[u, president, assured, administration, suppor..."


Classify the bias of articles with a model that yields the best results.
Options considered are TextBlob, vedarSentiment, and Transformers.

In [59]:
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

def classify_sentiment(text):
  
    result = sentiment_pipeline(text[:512])
    label = result[0]['label']
    score = result[0]['score']
    
    if label == "LABEL_0":  # Negative sentiment
        if score > 0.75:
            sentiment_label = "Highly Negative"
        elif score > 0.5:
            sentiment_label = "Negative"
        else:
            sentiment_label = "Slightly Negative"
    elif label == "LABEL_2":  # Positive sentiment
        if score > 0.75:
            sentiment_label = "Highly Positive"
        elif score > 0.5:
            sentiment_label = "Positive"
        else:
            sentiment_label = "Slightly Positive"
    else:  # Neutral sentiment
        sentiment_label = "Neutral"
    
    return sentiment_label

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [62]:
result = df['Article_Content'].apply(classify_sentiment)

In [64]:
df["Sentiment_Bias"] = result


In [68]:
df.head()

Unnamed: 0,Source,Link,Headline,Description,Timestamp,Date,Topic,Author,Region,Article_Content,Processed_Content,Sentiment_Bias
0,Al Jazeera,https://www.aljazeera.com/tag/israel-palestine...,Israel-Palestine conflict | Today's latest fro...,How Israel destroyed Gaza · 'The birds are wit...,3 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo...",Highly Negative
1,Al Jazeera,https://www.aljazeera.com/tag/gaza/,Gaza | Today's latest from Al Jazeera,... Israeli. Nicaragua breaks diplomatic ties ...,12 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo...",Highly Negative
2,Al Jazeera,https://www.aljazeera.com/news/2023/9/28/turki...,Turkish neutrality: How Erdogan manages ties w...,"Sep 28, 2023 ... But Erdogan's stance does hel...",Last update 28 Sep 2023,2024-10-14,Ukraine War,AlJazeera,Ukraine,"‘The West is reliable, Russia is equally relia...","[west, reliable, russia, equally, reliable, tu...",Neutral
3,Al Jazeera,https://www.aljazeera.com/features/2016/11/8/u...,US elections in Nigeria: 'The best reality TV ...,"Nov 8, 2016 ... Efeoghene Ori-Jesu, 34, is wat...",Last update 8 Nov 2016,2024-10-15,US Presidential Elections,AlJazeera,USA,“I’m excited at the possibility of a first fem...,"[im, excited, possibility, first, female, pres...",Highly Positive
4,Al Jazeera,https://www.aljazeera.com/news/liveblog/2024/9...,Israel's war on Gaza updates: New blasts in Le...,"Sep 18, 2024 ... A day after simultaneous blas...",Last update 19 Sep 2024,2024-10-14,Israel War,AlJazeera,Middle East,A day after simultaneous blasts across Lebanon...,"[day, simultaneous, blast, across, lebanon, le...",Highly Negative


We also need to extract the keywords from the articles that lead to the classification on the sentiment as it will help us <br> perform
 a detailed liguistic analysis of the articles.

In [None]:
import shap
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)


explainer = shap.Explainer(lambda x: model(**x).logits, tokenizer)

def get_bias_keywords(text, top_n=10):
    inputs = tokenizer(text, return_tensors="pt")
    shap_values = explainer(inputs)  # Get SHAP values for each token
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Get absolute shap values and corresponding tokens
    shap_scores = torch.abs(shap_values.values).mean(dim=2).squeeze()
    top_indices = shap_scores.argsort(descending=True)[:top_n]
    return [tokens[i] for i in top_indices]


pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development





In [5]:
keywords = df['Article_Content'][:5].apply(get_bias_keywords)

TypeError: RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=3, bias=True)
  )
) argument after ** must be a mapping, not numpy.ndarray

In [None]:
df['Bias_Keywords'] = keywords

In [None]:
df.to_csv("Labeled_Dataset_with_Keywords.csv")

In [25]:
from textblob import TextBlob

def sentiment_score(article_content):
    blob = TextBlob(article_content)
    polarity = blob.sentiment.polarity

    if polarity > 0.1:
        return 'positive'
    elif polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'


In [None]:
data = df['Processed_Content'][1000:1010]


textblob_results = data.apply(sentiment_score)

print("TextBlob results: \n", textblob_results)

TextBlob results: 
 1000     neutral
1001     neutral
1002    positive
1003    positive
1004     neutral
1005    positive
1006     neutral
1007     neutral
1008    positive
1009     neutral
Name: Article_Content, dtype: object
