In this stage, the goal is to clean the dataset and assign initial sentiment on the target column. Let's see how it goes further...

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('Labeled_Dataset.csv')

In [3]:
df.head()

Unnamed: 0,Source,Link,Headline,Description,Timestamp,Date,Topic,Author,Region,Article_Content,Processed_Content,Sentiment_Bias
0,Al Jazeera,https://www.aljazeera.com/tag/israel-palestine...,Israel-Palestine conflict | Today's latest fro...,How Israel destroyed Gaza · 'The birds are wit...,3 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Negative
1,Al Jazeera,https://www.aljazeera.com/tag/gaza/,Gaza | Today's latest from Al Jazeera,... Israeli. Nicaragua breaks diplomatic ties ...,12 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Negative
2,Al Jazeera,https://www.aljazeera.com/news/2023/9/28/turki...,Turkish neutrality: How Erdogan manages ties w...,"Sep 28, 2023 ... But Erdogan's stance does hel...",Last update 28 Sep 2023,2024-10-14,Ukraine War,AlJazeera,Ukraine,"‘The West is reliable, Russia is equally relia...","['west', 'reliable', 'russia', 'equally', 'rel...",Neutral
3,Al Jazeera,https://www.aljazeera.com/features/2016/11/8/u...,US elections in Nigeria: 'The best reality TV ...,"Nov 8, 2016 ... Efeoghene Ori-Jesu, 34, is wat...",Last update 8 Nov 2016,2024-10-15,US Presidential Elections,AlJazeera,USA,“I’m excited at the possibility of a first fem...,"['im', 'excited', 'possibility', 'first', 'fem...",Positive
4,Al Jazeera,https://www.aljazeera.com/news/liveblog/2024/9...,Israel's war on Gaza updates: New blasts in Le...,"Sep 18, 2024 ... A day after simultaneous blas...",Last update 19 Sep 2024,2024-10-14,Israel War,AlJazeera,Middle East,A day after simultaneous blasts across Lebanon...,"['day', 'simultaneous', 'blast', 'across', 'le...",Negative


Data Processing: In this step, we will perform processing on the target feature, article_content. Main steps include cleaning, stop words removal, tokenization, etc

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# nltk.download('punkt')         # For tokenization
# nltk.download('stopwords')     # For stop word removal
# nltk.download('wordnet')       # For lemmatization
# nltk.download('omw-1.4')       # Additional lemmatization data

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [4]:
def preprocess_text(text, max_length=50):
    #Clean text
    text = re.sub(r'[^A-Za-z\s]', '', text).lower().strip()

    #Tokenize
    tokens = word_tokenize(text)

    #Remove stop words
    tokens = [word for word in tokens if word not in stop_words]

    #Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]



    return tokens

In [5]:
df['Processed_Content'] = df['Article_Content'].apply(preprocess_text)

In [6]:
df[['Article_Content', 'Processed_Content']].head(10)

Unnamed: 0,Article_Content,Processed_Content
0,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo..."
1,‘Self-defence’ has vastly different meanings f...,"[selfdefence, vastly, different, meaning, colo..."
2,"‘The West is reliable, Russia is equally relia...","[west, reliable, russia, equally, reliable, tu..."
3,“I’m excited at the possibility of a first fem...,"[im, excited, possibility, first, female, pres..."
4,A day after simultaneous blasts across Lebanon...,"[day, simultaneous, blast, across, lebanon, le..."
5,A dozen Palestinians killed in Israeli militar...,"[dozen, palestinian, killed, israeli, military..."
6,A German research institute is tracking the fu...,"[german, research, institute, tracking, fundin..."
7,A look at the devastating toll Israel’s war on...,"[look, devastating, toll, israel, war, gaza, t..."
8,A Reuters investigation found that the Biden a...,"[reuters, investigation, found, biden, adminis..."
9,A school sheltering displaced Palestinians in ...,"[school, sheltering, displaced, palestinian, g..."


Classify the bias of articles with a model that yields the best results.
Options considered are TextBlob, vedarSentiment, and Transformers.

In [2]:
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

def classify_sentiment(text):
  
    result = sentiment_pipeline(text[:512])
    label = result[0]['label']
    score = result[0]['score']
    
    if label == "LABEL_0":  
        sentiment_label = "Negative"

    elif label == "LABEL_2":
        sentiment_label = "Positive"
   
    else: 
        sentiment_label = "Neutral"
    
    return sentiment_label




In [5]:
result = df['Article_Content'].apply(classify_sentiment)

In [6]:
df["Sentiment_Bias"] = result


In [27]:
df.head()

Unnamed: 0,Source,Link,Headline,Description,Timestamp,Date,Topic,Author,Region,Article_Content,Processed_Content,Sentiment_Bias,Keywords
0,Al Jazeera,https://www.aljazeera.com/tag/israel-palestine...,Israel-Palestine conflict | Today's latest fro...,How Israel destroyed Gaza · 'The birds are wit...,3 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Negative,"[genocide, killing, canadians, killed, war, he..."
1,Al Jazeera,https://www.aljazeera.com/tag/gaza/,Gaza | Today's latest from Al Jazeera,... Israeli. Nicaragua breaks diplomatic ties ...,12 Oct 2024,2024-10-14,Israel War,AlJazeera,Middle East,‘Self-defence’ has vastly different meanings f...,"['selfdefence', 'vastly', 'different', 'meanin...",Negative,"[killed, siege, bomb, attack, injured, palesti..."
2,Al Jazeera,https://www.aljazeera.com/news/2023/9/28/turki...,Turkish neutrality: How Erdogan manages ties w...,"Sep 28, 2023 ... But Erdogan's stance does hel...",Last update 28 Sep 2023,2024-10-14,Ukraine War,AlJazeera,Ukraine,"‘The West is reliable, Russia is equally relia...","['west', 'reliable', 'russia', 'equally', 'rel...",Neutral,"[putin, 1850s, russian, pbs, russia, president..."
3,Al Jazeera,https://www.aljazeera.com/features/2016/11/8/u...,US elections in Nigeria: 'The best reality TV ...,"Nov 8, 2016 ... Efeoghene Ori-Jesu, 34, is wat...",Last update 8 Nov 2016,2024-10-15,US Presidential Elections,AlJazeera,USA,“I’m excited at the possibility of a first fem...,"['im', 'excited', 'possibility', 'first', 'fem...",Positive,"[president, trump, americans, clinton, tonight..."
4,Al Jazeera,https://www.aljazeera.com/news/liveblog/2024/9...,Israel's war on Gaza updates: New blasts in Le...,"Sep 18, 2024 ... A day after simultaneous blas...",Last update 19 Sep 2024,2024-10-14,Israel War,AlJazeera,Middle East,A day after simultaneous blasts across Lebanon...,"['day', 'simultaneous', 'blast', 'across', 'le...",Negative,"[killed, explosions, wounded, blasts, hundreds..."


In [8]:
df.to_csv("Labeled_Dataset.csv", index=False)

We also need to extract the keywords from the articles that lead to the classification on the sentiment as it will help us <br> perform
 a detailed liguistic analysis of the articles.

In [4]:
import shap
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)





In [None]:
from keybert import KeyBERT


model = KeyBERT('distilbert-base-nli-mean-tokens')


In [None]:

def extract_keywords(text,  num_keywords=20):
    keywords = model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=num_keywords)
    return [kw[0] for kw in keywords]




In [21]:
keywords_column = []
i = 1
for text in df['Article_Content']:
    n_keywords = (len(text)//10)%100
    print(f"Extracting {n_keywords} keywords from article {i}")
    i += 1
    keywords = extract_keywords(text, n_keywords)
    if keywords:
        keywords_column.append(keywords)
    else:
        keywords_column.append([])


Extracting 43 keywords from article 1
Extracting 41 keywords from article 2
Extracting 92 keywords from article 3
Extracting 90 keywords from article 4
Extracting 20 keywords from article 5
Extracting 25 keywords from article 6
Extracting 33 keywords from article 7
Extracting 0 keywords from article 8
Extracting 96 keywords from article 9
Extracting 19 keywords from article 10
Extracting 61 keywords from article 11
Extracting 53 keywords from article 12
Extracting 52 keywords from article 13
Extracting 33 keywords from article 14
Extracting 3 keywords from article 15
Extracting 71 keywords from article 16
Extracting 88 keywords from article 17
Extracting 38 keywords from article 18
Extracting 28 keywords from article 19
Extracting 79 keywords from article 20
Extracting 69 keywords from article 21
Extracting 99 keywords from article 22
Extracting 77 keywords from article 23
Extracting 43 keywords from article 24
Extracting 0 keywords from article 25
Extracting 81 keywords from article 2

In [26]:
df['Keywords'] = keywords_column

In [32]:
df.to_csv("Labeled_Dataset_with_Keywords.csv", index=False)

Text Blob can be used to classify the sentiments of aticles. But we used transformers which are more powerful and can be used for more complex tasks. This is just as alternate approach.

In [25]:
from textblob import TextBlob

def sentiment_score(article_content):
    blob = TextBlob(article_content)
    polarity = blob.sentiment.polarity

    if polarity > 0.1:
        return 'positive'
    elif polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'
