# 🧠 Media Bias Classifier

This project proposes a machine learning-based News Bias Detection System that analyses news articles to identify potential biases in reporting. By leveraging natural language processing (NLP) and explainable AI techniques, the system will assess textual content and detect linguistic patterns that indicate bias.

Bias in news media can shape public opinion and influence decision-making, making it essential to recognize and mitigate biased reporting. Traditional methods of bias detection rely on human judgment, which is subjective, time-consuming, and inconsistent. This project aims to build an automated and transparent system for detecting bias, promoting more critical and informed media consumption.

In [2]:
import nltk
import re
import string
import pandas as pd
import os
import shutil
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import json
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
import re
import nltk
import unicodedata
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary resources
nltk.download('stopwords')
nltk.download('punkt')

# Load stopwords once (outside function)
stop_words = set(stopwords.words("english"))

def remove_accents(text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )

def preprocess_text(text):
    
    if not isinstance(text, str) or text.strip() == "":
        return ""

    text = text.lower()

    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Removes non-ASCII characters

    text = remove_accents(text)

    text = re.sub(r'[^a-z\s]', '', text)  # Keeps only alphabets and spaces
    text = re.sub(r'\s+', ' ', text).strip()  # Removes multiple spaces

    return text

# Example usage
sample_text = "𝔗𝔥𝔦𝔰 is a Tèst! The government is **CORRUPT**!!! 1234"
cleaned_text = preprocess_text(sample_text)
print("Cleaned Text: ", cleaned_text)


Cleaned Text:  tst government corrupt


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wtert\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wtert\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 🔎 Analytic Approach

Detecting bias in articles might be an unclear objective, due to its abstraction. The first question to give attention to could be how exactly we detect bias, but there might be a better way to approach this. Instead, asking what bias is and how we should define it could provide more clarity.

**Multi-Class Classification** is presumably the better option. It eliminates the need for something to be classified as unbiased, as the left, center, and right all contain some form of bias.

In [4]:
# json_folder = r"C:\Fontys\Semester4\MediaBias_Predictor\Article-Bias-Prediction-main\data\jsons"

# data = []

# # Loop through JSON files
# for filename in os.listdir(json_folder):
#     if filename.endswith(".json"):
#         file_path = os.path.join(json_folder, filename)
#         with open(file_path, "r", encoding="utf-8") as f:
#             article = json.load(f)
            
#             # Extract only the fields present in the JSON
#             data.append({
#                 "ID": article.get("ID", ""),  
#                 "topic": article.get("topic", ""),  
#                 "source": article.get("source", ""),  
#                 "title": article.get("title", ""),  
#                 "date": article.get("date", ""),  
#                 "authors": article.get("authors", ""),  
#                 "content": article.get("content", ""),  
#                 "bias_text": article.get("bias_text", ""),  # Bias category (e.g., left, center, right)
#                 "url": article.get("url", ""),  
#                 "source_url": article.get("source_url", "")  
#             })

df = pd.read_csv("news_bias_data.csv")

# df.to_csv("news_bias_data.csv", index=False)


In [5]:
df.head()

Unnamed: 0,ID,topic,source,title,date,authors,content,bias_text,url,source_url
0,004Gt3gcsotuiYmz,terrorism,New York Times - News,"Bomb Suspect Changed After Trip Abroad, Friend...",2016-09-20,N. R. Kleinfield,"Besides his most recent trip to Quetta , Mr. R...",left,http://www.nytimes.com/2016/09/20/nyregion/ahm...,www.nytimes.com
1,00eP4XD3VdMmHITE,supreme_court,Vox,Why Susan Collins claims she’s being bribed ov...,2018-09-12,"Emily Stewart, Terry Nguyen, Rebecca Jennings,...",Is Maine Republican Sen. Susan Collins being b...,left,https://www.vox.com/policy-and-politics/2018/9...,www.vox.com
2,00FTGIZEd6B8zQ4U,education,Ezra Klein,Poll: Prestigious Colleges Won't Make You Happ...,2014-05-06,Anya Kamenetz,Poll : Prestigious Colleges Wo n't Make You Ha...,left,http://www.npr.org/blogs/thetwo-way/2014/05/06...,www.npr.org
3,00HGGqBRf1kzPRlg,us_house,Breitbart News,Paul Ryan Reportedly Says No Chance for Border...,2017-09-12,Ian Mason,"House Speaker Paul Ryan , at a private dinner ...",right,http://www.breitbart.com/big-government/2017/0...,www.breitbart.com
4,00IzI5ynahBVtC9l,white_house,Guest Writer - Left,OPINION: Trump seeking change of legal fortune...,2019-07-11,Analysis Stephen Collinson,( CNN ) President Donald Trump has reason to h...,left,https://www.cnn.com/2019/07/11/politics/donald...,www.cnn.com


In [6]:
df["text"] = df["title"] + " " + df["content"]
features = ["source", "authors", "text", "bias_text"]

df[features].head()

Unnamed: 0,source,authors,text,bias_text
0,New York Times - News,N. R. Kleinfield,"Bomb Suspect Changed After Trip Abroad, Friend...",left
1,Vox,"Emily Stewart, Terry Nguyen, Rebecca Jennings,...",Why Susan Collins claims she’s being bribed ov...,left
2,Ezra Klein,Anya Kamenetz,Poll: Prestigious Colleges Won't Make You Happ...,left
3,Breitbart News,Ian Mason,Paul Ryan Reportedly Says No Chance for Border...,right
4,Guest Writer - Left,Analysis Stephen Collinson,OPINION: Trump seeking change of legal fortune...,left


In [7]:
# Check for missing values
print(df.isnull().sum())

# Check the distribution of bias_text labels
print(df['bias_text'].value_counts())

print("\nDataset after removing missing values:")
print(df.shape)

ID               0
topic            0
source           0
title            0
date          4407
authors       9668
content          0
bias_text        0
url              0
source_url       0
text             0
dtype: int64
bias_text
right     13734
left      13005
center    10815
Name: count, dtype: int64

Dataset after removing missing values:
(37554, 11)


In [8]:
print(nltk.data.find("corpora/wordnet.zip"))
df["cleaned_text"] = df["text"].apply(preprocess_text)

df[["text", "cleaned_text"]].head()

C:\Users\wtert\AppData\Roaming\nltk_data\corpora\wordnet.zip


Unnamed: 0,text,cleaned_text
0,"Bomb Suspect Changed After Trip Abroad, Friend...",bomb suspect changed trip abroad friends say b...
1,Why Susan Collins claims she’s being bribed ov...,susan collins claims shes bribed kavanaugh vot...
2,Poll: Prestigious Colleges Won't Make You Happ...,poll prestigious colleges wont make happier li...
3,Paul Ryan Reportedly Says No Chance for Border...,paul ryan reportedly says chance border wall p...
4,OPINION: Trump seeking change of legal fortune...,opinion trump seeking change legal fortune lon...


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

vectorizer = TfidfVectorizer(
    stop_words=english,
    lowercase=True,
)

label_encoder = LabelEncoder() 
X = vectorizer.fit_transform(df["cleaned_text"])
y = label_encoder.fit_transform(df["bias_text"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)

model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Detailed evaluation (precision, recall, f1-score, etc.)
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 70.52%
Classification Report:
              precision    recall  f1-score   support

      center       0.71      0.66      0.69      2195
        left       0.72      0.70      0.71      2603
       right       0.69      0.74      0.72      2713

    accuracy                           0.71      7511
   macro avg       0.71      0.70      0.70      7511
weighted avg       0.71      0.71      0.70      7511



In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=50, random_state=42)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

from sklearn.metrics import classification_report, accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Accuracy: 0.5164425509253096
              precision    recall  f1-score   support

      center       0.98      0.10      0.18      2195
        left       0.62      0.48      0.54      2603
       right       0.46      0.89      0.61      2713

    accuracy                           0.52      7511
   macro avg       0.69      0.49      0.44      7511
weighted avg       0.67      0.52      0.46      7511



In [None]:
from sklearn.svm import SVC  # Import Support Vector Classifier (SVM)

svm_model = SVC(kernel='linear', random_state=42)  # Linear kernel is commonly used for text classification

svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

from sklearn.metrics import classification_report, accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))


Accuracy: 0.7023032885101851
              precision    recall  f1-score   support

      center       0.70      0.68      0.69      2195
        left       0.71      0.70      0.70      2603
       right       0.70      0.73      0.71      2713

    accuracy                           0.70      7511
   macro avg       0.70      0.70      0.70      7511
weighted avg       0.70      0.70      0.70      7511

