## 🧠 Fake News using Artificial Immune Systems 

In this notebook we evaluate the performance across:

- AIS-only

- Supervised Logistic Regression

- A Hybrid Ensemble using a weighted combination of AIS and ML scores

- Conduct 5-fold cross-validation to ensure robust and generalizable results



### 📚 [References](https://github.com/KaiDMML/FakeNewsNet)

- Shu, K., Mahudeswaran, D., Wang, S., Lee, D., & Liu, H. (2018). **FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media.** *arXiv preprint arXiv:1809.01286*. [arXiv link](https://arxiv.org/abs/1809.01286)

- Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). **Fake News Detection on Social Media: A Data Mining Perspective.** *ACM SIGKDD Explorations Newsletter*, 19(1), 22–36. [DOI](https://doi.org/10.1145/3137597.3137600)

- Shu, K., Wang, S., & Liu, H. (2017). **Exploiting Tri-Relationship for Fake News Detection.** *arXiv preprint arXiv:1712.07709*. [arXiv link](https://arxiv.org/abs/1712.07709)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
# Load real and fake from politifact
basepath = "/Users/ayeshamendoza/repos/fake-news-immune-system"
datapath = os.path.join(basepath, "data/raw")
real = pd.read_csv(os.path.join(datapath, 'politifact_real.csv'))
fake = pd.read_csv(os.path.join(datapath, 'politifact_fake.csv'))

print("Real news shape:", real.shape)
print("Fake news shape:", fake.shape)

print("\nSample real news article:")
print(real.iloc[0])

print("\nSample fake news article:")
print(fake.iloc[0])

Real news shape: (624, 4)
Fake news shape: (432, 4)

Sample real news article:
id                                             politifact14984
news_url                             http://www.nfib-sbet.org/
title              National Federation of Independent Business
tweet_ids    967132259869487105\t967164368768196609\t967215...
Name: 0, dtype: object

Sample fake news article:
id                                             politifact15014
news_url             speedtalk.com/forum/viewtopic.php?t=51650
title        BREAKING: First NFL Team Declares Bankruptcy O...
tweet_ids    937349434668498944\t937379378006282240\t937380...
Name: 0, dtype: object


Data Preprocessing

In order to be able to use the text data in our Deep Learning models, we will need to convert the text data to numbers.  In order to do that the following pre-processing steps were done:

- Tokenization
- Stemming
- removing stopwords
- removing punctuations
- TF-IDF

In [3]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources if not done
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')

# Define cleaning function
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
import string
import re



In [5]:
def clean_text(text):
    stop_words = ENGLISH_STOP_WORDS
    stemmer = PorterStemmer()

    text = text.lower()
    text = re.sub(r'\.{2,}', ' ', text)              # remove ellipsis
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text) # remove URLs
    text = re.sub(r'\$\w*', '', text)                # remove $ mentions
    text = re.sub(r'#', '', text)                    # remove hashtags
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)  # <-- remove punctuation

    tokens = text.split()  # now safe to split on whitespace

    cleaned_tokens = [
        stemmer.stem(token)
        for token in tokens
        if token not in stop_words
    ]

    return ' '.join(cleaned_tokens)



# Load dataset
basepath = "/Users/ayeshamendoza/repos/fake-news-immune-system"
datapath = os.path.join(basepath, "data/raw")
real = pd.read_csv(os.path.join(datapath, 'politifact_real.csv'))
fake = pd.read_csv(os.path.join(datapath, 'politifact_fake.csv'))

# Add label columns
real['label'] = 'REAL'
fake['label'] = 'FAKE'

# Combine datasets
df = pd.concat([real, fake], ignore_index=True)



In [None]:
# Apply cleaning
df['clean_text'] = df['title'].fillna('')
df['clean_text'] = df['clean_text'].apply(clean_text)


# Save cleaned dataset
df.to_csv('../data/processed/cleaned_articles.csv', index=False)

# Preview cleaned text
print(df[['label', 'clean_text']].head())

article_texts = df['clean_text'].tolist()

  label                                         clean_text
0  REAL                         nation feder independ busi
1  REAL                              comment fayettevil nc
2  REAL  romney make pitch hope close deal elect rocki ...
3  REAL  democrat leader say hous democrat unit gop def...
4  REAL                   budget unit state govern fy 2008


In [7]:
import spacy
import numpy as np

nlp = spacy.load("en_core_web_md")

article_vectors = []
for doc in nlp.pipe(article_texts, disable=["ner", "parser"]):
    article_vectors.append(doc.vector)

article_vectors = np.array(article_vectors)
print("Embeddings shape:", article_vectors.shape)

label_map = {'REAL': 0, 'FAKE': 1}
true_labels = df['label'].map(label_map).tolist()

Embeddings shape: (1056, 300)


In [None]:
X = np.array(article_vectors)
y = np.array(true_labels)

In [10]:
import sys
sys.path.append('../') 

import src.negative_selection
import importlib
importlib.reload(src.negative_selection)

from src.negative_selection import detect_anomaly, generate_detectors
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [13]:
# === PARAMETERS ===
threshold = 0.45
ml_weight = 0.7
ais_weight = 1 - ml_weight
n_splits = 5
detector_params = {
    "num_detectors": 500,
    "noise_std": 0.07
}

In [14]:
# Perform Cross-validation
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
results_cv = {"AIS": [], "ML": [], "Hybrid": []}

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # AIS
    X_train_real = X_train[y_train == 0]
    detectors = generate_detectors(
        num_detectors=detector_params["num_detectors"],
        vector_dim=X.shape[1],
        self_matrix=X_train_real,
        threshold=threshold,
        noise_std=detector_params["noise_std"]
    )
    ais_preds = [1 if detect_anomaly(vec, detectors, threshold, debug=False) else 0 for vec in X_test]

    # ML
    clf = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
    clf.fit(X_train, y_train)
    ml_probs = clf.predict_proba(X_test)[:, 1]
    ml_preds = (ml_probs >= threshold).astype(int)

    # Hybrid
    hybrid_scores = ais_weight * np.array(ais_preds) + ml_weight * ml_probs
    hybrid_preds = (hybrid_scores >= threshold).astype(int)

    # Store results
    results_cv["AIS"].append(classification_report(y_test, ais_preds, output_dict=True))
    results_cv["ML"].append(classification_report(y_test, ml_preds, output_dict=True))
    results_cv["Hybrid"].append(classification_report(y_test, hybrid_preds, output_dict=True))

Generated 500 detectors in 6477 attempts (threshold=0.45, noise_std=0.07)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Generated 500 detectors in 6540 attempts (threshold=0.45, noise_std=0.07)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Generated 500 detectors in 7268 attempts (threshold=0.45, noise_std=0.07)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Generated 500 detectors in 5939 attempts (threshold=0.45, noise_std=0.07)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Generated 500 detectors in 7215 attempts (threshold=0.45, noise_std=0.07)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [15]:
# Summarize and Compare results
def summarize(results, label="1"):
    return {
        "Fake Precision": np.mean([fold[label]["precision"] for fold in results]),
        "Fake Recall":    np.mean([fold[label]["recall"] for fold in results]),
        "Fake F1":        np.mean([fold[label]["f1-score"] for fold in results]),
        "Accuracy":       np.mean([fold["accuracy"] for fold in results])
    }

summary_data = []
for model in ["AIS", "ML", "Hybrid"]:
    metrics = summarize(results_cv[model])
    metrics["Model"] = model
    summary_data.append(metrics)

cv_df = pd.DataFrame(summary_data)
print(cv_df)

   Fake Precision  Fake Recall   Fake F1  Accuracy   Model
0        0.000000     0.000000  0.000000  0.590910     AIS
1        0.721166     0.844935  0.777726  0.802110      ML
2        0.761255     0.664207  0.708928  0.777479  Hybrid
