## Fake News using Artificial Immune Systems 

### Data

✅ politifact_real.csv → real articles from Politifact 

✅ politifact_fake.csv → fake articles from Politifact

✅ gossipcop_real.csv → real articles from GossipCop

✅ gossipcop_fake.csv → fake articles from GossipCop

### 📚 [References](https://github.com/KaiDMML/FakeNewsNet)

- Shu, K., Mahudeswaran, D., Wang, S., Lee, D., & Liu, H. (2018). **FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media.** *arXiv preprint arXiv:1809.01286*. [arXiv link](https://arxiv.org/abs/1809.01286)

- Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). **Fake News Detection on Social Media: A Data Mining Perspective.** *ACM SIGKDD Explorations Newsletter*, 19(1), 22–36. [DOI](https://doi.org/10.1145/3137597.3137600)

- Shu, K., Wang, S., & Liu, H. (2017). **Exploiting Tri-Relationship for Fake News Detection.** *arXiv preprint arXiv:1712.07709*. [arXiv link](https://arxiv.org/abs/1712.07709)
✅ Includes 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
# Load real and fake from politifact
basepath = "/Users/ayeshamendoza/repos/fake-news-immune-system"
datapath = os.path.join(basepath, "data/raw")
real = pd.read_csv(os.path.join(datapath, 'politifact_real.csv'))
fake = pd.read_csv(os.path.join(datapath, 'politifact_fake.csv'))

print("Real news shape:", real.shape)
print("Fake news shape:", fake.shape)

print("\nSample real news article:")
print(real.iloc[0])

print("\nSample fake news article:")
print(fake.iloc[0])

Real news shape: (624, 4)
Fake news shape: (432, 4)

Sample real news article:
id                                             politifact14984
news_url                             http://www.nfib-sbet.org/
title              National Federation of Independent Business
tweet_ids    967132259869487105\t967164368768196609\t967215...
Name: 0, dtype: object

Sample fake news article:
id                                             politifact15014
news_url             speedtalk.com/forum/viewtopic.php?t=51650
title        BREAKING: First NFL Team Declares Bankruptcy O...
tweet_ids    937349434668498944\t937379378006282240\t937380...
Name: 0, dtype: object


Data Preprocessing

In order to be able to use the text data in our Deep Learning models, we will need to convert the text data to numbers.  In order to do that the following pre-processing steps were done:

- Tokenization
- Stemming
- removing stopwords
- removing punctuations
- TF-IDF

In [3]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources if not done
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')

# Define cleaning function
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
import string
import re



In [4]:
def clean_text(text):
    stop_words = ENGLISH_STOP_WORDS
    stemmer = PorterStemmer()

    text = text.lower()
    text = re.sub(r'\.{2,}', ' ', text)              # remove ellipsis
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text) # remove URLs
    text = re.sub(r'\$\w*', '', text)                # remove $ mentions
    text = re.sub(r'#', '', text)                    # remove hashtags
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)  # <-- remove punctuation

    tokens = text.split()  # now safe to split on whitespace

    cleaned_tokens = [
        stemmer.stem(token)
        for token in tokens
        if token not in stop_words
    ]

    return ' '.join(cleaned_tokens)



# Load dataset
basepath = "/Users/ayeshamendoza/repos/fake-news-immune-system"
datapath = os.path.join(basepath, "data/raw")
real = pd.read_csv(os.path.join(datapath, 'politifact_real.csv'))
fake = pd.read_csv(os.path.join(datapath, 'politifact_fake.csv'))

# Add label columns
real['label'] = 'REAL'
fake['label'] = 'FAKE'

# Combine datasets
df = pd.concat([real, fake], ignore_index=True)



In [7]:
# Apply cleaning
df['clean_text'] = df['title'].fillna('')
df['clean_text'] = df['clean_text'].apply(clean_text)
# df['clean_text'] = df['title'].fillna('').apply(clean_text)

# OPTIONAL: Save cleaned dataset
df.to_csv('../data/processed/cleaned_articles.csv', index=False)

# Preview cleaned text
print(df[['label', 'clean_text']].head())

# ✅ TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = vectorizer.fit_transform(df['clean_text'])

print("TF-IDF matrix shape:", tfidf_matrix.shape)


  label                                         clean_text
0  REAL                         nation feder independ busi
1  REAL                              comment fayettevil nc
2  REAL  romney make pitch hope close deal elect rocki ...
3  REAL  democrat leader say hous democrat unit gop def...
4  REAL                   budget unit state govern fy 2008
TF-IDF matrix shape: (1056, 2740)


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Fit TF-IDF
# vectorizer = TfidfVectorizer(max_features=5000)  # limit vocab to top 5000 tokens
vectorizer = TfidfVectorizer(
    max_features=5000,
    token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'
)
tfidf_matrix = vectorizer.fit_transform(df['clean_text'])

# Vocab size
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

# Preview first 20 tokens in vocab
print("\nSample vocab tokens:")
sample_tokens = list(vectorizer.vocabulary_.keys())[:20]
print(sample_tokens)

# Show shape
print(f"\nTF-IDF matrix shape: {tfidf_matrix.shape}")

# Show top tokens by IDF (most unique)
idf_scores = vectorizer.idf_
tokens_idf = sorted(zip(vectorizer.get_feature_names_out(), idf_scores), key=lambda x: -x[1])
print("\nTop 10 tokens by IDF (most unique):")
for token, idf in tokens_idf[:10]:
    print(f"{token}: {idf:.2f}")


Vocabulary size: 2610

Sample vocab tokens:
['nation', 'feder', 'independ', 'busi', 'comment', 'fayettevil', 'nc', 'romney', 'make', 'pitch', 'hope', 'close', 'deal', 'elect', 'rocki', 'mountain', 'news', 'democrat', 'leader', 'say']

TF-IDF matrix shape: (1056, 2610)

Top 10 tokens by IDF (most unique):
abandon: 7.27
abedin: 7.27
abid: 7.27
abortion: 7.27
absorb: 7.27
abus: 7.27
achiev: 7.27
acquir: 7.27
activ: 7.27
actuari: 7.27


In [9]:
basepath

'/Users/ayeshamendoza/repos/fake-news-immune-system'

In [10]:
from scipy import sparse

savepath = os.path.join(basepath, "data/processed")
df.to_csv(os.path.join(savepath, "clean_articles.csv"), index=False)

sparse.save_npz(os.path.join(savepath, "tfidf_matrix.npz"), tfidf_matrix)

# Save vocab
import pickle
with open(os.path.join(savepath,'tfidf_vocab.pkl'), 'wb') as f:
    pickle.dump(vectorizer.vocabulary_, f)

In [31]:
import sys
sys.path.append('../') 
import src.negative_selection
import importlib
importlib.reload(src.negative_selection)

from src.negative_selection import generate_detectors
from scipy import sparse

# Load saved tfidf matrix
self_matrix = sparse.load_npz(os.path.join(savepath, 'tfidf_matrix.npz'))

threshold = 0.8  # tweak threshold as needed
num_detectors = 100

num_real = len(real)
self_matrix = tfidf_matrix[:num_real]  # ONLY real news
vector_dim = self_matrix.shape[1]

# detectors = generate_detectors(num_detectors, vector_dim, self_matrix, threshold)
# detectors = generate_detectors(200, vector_dim, self_matrix, threshold)
detectors = generate_detectors(
    num_detectors=100,
    vector_dim=vector_dim,
    self_matrix=self_matrix,
    threshold=0.5,         # More realistic
    noise_std=0.05         # Gives variation
)


print(f"Generated {len(detectors)} detectors (requested {num_detectors})")

✅ Generated 100 detectors in 123 attempts (threshold=0.5, noise_std=0.05)
Generated 100 detectors (requested 100)


In [32]:
print(detectors.shape)  # should be (100, vector_dim)
print(detectors[0][:10])  # first 10 values of first detector


(100, 2610)
[0.         0.         0.         0.03721172 0.05465079 0.
 0.03827826 0.00081696 0.0386581  0.        ]


In [33]:
# Debug
import numpy as np
from sklearn.metrics.pairwise import cosine_distances

all_min_dists = []

for article_vec in tfidf_matrix.toarray():  # or .A
    distances = cosine_distances(detectors, article_vec.reshape(1, -1)).flatten()
    all_min_dists.append(np.min(distances))

# Basic stats
print("Min distance:", np.min(all_min_dists))
print("Max distance:", np.max(all_min_dists))
print("Mean distance:", np.mean(all_min_dists))


Min distance: 0.5020335620804497
Max distance: 1.0
Mean distance: 0.8460125867323083


In [35]:
true_labels[:5]

array(['REAL', 'REAL', 'REAL', 'REAL', 'REAL'], dtype=object)

In [37]:
label_map = {'REAL': 0, 'FAKE': 1}
df['label_int'] = df['label'].map(label_map)
df.head()

Unnamed: 0,id,news_url,title,tweet_ids,label,clean_text,predicted_label,label_int
0,politifact14984,http://www.nfib-sbet.org/,National Federation of Independent Business,967132259869487105\t967164368768196609\t967215...,REAL,nation feder independ busi,REAL,0
1,politifact12944,http://www.cq.com/doc/newsmakertranscripts-494...,comments in Fayetteville NC,942953459\t8980098198\t16253717352\t1668513250...,REAL,comment fayettevil nc,REAL,0
2,politifact333,https://web.archive.org/web/20080204072132/htt...,"Romney makes pitch, hoping to close deal : Ele...",,REAL,romney make pitch hope close deal elect rocki ...,REAL,0
3,politifact4358,https://web.archive.org/web/20110811143753/htt...,Democratic Leaders Say House Democrats Are Uni...,,REAL,democrat leader say hous democrat unit gop def...,REAL,0
4,politifact779,https://web.archive.org/web/20070820164107/htt...,"Budget of the United States Government, FY 2008",89804710374154240\t91270460595109888\t96039619...,REAL,budget unit state govern fy 2008,REAL,0


In [38]:
from sklearn.metrics import classification_report, confusion_matrix

predictions = []
for article_vec in tfidf_matrix.toarray():
    distances = cosine_distances(detectors, article_vec.reshape(1, -1)).flatten()
    is_fake = np.any(distances < 0.55)  # ← threshold here
    predictions.append(int(is_fake))

# Evaluate
true_labels = df['label_int'].values
print(confusion_matrix(true_labels, predictions))
print(classification_report(true_labels, predictions, target_names=["Real", "Fake"]))


[[509 115]
 [432   0]]
              precision    recall  f1-score   support

        Real       0.54      0.82      0.65       624
        Fake       0.00      0.00      0.00       432

    accuracy                           0.48      1056
   macro avg       0.27      0.41      0.33      1056
weighted avg       0.32      0.48      0.38      1056



🔍 Let’s Break Down What’s Going On
Confusion Matrix:


[[509 115]   ← Real: 509 correct, 115 false positives (flagged as fake)
 [432   0]]   ← Fake: 432 fake articles, all missed ❌


We're correctly identifying a good number of real articles (recall = 82%)

But we're not catching any fake news at all — detectors didn’t fire on them

Precision for "Fake" = 0, recall for "Fake" = 0 → F1 = 0

💡 Diagnosis
❓Possibility 1: Detectors are too similar to real, not close to fake
We built detectors based on noise from real news

If fake news vectors look too similar to real ones (in TF-IDF space), they slip through undetected

❓Possibility 2: Threshold is too strict
You used threshold = 0.55

But we saw earlier that min distances start at ~0.50, and mean = 0.84

Try lowering threshold to ~0.7 or even 0.75 to allow detectors to fire on fake news

In [40]:
for t in [0.7, 0.75]:
    predictions = []
    for article_vec in tfidf_matrix.toarray():
        distances = cosine_distances(detectors, article_vec.reshape(1, -1)).flatten()
        is_fake = np.any(distances < t)
        predictions.append(int(is_fake))

    print(f"\n🔎 Threshold = {t}")
    print(confusion_matrix(true_labels, predictions))
    print(classification_report(true_labels, predictions, target_names=["Real", "Fake"]))



🔎 Threshold = 0.7
[[491 133]
 [431   1]]
              precision    recall  f1-score   support

        Real       0.53      0.79      0.64       624
        Fake       0.01      0.00      0.00       432

    accuracy                           0.47      1056
   macro avg       0.27      0.39      0.32      1056
weighted avg       0.32      0.47      0.38      1056


🔎 Threshold = 0.75
[[477 147]
 [430   2]]
              precision    recall  f1-score   support

        Real       0.53      0.76      0.62       624
        Fake       0.01      0.00      0.01       432

    accuracy                           0.45      1056
   macro avg       0.27      0.38      0.32      1056
weighted avg       0.32      0.45      0.37      1056



In [27]:
# import matplotlib.pyplot as plt
# import numpy as np

# sample_detector = detectors[0]
# distances = np.linalg.norm(self_matrix.toarray() - sample_detector, axis=1)

# plt.hist(distances, bins=30)
# plt.xlabel('Distance to self')
# plt.ylabel('Count')
# plt.title('Distances from sample detector to self samples')
# plt.show()


0.8

In [28]:
import sys
sys.path.append('../') 

import src.negative_selection
import importlib
importlib.reload(src.negative_selection)

from src.negative_selection import detect_anomaly

# # Pick sample article (convert sparse to dense row)
# sample_article_vector = tfidf_matrix[0].toarray()[0]

# result = detect_anomaly(sample_article_vector, detectors, threshold)

# print("Article detected as FAKE" if result else "Article detected as REAL")

predictions = []
threshold = 0.05

for i in range(tfidf_matrix.shape[0]):
    # Get article vector → convert sparse row to dense array
    article_vector = tfidf_matrix[i].toarray()[0]
    
    # Run detection
    detected = detect_anomaly(article_vector, detectors, threshold)
    
    # Map True/False → FAKE/REAL
    predictions.append('FAKE' if detected else 'REAL')

# Assign predictions to dataframe
df['predicted_label'] = predictions


In [29]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(df['label'], predictions, target_names=["Real", "Fake"]))
print(confusion_matrix(df['label'], predictions))


              precision    recall  f1-score   support

        Real       0.00      0.00      0.00       432
        Fake       0.59      1.00      0.74       624

    accuracy                           0.59      1056
   macro avg       0.30      0.50      0.37      1056
weighted avg       0.35      0.59      0.44      1056

[[  0 432]
 [  0 624]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
accuracy = (df['label'] == df['predicted_label']).mean()
print(f"Accuracy: {accuracy:.2%}")

In [30]:
for t in [0.02, 0.04, 0.08]:
    preds = []
    for i in range(tfidf_matrix.shape[0]):
        article_vector = tfidf_matrix[i].toarray()[0]
        detected = detect_anomaly(article_vector, detectors, t)
        preds.append('FAKE' if detected else 'REAL')
    acc = (df['label'] == preds).mean()
    print(f"Threshold {t}: Accuracy {acc:.2%}")


Threshold 0.02: Accuracy 59.09%
Threshold 0.04: Accuracy 59.09%
Threshold 0.08: Accuracy 59.09%


Min distance: 0.8795570857692714
Max distance: 1.0
Mean distance: 0.9279772336929644
