## AI Tools Final Project Baseline Model: kNN Fake News Classifier

**Basic kNN model that predicts if articles are real or fake news by taking in the headlines that are vectorized using 'Bag of Words' approach. Simple word count is used for vectors, no weighting by frequency or sequential data included.**

In [1]:
import pandas as pd
import numpy as np
import re

**Import Dataset: "Fake News Classification". We only want the headline and the label.**

In [2]:
data = pd.read_csv("WELFake_Dataset.csv")
data = data.dropna()
data = data.drop(columns = ["text"])
data = data.drop(columns = ["Unnamed: 0"])
data.head(10)

Unnamed: 0,title,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",0
4,SATAN 2: Russia unvelis an image of its terrif...,1
5,About Time! Christian Group Sues Amazon and SP...,1
6,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,1
7,HOUSE INTEL CHAIR On Trump-Russia Fake Story: ...,1
8,Sports Bar Owner Bans NFL Games…Will Show Only...,1
9,Latest Pipeline Leak Underscores Dangers Of Da...,1
10,GOP Senator Just Smacked Down The Most Puncha...,1


**Subset of data being used.**

In [3]:
data = data

**Method that takes in one headline, preprocesses it into a list of words, and adds all the new, unique words into a global frequency dictionary for the entire dataset.**

In [4]:
def make_frequency_dict(headline):
    words_list = process_sentence(headline)

    for word in words_list:
        if word in frequency_dict:
            frequency_dict[word] += 1
        else:
            frequency_dict[word] = 1

**Preprocessing method that takes in text and puts it into a list of individual, lowercase words with no numbers or punctuation.**

In [5]:
def process_sentence(text):
    text = re.sub(r'\d', '', text)
    text = re.sub(r'[^\w\s]','',text)
    text = text.lower()
    words_list = text.split()
    return words_list

**Method that takes in all the headlines of the entire dataset and the total unique words list and returns a 2D numpy array that contains a vector for each headline. Each vector is total_unique_words in length and each index in the vector represents the count of that word in that specific headline.**

In [6]:
def headline_to_vector_fast(headlines, total_unique_words):
    # processes all headlines at once
    word_lists = headlines.apply(process_sentence)

    # create a dictionary of words as keys, indices as values
    word_to_index = {word: idx for idx, word in enumerate(total_unique_words)}

    # create 2D numpy array of zeros, specify integers in dictionary
    vectors = np.zeros((len(headlines), len(total_unique_words)), dtype=int)

    # creates array with each headline having a vector 
    for i, words in enumerate(word_lists):
        for word in words:
            if word in word_to_index:
                vectors[i, word_to_index[word]] += 1

    return vectors

**Runs the above methods to fully preprocess the dataset into vectors that can be put into the kNN model.**

In [7]:
# create frequency dictionary 
frequency_dict = {}
titles_list = data["title"].tolist()
for title in titles_list:
    make_frequency_dict(title)

# extract list of unique words
total_unique_words = list(frequency_dict.keys())
size = len(total_unique_words)
print("Number of unique words:", size)

# create input array of dimension: number headlines by unique word vector
X = headline_to_vector_fast(data["title"], total_unique_words)

print(X.shape)
print("X is", type(X))

Number of unique words: 36634
(71537, 36634)
X is <class 'numpy.ndarray'>


## TF-IDF

In [8]:
data = data.dropna()
title_and_label = data
#title_and_label.head(10)

title_and_label_2 = data
title_and_label_2.head(10)

Unnamed: 0,title,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",0
4,SATAN 2: Russia unvelis an image of its terrif...,1
5,About Time! Christian Group Sues Amazon and SP...,1
6,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,1
7,HOUSE INTEL CHAIR On Trump-Russia Fake Story: ...,1
8,Sports Bar Owner Bans NFL Games…Will Show Only...,1
9,Latest Pipeline Leak Underscores Dangers Of Da...,1
10,GOP Senator Just Smacked Down The Most Puncha...,1


In [9]:
#title_and_label_2 = title_and_label_2[:10000]


def make_frequency_dict(text):
    words = process_sentence(text)
    for word in words:
        if word in frequency_dict:
            frequency_dict[word] += 1
        else:
            frequency_dict[word] = 1


def process_sentence(text):
    text = re.sub(r'\d', '', text)
    text = re.sub(r'[^\w\s]','',text)
    text = text.lower()
    words = text.split()
    return words


def headline_to_vector(headline):
    words_list = process_sentence(headline) # sentence broken up into individual words
    headline_dict = dict.fromkeys(total_unique_words, 0) # the entire csv dictionary with blank values
    for word in words_list:
        if(word in headline_dict):
            headline_dict[word] += 1
    vector = list(headline_dict.values())
    return np.array(vector)


frequency_dict = {}

# this creates the frequency dictionary for the whole csv file

titles_list = title_and_label_2["title"].tolist()

for title in titles_list:
    make_frequency_dict(title)

total_unique_words = frequency_dict.keys()
size = len(total_unique_words)
print(size)

X = np.empty((size, size))

vectors = title_and_label_2["title"].apply(headline_to_vector).to_numpy()

X = np.stack(vectors)

print(X.shape)

36634
(71537, 36634)


In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to retain 95% of the variance
# pca = PCA(n_components=0.90, random_state=42)
# X_pca = pca.fit_transform(X_scaled)
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

**Basic kNN model used in class.**

**runKNN method that creates the model, 'fits' the data, predicts the testing set, and returns the accuracy of the testing set.**

In [11]:
from sklearn.metrics import accuracy_score

def runKNN(kVal, train_x, train_y, test_x, test_y):
    knn = KNN(kVal)
    knn.fit(train_x, train_y)
    predictions = knn.predict_loop(test_x)
    return accuracy_score(test_y, predictions)

**Train/test split using the preprocessed 2D numy array created earlier and the labels form the dataset. 80/20 train/test split is standard.**

In [12]:
from sklearn.model_selection import train_test_split
#labels_vector = data["label"].to_numpy()
labels_vector = data["label"].to_numpy()[:X.shape[0]]  # Trim labels to match X


# X_train, X_test, y_train, y_test = train_test_split(X, labels_vector, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X_pca, labels_vector, test_size=0.2, random_state=0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(57229, 2)
(14308, 2)
(57229,)
(14308,)


In [13]:
# runKNN(1, X_train, y_train, X_test, y_test)

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors = 7000)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.7321778026279004
