# Fake news detection

In today's digital age, the rise of fake news poses a significant challenge to the reliability of information. Misleading stories, distorted facts, and fabricated content can easily spread across online platforms, affecting public perception and decision-making.

The Fake News Detection Project addresses this issue through advanced machine learning and natural language processing techniques. The project's goal is to develop a powerful model that can distinguish between fake and genuine news articles. By analyzing linguistic patterns, writing styles, and source credibility, the model aims to provide users with a tool to assess the accuracy of information they encounter.

By promoting media literacy and critical thinking, the project seeks to empower individuals to make more informed judgments about the news they consume. Ultimately, the project aims to contribute to a more trustworthy and transparent media landscape by combating the spread of misinformation.

In [2]:
import numpy as np
import pandas as pd
import json
import csv
import random

from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import pprint
import tensorflow.compat.v1 as tf
from tensorflow.python.framework import ops
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
tf.disable_eager_execution()

In [3]:
df = pd.read_csv("/Users/marysia/Downloads/news.csv")

In [4]:
pd.set_option("display.max_rows", None)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


## Preprocessing

In [6]:
df = df.drop(columns = ["Unnamed: 0"], axis = 1)

In [7]:
df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   6335 non-null   object
 1   text    6335 non-null   object
 2   label   6335 non-null   object
dtypes: object(3)
memory usage: 148.6+ KB


In [9]:
df.isna().sum()

title    0
text     0
label    0
dtype: int64

In [10]:
# Convert categorical labels to numerical
label_encoder = LabelEncoder()

df["label"] = label_encoder.fit_transform(df["label"])

In [11]:
embedding_dim = 50
max_length = 54
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"
training_size = 3000
test_portion = .1

In [12]:
title = []
text = []
labels = []

for x in range(training_size):
    title.append(df['title'][x])
    text.append(df['text'][x])
    labels.append(df['label'][x])

In [13]:
# Tokenize data
tokenizer1 = Tokenizer()

tokenizer1.fit_on_texts(title)
word_index1 = tokenizer1.word_index

In [14]:
vocab_size1 = len(word_index1)
sequences1 = tokenizer1.texts_to_sequences(title) #convert title to numbers, that was created after tokenizator

In [15]:
padded1 = pad_sequences(sequences1, padding=padding_type, truncating=trunc_type)
split = int(test_portion * training_size)
training_sequences1 = padded1[split:training_size]
test_sequences1 = padded1[0:split]
test_labels = labels[0:split]
training_labels = labels[split:training_size]

In [16]:
# Embedding
embedding_index = {}

with open("/Users/marysia/Downloads/glove.6B.50d.txt") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

NameError: name 'embeddings_index' is not defined

In [None]:
embeddings_matrix = np.zeros((vocab_size1+1, embedding_dim))
for word, i in word_index1.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

## Create the model

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size1+1, embedding_dim,
                              input_length=max_length, weights=[
                                  embeddings_matrix],
                              trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
num_epochs = 50

training_padded = np.array(training_sequences1)
training_labels = np.array(training_labels)
testing_padded = np.array(test_sequences1)
testing_labels = np.array(test_labels)

history = model.fit(training_padded, training_labels,
                    epochs=num_epochs,
                    validation_data=(testing_padded,
                                     testing_labels),
                    verbose=2)

In [None]:
# sample text to check if fake or not
X = "Karry to go to France in gesture of sympathy"

# detection
sequences = tokenizer1.texts_to_sequences([X])[0]
sequences = pad_sequences([sequences], maxlen=54,
                          padding=padding_type,
                          truncating=trunc_type)
if(model.predict(sequences, verbose=0)[0][0] >= 0.5):
    print("This news is True")
else:
    print("This news is false")
