<a href="https://colab.research.google.com/github/anshika3112/Fake_news_Detection/blob/main/Fake_News_Detection_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Summary
Fake news refers to deliberately fabricated or misleading information presented as factual news. It often aims to deceive readers or viewers for various purposes, such as influencing opinions, spreading propaganda, or generating revenue through clicks. Detecting fake news is crucial in combating misinformation and maintaining the integrity of journalism and public discourse. Deep learning techniques, such as natural language processing and neural networks, can be effective in identifying patterns and features indicative of fake news, helping to automate the process of verification and fact-checking.

##Work Flow :-

* Utilizing LSTM (Long Short-Term Memory) RNN for Fake News Detection.

* Feature embedding is constructed before LSTM, facilitating data representation.

* Preprocessing includes stemming SnowballStemmer regex using stemming SnowballStemmer regex and stemming SnowballStemmer regex cleaning.

* One-hot encoding by Keras prepares textual data for LSTM input.
Adam optimizer and binary_crossentropy loss function are employed during model compilation.

* Model performance evaluation includes classification report and confusion matrix.
Determination of the optimal threshold value for prediction refinement.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from sklearn.preprocessing import LabelEncoder #convert categorical data to numerical value
from keras import Sequential #layer which handle the sequential data
from keras.layers import Embedding, Dense, LSTM #embedding-convert tex to decimal,dense-fully connected  layers
from keras.preprocessing.text import one_hot#one_hot-embedder
from keras.utils import pad_sequences
import nltk# use for nlp
from nltk.stem.snowball import SnowballStemmer#same stemmer
import regex as re
from nltk.tokenize import sent_tokenize# breaks the paragraph into sentances.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report# how efficently the model is working.
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from nltk.corpus import stopwords

In [None]:
import tensorflow as tf

print(tf.__version__)


2.15.0


In [None]:
# download some packages
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')#type of dictionary containing all english words.

stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# datasets
df_fake = pd.read_csv("/content/drive/MyDrive/Fake_news_detection/News _dataset/Fake.csv")
df_true = pd.read_csv("/content/drive/MyDrive/Fake_news_detection/News _dataset/True.csv")

In [None]:
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [None]:
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
df_true['status'] = 0
df_fake['status'] = 1

### Since we are going to build model only based on the title feature, hence drop text, date , subject

In [None]:
# merge and remove unnecessary columns
df = pd.concat([df_true,df_fake])
df.drop(['subject','text','date'],axis=1,inplace=True)

In [None]:
#Blending both datasets into one
random_indexes = np.random.randint(0,len(df),len(df)) # (start, end(inclusive), number of random ints)
df = df.iloc[random_indexes].reset_index(drop=True)   # resetting index

In [None]:
pd.set_option('display.max_colwidth', 500)
random = np.random.randint(0,len(df),20)
df.iloc[random]

Unnamed: 0,title,status
10312,Factbox: What to watch in negotiations over details of U.S. tax bill,0
12350,Canada says has no plans to remove embassy staff from Cuba,0
37377,Donald Trump Tells Security To Throw Bernie Sanders Supporter Out Into The Cold Without His Coats (VIDEO),1
43733,Austria's conservatives reach coalition deal with far right: Kurz,0
18632,"Russian, Iranian diplomats to discuss Iran nuclear deal this week: Ifax",0
30768,Trump travel curbs pose revenue challenges for U.S. colleges,0
44242,BOMBSHELL: Clinton WikiLeak Exposes Entire ‘Shadow Government’ – Jay Dyer (Vid),1
16267,BREAKING: IRAN Tests Cruise Missile…Trump WARNS…They’re “Playing with fire…They don’t appreciate how ‘kind’ President Obama was to them. Not Me!” [VIDEO],1
21505,OBAMA Made CHRISTIAN Pastor Pay For His Own Ticket Home After Iran Got Secret $1.7 Billion Ransom For His Release,1
41209,Soda taxes spread after votes in four U.S. cities,0


##Work required to be done on data before feeding to neural network-

1. Remove punctuations eg "".

2. Convert uppercase to lowercase.

3. No need to apply stemming, otherwise it will just shorten the word unnecessarily.

4. Apply lemmatization.

5. Remove all the stopwords.

6. Finally make vocabulary after completion of 5 steps


In [None]:
# Null values
df.isnull().sum()

title     0
status    0
dtype: int64

In [None]:
# longest sentence length
def longest_sentence_length(text):
  return len(text.split())

df['maximum_length'] = df['title'].apply(lambda x : longest_sentence_length(x))
print('longest sentence having length -')
max_length = max(df['maximum_length'].values)
print(max_length)

longest sentence having length -
42


In [None]:
# Text cleaning
text_cleaning = "\b0\S*|\b[^A-Za-z0-9]+"

def preprocess_filter(text, stem=False):
  text = re.sub(text_cleaning, " ",str(text.lower()).strip())
  tokens = []
  for token in text.split():
    if token not in stop_words:
      if stem:
        stemmer = SnowballStemmer(language='english')
        token = stemmer.stem(token)
      tokens.append(token)
  return " ".join(tokens)

* The regular expression pattern removes non-alphanumeric characters and digits, enhancing text cleanliness for subsequent analysis.

* Text preprocessing function standardizes text by converting it to lowercase, removing stop words, and optionally applying stemming, facilitating effective natural language processing.

* Stemming simplifies words to their root forms, aiding in capturing underlying semantic meanings and improving search accuracy.

* Removal of stop words reduces noise in text data, focusing on informative words and enhancing the quality of subsequent analysis.

In [None]:
# Word embedding with pre padding
def one_hot_encoded(text,vocab_size=5000,max_length = 40):
    hot_encoded = one_hot(text,vocab_size)
    return hot_encoded

In [None]:
# word embedding pipeline
def word_embedding(text):
    preprocessed_text=preprocess_filter(text)
    return one_hot_encoded(preprocessed_text)

In [None]:
# Creating Model
embedded_features = 40
model = Sequential()
model.add(Embedding(5000,embedded_features,input_length = max_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy',optimizer= 'adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 42, 40)            200000    
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 256501 (1001.96 KB)
Trainable params: 256501 (1001.96 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


* The model utilizes a vocabulary size of 5000, reflecting the extensive nature of the dataset and the need to handle a wide range of words effectively.

* By embedding input tokens into 40-dimensional vectors, the model captures nuanced semantic relationships, crucial for understanding the complex language patterns present in the dataset.

* Leveraging an LSTM layer with 100 units, the model effectively learns from the extensive sequential data, ensuring it captures long-term dependencies and context effectively.

* With a final dense layer employing a sigmoid activation function, the model delivers binary classification predictions, adeptly classifying the vast and varied dataset with accuracy.

In [None]:
# One hot encoded title
one_hot_encoded_title =df['title'].apply(lambda x : word_embedding(x)).values

In [None]:
# padding to make the size equal of the sequences
padded_encoded_title = pad_sequences(one_hot_encoded_title,maxlen=max_length,padding = 'pre')


In [None]:
# Splitting
X = padded_encoded_title
y = df['status'].values
y = np.array(y)

# shapes
print(X.shape)
print(y.shape)

(44898, 42)
(44898,)


In [None]:
# shape and size
print('X shape {}'.format(X.shape))
print('y shape {}'.format(y.shape))

X shape (44898, 42)
y shape (44898,)


In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y, random_state = 42)

# Shape and size of train and test dataset
print('X train shape {}'.format(X_train.shape))
print('X test shape {}'.format(X_test.shape))
print('y train shape {}'.format(y_train.shape))
print('y test shape {}'.format(y_test.shape))

X train shape (33673, 42)
X test shape (11225, 42)
y train shape (33673,)
y test shape (11225,)


In [None]:
# Model training
# training
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=15,batch_size=64)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7c329ffdc8e0>

##Evaluation

In [None]:
# setting threshold value
def best_threshold_value(thresholds:list,X_test):
    accuracies = []
    for thresh in thresholds:
        ypred =model.predict(X_test)
        ypred = np.where(ypred> thresh,1,0)
        accuracies.append(accuracy_score(y_test,ypred))
    return pd.DataFrame({
        'Threshold': thresholds,
        'Accuracy' : accuracies
    })

In [None]:
best_threshold_value([0.4,0.5,0.6,0.7,0.8,0.9], X_test)



Unnamed: 0,Threshold,Accuracy
0,0.4,0.965969
1,0.5,0.966325
2,0.6,0.966771
3,0.7,0.967216
4,0.8,0.96686
5,0.9,0.967216


* Not much difference in accuray.
* But the most suitable threshold value we have got is 0.4.

In [None]:
# Predictino value at threshold 0.4
y_pred = model.predict(X_test)
y_pred = np.where(y_pred >0.4, 1, 0)



In [None]:
# Confusion matrix
print('Confusion matrix')
print(confusion_matrix(y_pred,y_test))
print('----------------')
print('Classification report')
print(classification_report(y_pred,y_test))

Confusion matrix
[[5189  175]
 [ 207 5654]]
----------------
Classification report
              precision    recall  f1-score   support

           0       0.96      0.97      0.96      5364
           1       0.97      0.96      0.97      5861

    accuracy                           0.97     11225
   macro avg       0.97      0.97      0.97     11225
weighted avg       0.97      0.97      0.97     11225



* The model performs well in both classes, with high precision, recall, and F1-score, suggesting robustness in classification.

* There is no significant imbalance in performance between the two classes, as evidenced by similar metrics for both classes.

* The model's overall performance is excellent, achieving high accuracy on the dataset.

##Predictions

In [None]:
# input generator
def prediction_on_custom_input(text):
    encoded = word_embedding(text)
    padded_encoded_title = pad_sequences([encoded],maxlen=max_length,padding = 'pre')
    output = model.predict(padded_encoded_title)
    output = np.where(0.4>output,1,0)
    if output[0][0] == 1:
        return 'Yes this News is fake'
    return 'No, It is not fake'

In [None]:
# predictions
prediction_on_custom_input('Americans are more concerned over Indians fake open source contribution')



'No, It is not fake'

In [None]:
news = 'Trump Just Sent Michelle Obama a Bill She will Never Be able to pay in her lifetime'
prediction_on_custom_input(news)



'No, It is not fake'

##Saving The Model


In [None]:
def save_model(model, suffix= None):
  """
  Saves a given model in a models directory and appends a suffix (string).
  """
  #Create a model directory pathname with current file
  modeldir = os.path.join("/content/drive/MyDrive/Fake_news_detection/Model/")
  model_path= modeldir +"-"+ suffix+".h5" #save format of model(like extension)
  print(f"Saving model to: {model_path}....")
  model.save(model_path)
  return model_path

In [None]:
save_model(model, suffix="Fake_news_predictor")

Saving model to: /content/drive/MyDrive/Fake_news_detection/Model/-Fake_news_predictor.h5....


'/content/drive/MyDrive/Fake_news_detection/Model/-Fake_news_predictor.h5'

In [None]:
def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading the saved model from: {model_path}...")

  loaded_model = keras.models.load_model(model_path)

  return model

##Integrating OCR (Optical Character Recognition)

In [None]:
!pip install pytesseract


Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [None]:
!apt-get install tesseract-ocr


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 2s (2,175 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 121925 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

In [None]:
from PIL import Image
import pytesseract

# Set the path to the Tesseract executable (change it according to your installation on Kali Linux)
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# Open an image file
image_path = '/content/drive/MyDrive/Fake_news_detection/Input_Image/test-img1.png'
img = Image.open(image_path)

# Use Tesseract to do OCR on the image
# Use Tesseract to do OCR on the image
text = pytesseract.image_to_string(img)


# Print the extracted text
print("Extracted Text:")
print(text)


Extracted Text:
AMERICANS ARE MORE CONCERNED OVER
INDIANS FAKE OPEN SOURCE CONTRIBUTION.



###Prediction On The Text Retrieved Through OCR

* We can make prediction on the text extracted through image from the above code.

In [None]:
# Loading the saved trained model
model = load_model("/content/drive/MyDrive/Fake_news_detection/Model/-Fake_news_predictor.h5")

Loading the saved model from: /content/drive/MyDrive/Fake_news_detection/Model/-Fake_news_predictor.h5...


In [None]:
prediction_on_custom_input(text)



'No, It is not fake'