<a href="https://colab.research.google.com/github/gbiamgaurav/NLP/blob/main/Fake_news_classifier_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fake News Classifier Using LSTM
### Dataset: https://www.kaggle.com/c/fake-news/data#

## Import the data directly from kaggle

In [14]:
# Load the drive

from google.colab import drive
drive.mount('/content/gdrive')
from google.colab import files
files.upload()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"gbiamgaurav","key":"6c5e2ca88e0e1332c5c4009c0a01fc76"}'}

In [15]:
# Check the kaggle.json

!ls -lha kaggle.json

-rw-r--r-- 1 root root 67 Apr 12 05:15 kaggle.json


In [16]:
# Make a directory

!mkdir -p ~/.kaggle

!cp kaggle.json ~/.kaggle/

In [17]:
# Set permission

!chmod 600 /root/.kaggle/kaggle.json

In [19]:
# Download the file directly from Kaggle

!kaggle competitions download -c fake-news

fake-news.zip: Skipping, found more recently modified local copy (use --force to force download)


In [20]:
# Unzip the zip file

!unzip fake-news.zip

Archive:  fake-news.zip
replace submit.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: submit.csv              
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: test.csv                
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train.csv               


In [21]:
import pandas as pd

In [23]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [26]:
## Check for null values

df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

## Since its a text data we can-not replace the null values, rather we should drop those.

In [27]:
## Drop null / nan values

df = df.dropna()

In [28]:
## Check for the null values

df.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [29]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [30]:
## Get the independent features

X = df.drop('label', axis=1)
X

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [31]:
## Get the dependent feature

y = df['label']
y

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 18285, dtype: int64

In [33]:
print("Shape of independent features: ", X.shape)
print('Shape of dependent features: ', y.shape)

Shape of independent features:  (18285, 4)
Shape of dependent features:  (18285,)


In [38]:
## import tensorflow

import tensorflow as tf

print(tf.__version__)

2.12.0


## Import Tensorflow libraries

In [39]:
## Import libraries

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [41]:
## Vocabulary size 

voc_size = 5000

In [42]:
messages = X.copy()

In [44]:
messages

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [45]:
messages['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [46]:
messages

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [47]:
messages.reset_index(inplace=True)

In [48]:
messages

Unnamed: 0,index,id,title,author,text
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...,...
18280,20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
18281,20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
18282,20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
18283,20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


## Import NLTK libraries

In [49]:
import nltk
import re
from nltk.corpus import stopwords

In [50]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [53]:
## Data Preprocessing

from nltk.stem.porter import PorterStemmer  ## Stemming purpose
ps = PorterStemmer()

corpus = []

for i in range(0, len(messages)):

  review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
  review = review.lower()
  review = review.split()

  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)

  corpus.append(review)

In [55]:
corpus

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag',
 'russian research discov secret nazi militari base treasur hunter arctic photo',
 'us offici see link trump russia',
 'ye paid govern troll social media blog forum websit',
 'major leagu soccer argentin find home success new york time',
 'well fargo chief abruptli step new york time',
 'anonym donor pay million releas everyon arrest dakota access pipelin',
 'fbi close hilla

In [56]:
len(corpus)

18285

In [59]:
corpus[1]

'flynn hillari clinton big woman campu breitbart'

## One - Hot representation

In [61]:
onehot_repr = [one_hot(words, voc_size) for words in corpus]

onehot_repr

[[3459, 1049, 4826, 1570, 3452, 2499, 1344, 2139, 2871, 1698],
 [466, 2464, 2019, 3327, 3634, 1629, 243],
 [2079, 4674, 1647, 2338],
 [464, 4908, 383, 2466, 4537, 3869],
 [1713, 3634, 2542, 2787, 3645, 4094, 3634, 2501, 1559, 3260],
 [1138,
  3271,
  4794,
  3365,
  4291,
  4719,
  1602,
  2516,
  4365,
  3143,
  4792,
  4916,
  791,
  4163,
  243],
 [3063, 4666, 4165, 3618, 1973, 3056, 3941, 100, 4825, 4278, 618],
 [1198, 67, 4756, 1160, 3872, 53, 4719, 2399, 4825, 4278, 618],
 [4358, 4004, 4210, 553, 3347, 129, 187, 2032, 4719, 69],
 [703, 2186, 334, 3366, 1958, 3240, 2368, 990],
 [1286, 1773, 148, 525, 26, 1991, 4420, 2958, 4625, 4852, 1420],
 [2466, 3893, 3452, 129, 4719, 3872],
 [1726, 392, 3639, 2630, 1109, 1060, 3864, 3921, 3924],
 [4938, 2510, 3642, 4318, 3012, 3312, 3758, 4825, 4278, 618],
 [947, 1709, 3820, 3424, 1152, 4825, 4278, 618],
 [2867, 3025, 2565, 1752, 146, 540, 3220, 195, 341, 4349],
 [934, 3119, 2464],
 [1559, 2824, 4970, 1240, 4719, 1727, 871, 243],
 [4488, 1960,

In [63]:
corpus[1]

'flynn hillari clinton big woman campu breitbart'

In [62]:
onehot_repr[1]

[466, 2464, 2019, 3327, 3634, 1629, 243]

## Embedding Representation

In [64]:
sent_length = 20  # enabling sentence length

embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length)

print(embedded_docs)

[[   0    0    0 ... 2139 2871 1698]
 [   0    0    0 ... 3634 1629  243]
 [   0    0    0 ... 4674 1647 2338]
 ...
 [   0    0    0 ... 4825 4278  618]
 [   0    0    0 ... 1932 3102   88]
 [   0    0    0 ... 2549 3508 3658]]


In [66]:
embedded_docs[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,  466, 2464, 2019, 3327, 3634, 1629,  243], dtype=int32)

## Creating the model

In [68]:
## Creating the model

embedding_vector_features = 40

model = Sequential()

model.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))

model.add(LSTM(100))

model.add(Dense(1, activation = "sigmoid"))  # output is binary we should apply sigmoid

model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics = ['accuracy']) # Binary classification

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 40)            200000    
                                                                 
 lstm_1 (LSTM)               (None, 100)               56400     
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [69]:
len(embedded_docs)

18285

In [70]:
import numpy as np

In [72]:
X_final = np.array(embedded_docs)
y_final = np.array(y)

In [73]:
print("Shape of X_final: ", X_final.shape)
print("Shape of y_final: ", y_final.shape)

Shape of X_final:  (18285, 20)
Shape of y_final:  (18285,)


In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size= 0.30, random_state=42)

## Model Training

In [75]:
## Train the model

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=25, batch_size=64)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7ff27c8b7b20>

## Apply Earlystopping

In [76]:
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping()

In [80]:
## Train the model while applying earlystopping

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=25, batch_size=64, callbacks=[early_stopping])

Epoch 1/25
Epoch 2/25
Epoch 3/25


<keras.callbacks.History at 0x7ff27c07b280>

## Adding Dropout - It will reduce overfitting

In [96]:
from tensorflow.keras.layers import Dropout

## Creating Model

embedding_vector_features = 40

model = Sequential()
model.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binarycrossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
print(model)

## Performance Metrics & Accuracy

In [81]:
y_pred = model.predict(X_test)

y_pred



array([[1.0000000e+00],
       [9.4791092e-11],
       [2.3650061e-04],
       ...,
       [1.0000000e+00],
       [2.5425705e-12],
       [1.0000000e+00]], dtype=float32)

In [91]:
y_pred = np.where(y_pred > 0.68, 1, 0) # AUC-ROC Curve

In [92]:
y_pred

array([[1],
       [0],
       [0],
       ...,
       [1],
       [0],
       [1]])

In [89]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [93]:
confusion_matrix(y_test, y_pred)

array([[2828,  279],
       [ 202, 2177]])

In [94]:
print('Accuracy of the model: ', accuracy_score(y_test, y_pred))

Accuracy of the model:  0.9123222748815166


In [95]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92      3107
           1       0.89      0.92      0.90      2379

    accuracy                           0.91      5486
   macro avg       0.91      0.91      0.91      5486
weighted avg       0.91      0.91      0.91      5486



In [97]:
from tensorflow.keras.layers import Dropout

## Creating Model

embedding_vector_features = 40

model_1 = Sequential()
model_1.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))
model_1.add(Dropout(0.3))
model_1.add(LSTM(100))
model_1.add(Dropout(0.3))
model_1.add(Dense(1, activation="sigmoid"))
model_1.compile(loss="binarycrossentropy", optimizer="adam", metrics=["accuracy"])
print(model_1.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 20, 40)            200000    
                                                                 
 dropout_2 (Dropout)         (None, 20, 40)            0         
                                                                 
 lstm_3 (LSTM)               (None, 100)               56400     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [98]:
y_pred_1 = model_1.predict(X_test)

y_pred_1 



array([[0.50145984],
       [0.4997586 ],
       [0.50034726],
       ...,
       [0.5017005 ],
       [0.49951613],
       [0.50079316]], dtype=float32)

In [99]:
y_pred_1 = np.where(y_pred > 0.68, 1, 0) # AUC-ROC Curve

In [100]:
y_pred_1

array([[1],
       [0],
       [0],
       ...,
       [1],
       [0],
       [1]])

In [101]:
print(confusion_matrix(y_test, y_pred_1))

[[2828  279]
 [ 202 2177]]


In [102]:
print("Accuracy of the new model: ", accuracy_score(y_test, y_pred_1))

Accuracy of the new model:  0.9123222748815166


In [103]:
print(classification_report(y_test, y_pred_1))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92      3107
           1       0.89      0.92      0.90      2379

    accuracy                           0.91      5486
   macro avg       0.91      0.91      0.91      5486
weighted avg       0.91      0.91      0.91      5486

