<a href="https://colab.research.google.com/github/aafreen2212/DL/blob/master/FakeNewsClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK #1: UNDERSTAND THE PROBLEM STATMENT AND BUSINESS CASE

Welcome to "NLP : Fake News Detector". This is a project-based course which should take approximately 1.5 hours to finish. Before diving into the project, please take a look at the course objectives and structure:
Course Objectives

In this course, we are going to focus on the following learning objectives:

   1. Apply python libraries to import and visualize datasets
  2.  Perform exploratory data analysis and plot word-cloud
  3.  Perform text data cleaning such as removing punctuation and stop words
  4.  Understand the concept of tokenizer.
 5.   Perform tokenizing and padding on text corpus to feed the deep learning model.
  6.  Understand the theory and intuition behind Recurrent Neural Networks and LSTM
 7.   Build and train the deep learning model
 8.   Access the performance of the trained model

# TASK #2: IMPORT LIBRARIES AND DATASETS

In [None]:
!pip install plotly
!pip install --upgrade nbformat
!pip install nltk
!pip install spacy # spaCy is an open-source software library for advanced natural language processing
!pip install WordCloud
!pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing
import nltk
nltk.download('punkt')

import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import nltk
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
# import keras
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model
!pip install jupyterthemes
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False) 
# setting the style of the notebook to be monokai theme  
# this line of code is important to ensure that we are able to see the x and y axes clearly
# If you don't run this code line, you will notice that the xlabel and ylabel on any plot is black on black and it will be hard to see them. 


In [None]:
# load the data
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=19uJBw5Ond5m8MrrqLnHsELiRTzg1nSXQ' -O Fake.csv
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1vIaay2z09JnBBSIKHmscctzraBvHC-iU' -O True.csv


df_true = pd.read_csv("True.csv")
df_fake = pd.read_csv("Fake.csv")

MINI CHALLENGE #1: 
- Indicate how many data samples do we have per class (i.e.: Fake and True)
- List how many Null element are present and the memory usage for each dataframe

In [None]:
df_fake.head()

In [None]:
df_fake.iloc[0,1]

In [None]:
df_true.head()

In [None]:
df_true.shape

In [None]:
df_fake.shape

In [None]:
df_true.info()

In [None]:
df_fake.info()

In [None]:
df_true.isnull().sum()

In [None]:
df_fake.isnull().sum()

# TASK #3: PERFORM EXPLORATORY DATA ANALYSIS

In [None]:
# add a target class column to indicate whether the news is real or fake
df_true['isfake'] = 0
df_true.head()

In [None]:
df_fake['isfake'] = 1
df_fake.head()

In [None]:
# Concatenate Real and Fake News
df = pd.concat([df_true, df_fake]).reset_index(drop = True)
df

In [None]:
df.drop(columns = ['date'], inplace = True)

In [None]:
# combine title and text together
df['original'] = df['title'] + ' ' + df['text']
df.head()

In [None]:
df['original'][0]

# TASK #4: PERFORM DATA CLEANING

In [None]:
# download stopwords
nltk.download("stopwords")

In [None]:
# Obtain additional stopwords from nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [None]:
# Remove stopwords and remove words with 2 or less characters
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words:
            result.append(token)
            
    return result

In [None]:
# Apply the function to the dataframe
df['clean'] = df['original'].apply(preprocess)

In [None]:
# Show original news
df['original'][0]

In [None]:
# Show cleaned up news after removing stopwords
print(df['clean'][0])

In [None]:
df

In [None]:
# Obtain the total words present in the dataset
list_of_words = []
for i in df.clean:
    for j in i:
        list_of_words.append(j)


In [None]:
list_of_words

In [None]:
len(list_of_words)

In [None]:
# Obtain the total number of unique words
total_words = len(list(set(list_of_words)))
total_words

In [None]:
# join the words into a string
df['clean_joined'] = df['clean'].apply(lambda x: " ".join(x))

In [None]:
df

In [None]:
df['clean_joined'][0]

MINI CHALLENGE #2:
- Perform sanity check on the prepocessing stage by visualizing at least 3 sample news 




In [None]:
df[['clean_joined','original']]

# TASK #5: VISUALIZE CLEANED UP DATASET

In [None]:
df

In [None]:
# plot the number of samples in 'subject'
plt.figure(figsize = (8, 8))
sns.countplot(y = "subject", data = df)

MINI CHALLENGE #3: 
- Plot the count plot for fake vs. true news

In [None]:
plt.figure(figsize = (8, 8))
sns.countplot(y = "isfake", data = df)

In [None]:
# plot the word cloud for text that is Fake
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop_words).generate(" ".join(df[df.isfake == 1].clean_joined))
plt.imshow(wc, interpolation = 'bilinear')

In [None]:
# plot the word cloud for text that is True
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop_words).generate(" ".join(df[df.isfake == 0].clean_joined))
plt.imshow(wc, interpolation = 'bilinear')

In [None]:
# length of maximum document will be needed to create word embeddings 
maxlen = -1
for doc in df.clean_joined:
    tokens = nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen = len(tokens)
print("The maximum number of words in any document is =", maxlen)

In [None]:
# visualize the distribution of number of words in a text
import plotly.express as px
fig = px.histogram(x = [len(nltk.word_tokenize(x)) for x in df.clean_joined], nbins = 100)
fig.show()

# TASK #6: PREPARE THE DATA BY PERFORMING TOKENIZATION AND PADDING

In [None]:
# split data into test and train 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.clean_joined, df.isfake, test_size = 0.2)

In [None]:
from nltk import word_tokenize

In [None]:
# Create a tokenizer to tokenize the words and create sequences of tokenized words
#fit_on_texts - will create a vocab
#texts_to_sequence will create a encoded sequence

tokenizer = Tokenizer(num_words = total_words)
tokenizer.fit_on_texts(x_train)
train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)


In [None]:
print("The encoding for document\n",df.clean_joined[0],"\n is : ",train_sequences[0])

In [None]:
# Add padding can either be maxlen = 4406 or smaller number maxlen = 40 seems to work well based on results
padded_train = pad_sequences(train_sequences,maxlen = 40, padding = 'post', truncating = 'post')
padded_test = pad_sequences(test_sequences,maxlen = 40, truncating = 'post') 

In [None]:
for i,doc in enumerate(padded_train[:2]):
     print("The padded encoding for document",i+1," is : ",doc)

# TASK #7: UNDERSTAND THE THEORY AND INTUITION BEHIND RECURRENT NEURAL NETWORKS AND LSTM


RNN is for shorter sentences otherwise use LSTM

# TASK #8: UNDERSTAND THE INTUITION BEHIND LONG SHORT TERM MEMORY (LSTM) NETWORKS

In [None]:
LSTM overcomes issue of vanishing gradients

# TASK #9: BUILD AND TRAIN THE MODEL 

Embedding layer learns low dimension continuous representation of discrete values

Time Distributed Dense applies the same dense layer to every time step during GRU/LSTM Cell unrolling. That’s why the error function will be between the predicted label sequence and the actual label sequence.

Using return_sequences=False, the Dense layer will get applied only once in the last cell. This is normally the case when RNNs are used for classification problems. 



In [None]:
# Sequential Model
model = Sequential()

# embeddidng layer
model.add(Embedding(total_words, output_dim = 128))
# model.add(Embedding(total_words, output_dim = 240))


# Bi-Directional RNN and LSTM
model.add(Bidirectional(LSTM(128)))

# Dense layers
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1,activation= 'sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

In [None]:
total_words

In [None]:
y_train = np.asarray(y_train)

In [None]:
# train the model
model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)

# TASK #9: ASSESS TRAINED MODEL PERFORMANCE


In [None]:
# make prediction
pred = model.predict(padded_test)

In [None]:
# if the predicted value is >0.5 it is real else it is fake
prediction = []
for i in range(len(pred)):
    if pred[i].item() > 0.5:
        prediction.append(1)
    else:
        prediction.append(0)

In [None]:
# getting the accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(list(y_test), prediction)

print("Model Accuracy : ", accuracy)

In [None]:
# get the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(list(y_test), prediction)
plt.figure(figsize = (25, 25))
sns.heatmap(cm, annot = True)