# PHASE 2 : INNOVATION

***************************************************************************************************************

PROJECT TITLE 9238 - FAKE NEWS DETECTION USING NLP

NAME - DHIVYADHARSHINI A

TEAM ID - 5275

TEAM NAME - Proj_204221_Team_2

COLLEGE CODE - NAME : 9238 - MANGAYARKARASI COLLEGE OF ENGINEERING PARAVAI, MADURAI.

GROUP : 5

GITHUB REPOSITORY LINK : https://github.com/dhivyadhar/IBM_AI.git

***************************************************************************************************************


# ABSTRACT :

The abstract for a fake news detection web application using NLP would be a concise summary of the project's key aspects and objectives. Here's an example:

"In the era of digital information, the proliferation of fake news has become a critical issue. This project presents a web application leveraging Natural Language Processing (NLP) techniques to detect fake news articles. The application employs machine learning models to analyze and classify news content, providing users with a reliable tool to assess the credibility of online information sources. Through the integration of NLP, user-friendly interfaces, and real-time analysis, this web application aims to contribute to the fight against misinformation in the digital age."

# PROPOSAL : 
        Phase 2 of our Fake News Detection project aims to build upon the foundation of Phase 1 and advance the state-of-the-art in NLP-based fake news detection. 
        By improving accuracy, enabling real-time monitoring, supporting multiple languages, and enhancing user engagement, we hope to make a substantial contribution to combating the spread of fake news in the digital age. We seek support and funding to carry out this critical initiative.

# DESCRIPTION FOR THE PACKAGES:
           The code snippet provided imports essential Python libraries for data analysis and visualization, including pandas, numpy, matplotlib, and seaborn. It also imports natural language processing tools from NLTK, such as stopwords and PorterStemmer, indicating potential text analysis tasks. Additionally, the code imports regular expressions (re) and initializes NLTK. This combination of libraries suggests the code may be used for data preprocessing and analysis, including text data manipulation and visualization.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords  # Add this import
from nltk.stem import PorterStemmer
import nltk
import re

# NLTK.DOWNLOAD :
         Downloading NLTK data is necessary when working with NLTK for various natural language processing tasks. NLTK (Natural Language Toolkit) provides a wide range of resources like corpora, models, and dictionaries for text analysis. Here are some reasons why downloading NLTK data is necessary:
            For Access to Pre-trained Models,Corpora,Stopwords and Stemming,Lexical Resources,Learning and Research.[nltk.download()] 

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ELCOT\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
nltk.download('stopwords', download_dir='/home')


[nltk_data] Downloading package stopwords to /home...
[nltk_data]   Package stopwords is already up-to-date!


True

# LOADING THE DATASET:
To load a fake news detection dataset that contains both true and fake news articles, you'll typically use a Python library like pandas to handle the dataset. You can assume that the dataset is stored in a CSV file format

In [7]:
# Load the fake news dataset
fake_data = pd.read_csv('fake.csv')
# Load the true news dataset
true_data = pd.read_csv('true.csv')

In [8]:
print("\n TRUE DATA \n")
true_data[0:10]
print("\n FAKE DATA \n")

fake_data[0:4]


 TRUE DATA 


 FAKE DATA 



Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"


In [9]:
print(true_data['text'][0])
true_data.columns

WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support educati

Index(['title', 'text', 'subject', 'date'], dtype='object')

# FUNCTION USED:
Here we use a function len() to find the length of the each dataset[true.csv,fake.csv]. 

In [10]:
len(true_data)

21417

In [11]:
len(fake_data)

23481

# LABELING & PREPROCESSING:
    we used to labels the fake data as a numerical value 1,true data as a value 0.

In [12]:
# Data preprocessing
# Combine the two datasets and add labels
fake_data["label"] = 1
true_data["label"] = 0
data = pd.concat([fake_data, true_data], ignore_index=True)
data.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


# REMOVING UNNECESSARY COLUMNS:
 In the above process,the data obtained after concatenation contains unnecessary columns like date and subject.In the upcoming process we remove those unnecessary columns which are not needed for processing of the fake news detection.

In [13]:
# Remove unnecessary columns (e.g., date, subject)
data.drop(columns=["date", "subject"], inplace=True)
data.head()

Unnamed: 0,title,text,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,1


In [14]:

random_permutation = np.random.permutation(len(data))
data= data.iloc[random_permutation]
print(data.columns)
data.head()

Index(['title', 'text', 'label'], dtype='object')


Unnamed: 0,title,text,label
11093,THE VIEW Brings On Bill O’Reilly’s Sexual Hara...,The last accuser who could have been the nail ...,1
32211,"A festival air, and unease, hang over pre-conv...","CLEVELAND (Reuters) - East 4th Street, emblema...",0
36442,U.N. chief says no alternative to two state so...,UNITED NATIONS (Reuters) - United Nations Secr...,0
19926,GRAB THE POPCORN! Queen Of Corruption DENIED S...,The Drudge Report has gained access to the rul...,1
35243,Venezuela may ban main opposition parties from...,CARACAS (Reuters) - Venezuela s pro-government...,0


In [15]:
filterd_data=data.loc[:, ['title', 'text', 'label']]
filterd_data.head()

Unnamed: 0,title,text,label
11093,THE VIEW Brings On Bill O’Reilly’s Sexual Hara...,The last accuser who could have been the nail ...,1
32211,"A festival air, and unease, hang over pre-conv...","CLEVELAND (Reuters) - East 4th Street, emblema...",0
36442,U.N. chief says no alternative to two state so...,UNITED NATIONS (Reuters) - United Nations Secr...,0
19926,GRAB THE POPCORN! Queen Of Corruption DENIED S...,The Drudge Report has gained access to the rul...,1
35243,Venezuela may ban main opposition parties from...,CARACAS (Reuters) - Venezuela s pro-government...,0


In [16]:
filterd_data.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [17]:
# Text preprocessing (e.g., lowercase, remove special characters, stop words, stemming)
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

In [19]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.split()
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

In [20]:
filterd_data["title"] = filterd_data["title"].apply(preprocess_text)
filterd_data["text"] = filterd_data["text"].apply(preprocess_text)


In [21]:
print(filterd_data.head())

                                                   title  \
11093  view bring bill reilli sexual harass accus rea...   
32211        festiv air uneas hang pre convent cleveland   
36442    u n chief say altern two state solut middl east   
19926  grab popcorn queen corrupt deni special treatm...   
35243  venezuela may ban main opposit parti president...   

                                                    text  label  
11093  last accus could nail coffin bill reilli spoke...      1  
32211  cleveland reuter east th street emblemat new c...      0  
36442  unit nation reuter unit nation secretari gener...      0  
19926  drudg report gain access rule upcom mega debat...      1  
35243  caraca reuter venezuela pro govern legisl supe...      0  


# CREATING A MODEL USING DEEP LEARNING MODEL LSTM :
Now we are going To create a deep learning model for text classification using an LSTM (Long Short-Term Memory) architecture, you can follow these steps using Python and popular libraries like TensorFlow and Keras. Assuming you have the "filterd_data" dataset, you can preprocess and build a model as follows:

In [25]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#Defining hyperparameters : 
              For your deep learning model is a crucial step in the model development process. Hyperparameters are settings that control the behavior and performance of your model. Finding the right hyperparameters can significantly impact the model's accuracy and generalization.

In [26]:
# Define hyperparameters
max_words = 10000  # Maximum number of words to consider in the tokenizer
max_sequence_length = 100  # Maximum length of input sequences
embedding_dim = 100  # Dimension of the word embeddings
batch_size = 64
epochs = 10

In [27]:
# Tokenize the text
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(filterd_data['text'])
sequences = tokenizer.texts_to_sequences(filterd_data['text'])
X = pad_sequences(sequences, maxlen=max_sequence_length)
y = filterd_data['label']

In [28]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Build the LSTM model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_sequence_length),
    tf.keras.layers.LSTM(128),  # You can adjust the number of LSTM units as needed
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [30]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [31]:
# Train the model
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x16d67dc90c0>

In [32]:
# Evaluate the model
y_pred = model.predict(X_test)
y_pred_binary = (y_pred > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred_binary)
conf_matrix = confusion_matrix(y_test, y_pred_binary)
class_report = classification_report(y_test, y_pred_binary)




In [33]:
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

Accuracy: 0.9838530066815144
Confusion Matrix:
[[4165   92]
 [  53 4670]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      4257
           1       0.98      0.99      0.98      4723

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980



Replace 'your_dataset.csv' with the actual file path to your dataset containing the 'text' and 'label' columns. This code snippet preprocesses the text data, builds an LSTM-based neural network, trains it, and evaluates its performance.

# Make sure to install TensorFlow and other necessary libraries if you haven't already:

# CONCLUSION:
 In conclusion,The Phase 2 of our innovation for fake news detection using Natural Language Processing (NLP) has been a significant step forward in our efforts to combat the spread of misinformation and disinformation. This phase has brought us closer to building an effective and robust fake news detection system

                                                         ************