NEWS CLASIFFICATION 🗞️

This repository hosts a notebook featuring an in-depth analysis of several Neural Networks models (RNN, CNN, feed-forward) and Multinomial Naive Bayes along with an app deployment using Streamlit. The following models were meticulously evaluated:

Basic Multinomial Naive Bayes
Basic Keras Model
LSTM Model
LSTM GRU Model
LSTM Bidirectional Model
- TextVectorization + Keras Embedding
- Text_to_word_sequence + Word2Vec Embedding
Basic CNN Model

The dataset used has been downloaded from Kaggle and contains a set of Fake and Real News.

The app can be tested following this link. Feel free to ⭐ and clone this repo 😉

👨‍💻 Tech Stack

📐 Set Up

In the initial project phase, a set of essential helper functions was created to streamline data analysis and model evaluation. These functions include:

Plot Word Cloud: Generates a word cloud for a specific label value and displays it in a subplot.
Plot Confusion Matrix: Visualizes classification results using a confusion matrix.
Plot Precision/Recall Results: Computes model accuracy, precision, recall, and F1-score for binary classification models, returning the results in a DataFrame.

👨‍🔬 Data Analysis

The first step of the project involved a comprehensive analysis of the dataset, including its columns and distribution. The dataset consists of two files (fake and true), each with the following columns:

Title
Text
Subject
Date

Labels Distribution

Upon merging the datasets, it became apparent that the labels are well-balanced, with both fake and true labels at approximately 50%, negating the need for oversampling or undersampling. The dataset initially contained 23,481 fake and 21,417 true news articles, with 209 duplicate rows removed.

Subjects Distribution

The subjects column revealed eight different topics, with true news and fake news being allocated in different subjects. This indicates a clear separation of labels within subjects.

WordCloud

A word cloud visualization showed that the terms "Trump" and "US" were among the most common words in both label categories.

📶 Data Preprocessing

In parallel with data analysis, several preprocessing steps were undertaken to create a clean dataset for further modeling:

Removal of duplicate rows
Elimination of rows with empty cells
Merging of the text and title columns into a single column
Dataframe cleaning, including punctuation removal, elimination of numbers, special character removal, stopword removal, and lemmatization

These steps resulted in approximately 6,000 duplicated rows, which were subsequently removed, resulting in a final dataset of 38,835 rows while maintaining a balanced label distribution.

Final Labels Distribution

👨‍🔬 Modeling

The project involved training several models with varying configurations, primarily consisting of five CNN models, one CNN model combined with Multinomial Naive Bayes.

Model Results

Model Performance Evaluation

All models demonstrated impressive performance, consistently achieving high accuracies, frequently surpassing the 90% mark. The model evaluation process involved several steps:

Baseline Model with GridSearch:
- A Multinomial Naive Bayes model was established using the TfidfVectorizer.
- Despite being a basic model, it set the initial benchmark for performance.
Advanced Models with TextVectorization and Keras Embedding:
- A series of models were tested with advanced text vectorization and embedding techniques.
- These models consistently reached accuracies exceeding 99%.
- The enhanced vectorization and embedding significantly improved model performance.
Best-Performing Model: LSTM Bidirectional with Tokenization and Word Embedding:
- The LSTM Bidirectional model, known for its sequence modeling capabilities, was identified as the best performer.
- It was further evaluated with a different tokenizer and embedding, specifically using text_to_word_sequence and Word2Vec embedding.
- While the performance remained impressive, it exhibited a slightly lower accuracy compared to the other models.

👏 App Deployment

The last step was to deploy an app using Gradio. The app can be tested following this link.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
app		app
images		images
model		model
.gitignore		.gitignore
Fake_news_NLP.ipynb		Fake_news_NLP.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEWS CLASIFFICATION 🗞️

👨‍💻 Tech Stack

📐 Set Up

👨‍🔬 Data Analysis

Labels Distribution

Subjects Distribution

WordCloud

📶 Data Preprocessing

Final Labels Distribution

👨‍🔬 Modeling

Model Results

Model Performance Evaluation

👏 App Deployment

About

Languages

benitomartin/nlp-news-classification

Folders and files

Latest commit

History

Repository files navigation

NEWS CLASIFFICATION 🗞️

👨‍💻 Tech Stack

📐 Set Up

👨‍🔬 Data Analysis

Labels Distribution

Subjects Distribution

WordCloud

📶 Data Preprocessing

Final Labels Distribution

👨‍🔬 Modeling

Model Results

Model Performance Evaluation

👏 App Deployment

About

Topics

Resources

Stars

Watchers

Forks

Languages