Skip to content

benitomartin/nlp-news-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NEWS CLASIFFICATION 🗞️

This repository hosts a notebook featuring an in-depth analysis of several Neural Networks models (RNN, CNN, feed-forward) and Multinomial Naive Bayes along with an app deployment using Streamlit. The following models were meticulously evaluated:

  • Basic Multinomial Naive Bayes
  • Basic Keras Model
  • LSTM Model
  • LSTM GRU Model
  • LSTM Bidirectional Model
    • TextVectorization + Keras Embedding
    • Text_to_word_sequence + Word2Vec Embedding
  • Basic CNN Model

The dataset used has been downloaded from Kaggle and contains a set of Fake and Real News.

The app can be tested following this link. Feel free to ⭐ and clone this repo 😉

👨‍💻 Tech Stack

Visual Studio Code Jupyter Notebook Python Pandas NumPy Plotly Matplotlib scikit-learn TensorFlow Linux Git Streamlit

📐 Set Up

In the initial project phase, a set of essential helper functions was created to streamline data analysis and model evaluation. These functions include:

  • Plot Word Cloud: Generates a word cloud for a specific label value and displays it in a subplot.
  • Plot Confusion Matrix: Visualizes classification results using a confusion matrix.
  • Plot Precision/Recall Results: Computes model accuracy, precision, recall, and F1-score for binary classification models, returning the results in a DataFrame.

👨‍🔬 Data Analysis

The first step of the project involved a comprehensive analysis of the dataset, including its columns and distribution. The dataset consists of two files (fake and true), each with the following columns:

  • Title
  • Text
  • Subject
  • Date

Labels Distribution

Upon merging the datasets, it became apparent that the labels are well-balanced, with both fake and true labels at approximately 50%, negating the need for oversampling or undersampling. The dataset initially contained 23,481 fake and 21,417 true news articles, with 209 duplicate rows removed.

Subjects Distribution

The subjects column revealed eight different topics, with true news and fake news being allocated in different subjects. This indicates a clear separation of labels within subjects.

WordCloud

A word cloud visualization showed that the terms "Trump" and "US" were among the most common words in both label categories.

📶 Data Preprocessing

In parallel with data analysis, several preprocessing steps were undertaken to create a clean dataset for further modeling:

  • Removal of duplicate rows
  • Elimination of rows with empty cells
  • Merging of the text and title columns into a single column
  • Dataframe cleaning, including punctuation removal, elimination of numbers, special character removal, stopword removal, and lemmatization

These steps resulted in approximately 6,000 duplicated rows, which were subsequently removed, resulting in a final dataset of 38,835 rows while maintaining a balanced label distribution.

Final Labels Distribution

👨‍🔬 Modeling

The project involved training several models with varying configurations, primarily consisting of five CNN models, one CNN model combined with Multinomial Naive Bayes.

Model Results

Model Performance Evaluation

All models demonstrated impressive performance, consistently achieving high accuracies, frequently surpassing the 90% mark. The model evaluation process involved several steps:

  1. Baseline Model with GridSearch:

    • A Multinomial Naive Bayes model was established using the TfidfVectorizer.
    • Despite being a basic model, it set the initial benchmark for performance.
  2. Advanced Models with TextVectorization and Keras Embedding:

    • A series of models were tested with advanced text vectorization and embedding techniques.
    • These models consistently reached accuracies exceeding 99%.
    • The enhanced vectorization and embedding significantly improved model performance.
  3. Best-Performing Model: LSTM Bidirectional with Tokenization and Word Embedding:

    • The LSTM Bidirectional model, known for its sequence modeling capabilities, was identified as the best performer.
    • It was further evaluated with a different tokenizer and embedding, specifically using text_to_word_sequence and Word2Vec embedding.
    • While the performance remained impressive, it exhibited a slightly lower accuracy compared to the other models.

👏 App Deployment

The last step was to deploy an app using Gradio. The app can be tested following this link.