# Fake News Article Detection using Natural Language Processing

In this Colab notebook, we will build a machine learning model to classify news articles as either "Real" or "Fake" based on their content. To achieve this, we will perform a series of text preprocessing and feature engineering steps, followed by training a Logistic Regression model.

## Dataset

https://www.kaggle.com/c/fake-news/data?select=train.csv

We will use a dataset of news articles, loaded from a CSV file, which includes information such as the author, title, and content of each article. The goal is to create a model that can effectively distinguish between genuine news articles and fake or misleading ones.

## Data Preprocessing

1. **Combining Author and Title**: We will concatenate the "author" and "title" columns into a single "content" column. This helps capture more information for classification.

2. **Text Preprocessing**: We'll clean the text data by converting it to lowercase, removing non-alphabetical characters, and applying word stemming using the Porter Stemmer. Additionally, common English stopwords will be removed to focus on meaningful words.

## Feature Engineering

To convert the text data into a format suitable for machine learning, we will use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique. This process will transform the text into numerical features while considering the importance of words in the corpus.

## Model Building

We will utilize a Logistic Regression classifier to train our model. This algorithm is a popular choice for text classification tasks due to its simplicity and effectiveness.

## Model Evaluation

We will assess the performance of our model by calculating accuracy scores on both the training and testing datasets. Additionally, we will select a sample from the test data and make predictions on it to demonstrate how the model classifies news articles.

Let's proceed with the code step by step to build and evaluate our news article classification model.


# Importing Dependencies

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import nltk

# Reading, Importing and modifying Data

In [None]:
# Load the news dataset from a CSV file
news_dataset = pd.read_csv('train.csv')

In [None]:
# Download NLTK stopwords data
nltk.download('stopwords')

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

In [None]:
news_dataset.shape

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

In [None]:
# Fill any missing values in the dataset with an empty string
news_dataset = news_dataset.fillna('')

In [None]:
# Combine 'author' and 'title' columns into a single 'content' column
news_dataset['content'] = news_dataset['author'] + ' ' + news_dataset['title']

In [None]:
print(news_dataset['content'])

# Preping the data for Model

Data Seprating

In [None]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

Stemming

In [None]:
# Initialize a Porter Stemmer for word stemming
port_stem = PorterStemmer()

In [None]:
# Define a function for text preprocessing (stemming, lowercasing, removing stopwords)
def stemming(content):
    # Remove non-alphabetical characters and replace them with spaces
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    # Convert the text to lowercase
    stemmed_content = stemmed_content.lower()
    # Split the text into individual words
    stemmed_content = stemmed_content.split()
    # Apply stemming using Porter Stemmer and remove stopwords
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    # Join the processed words back into a single string
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
# Apply the preprocessing function to the 'content' column of the dataset
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

# Preprocessed Data

In [None]:
# Extract the preprocessed text data and labels
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X)

In [None]:
print(Y)

In [None]:
Y.shape

Vectorization

In [None]:
# Initialize a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the preprocessed text data
vectorizer.fit(X)

# Transform the text data into TF-IDF vectors
X = vectorizer.transform(X)

In [None]:
print(X)

Data Split

In [None]:
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# Model Initiation

In [None]:
# Initialize a Logistic Regression model
model = LogisticRegression()

In [None]:
# Train the Logistic Regression model on the training data
model.fit(X_train, Y_train)

In [None]:
# Predict labels for the training data and calculate training accuracy
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
# Predict labels for the test data and calculate test accuracy
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
# Print the training and test accuracy scores
print('Accuracy score of the training data:', training_data_accuracy)
print('Accuracy score of the test data:', test_data_accuracy)

# Model Test

In [None]:
# Select a new sample from the test data for prediction
X_new = X_test[3]

# Use the trained model to predict the label of the new sample
prediction = model.predict(X_new)

# Print whether the news is classified as "Real" or "Fake" based on the prediction
if prediction[0] == 0:
    print('The news is Real')
else:
    print('The news is Fake')

In [None]:
# Print the actual label of the selected sample
print('Actual label:', Y_test[3])