<a href="https://colab.research.google.com/github/enavar25/Machine_learning_projects/blob/main/News_Classification_with_a_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project:** News classification detection with a CNN 

**Here I will build a simple CNN using keras to classify if a text is either fake or true news. The dataset that was used was the ISOT Fake News Dataset, which contains two types of articles fake and real News. This dataset was collected from real world sources.**

---


 Below, I will import all necessary functions and libraries.



In [None]:
!pip install --quiet gdown==4.5.4 --no-cache-dir # I need this to import large files from google drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import string
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS    # need to clean up tokenzation process 



from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')
##################################
!gdown 1lEK4q5rbE2WWrwATUgnsFG5LI8TvTE9Y
!unzip -qq real_or_fake_news.zip;

Downloading...
From: https://drive.google.com/uc?id=1lEK4q5rbE2WWrwATUgnsFG5LI8TvTE9Y
To: /content/real_or_fake_news.zip
100% 43.0M/43.0M [00:00<00:00, 67.5MB/s]


Here, I make two panda dataframes and assign the label 1 to real news and 0 to fake news. I also check for null and duplicate entries and drop duplicate entries since there are no null entries.

In [None]:
true = pd.read_csv("real_news.csv")
fake = pd.read_csv("fake_news.csv")

true['category'] = 1
fake['category'] = 0
#####################################
fake.isnull().sum()
fake.duplicated().sum()
fake = fake.drop_duplicates()
true.isnull().sum()
true.duplicated().sum()
true= true.drop_duplicates()

I combine the true and fake dataframes into one dataframe. We are also going to preprocess the text by removing punctuation and stopwords (unwanted tokens).



In [None]:
# Combine the true and fake news datasets into one dataframe.
df = pd.concat([true,fake]) 

# Define a function to remove punctuation.
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

# Use pandas to apply the remove_punctuations function to the 'text' column
df['processed_text'] = df['text'].apply(remove_punctuations)
# Covert the processed text to lowercase
df['processed_text'] = df['processed_text'].apply(str.lower)
# Remove all the stopwords from the processed text
df['processed_text'] = df['processed_text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stopwords)])
    )

I create our outputs and inputs with our preprocessed text and split our dataframe into train and test datasets

In [None]:

inputs = df['processed_text']
output = df['category']

x_train, x_test, y_train, y_test = train_test_split(
    inputs, 
    output, 
    shuffle=True,
    test_size=0.2, 
    random_state=42
 )

creating a vectorization layer with a maximum of 5000 word tokens and adapting it to the training data. TextVectorization converts a batch of strings into tokens and converts each token into a vector of either dense floats or integers.

In [None]:
vectorize_layer =  TextVectorization(max_tokens=5000,output_mode= 'int', output_sequence_length=100) #tokenize and vectorize 
vectorize_layer.adapt(x_train)

Initializing our neural network and adding dense, convolutional, embedding, dense, and max pooling layers. 

In [None]:
model = Sequential()
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(Embedding(input_dim = 5000,output_dim = 100,input_length = 100)) # the embedding layer catches relationships and meaning between words using linear algebra
          
model.add(Conv1D(filters = 128, kernel_size= 5, activation ='relu'))
model.add(MaxPool1D(pool_size=2))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(1,activation = 'sigmoid'))

Creating a summary of the model to see our total parameters 

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 100)              0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 100, 100)          500000    
                                                                 
 conv1d (Conv1D)             (None, 96, 128)           64128     
                                                                 
 max_pooling1d (MaxPooling1D  (None, 48, 128)          0         
 )                                                               
                                                                 
 dense (Dense)               (None, 48, 256)           33024     
                                                                 
 dense_1 (Dense)             (None, 48, 1)             2

Compiling and training the model 




In [None]:

opt = Adam(learning_rate= 0.01)
model.compile(optimizer = opt, loss= 'binary_crossentropy', metrics = ['accuracy'])

model.fit(x_train, y_train, epochs = 5, batch_size = 256);

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Evaluate the model's accuracy for both the train and test sets.

In [None]:
model.evaluate(x_train,y_train)
model.evaluate(x_test,y_test)



[0.41151779890060425, 0.7992444634437561]

Overall the model is around 80 percent accurate, which is not that bad. In the next project I will see if I can get better accuracy in a similar text classification task with an attention mechanism model (Transformer).