<a href="https://colab.research.google.com/github/enavar25/Machine_learning_projects/blob/main/News_classification_with_bert_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project:** News classification detection with a transformer network (BERT)

**Here I will use a bert model to classify if text from a news article is either fake or true news. The dataset that was used was the ISOT Fake News Dataset, which contains two types of articles fake and real News. This dataset was collected from real world sources.**


---

 Below, I will import all necessary functions and libraries.



In [None]:
!pip install --quiet gdown==4.5.4 --no-cache-dir # I need this to import large files from google drive

!pip install -q -U "tensorflow-text==2.11.*"

import numpy as np
import pandas as pd


import tensorflow as tf
import tensorflow_hub as hub


import string
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS    # need to clean up tokenzation process 



from sklearn.model_selection import train_test_split
from keras import layers, Model, metrics

import warnings
warnings.filterwarnings('ignore')
##################################
!gdown 1lEK4q5rbE2WWrwATUgnsFG5LI8TvTE9Y
!unzip -qq real_or_fake_news.zip;

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading...
From: https://drive.google.com/uc?id=1lEK4q5rbE2WWrwATUgnsFG5LI8TvTE9Y
To: /content/real_or_fake_news.zip
100% 43.0M/43.0M [00:00<00:00, 144MB/s]


Here, I make two panda dataframes and assign the label 1 to real news and 0 to fake news. I also check for null and duplicate entries and drop duplicate entries since there are no null entries. I also combine the true and fake dataframes into one dataframe. We are also going to preprocess the text by removing punctuation and stopwords (unwanted tokens).

In [None]:
true = pd.read_csv("real_news.csv")
fake = pd.read_csv("fake_news.csv")

true['category'] = 1
fake['category'] = 0
#####################################
fake.isnull().sum()
fake.duplicated().sum()
fake = fake.drop_duplicates()
true.isnull().sum()
true.duplicated().sum()
true= true.drop_duplicates()
##########################################
true.reset_index(drop=True, inplace=True)
fake.reset_index(drop=True, inplace=True)

df = pd.concat([true,fake])
df = df.reset_index(drop=True)


def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

# Use pandas to apply the remove_punctuations function to the 'text' column
df['processed_text'] = df['text'].apply(remove_punctuations)
# Covert the processed text to lowercase
df['processed_text'] = df['processed_text'].apply(str.lower)
# Remove all the stopwords from the processed text
df['processed_text'] = df['processed_text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stopwords)])
    )
#####


I create our outputs and inputs with our preprocessed text and split our dataframe into train and test datasets below

In [None]:
inputs = df['processed_text']
output = df['category']
X_train, X_test, y_train, y_test = train_test_split(inputs,output)

Here we download the bert model with its  tokenization and embedding/vectorization layers. Note we already have done some tokenization for the dataframe but this layer has more preprocessing steps that I have not done yet.

In [None]:
import tensorflow_text as text
bertPreprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

bertEncode = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")


we start building the bert model by initializing the tokenization layer and the encoding/embedding layer

In [None]:
inputText = layers.Input(shape=(), dtype=tf.string, name='text')
preprocessedText = bertPreprocess(inputText)
outputs = bertEncode(preprocessedText)

we start initializing our neural network layers with a dropout layer to prevent overfitting and a dense layer which will be our output layer. The dense output layer will use a sigmoid function to output numbers between 0 and 1 since this is a binary classification task

In [None]:
b = layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
b = layers.Dense(1, activation='sigmoid', name="output")(b)

we construct the model with the input and output layer and see a summary of the model to see the layers of our model


In [None]:
model = Model(inputs=[inputText], outputs = [b])
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer_2 (KerasLayer)     {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                

we start putting in the parameters to compile our model. To optimize our model we use adam, and we assign binary_crossentropy to our loss function since this is a binary classification task. The only metrics that we will use is accuracy and precision.

In [None]:
Metrics = [
      metrics.BinaryAccuracy(name='accuracy'),
      metrics.Precision(name='precision')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=Metrics)

Finally we train our model

In [None]:
model.fit(X_train, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f45f5b71880>

After 5 epochs, we see that the model has an accuracy of 92.08 percent and a precision of 91.48 percent

We can now use our model to predict and classify out text. Below in the small test list we can type any string text in the elements and use our model to predict if the text is fake or true. If the prediction is more than or equal to .50, then it is fake.Else it is true. 

In [None]:
small_test =['put your text element here','you can type more than one element']
model.predict(small_test)



array([[0.00814331],
       [0.14144714]], dtype=float32)