<a href="https://colab.research.google.com/github/ds-riselabs/run-am-with-tflite/blob/main/Run_Am_with_Tflite.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##News Source Verification using TensorFlow Lite

#### **Install TF Lite Model Maker**
Install the **TensorFlow Lite Model Maker** library. TF Lite Model Maker makes it easy to train models on custom dataset and reduces time to train by using Transfer Learning on pre-trained models.

In [1]:
!pip install -q tflite-model-maker

[K     |████████████████████████████████| 577 kB 8.5 MB/s 
[K     |████████████████████████████████| 10.9 MB 32.1 MB/s 
[K     |████████████████████████████████| 840 kB 69.7 MB/s 
[K     |████████████████████████████████| 3.4 MB 31.7 MB/s 
[K     |████████████████████████████████| 128 kB 71.0 MB/s 
[K     |████████████████████████████████| 77 kB 7.8 MB/s 
[K     |████████████████████████████████| 60.2 MB 1.2 MB/s 
[K     |████████████████████████████████| 87 kB 6.8 MB/s 
[K     |████████████████████████████████| 238 kB 61.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 65.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 63.1 MB/s 
[K     |████████████████████████████████| 25.3 MB 1.3 MB/s 
[K     |████████████████████████████████| 99 kB 10.6 MB/s 
[K     |████████████████████████████████| 352 kB 58.9 MB/s 
[K     |████████████████████████████████| 40 kB 5.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 69.2 MB/s 
[K     |██████████████████████

#### **Install necessary libraries**

In [2]:
import numpy as np
from numpy.random import RandomState
import pandas as pd
import os
import glob
import warnings
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.config import ExportFormat
from tflite_model_maker.config import QuantizationConfig
from tflite_model_maker.text_classifier import AverageWordVecSpec
from tflite_model_maker.text_classifier import DataLoader

import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')
#kernel setttings
warnings.filterwarnings(action='ignore')
pd.set_option('display.max_rows', 25000)

#### **Import dataset**
Import the true and fake news dataset and read them as CSV files using the Pandas library.

In [3]:
#import and mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
# Change directory to the folder where the downloaded data was in your drive
%cd /content/gdrive/My Drive/run-am data

/content/gdrive/My Drive/run-am data


In [5]:
# Get xlsx files list from a folder holding them
path = '/content/gdrive/My Drive/run-am data'
excel_files = glob.glob(path + "/*.xlsx")
# Read each xlsx file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_excel(file) for file in excel_files)

#### **View dataset**
Check your dataset and see if it is properly imported or not.

In [6]:
# Concatenate all DataFrames in the data folder
big_df   = pd.concat(df_list, ignore_index=True)

In [7]:
big_df.shape

(26409, 4)

In [8]:
big_df.head()

Unnamed: 0,News-Headline,News-Source,Date,Publisher
0,Appeal court sets aside judgement that voided ...,Unverified,2022-11-05 00:00:00,Linda Ikeji
1,Nnamdi Kanu to appear in court May 18 —Defence...,Unverified,2022-11-05 00:00:00,Linda Ikeji
2,The outrageous cost of party nomination form w...,Unverified,2022-11-05 00:00:00,Linda Ikeji
3,2023 Presidency should go to South East - Obas...,Unverified,2022-11-05 00:00:00,LindaIkeji
4,President Buhari rejects call for tenure exten...,Unverified,2022-10-05 00:00:00,Linda-Ikeji


In [9]:
big_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26409 entries, 0 to 26408
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   News-Headline  26409 non-null  object
 1   News-Source    26408 non-null  object
 2   Date           26409 non-null  object
 3   Publisher      26409 non-null  object
dtypes: object(4)
memory usage: 825.4+ KB


In [10]:
big_df["News-Source"].value_counts(dropna = False)

Verified      13424
Unverified    12980
unverified        4
NaN               1
Name: News-Source, dtype: int64

In [11]:
#Treating NA
nan_value = big_df[big_df['News-Source'].isna()]
nan_value

Unnamed: 0,News-Headline,News-Source,Date,Publisher
493,"2023: Buhari’s Minister, Pauline Tallen Declar...",,2022-08-05 00:00:00,franktalknow


In [12]:
big_df["News-Source"] = big_df["News-Source"].replace("unverified", "Unverified")
big_df["News-Source"] = big_df["News-Source"].replace(np.nan, "Unverified")

In [13]:
big_df["Text"] = big_df["News-Headline"] + " " + "--" + " " + big_df["Publisher"]

In [14]:
#find duplicate rows across a column of interest
duplicateRows = big_df[big_df.duplicated(['Text'])]

In [15]:
duplicateRows

Unnamed: 0,News-Headline,News-Source,Date,Publisher,Text
359,2023 Presidency: Why I Will Appoint Stomach In...,Unverified,30/4/2022,akelicious,2023 Presidency: Why I Will Appoint Stomach In...
395,2023: North Yet To Pick Consensus Candidate — ...,Unverified,24/4/2022,akelicious,2023: North Yet To Pick Consensus Candidate — ...
721,Boroffice plans big for presidential declarati...,Unverified,2022-12-05 00:00:00,franktalknow,Boroffice plans big for presidential declarati...
724,2023: Orji Kalu attacks Edwin Clark for “betra...,Unverified,2022-12-05 00:00:00,franktalknow,2023: Orji Kalu attacks Edwin Clark for “betra...
725,Presidential primary: APC will embrass you at ...,Unverified,2022-12-05 00:00:00,franktalknow,Presidential primary: APC will embrass you at ...
726,Why I want to become president – Saraki offici...,Unverified,2022-12-05 00:00:00,franktalknow,Why I want to become president – Saraki offici...
727,JUST IN: Jonathan and APC: Nigerians don’t kno...,Unverified,2022-12-05 00:00:00,franktalknow,JUST IN: Jonathan and APC: Nigerians don’t kno...
728,"JUST IN: Buhari orders Emefiele, other appoint...",Unverified,2022-12-05 00:00:00,franktalknow,"JUST IN: Buhari orders Emefiele, other appoint..."
729,"PDP cancels zoning, declares presidential tick...",Unverified,2022-12-05 00:00:00,franktalknow,"PDP cancels zoning, declares presidential tick..."
730,2023 presidency: Osinbajo submits nomination form,Unverified,2022-12-05 00:00:00,franktalknow,2023 presidency: Osinbajo submits nomination f...


In [16]:
duplicateRows.shape

(590, 5)

In [17]:
# dropping ALL duplicate values
big_df.drop_duplicates(subset ="Text",
                     keep = "first", inplace = True)

In [18]:
big_df.head()

Unnamed: 0,News-Headline,News-Source,Date,Publisher,Text
0,Appeal court sets aside judgement that voided ...,Unverified,2022-11-05 00:00:00,Linda Ikeji,Appeal court sets aside judgement that voided ...
1,Nnamdi Kanu to appear in court May 18 —Defence...,Unverified,2022-11-05 00:00:00,Linda Ikeji,Nnamdi Kanu to appear in court May 18 —Defence...
2,The outrageous cost of party nomination form w...,Unverified,2022-11-05 00:00:00,Linda Ikeji,The outrageous cost of party nomination form w...
3,2023 Presidency should go to South East - Obas...,Unverified,2022-11-05 00:00:00,LindaIkeji,2023 Presidency should go to South East - Obas...
4,President Buhari rejects call for tenure exten...,Unverified,2022-10-05 00:00:00,Linda-Ikeji,President Buhari rejects call for tenure exten...


In [19]:
from sklearn.utils import shuffle

# Purify
big_df = big_df.iloc[:,[-1, 1]]

# Shuffle
big_df = shuffle(big_df).reset_index(drop=True)

display(big_df)

Unnamed: 0,Text,News-Source
0,My presidency to guarantee national stability ...,Verified
1,Muslim-Muslim ticket: What Tinubu would have d...,Verified
2,Operation vote and take money widespread at Ek...,Unverified
3,2023 Presidency: Ekweremadu Congratulates Okow...,Unverified
4,How Much Did APC Spend On The Ekiti Gubernator...,Unverified
...,...,...
25814,Wike Gives ₦30 Million To Families Of Slain Po...,Unverified
25815,Presidential Running Mate: Okowa’s an asset – ...,Verified
25816,"Alleged Organ Harvesting: Court orders NIMC, o...",Verified
25817,Tinubu: APC Must Not Become Like Other Parties...,Unverified


In [20]:
train_val_df = big_df.sample(frac = 0.8)
test_df = big_df.drop(train_val_df.index)

train_df = train_val_df.sample(frac = 0.8)
val_df = train_val_df.drop(train_df.index)

# Reset Index
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

print('trainset size:', train_df.shape)
print('valset size:', val_df.shape)
print('testset size:', test_df.shape)

trainset size: (16524, 2)
valset size: (4131, 2)
testset size: (5164, 2)


In [21]:
train_df.to_csv('traindata.csv',  index=False)
val_df.to_csv('valdata.csv',  index=False)
test_df.to_csv('testdata.csv', index=False)

#### **Choose a model architecture**
Choose any  one of the model architectures of your choice and comment the rest. Each model architecture is different from the other and will yield different results. The MobileBERT model takes more time to train as its architecture is quite complex. However, feel free to play with different architectures until you find the best result.

In [22]:
spec = model_spec.get('average_word_vec')
#spec = model_spec.get('mobilebert_classifier')
# spec = model_spec.get('bert_classifier')
# spec = AverageWordVecSpec(wordvec_dim=32)


#### **Customize the MobileBERT model hyperparameters**

**Note:** Run this cell only if you have chosen the `MobileBERT Classifier` model architecture.

The model parameters you can adjust are:

* `seq_len`: Length of the sequence to feed into the model.
* `initializer_range`: The standard deviation of the `truncated_normal_initializer` for initializing all weight matrices.
* `trainable`: Boolean that specifies whether the pre-trained layer is trainable.

The training pipeline parameters you can adjust are:

* `model_dir`: The location of the model checkpoint files. If not set, a temporary directory will be used.
* `dropout_rate`: The dropout rate.
* `learning_rate`: The initial learning rate for the Adam optimizer.
* `tpu`: TPU address to connect to.

For instance, you can set the `seq_len=1024` (default is 128). This allows the model to classify longer text.

In [23]:
spec.seq_len = 1024 #512 #256

#### **Load train, test and validation data**
Load the training, validation and test data CSV files to prepare the model training process. Make sure the `is_training` parameter for `test_data` and `val_data` is set to `False`.

In [24]:
train_data = DataLoader.from_csv(
      filename='traindata.csv',
      text_column='Text',
      label_column='News-Source',
      model_spec=spec,
      is_training=True)

test_data = DataLoader.from_csv(
      filename='testdata.csv',
      text_column='Text',
      label_column='News-Source',
      model_spec=spec,
      is_training=False) 
val_data = DataLoader.from_csv(
      filename='valdata.csv',
      text_column='Text',
      label_column='News-Source',
      model_spec=spec,
      is_training=False) 

#### **Train model**
Start the model training on the train dataset. Feel free to play around with different no. of epochs until you find the ideal epoch value that gives the best results.

In [25]:
model = text_classifier.create(train_data, model_spec=spec, epochs=10)

Epoch 2/2
Epoch 3/3
Epoch 4/4
Epoch 5/5
Epoch 6/6
Epoch 7/7
Epoch 8/8
Epoch 9/9
Epoch 10/10


#### **Examine your model structure - Layers of the neural network**

In [26]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1024, 16)          160048    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 2)                 34        
                                                                 
Total params: 160,354
Trainable params: 160,354
Non-trainable params: 0
__________________________________________________

#### **Evaluate the model**
Evaluate the model accuracy on the test and validation data  and see for yourself if the model needs some tweakings such as increase in dataset or hyperparameter tuning in order to increase the accuracy.

In [27]:
loss, acc = model.evaluate(test_data)



In [28]:
loss, acc = model.evaluate(val_data)



#### **Export TF Lite model**
The final model is exported as a TF Lite model which can be downloaded and directly deployed on our Android app.

In [30]:
model.export(export_dir='average_word_vec')