# Classification

## Loading data

For the classification, let's load the same JSON files.

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
data_path = Path(r'./data')

In [3]:
files_data = {
    file.stem : file for file in data_path.iterdir()
}

In [13]:
df_fake_test = pd.read_json(files_data["fake_test"])
df_real_test = pd.read_json(files_data["real_test"])

## Data preparation

As a proof of concept, we can use only the news title to perform the classification. We can drop the rows without `title`. We'll assign label `1` to the fake news and only keep the columns with `title` and `label` and rename the `title` column to `text` in order to use the [simple transformers](https://github.com/ThilinaRajapakse/simpletransformers) Python library. We'll assign label `0` to the real news. Finally, we concatenate the fake and real dataframes into one dataframe for both the training and the test data.

In [14]:
df_fake_test = df_fake_test.drop(df_fake_test.loc[df_fake_test["title"]==""].index)
df_fake_test["label"] = 1
df_fake_test=df_fake_test[["title", "label"]]
df_fake_test.rename(columns={"title":"text"})

Unnamed: 0,text,label
0,THE POST-GLOBAL ORDER IS SOMETHING INEVITABLE,1
1,Report: US military may transfer SK to Chinese...,1
2,Oats and corn flakes strengthen immunity again...,1
3,Sports,1
4,Egypt now,1
...,...,...
195,"Mrs. Minister ""in an hour"" among NATO's ""wise ...",1
196,"Study: ""Corona"" is not considered fatal to all...",1
197,THE PESTILENTIALS: THE GEOPOLITICS OF EPIDEMIC...,1
198,"Important news, ""a famous Russian doctor revea...",1


In [15]:
df_real_test = df_real_test.drop(df_real_test.loc[df_real_test["title"]==""].index)
df_real_test["label"] = 0
df_real_test=df_real_test[["title", "label"]]
df_real_test.rename(columns={"title":"text"})

Unnamed: 0,text,label
0,Steph Curry : The World Depends on Our Kindnes...,0
1,Russian Region Crowded Queues Spark Coronaviru...,0
2,Moscow Halts Abortions During Coronavirus Outb...,0
3,China reports further fall in new virus cases ...,0
4,"New Mexico counties , rural residents fight tr...",0
...,...,...
194,What We Can Learn From the 20th Century Deadli...,0
195,Two passengers dead from Coronavirus - hit cru...,0
197,How does the coronavirus death rate compare to...,0
198,Kenya is mobilising against coronavirus – but ...,0


In [16]:
test_data = pd.concat([df_fake_test, df_real_test])
test_data = train_data.sample(frac=1).reset_index(drop=True)

In [4]:
df_fake_train = pd.read_json(files_data["fake_train"])
df_real_train = pd.read_json(files_data["real_train"])

In [5]:
df_fake_train = df_fake_train.drop(df_fake_train.loc[df_fake_train["title"]==""].index)
df_fake_train["label"] = 1
df_fake_train=df_fake_train[["title", "label"]]
df_fake_train.rename(columns={"title":"text"})

Unnamed: 0,text,label
0,Online Facts New conspiracy theory: #Bel_Gates...,1
1,Revolutionary Guards: Corona could be an Ameri...,1
2,Yellow skin is the host environment of the vir...,1
3,China and Russia are doing what the European U...,1
5,Suspicious facts accompanied Corona ... Who knew?,1
...,...,...
795,Russia Today Channel / Mystery of the coronavi...,1
796,"In Russia, they invented a drug to treat coron...",1
797,How the Soviet Union responded to a more serio...,1
798,News 6060,1


In [6]:
df_real_train = df_real_train.drop(df_real_train.loc[df_real_train["title"]==""].index)
df_real_train["label"] = 0
df_real_train=df_real_train[["title", "label"]]
df_real_train.rename(columns={"title":"text"})

Unnamed: 0,text,label
0,Scots GPs told not to meet fever patients as f...,0
1,Coronavirus : Fighting al - Shabab propaganda ...,0
2,Engineer fears China virus impact,0
3,Coronavirus : South Korean PM vows swift act...,0
4,Finnair issues profit warning over Covid - 19 ...,0
...,...,...
794,Could There Be Another Coronavirus Quarantine ?,0
795,"If I get corona , I get corona : the America...",0
796,Could COVID - 19 coronavirus trigger a Europea...,0
798,How Coronavirus Is Changing Donald Trump 2020 ...,0


In [7]:
train_data = pd.concat([df_fake_train, df_real_train])
train_data.reset_index(drop=True, inplace=True)
train_data = train_data.sample(frac=1).reset_index(drop=True)

## Classifier training and evaluation

As a proof of concept, we'll use a small Transformer Language Model (LM), `distilBERT`. Which is a streamlined version of `BERT`.
We train it with the shuffled training data.

In [12]:
from simpletransformers.classification import ClassificationModel

model = ClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=2, use_cuda=False)
model.train_model(train_data)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=1586.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 1', max=199.0, style=ProgressStyle(des…











We pass the shuffled testing data in order to get a performance evaluation.

In [17]:
result, model_outputs, wrong_predictions = model.eval_model(test_data)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=1586.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=199.0, style=ProgressStyle(descr…




In [18]:
result

{'mcc': 0.8689157469260813,
 'tp': 695,
 'tn': 783,
 'fp': 11,
 'fn': 97,
 'eval_loss': 0.17339340240306142}

We can compute the accuracy of the model<br><br> $$acc = \frac{TP + TN}{TP + TN + FP + FN}$$<br><br> where TP are true positives, TN are true negatives, FP are false positives and FN are false negatives. We can use this metric since our dataset was balanced among the two classes.

In [44]:
acc = (result["tp"] + result["tn"])/(result["tp"] + result["tn"] + result["fp"] + result["fn"])
print(f'Accuracy: {acc:.2%}')

Accuracy: 93.19%


The model was trained only one epoch and we already obtained a really high accuracy value! This is good proof of concept of the efficiency of using this kind of models. We can try using not only the title but the complete text news, and try with some hyperparameter tuning, such as the learning rate or the number of epochs in order to improve the result.

## Predictions

With the model trained, we can make predictions in arbitrary input. For example:

In [36]:
predictions, raw_outputs = model.predict(["New Zealand mosque shooter gets life in jail withput parole"])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [41]:
"Real news" if predictions == 0 else "Fake news"

'Real news'