# Part 2: Simple model

Running all models at once might crash the kernel.

# Task 0: Choice of label groups

political and reliable has been chosen as "real news", while the rest as "fake news". some categories like unknown or unreliable might include legitimate articles which were put in their categories simply due to the lack of validation and confirmation. This might make it difficult for the model to correctly categorize these articles.


In [1]:
import numpy as np
from scipy.sparse import csr_matrix, hstack, save_npz
data = np.load('../Simple_995.npz', allow_pickle=True)
data_BBC = np.load('../Simple_995_BBC.npz', allow_pickle=True)


# task 1: Simple model baselines

There are two baseline models, one with one-hot encoding, the other a Bag of Words encoding. both use logistic regression, as it is the simplest classification model.

### logistic regression with one-hot encoding:

This is the simplest model which made sence that we could think of. It uses one-hot representations of the articles' vocabularies to classify them. However, this representation does not take frequency into consideration, making it ineffective.

In [2]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
import pickle


#load features
X_train=data['X_train_content_ONEHOT'].item()
X_val=data['X_val_content_ONEHOT'].item()
Y_train = data['Y_train'].ravel()
Y_val = data['Y_val'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3, max_iter=1000,penalty = None)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))
    

with open("../simple_model_ONEHOT.pkl",'wb') as f:
    pickle.dump(model,f)

del model, X_train, X_val, Y_train, Y_val, Y_pred

Accuracy: 0.5818633540372671
Precision: 0.5397275899998643
Recall: 0.9648249745108511
F1 score: 0.6922226189335818


### Logistic regression with Bag of words encoding:

This is the simple model baseline takes frequency into consideration, and yields considerably better results.

This is the simple model used from here on out, as it is simple, but not too much so.

In [3]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
import pickle


#load features
X_train=data['X_train_content_BOW'].item()
X_val=data['X_val_content_BOW'].item()
Y_train = data['Y_train'].ravel()
Y_val = data['Y_val'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3, max_iter=1000,penalty = None)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))
    
del model, X_train, X_val, Y_train, Y_val, Y_pred

Accuracy: 0.8691156462585033
Precision: 0.8838441743738694
Recall: 0.8421129290673399
F1 score: 0.8624740499484107


# Task 2: Which column features might be useful

From the data exploration, it is clear that some domains produce much more real/fake news than others. This would make domain a good indicator of fake/real news.

Furthermore, the title of the article would also be helpful; "fake" articles, especially, especially ones like "hate", "extreme bias", "clickbait", etc. will have distinct titles that are provocative, misleading or exaggerated. This feature would therefore help distinguish fake/real news.

The rest of the columns have too many empty cells to be used as features.

### Domain:

Adding domain features to the simple model, we can see that the model gives a almost perfect prediction, which is surprising.

Training the simple model with only domian features, we get a perfect prediction, indicating that the domain of an article is directly tied to it being real or fake.

Because of this, it would not make sense to include domain in the model, at it effectively serves as an answer.

In [4]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score

#load features
X_train_content=data['X_train_content_BOW'].item()
X_val_content=data['X_val_content_BOW'].item()
X_train_domain = data['X_train_domain'].item()
X_val_domain = data['X_val_domain'].item()

Y_train = data['Y_train'].ravel()
Y_val = data['Y_val'].ravel()

#combine content and domain features
X_train = hstack([X_train_content,X_train_domain])
X_val = hstack([X_val_content, X_val_domain])

#train predict and evaluate with content and domain
model = LogisticRegression(tol=1e-3, max_iter=1000,penalty = None)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("Model using Content and Domain")
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))
    
#train predict and evaluate with only domain
model1 = LogisticRegression(tol=1e-3)
model1.fit(X_train_domain, Y_train)
Y_pred = model1.predict(X_val_domain)
print("Model using Domain only")
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))
    

del model,model1, X_train_content, X_val_content,X_train_domain, X_val_domain, Y_train, Y_val, Y_pred


Model using Content and Domain
Accuracy: 0.9852351375332742
Precision: 0.9872176416060887
Recall: 0.9824246249453804
F1 score: 0.9848153015038692
Model using Domain only
Accuracy: 0.9997633836143153
Precision: 1.0
Recall: 0.9995144924018061
F1 score: 0.9997571872571872


### Title:

adding title features gives much better performance. This indicates, that the title of an article helps distinguish real news from fake ones. It therefore makes sence to include title.

In [5]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score

#load features
X_train_content=data['X_train_content_BOW'].item()
X_val_content=data['X_val_content_BOW'].item()
X_train_title = data['X_train_title'].item()
X_val_title = data['X_val_title'].item()
Y_train = data['Y_train'].ravel()
Y_val = data['Y_val'].ravel()

#combine content and domain features
X_train = hstack([X_train_content,X_train_title])
X_val = hstack([X_val_content, X_val_title])

#train predict and evaluate with content and domain
model = LogisticRegression(tol=1e-3, max_iter=1000,penalty = None)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("Model using Content and Title")
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))

del model, X_train_content, X_val_content,X_train_title, X_val_title, Y_train, Y_val, Y_pred


Model using Content and Title
Accuracy: 0.8976989056492162
Precision: 0.8974283830317239
Recall: 0.8920473855415837
F1 score: 0.8947297938909923


# task 3: Simple model with added BBC articles

Adding the BBC articles, there isnt much difference in the result. However, since the dataset had slightly more fake articles, the BBC articles are included. They make the dataset just a bit more balanced, and more data can't hurt.

In [6]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
import pickle

#load features
X_train=data_BBC['X_train'].item()
X_val = data_BBC['X_val'].item()
Y_train = data_BBC['Y_train'].ravel()
Y_val = data_BBC['Y_val'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3, max_iter=1000,penalty = None)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("simple model with BBC articles")
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))

with open("../simple_model_BOW.pkl",'wb') as f:
    pickle.dump(model,f)
    
del model, X_train, X_val, Y_train, Y_val, Y_pred

simple model with BBC articles
Accuracy: 0.8706857042129384
Precision: 0.8869285286484329
Recall: 0.845932308278193
F1 score: 0.8659454718222352
