# Part 2: Simple model

# Task 0: Choice of label groups

political and reliable has been chosen as "real news", while the rest as "fake news". some categories like unknown or unreliable might include legitimate articles which were put in their categories simply due to the lack of validation and confirmation. This might make it difficult for the model to correctly categorize these articles.


In [6]:
import numpy as np
from scipy.sparse import csr_matrix, hstack, save_npz
data = np.load('../Simple_995.npz', allow_pickle=True)
data_BBC = np.load('../Simple_995_BBC.npz', allow_pickle=True)


# task 1: Simple model baselines

There are two baseline models, one with one-hot encoding, the other a Bag of Words encoding. both use logistic regression, as it is the simplest classification model.

### logistic regression with one-hot encoding:

This is the simplest model which made sence that we could think of. It uses one-hot representations of the articles' vocabularies to classify them. However, this representation does not take frequency into consideration, making it ineffective.

In [7]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pickle


#load features
X_train=data['X_train_content_ONEHOT'].item()
X_test=data['X_test_content_ONEHOT'].item()
Y_train = data['Y_train'].ravel()
Y_test = data['Y_test'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(classification_report(Y_test, Y_pred))

del model, X_train, X_test, Y_train, Y_test, Y_pred

              precision    recall  f1-score   support

           0       0.98      0.19      0.32     42919
           1       0.54      1.00      0.70     41607

    accuracy                           0.59     84526
   macro avg       0.76      0.59      0.51     84526
weighted avg       0.76      0.59      0.51     84526



### Logistic regression with Bag of words encoding:

This is the simple model baseline takes frequency into consideration, and yields considerably better results.

This is the simple model used from here on out, as it is simple, but not too much so.

In [8]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pickle


#load features
X_train=data['X_train_content_BOW'].item()
X_test=data['X_test_content_BOW'].item()
Y_train = data['Y_train'].ravel()
Y_test = data['Y_test'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(classification_report(Y_test, Y_pred))

with open("../simple_BOW_model.pkl",'wb') as f:
    pickle.dump(model,f)
    
del model, X_train, X_test, Y_train, Y_test, Y_pred

              precision    recall  f1-score   support

           0       0.85      0.89      0.87     42919
           1       0.88      0.83      0.86     41607

    accuracy                           0.86     84526
   macro avg       0.86      0.86      0.86     84526
weighted avg       0.86      0.86      0.86     84526



# Task 2: Which column features might be useful

From the data exploration, it is clear that some domains produce much more real/fake news than others. This would make domain a good indicator of fake/real news.

Furthermore, the title of the article would also be helpful; "fake" articles, especially, especially ones like "hate", "extreme bias", "clickbait", etc. will have distinct titles that are provocative, misleading or exaggerated. This feature would therefore help distinguish fake/real news.

The rest of the columns have too many empty cells to be used as features.

### Domain:

Adding domain features to the simple model, we can see that the model gives a almost perfect prediction, which is surprising.

Training the simple model with only domian features, we get a perfect prediction, indicating that the domain of an article is directly tied to it being real or fake.

Because of this, it would not make sense to include domain in the model, at it effectively serves as an answer.

In [9]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

#load features
X_train_content=data['X_train_content_BOW'].item()
X_test_content=data['X_test_content_BOW'].item()
X_train_domain = data['X_train_domain'].item()
X_test_domain = data['X_test_domain'].item()

Y_train = data['Y_train'].ravel()
Y_test = data['Y_test'].ravel()

#combine content and domain features
X_train = hstack([X_train_content,X_train_domain])
X_test = hstack([X_test_content, X_test_domain])

#train predict and evaluate with content and domain
model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print("Model using Content and Domain")
print(classification_report(Y_test, Y_pred))

#train predict and evaluate with only domain
model1 = LogisticRegression(tol=1e-3)
model1.fit(X_train_domain, Y_train)
Y_pred = model1.predict(X_test_domain)
print("Model using Domain only")
print(classification_report(Y_test, Y_pred))

del model, X_train_content, X_test_content,X_train_domain, X_test_domain, Y_train, Y_test, Y_pred


Model using Content and Domain
              precision    recall  f1-score   support

           0       0.96      0.97      0.96     42919
           1       0.97      0.96      0.96     41607

    accuracy                           0.96     84526
   macro avg       0.96      0.96      0.96     84526
weighted avg       0.96      0.96      0.96     84526

Model using Domain only
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42919
           1       1.00      1.00      1.00     41607

    accuracy                           1.00     84526
   macro avg       1.00      1.00      1.00     84526
weighted avg       1.00      1.00      1.00     84526



### Title:

adding title features gives much better performance. This indicates, that the title of an article helps distinguish real news from fake ones. It therefore makes sence to include title.

In [10]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

#load features
X_train_content=data['X_train_content_BOW'].item()
X_test_content=data['X_test_content_BOW'].item()
X_train_title = data['X_train_title'].item()
X_test_title = data['X_test_title'].item()
Y_train = data['Y_train'].ravel()
Y_test = data['Y_test'].ravel()

#combine content and domain features
X_train = hstack([X_train_content,X_train_title])
X_test = hstack([X_test_content, X_test_title])

#train predict and evaluate with content and domain
model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print("Model using Content and Title")
print(classification_report(Y_test, Y_pred))
del model, X_train_content, X_test_content,X_train_title, X_test_title, Y_train, Y_test, Y_pred


Model using Content and Title
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     42919
           1       0.88      0.87      0.87     41607

    accuracy                           0.88     84526
   macro avg       0.88      0.88      0.88     84526
weighted avg       0.88      0.88      0.88     84526



# task 3: Simple model with added BBC articles

Adding the BBC articles, there isnt much difference in the result. However, since the dataset had slightly more fake articles, the BBC articles are included. They make the dataset just a bit more balanced, and more data can't hurt.

In [11]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pickle

#load features
X_train=data_BBC['X_train'].item()
X_test = data_BBC['X_test'].item()
Y_train = data_BBC['Y_train'].ravel()
Y_test = data_BBC['Y_test'].ravel()

#fit,predict and evaluate
model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(classification_report(Y_test, Y_pred))

with open("../simple_model.pkl",'wb') as f:
    pickle.dump(model,f)
    
del model, X_train, X_test, Y_train, Y_test, Y_pred

              precision    recall  f1-score   support

           0       0.85      0.89      0.87     43036
           1       0.88      0.84      0.86     42060

    accuracy                           0.86     85096
   macro avg       0.87      0.86      0.86     85096
weighted avg       0.87      0.86      0.86     85096

