# Final Evaluation

The best models were the logistic regression and the XGBoost. I will use logistic regression because it is a simpler model and thus should be less likely to over fit and more likely to perform well on unseen data. First I will import the libraries, apply tfidf vectorization, and then I will simply fit and run the model using the train and test data. I will choose accuracy and precision as my metrics since the goal is to find real news out of fake news. The higher precision is the less fake news articles slip by the classifier.

# Load data

In [4]:
# import libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import string

In [2]:
# import data
train = pd.read_csv("train_clean.csv")
test = pd.read_csv("test_clean.csv")

train.head(3)

Unnamed: 0,text,real_news
0,"Upset at Trump, Mexico voices 'worry and irrit...",1
1,WATCH: Obama Directly Calls Trump Out For Car...,0
2,Trump lawyer says president not informed he is...,1


# TFidf Vectorization

In [5]:
# preprocess tect, split x and y, and perform tfidf vectorization

# preprocess text
# remove punctuation and lowercase everything
def preprocess(text):
    for char in string.punctuation:
        text = text.replace(char, "")
    text = text.lower()
    return text

train.text = train.text.map(lambda x: preprocess(x))
test.text = test.text.map(lambda x: preprocess(x))

# split x and y
X, y = train.text, train.real_news
X_test, y_test = test.text, test.real_news


# tfidf vectorization
vz = TfidfVectorizer()
X = vz.fit_transform(X)
X_test = vz.transform(X_test)

In [7]:
X.shape

(26553, 186835)

# Fiting and runing the model

In [8]:
# create model with best params
model = LogisticRegression(C=1000)

# fit model
model.fit(X, y)

# running the model on test data
y_pred = model.predict(X_test)

# results
print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4539
           1       0.99      0.99      0.99      4313

    accuracy                           0.99      8852
   macro avg       0.99      0.99      0.99      8852
weighted avg       0.99      0.99      0.99      8852



So the final result is 99% accuracy and 99% precision.