<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Fake-news-detection" data-toc-modified-id="Fake-news-detection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Fake news detection</a></span><ul class="toc-item"><li><span><a href="#Importing-libraries" data-toc-modified-id="Importing-libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importing libraries</a></span></li><li><span><a href="#Feature-extraction-functions" data-toc-modified-id="Feature-extraction-functions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Feature extraction functions</a></span></li><li><span><a href="#Reading-and-preparing-the-corpus" data-toc-modified-id="Reading-and-preparing-the-corpus-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Reading and preparing the corpus</a></span></li><li><span><a href="#Parametrization-and-feature-extraction" data-toc-modified-id="Parametrization-and-feature-extraction-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Parametrization and feature extraction</a></span></li><li><span><a href="#Frequency-threshold" data-toc-modified-id="Frequency-threshold-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Frequency threshold</a></span></li><li><span><a href="#Weighting-schemes" data-toc-modified-id="Weighting-schemes-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Weighting schemes</a></span></li><li><span><a href="#Classification-Process" data-toc-modified-id="Classification-Process-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Classification Process</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Training</a></span><ul class="toc-item"><li><span><a href="#Initializing-classification-algorithms" data-toc-modified-id="Initializing-classification-algorithms-1.8.1"><span class="toc-item-num">1.8.1&nbsp;&nbsp;</span>Initializing classification algorithms</a></span></li><li><span><a href="#Cross-Validation-Classification-and-Evaluation-Function" data-toc-modified-id="Cross-Validation-Classification-and-Evaluation-Function-1.8.2"><span class="toc-item-num">1.8.2&nbsp;&nbsp;</span>Cross-Validation Classification and Evaluation Function</a></span></li></ul></li><li><span><a href="#Testing" data-toc-modified-id="Testing-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Testing</a></span><ul class="toc-item"><li><span><a href="#Test-the-performance-of-Character-bi-gram-Baseline" data-toc-modified-id="Test-the-performance-of-Character-bi-gram-Baseline-1.9.1"><span class="toc-item-num">1.9.1&nbsp;&nbsp;</span>Test the performance of Character bi-gram Baseline</a></span></li></ul></li><li><span><a href="#Evaluation-Matrix" data-toc-modified-id="Evaluation-Matrix-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Evaluation Matrix</a></span></li></ul></li></ul></div>

# Fake news detection

## Importing libraries

In [39]:
import re
import glob
import numpy as np
import os
import json
import argparse
import time
import codecs
import string
import codecs
import random
import scipy.sparse as sp
from pathlib import Path
import pandas as pd 

from nltk.corpus import stopwords as sw
from string import punctuation
from IPython.display import display


from random import randrange
from scipy.sparse import csr_matrix, csc_matrix, hstack, coo_matrix
from gensim.matutils import Scipy2Corpus, corpus2csc
from gensim.models.logentropy_model import LogEntropyModel
from collections import defaultdict, Counter



from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble.weight_boosting import AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, make_scorer, classification_report

In [40]:
#To create a list of required libraries and their versions 
#!pip freeze > requirements.txt

## Feature extraction functions

In [41]:
#Extracts word-ngrams, when n=1 is equal to bag of words
def wordNgrams(text, n):
    ngrams = []
    text = [word for word in text.split() if word not in string.punctuation]
    ngrams = [' '.join(text[i:i+n])+'' for i in range(len(text)-n+1)]
    return ngrams

In [42]:
#Extracts character n-grams
def charNgrams(text, n):
    ngrams = []
    ngrams = [text[i:i+n]+'_cng' for i in range(len(text)-n+1)]
    return ngrams

In [43]:
def load_diccionario(ruta):
    terms = set()#Dictionary of slangs
    try:
        tmp = open(ruta, "r", encoding="utf8" )     
        while True :
            linea = tmp.readline()                                                                                   
            #linea = to_unicode(linea) 
            if (not linea) or (linea == ""):                                                                               
                break;                                                                                                      
            linea = linea.rstrip()
            terms.add(linea.lower())
        return (terms)
    except IOError as e:
        print ("Error: "+ruta+" I/O error({0}): {1}".format(e.errno, e.strerror))
        exit(1)

In [44]:
#Extracts function words n-grams with a pre-loaded dictionary
def funcNgrams(text, n):
    stop_words = load_diccionario('stop_words.txt')
    patt=r'\b(' + ('|'.join(re.escape(key) for key in stop_words)).lstrip('|') + r')\b'
    pattern = re.compile(patt)
    text = re.sub(r"(\n+|\r+|(\r\n)+)", " ", text)
    text = re.sub(r" +", " ", text)
    text = re.sub(r"’", "'", text)
    text = re.sub(r"[" + punctuation + "]*", "", text)
    terms = pattern.findall(text)
    n_grams=[('_'.join(terms[i:i+n])) + "_fwn" for i in range(len(terms)-n+1)]

    return n_grams

In [45]:
def extract_features(text,cn,wn,fn):
    text = text.lower()
    #text=clean_text(text)
    features = []
    for n in wn:
        if n != 0:
            features.extend(wordNgrams(text,n))
    for n in cn:
        if n != 0:
            features.extend(charNgrams(text,n))
    for n in fn:
            if n != 0:
                features.extend(funcNgrams(text,n))
    return features

In [46]:
# Extracts all features in a set of 'texts' and return as a string separated with the simbol '&%$'
def process_texts(texts,cn,wn,fn):
    occurrences=defaultdict(int)
    featuresList=[]
    featuresDict=Counter()
    for (text) in texts:
        features=extract_features(text,cn,wn,fn)
        featuresDict.update(features)
        featuresList.append('&%$'.join(features))
    return featuresList, featuresDict

## Reading and preparing the corpus

In [47]:
def preprocessText(text):
    cleantext = re.sub('\d','0',text)
    return cleantext

When we want to remove stop words, we should use this (below) function 1. Anyhow we need to use either function 1 or function 2. we can't use both at the same time, it means we need to comment one of the functions

In [48]:
# Function 1
# utility function for reading files
def read_txt_files(files):
    text=[]
    topic=[]
    for i,file_path in sorted(enumerate(files)):
        #print(file_path.split('\\')[1].split('.')[0])
        with open(file_path,"r", encoding="utf8") as infile:
            cleantext = preprocessText(infile.read())
            text.append(cleantext)
            file_topic=''.join(re.findall('[A-Za-z0-9]',file_path.split('\\')[1].split('.')[0]))
            topic.append(file_topic)
    return text, topic

When we do not want to remove stop words. I should use this (below) function 2

In [49]:
# #Function 2
# #utility function for reading files
# def read_txt_files(files):
#     text=[]
#     topic=[]
#     for i,file_path in sorted(enumerate(files)):
#         print('read', file_path)
#         with open(file_path,"r", encoding="utf8") as infile:
#             text.append(infile.read())
#             file_topic=''.join(re.findall('[A-Za-z]',file_path.split('\\')[3].split('.')[0]))
#             topic.append(file_topic)
#     return text, topic

In [50]:
#reading the path of real and fake news for training
train_path_real='Data\\Train\\Real\\'
train_path_fake='Data\\Train\\Fake\\'

real_news, real_news_topics = read_txt_files(sorted(glob.glob(train_path_real+'*.txt')))

fake_news, fake_news_topics = read_txt_files(sorted(glob.glob(train_path_fake+'*.txt')))

#contatenating real and fake news in one variable for training
train_texts = np.concatenate((real_news, fake_news))
train_labels = np.concatenate((np.zeros(len(real_news)), np.ones(len(fake_news))))
train_topics = np.concatenate((real_news_topics, fake_news_topics))

In [51]:
print ('Train:')
print ('\t Real:',len(real_news))
print ('\t Fake:',len(fake_news))

Train:
	 Real: 500
	 Fake: 400


In [52]:
#reading the path of real and fake news for testing
test_path_real='Data\\Test\\Real\\'
test_path_fake='Data\\Test\\Fake\\'

real_news, real_news_topics = read_txt_files(sorted(glob.glob(test_path_real+'*.txt')))
fake_news, fake_news_topics = read_txt_files(sorted(glob.glob(test_path_fake+'*.txt')))

#contatenating real and fake news in one variable for testing
test_texts = np.concatenate((real_news, fake_news))
test_labels = np.concatenate((np.zeros(len(real_news)), np.ones(len(fake_news))))
test_topics = np.concatenate((real_news_topics, fake_news_topics))

In [53]:
print ('Test:')
print ('\t Real:',len(real_news))
print ('\t Fake:',len(fake_news))

Test:
	 Real: 250
	 Fake: 150


## Parametrization and feature extraction

In [54]:
# Parameters
cnvalues=[2] #character n-grams
wnvalues=[0] # word n-grams; bag of words
fnvalues=[0] # function words n-grams

In [55]:
#Train feature extraction
print('Extracting features')
train_features, dicOfFeatures = process_texts(train_texts,cnvalues,wnvalues,fnvalues)

vectorizer = CountVectorizer(lowercase=False, min_df=2, tokenizer=lambda x: x.split('&%$')) #--> we can change this
train_data = vectorizer.fit_transform(train_features)
train_data = train_data.astype(float)
print('\t', 'Labels for each document: ', len(train_labels))
print('\t', 'Total training files (Real + Fake) : ', len(train_texts))
print('\t', 'Vocabulary size of', len(train_texts), 'files is : ',len(dicOfFeatures))
print ('\t','Train shape:',train_data.shape)
print('\t', 'class dictribution',Counter(train_labels))

Extracting features
	 Labels for each document:  900
	 Total training files (Real + Fake) :  900
	 Vocabulary size of 900 files is :  2800
	 Train shape: (900, 2312)
	 class dictribution Counter({0.0: 500, 1.0: 400})


Labels for each documents means that we have total 900 files in which first 500 files are assigned value 0 (we assign value 0 if a documents belongs to real class, and we assigned value one if a documents belongs to fake class), and rest 288 files are assigned value one. 

In [73]:
# Test feature extraction
print('Extracting Test features')
test_features,dicOfFeaturesTest = process_texts(test_texts,cnvalues,wnvalues,fnvalues)
test_data = vectorizer.transform(test_features)
test_data = test_data.astype(float)

print('\t', 'Total testing files (Real + Fake): ', len(test_texts))
print('\t', 'vocabulary size: ',len(dicOfFeaturesTest))
print ('\t','Test shape:',test_data.shape)
print('\t', 'class dictribution: ',Counter(test_labels))

Extracting Test features
	 Total testing files (Real + Fake):  400
	 vocabulary size:  2325
	 Test shape: (400, 2312)
	 class dictribution:  Counter({0.0: 250, 1.0: 150})


## Frequency threshold

N = 5 means we remove all the words from the train data that has frequency less than 5.

In [57]:
N=5
X=train_data
values=np.array(X.sum(axis=0)).ravel()
thresholdMask=(values >= N)*1
indices_zero = list(np.nonzero(thresholdMask == 0)[0])
all_cols = np.arange(X.shape[1])
cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, indices_zero)))[0]
train_data = X[:, cols_to_keep]
#####

scaled_train_data=train_data
scaled_test_data=test_data
print('Train shape:',scaled_train_data.shape)
print('Test shape:',scaled_test_data.shape)

Train shape: (900, 1803)
Test shape: (400, 2312)


To perfrom the experiments, the dimensions of testing and training features should
be the same. In order to see the testing features, the below cell will perform this 
task.

In [58]:
Z=test_data
all_cols = np.arange(Z.shape[1])
cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, indices_zero)))[0]
test_data = Z[:, cols_to_keep]
scaled_test_data=test_data
print('Test shape:',scaled_test_data.shape)

Test shape: (400, 1803)


## Weighting schemes

In [59]:
# print ('only frecuency:',test_data)
feature_weight='tfidf' # possible values: binary, logent, tfidf, norm, relat

if feature_weight == 'binary':
    scaled_train_data = preprocessing.Binarizer().fit_transform(scaled_train_data)
    scaled_test_data = preprocessing.Binarizer().fit_transform(scaled_test_data)
    print ("feature_weight = binary")
    
elif feature_weight == 'logent':
    Xc = Scipy2Corpus(scaled_train_data)
    log_ent = LogEntropyModel(Xc)
    X = log_ent[Xc]
    X = corpus2csc(X)
    scaled_train_data = sp.csc_matrix.transpose(X)
    
    Xtest = Scipy2Corpus(scaled_test_data)
    X = log_ent[Xtest]
    X = corpus2csc(X, scaled_train_data.shape[1])
    scaled_test_data = sp.csc_matrix.transpose(X)
    print ("feature_weight = logent")
    
elif feature_weight == 'tfidf':
    transformer = TfidfTransformer()
    scaled_train_data = transformer.fit_transform(scaled_train_data)
    scaled_test_data = transformer.transform(scaled_test_data)
    print ("feature_weight = tfidf")
    
elif feature_weight=='norm':
    scaled_train_data = preprocessing.normalize(scaled_train_data, norm='l2')
    max_abs_scaler = preprocessing.MaxAbsScaler()
    scaled_train_data = max_abs_scaler.fit_transform(scaled_train_data)
    scaled_test_data = max_abs_scaler.transform(scaled_test_data)
    print ("feature_weight = norm")
    
elif feature_weight=='relat':
    s = scaled_train_data.sum(axis = 1)
    scaled_train_data = coo_matrix(np.nan_to_num(scaled_train_data/s))
    s = scaled_test_data.sum(axis = 1)
    scaled_test_data = coo_matrix(np.nan_to_num(scaled_test_data/s))
    print ("feature_weight = relat")
    
else:
    print ("feature_weight = tf")
    
# print ('with weighting scheme:',scaled_test_data)

feature_weight = tfidf


## Classification Process

## Training

### Initializing classification algorithms

In [60]:
# Applying classification algorithms
clf=LinearSVC(C=0.01,class_weight='balanced', random_state=85)
clfSVC=SVC(C=0.01, kernel='linear',class_weight='balanced')
clfSVC.fit(scaled_train_data, train_labels)
clfMnb=MultinomialNB()
clfMnb.fit(scaled_train_data, train_labels)
clfBnb=BernoulliNB()
clfBnb.fit(scaled_train_data, train_labels)
clfLG=LogisticRegression(solver='lbfgs', tol=0.001, C=0.01,class_weight='balanced')
clfLG.fit(scaled_train_data, train_labels)
clfDT=DecisionTreeClassifier(random_state=0)
clfDT.fit(scaled_train_data, train_labels)
clfRFC=RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clfRFC.fit(scaled_train_data, train_labels)
clfAB=AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0,  random_state=None)
clfAB.fit(scaled_train_data, train_labels)

AdaBoostClassifier()

### Cross-Validation Classification and Evaluation Function

In [61]:
# #Utility function
originalclass=[]
predictedclass=[]
def classification_report_with_f1_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return f1_score(y_true, y_pred) # return accuracy score

In [62]:
print('Training Classifier')
    


nested_score = cross_val_score(clf, X=scaled_train_data, y=train_labels, cv=10, scoring=make_scorer(classification_report_with_f1_score))
#cvScoreLinearSVC=cross_val_score(clf, scaled_train_data, train_labels, cv=10, scoring='f1').mean()

cvScoreMnb=cross_val_score(clfMnb, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation Multinomial Naive Bayes',cvScoreMnb)


cvScoreSVC=cross_val_score(clfSVC, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation Linear SVC',cvScoreSVC)

cvScoreLG=cross_val_score(clfLG, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation Logistic Regression',cvScoreLG)

cvScoreAD=cross_val_score(clfAB, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation AdaBoost',cvScoreAD)

cvScoreDT=cross_val_score(clfDT, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation Decision Tree',cvScoreDT)

cvScoreRFC=cross_val_score(clfRFC, scaled_train_data, train_labels, cv=10, scoring='f1').mean()
#print('10-Fold Cross-validation Random Forest',cvScoreRFC)

# print('10-Fold Cross-validation Linear SVC',nested_score.mean())
#print(classification_report(originalclass, predictedclass))

Training Classifier


## Testing

### Test the performance of Character bi-gram Baseline

In [63]:
#Multinomial Naive Bayes
prediction_MNB=clfMnb.predict(scaled_test_data)
print(classification_report(test_labels, prediction_MNB))
print('\n MultiomialNaviev Bayes')
print('Accuracy',accuracy_score(test_labels, prediction_MNB))
print('F1-score',f1_score(test_labels, prediction_MNB))
print('F1-mac',f1_score(test_labels, prediction_MNB, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_MNB, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.70      1.00      0.83       250
         1.0       1.00      0.30      0.46       150

    accuracy                           0.74       400
   macro avg       0.85      0.65      0.64       400
weighted avg       0.82      0.74      0.69       400


 MultiomialNaviev Bayes
Accuracy 0.7375
F1-score 0.4615384615384615
F1-mac 0.6439923712650985
F1-weighted 0.6896058486967577


In [64]:
prediction_SVC=clfSVC.predict(scaled_test_data)
print(classification_report(test_labels, prediction_SVC))
print('\n SVC')
print('Accuracy',accuracy_score(test_labels, prediction_SVC))
print('F1-score',f1_score(test_labels, prediction_SVC))
print('F1-mac',f1_score(test_labels, prediction_SVC, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_SVC, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.62      1.00      0.77       250
         1.0       0.00      0.00      0.00       150

    accuracy                           0.62       400
   macro avg       0.31      0.50      0.38       400
weighted avg       0.39      0.62      0.48       400


 SVC
Accuracy 0.625
F1-score 0.0
F1-mac 0.38461538461538464
F1-weighted 0.4807692307692308


  _warn_prf(average, modifier, msg_start, len(result))


In [65]:
prediction_Bnb=clfBnb.predict(scaled_test_data)
print(classification_report(test_labels, prediction_Bnb))
print('\n Bernalo Naviev Bayes')
print('Accuracy',accuracy_score(test_labels, prediction_Bnb))
print('F1-score',f1_score(test_labels, prediction_Bnb))
print('F1-mac',f1_score(test_labels, prediction_Bnb, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_Bnb, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.96      0.64      0.77       250
         1.0       0.62      0.95      0.75       150

    accuracy                           0.76       400
   macro avg       0.79      0.80      0.76       400
weighted avg       0.83      0.76      0.76       400


 Bernalo Naviev Bayes
Accuracy 0.76
F1-score 0.7486910994764399
F1-mac 0.7595130138530526
F1-weighted 0.7622184924472057


In [66]:
prediction_LG=clfLG.predict(scaled_test_data)
print(classification_report(test_labels, prediction_LG))
print('\n Logistic Regression')
print('Accuracy',accuracy_score(test_labels, prediction_LG))
print('F1-score',f1_score(test_labels, prediction_LG))
print('F1-mac',f1_score(test_labels, prediction_LG, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_LG, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.94      0.89      0.91       250
         1.0       0.83      0.90      0.87       150

    accuracy                           0.90       400
   macro avg       0.89      0.90      0.89       400
weighted avg       0.90      0.90      0.90       400


 Logistic Regression
Accuracy 0.895
F1-score 0.8653846153846153
F1-mac 0.8896595208070617
F1-weighted 0.8957282471626733


In [67]:
prediction_DT=clfDT.predict(scaled_test_data)
print(classification_report(test_labels, prediction_DT))
print('\n Decision Tree')
print('Accuracy',accuracy_score(test_labels, prediction_DT))
print('F1-score',f1_score(test_labels, prediction_DT))
print('F1-mac',f1_score(test_labels, prediction_DT, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_DT, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83       250
         1.0       0.72      0.73      0.72       150

    accuracy                           0.79       400
   macro avg       0.78      0.78      0.78       400
weighted avg       0.79      0.79      0.79       400


 Decision Tree
Accuracy 0.79
F1-score 0.7218543046357616
F1-mac 0.7765898029202905
F1-weighted 0.7902736774914225


In [68]:
prediction_RFC=clfRFC.predict(scaled_test_data)
print(classification_report(test_labels, prediction_RFC))
print('\n Random Forest')
print('Accuracy',accuracy_score(test_labels, prediction_RFC))
print('F1-score',f1_score(test_labels, prediction_RFC))
print('F1-mac',f1_score(test_labels, prediction_RFC, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_RFC, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.62      1.00      0.77       250
         1.0       0.00      0.00      0.00       150

    accuracy                           0.62       400
   macro avg       0.31      0.50      0.38       400
weighted avg       0.39      0.62      0.48       400


 Random Forest
Accuracy 0.625
F1-score 0.0
F1-mac 0.38461538461538464
F1-weighted 0.4807692307692308


In [69]:
prediction_AB=clfAB.predict(scaled_test_data)
print(classification_report(test_labels, prediction_AB))
print('\n Adaboost')
print('Accuracy',accuracy_score(test_labels, prediction_AB))
print('F1-score',f1_score(test_labels, prediction_AB))
print('F1-mac',f1_score(test_labels, prediction_AB, average='macro'))
print('F1-weighted',f1_score(test_labels, prediction_AB, average='weighted'))

              precision    recall  f1-score   support

         0.0       0.74      0.96      0.84       250
         1.0       0.86      0.45      0.59       150

    accuracy                           0.77       400
   macro avg       0.80      0.70      0.72       400
weighted avg       0.79      0.77      0.75       400


 Adaboost
Accuracy 0.7675
F1-score 0.593886462882096
F1-mac 0.7155071543832546
F1-weighted 0.7459123272585443


We can observe the Logistic Regression obtained the best results. Therefore, we are only calculating the 
    1. Precision, 
    2. Recall, 
    3. F1_Real, 
    4. F1_Fake
    5. Accuracy
    6. F1-macro 
    
for Logistic Regression.

## Evaluation Matrix

    1. Precision = TP \ TP + FP
    2. Recall = TP \TP + FN 
    3. Precision X Recall \(Precision + Recall)

In [70]:
cm = confusion_matrix(test_labels, prediction_LG)
cm

array([[223,  27],
       [ 15, 135]], dtype=int64)

In [71]:
tn, fp, fn, tp = cm.ravel()
tn, fp, fn, tp

(223, 27, 15, 135)

In [72]:
prec_fake = tp/(tp + fp)
print('Precision for Fake class :', prec_fake)


rec_fake = tp/(tp + fn)
print('Recall for Fake class :', rec_fake)


f1_fake = 2 * prec_fake * rec_fake / ( prec_fake + rec_fake)
print('F1_Fake :', f1_fake)



prec_real = tn/(tn + fn)
print('\nPrecision for Real class :', prec_real)



rec_real = tn/(tn + fp)
print('Recall for Real class :', rec_real)


f1_real = 2 * prec_real * rec_real / ( prec_real + rec_real)
print('F1_Real :', f1_real)


f1_mac = (f1_real + f1_fake )/2
print('\nF1_Macro :', f1_mac)



#Calculate metrics (f1 ) for each label, 
#and find their average weighted by support (the number of true instances for each label).
#This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

f1_weighted = (250 /400) * f1_real + (150 /400) * f1_fake
print('\nF1_Average :', f1_weighted)

Precision for Fake class : 0.8333333333333334
Recall for Fake class : 0.9
F1_Fake : 0.8653846153846153

Precision for Real class : 0.9369747899159664
Recall for Real class : 0.892
F1_Real : 0.9139344262295082

F1_Macro : 0.8896595208070617

F1_Average : 0.8957282471626733
