# **Fake And Real News Classification**

---

Authors: [Femi Kamau](https://www.github.com/ctrl-Karugu), [Monicah Iwagit](), [Teofilo Gafna](), [Wendy Mwiti](https://www.github.com/WendyMwiti)

## 1. Business Understanding

## 1.1 Determine Business Objectives
## 1.1.1 Background
In recent decades, the prevalence of fake news in the media has gradually increased. The term "fake news" refers to news, information, or stories that are either wholly made up or inaccurate to some degree and are created to either influence people’s views, push a political agenda or cause confusion. 

According to a survey carried out by Edelman Public Relations Firm, media trust has consistently dropped by 8% every year. There exists a general lack of confidence in whatever information is disseminated by the media with a percentage as high as 41% of people actively saying that they avoid the news.

In the past, we relied on reliable journalists, media organizations, and sources who were bound by stringent ethical standards. However, the internet has made it possible to publish, exchange, and consume news and information in a completely new way with hardly any restrictions or editorial guidelines.

Nowadays, a lot of people obtain their news from social media networks and websites, and it can frequently be challenging to determine whether a story is legitimate or not. Any surge in false news or hoax stories has also been attributed to information overload and a general lack of understanding by individuals of how the internet functions.
Hence the need for our project -the building of an NLP based classifier that categorizes articles on  whether they are real or fake. 

## 1.1.2 Problem Statement
Fake news has various effects on human life, some of which are surprising and could vary from the way we humans perceive risk, the content of our dreams and even to our likelihood of getting a health complication. Therefore, this project, in a bid to reduce some of these effects, is concerned with identifying a solution that could be used to detect and filter out articles containing fake news as it will prove  useful to both readers and companies (stakeholders of the project) interested in the issue.

## 1.1.3 Research Questions
* Which periods of the year have the most fake news?
* Which person appears most in fake news?
* What topics are most prevalent in fake news?
* What are the most common keywords in fake news?

## 1.1.4 Business Objectives
* To establish which months have the most fake news.
* To ascertain which subject dominated the fake news.
* To  predict if news is fake or real

## 1.1.5 Business Success Criteria
Build an NLP Classification model that predicts, with an accuracy of 90%, the validity of the news articles and a precision of 90% 


In [24]:
!python -m spacy download en_core_web_sm
    

Collecting en-core-web-sm==3.4.1

2022-12-03 22:12:46.817297: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2022-12-03 22:12:46.817362: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-03 22:12:52.603188: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-12-03 22:12:52.603236: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2022-12-03 22:12:52.611408: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-UT5CAP7
2022-12-03 22:12:52.611720: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-UT5CAP7



  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [23]:
!pip install click --upgrade

Collecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
Installing collected packages: click
  Attempting uninstall: click
    Found existing installation: click 7.1.2
    Uninstalling click-7.1.2:
      Successfully uninstalled click-7.1.2
Successfully installed click-8.1.3


## 2. Data Understanding

Load Libraries

In [26]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
import nltk
import re
from nltk.corpus import stopwords
from nltk.collocations import *
import string
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, auc, RocCurveDisplay
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn import svm
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import precision_score, recall_score, f1_score, mean_absolute_error, mean_squared_error
import pickle
import spacy
import unidecode
from word2number import w2n
import contractions
import re


nlp = spacy.load('en_core_web_sm')

# Remove 'no' and 'not' from SpaCy's stop words list
deselect_stop_words = ['no', 'not']
for w in deselect_stop_words:
    nlp.vocab[w].is_stop = False
    
pd.set_option('display.max_colwidth', None)

We had 2 csv files sourced from kaggle which contains nearly 23481 fake news and 21417 real news posted between 2015-2018 from news sites such as 21st Century Wire


Load Data

In [27]:
# Load the real news data
real = pd.read_csv('./data/True.csv')

# Load the fake news data
fake = pd.read_csv('./data/Fake.csv')

FileNotFoundError: [Errno 2] No such file or directory: './data/True.csv'

In [None]:
# Preview real data
real.head(2)

In [None]:
# Preview the fake data
fake.head(2)

In [None]:
#  Add a column called 'category' to both DataFrames which will become our taget variable
real['category'] = 1
fake['category'] = 0

In [None]:
# Combine both DataFrames 
data = pd.concat([real, fake])

In [None]:
# Preview the new DataFrame head
data.head(2)

In [None]:
# Preview the new DataFrame tail
data.tail(2)

### Description of columns in the file:

`title`- contains news headlines

`text`- contains news content/article

`subject`- the type of news

`date`- the date the news was published


In [8]:
class dataUnderstanding(object):
    """A class that does basic Data Understanding"""
    
    def __init__(self, df):
        self.shape = df.shape
        self.info = df.info
        self.duplicates = df.duplicated().sum()
        self.missing = df.isna().sum()
        self.types = df.dtypes

In [9]:
# Instantiate the class
understanding = dataUnderstanding(data)

NameError: name 'data' is not defined

In [None]:
# Summary of the dataset
print(f"Shape:{understanding.shape}")
print()
print(understanding.info())

From the summary above, we can see that the dataset contain 44898 rows and spans 5 columns. The columns are: title, text, subject, date and category. The category column is the target variable and the rest are the features.

Furthermore, the dataset contains 4 object columns and 1 integer column. The object columns are: title, text, subject and date. The integer column is: category (target variable). We may need to convert the date column to datetime format in the data preparation phase.

The dataset does not contain any missing values.

In [None]:
# Check for duplicates
print(f"Duplicates: {understanding.duplicates}")

The dataset contains 209 duplicates. They shall be inspected and removed if necessary in the data preparation phase.

In [7]:
# Check the number of missing values
understanding.missing

NameError: name 'understanding' is not defined

In [None]:
# Inspect the value counts in the subject column
data['category'].value_counts()

The articles that are within this dataset fall under 8 different subjects. These are: politicsNews, worldnews, News, politics, left-news, Government News, US_News, and Middle-east.

In [None]:
# Inspect the value counts in the category column
data['category'].value_counts()

The dataset is fairly balanced between fake and real news 

In [None]:
# Inspect the date column
data['date'].nunique()

The date column contains 2397 unique dates

## 3. Data Preparation


### 3.1 Validity

> To ensure validity within the dataset, we will be checking that the data is in the correct format.

In [None]:
# Converting the subject column to category
data['subject'] = data['subject'].astype('category')

In [None]:
# Converting date from object to datetime
data['date'] = pd.to_datetime(data['date'], dayfirst=True, errors='coerce')

# Type of the date column
data['date'].dtype

# Preview the updated DataFrame 
data.head(1)

In [None]:
# Extract the months and the years from the date column
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.strftime('%B')

In [None]:
# Inspect the updated datatypes
data.dtypes

### 3.2 Consistency

> In this section, we will be obeserving the consistency of the data. We will be checking for duplicates

In [None]:
# Create a dataframe for the duplicated to be inspected
duplicates  = data[data.duplicated()]
duplicates.head(3)

In [None]:
# Drop the duplicates
print(f"Before dropping: {len(data)}")
data.drop_duplicates(inplace=True)
print(f"After dropping: {len(data)}")

Remove the URLS from the text column

### 3.3 Completeness

From data understanding section we found there to be no missing values. Therefore we can confirm that the the dataset is complete.

### 3.4 Uniformity

 To check on whether different systems refer to the same value in the same format

In [None]:
# Check the value counts of the subject column
data['subject'].value_counts()

In [None]:
# Rename the categories in the subject column
data['subject'].replace({'politicsNews': 'politics',
                         'worldnews': 'world_news',
                         'News': 'news',
                         'left-news': 'left_news',
                         'Government News': 'government_news',
                         'US_News': 'us_news',
                         'Middle-east': 'middle_east'}, inplace=True)


In [None]:
# Preview the updated subject column value counts
data['subject'].value_counts()

### 3.5 Exploratory Data Analysis

#### 3.5.1 Univariate Analysis

In [None]:
# Create a function that visualizes the value counts of a column
def plot_bar(df, col)-> None:
    """ A function that returns a plot count of columns"""
    plt.figure(figsize=(12,8))
    sns.countplot(data=df, x = col, order=df[col].value_counts().index)
    plt.title(f"{col} count plot", fontsize=25)
    plt.ylabel("Count", fontsize=15)
    plt.xticks(rotation=45)
    plt.xlabel(f"{col}", fontsize=15)
    plt.show()
    

##### 3.5.1.1 `text`

In [None]:
def plot_word_cloud(df, target, feature, i:int):
    """This fuction creates a wordcloud for the news texts"""
    real_news =df[df[target]==i]
    text = data[feature].values
    wordcloud = WordCloud(
        max_words = 400,
        width = 800,
        height = 600,
        background_color = 'black',
        stopwords = STOPWORDS).generate(str(text))
    fig = plt.figure(
        figsize = (14, 8),
        facecolor = 'k',
        edgecolor = 'k')
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()

##### Fake News Visualization Word Cloud

In [None]:
plot_word_cloud(data, 'category', 'text', 0)

##### Real News Visualization Word Cloud

In [None]:
plot_word_cloud(data, 'category', 'text', 1)

##### 3.5.1.2 `subject`

In [None]:
# Plotting a countplot of the subject
plot_bar(data, 'subject')

Observations:
* Most of the published news talk about politics followed by world news while the least discussed subject is the Middle East

##### 3.5.1.3 `month`

In [None]:
# Plotting a countplot of the Months
plot_bar(data, 'month')

Observations:
* Most of the news is published during the month of November followed by October and September.
* It is also worth noting that the months of September and October precede November.
* The month of August registered the least number of published news

##### 3.5.1.4 `year`

In [None]:
# Plotting the years
plot_bar(data, 'year')

Observations:
* 2017 had the most news followed by 2016

##### 3.5.1.5 `category`

In [None]:
# Plotting a pie chart for the column 'category'
fig, ax = plt.subplots(figsize=(12,8))
data['category'].value_counts().plot(kind='pie', autopct='%.0f%%');

Observations:
* The data set is fairly balanced. However, fake news is slightly more than real news.

#### 3.5.2 Bivariate Analysis

In [None]:
# Creating a function that plots a count plot with respect to another column
def plot_bivariate(df, col, by):
    plt.figure(figsize=(12,8))
    sns.countplot(data=df, x=col, hue=by)
    plt.title(f"{col} count plot by {by}", fontsize=25)
    plt.ylabel("count", fontsize=15)
    plt.xticks(rotation=45)
    plt.xlabel(f"{col}", fontsize=15)
    plt.show()

##### 3.5.2.1 `month` & `category`

In [None]:
# Plotting month by category
plot_bivariate(data, 'month', 'category')

Observations:
  * From the visualization above, we see a spike in real news towards the end of the year. However, fake news remains constant throughout the year. 

##### 3.5.2.2 `year` & `category`

In [None]:
# plot year by category
plot_bivariate(data, 'year', 'category')

Observations:
* The visualization above tells us that majority of the news captured in this dataset is from the year 2016 and 2017. Furthermore, 2017 seems to have a higher amount of data output compared to the rest. This could be attributed to the fact that 2017 was the beginning of a new presidential term in the United States which always brings with it a fair share of news coverage.

##### 3.5.2.3 `subject` & `category`

In [None]:
plot_bivariate(data, 'subject', 'category')

Observations
* We see that the real news within this dataset falls under the politics nad world_news subjects. Therefore, a prediction system would probably work best for the aforementioned subjects.

## 4. Modeling

### 4.1 Data Preprocessing

In [4]:
# A function to remove web tags
def remove_web_tags(text):
    """Remove html tags from a string"""
    # Remove https links
    clean = re.compile(r'https\S*')
    text = re.sub(clean, '', text)

    # Remove data '.com' links
    clean = re.compile(r'\S+\.com\S+')
    return re.sub(clean, '', text)


# Function to remove twitter handles
def remove_twitter_handles(text):
    """This function removes twitter handles from a string"""
    clean = re.compile(r'@\S*')
    return re.sub(clean, '', text)


# Function to convert Non-ASCII characters to ASCII
def remove_accented_chars(text):
    """This function removes accented characters from text, e.g. café"""
    text = unidecode.unidecode(text)
    return text


# Function to expand contractions
def expand_contractions(text):
    """Expand shortened words, e.g. don't to do not"""
    text = contractions.fix(text)
    return text


# Function to remove special characters
def remove_special_characters(text):
    """This function removes special characters from text, e.g. $"""
    clean = re.compile(r'[^a-zA-Z0-9\s]')
    return re.sub(clean, ' ', text)


# Function to lowercase text
def lowercase_text(text):
    """This function converts characters to lowercase"""
    return text.lower()


# Function to convert number words to digits
def convert_number_words(text):
    """Convert number words to digits and remove them"""
    
    pattern = r'(\W+)'
    tokens = re.split(pattern, text)

    for i, token in enumerate(tokens):
        try:
            tokens[i] = str(w2n.word_to_num(token))
        except:
            pass
    
    return ''.join(tokens)


# Function to remove numbers
def remove_numbers(text):
    """This function removes numbers from text"""
    clean = re.compile(r'\d+')
    return re.sub(clean, '', text)


# Function to remove short words
def remove_small_words(text):
    """This function removes words with length 1 or 2"""
    clean = re.compile(r'\b\w{1,2}\b')
    return re.sub(clean, '', text)


# Function to remove names of people
def remove_names(text):
    """This is a function that removes the names from text"""
    with open('./data_preprocessing/names.txt', 'r') as f:
        NAMES = set(f.read().splitlines())

        NAMES = [name.lower() for name in NAMES]
        
    pattern = r'\W+'
    tokens = re.split(pattern, text)
    
    words = tokens
      
    for token in tokens:
        if token in NAMES:
            while token in words:
                words.remove(token)
    
    text = ' '.join(words)
    
    return text


# Function to remove countries
def remove_countries(text):
    """This is a function that removes the countries from text"""
    with open('./data_preprocessing/countries.txt', 'r') as f:
        COUNTRIES = set(f.read().splitlines())

        COUNTRIES = [name.lower() for name in COUNTRIES]
        
    pattern = r'(\W+)'
    tokens = re.split(pattern, text)
    
    words = tokens
    
    for token in tokens:
        if token in COUNTRIES:
            while token in words:
                words.remove(token)
    
    text = ' '.join(words)
    
    return text


# Function to remove US cities of people
def remove_cities(text):
    """This is a function that removes the US cities from text"""
    with open('./data_preprocessing/cities.txt', 'r') as f:
        CITIES = set(f.read().splitlines())

        CITIES = [name.lower() for name in CITIES]
        
    pattern = r'(\W+)'
    tokens = re.split(pattern, text)
    
    words = tokens
    
    for token in tokens:
        if token in CITIES:
            while token in words:
                words.remove(token)
    
    text = ' '.join(words)
    
    return text


# Function to remove days and months
def remove_days_and_months(text):
    """This is a function that removes the months and years from text"""
    
    # Load the months
    with open('./data_preprocessing/months.txt', 'r') as f:
        MONTHS = set(f.read().splitlines())

        MONTHS = [name.lower() for name in MONTHS]
    
    # Load the days of the week
    with open('./data_preprocessing/week.txt', 'r') as f:
        WEEK = set(f.read().splitlines())

        WEEK = [name.lower() for name in WEEK]
      
    pattern = r'(\W+)'
    tokens = re.split(pattern, text)  
    
    words = tokens
    
    for token in tokens:
        if token in MONTHS:
            while token in words:
                words.remove(token)
     
    for token in tokens:
        if token in WEEK:
            while token in words:
                words.remove(token)
            
    text = ' '.join(words)
            
    return text


def stopwords(text):
    """This function removes the stopwords in the text"""
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords = set(stopwords)
    
    tokens = re.split(r'(\W+)', text)
    
    text = [token for token in tokens if token not in stopwords]

    return ' '.join(text)


# Function to remove extra spaces
def remove_whitespace(text):
    """Remove extra spaces from a string"""
    
    clean = re.compile(r'\s{2,10000}')
    text = re.sub(clean, ' ', text)
    
    return text


# Lemmatize text
def lemmatize(text):
    lemma = WordNetLemmatizer()
    
    tokens = re.split('\W+', text)
    
    text = [lemma.lemmatize(token) for token in tokens]
    
    return ' '.join(text)
  

In [5]:
# Remove the web tags in the text
data['text'] = data['text'].apply(remove_web_tags)

data.head(1)

NameError: name 'data' is not defined

In [None]:
# Remove the twitter handles from the text
data['text'] = data['text'].apply(remove_twitter_handles)

data.head(1)

In [None]:
# Remove the accented characters
data['text'] = data['text'].apply(remove_accented_chars)

data.head(1)

In [None]:
# Expand the contractions
data['text'] = data['text'].apply(expand_contractions)

data.head(1)

In [None]:
# Remove the special characters
data['text'] = data['text'].apply(remove_special_characters)

data.head(1)

In [None]:
# Lowercase the text
data['text'] = data['text'].apply(lambda x: x.lower())

data.head(1)

In [None]:
# Convert numeric words to numbers
data['text'] = data['text'].apply(remove_numbers)

data.head(1)

In [None]:
# Remove short words (less than 3 characters)
data['text'] = data['text'].apply(remove_small_words)

data.head(1)

In [None]:
# Remove names of countries
data['text'] = data['text'].apply(remove_countries)

data.head(1)

In [None]:
# Remove the days and months
data['text'] = data['text'].apply(remove_days_and_months)

data.head(1)

In [None]:
# Remove the stopwords
data['text'] = data['text'].apply(stopwords)

data.head(1)

In [None]:
# Remove the whitespace
data['text'] = data['text'].apply(remove_whitespace)

data.head(1)

In [None]:
# Lemmatize the data
data['text'] = data['text'].apply(lemmatize)

data.head(1)

Fake News Word Cloud (After Preprocessing)

In [None]:
# WordCloud after preprocessing
plot_word_cloud(data, 'category', 'text', 0)

The above WordCloud makes it evident that trump, washington, military, said, government are trending topics in fake news

Real News Word Cloud (After Preprocessing)

In [None]:
# WordCloud after preprocessing
plot_word_cloud(data, 'category', 'text', 1)

On the other hand, washington, military, trump, said, new, one were the common words in real news

### 4.2 Train Models

#### 4.2.1 Splitting the dataset

In [None]:
X = data['text']
y = data['category']

# Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### 4.2.2 Vectorization

In [None]:
# Declaring a Vectoriser
tfidf_vect = TfidfVectorizer()

# 'Fitting' the Vectoriser
tfidf_vect_fit = tfidf_vect.fit(X_train)

# Creating 'Test' and 'Train' vectorised dataframes
tfidf_train = tfidf_vect_fit.transform(X_train)
tfidf_test = tfidf_vect_fit.transform(X_test)

# Checking, if we did everything alright
tfidf_train

In [None]:
# Define function to get evaluation scores and plot a confusion matrix
def score_model(model, y_test_true, X_test):
    """ A function that returns scores of a model as well as a confusion matrix"""
    
    y_pred = model.predict(X_test)
    
    precision, recall, fscore, train_support = score(y_test_true, y_pred, pos_label=1, average='binary')
    print('Precision: {} / Recall: {} / F1-Score: {} / Accuracy: {}'.format(
    round(precision, 3), round(recall, 3), round(fscore,3), round(accuracy_score(y_test_true,y_pred), 3)))
    
    # Create a confusion matrix 
    cm = confusion_matrix(y_test_true, y_pred)

    # Make a Dataframe, of the metrics with classes
    class_label = [0, 1]
    df_cm = pd.DataFrame(cm, index=class_label,columns=class_label)

    # Plot the Model
    sns.heatmap(df_cm, annot=True, fmt='d')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

### 4.2.3 Simple Models

##### 4.2.3.1 Logistic Regression

In [None]:
# Instantiate the Logistic Regression Algorithm  
lr = LogisticRegression()

# Fit Algorithm
lr.fit(tfidf_train, y_train)

score_model(lr, y_test, tfidf_test)

The accuracy of the logistic regression model implies that 98.3% of the time, the model correctly classifies a news article as true or fake.

The precision and recall scores are 98.2 % and 98.3% respectively. Implying that out of the total real news, 98.2 % were truly real.

112 samples were classified as fake yet they were true. On the other hand, 114 samples were classified as true yet they were false.

13181 samples were correctly classified correctly.

##### 4.2.3.2 Random Forest Algorithm

In [None]:
# Instantiate the Naive Bayes Algorithm# 
rf = RandomForestClassifier(min_samples_leaf=20, min_samples_split=20, random_state=100)

# Fit Algorithm
rf = rf.fit(tfidf_train , y_train)

score_model(rf, y_test, tfidf_test)

The random forest model performs slightly worse than the logistic regression model with an accuracy score of 98.2%. 

This implies that 98.2% of the time, a news article is correctly classified as real or fake. 

The precision and recall scores are 98% and 98.2% respectively. Implying that out of the total real news, 98.1% were truly real. 117 samples were classified as fake yet they were true. 

On the other hand, 128 samples were classified as true yet they were false. 13177 samples were correctly classified correctly.

### 4.3 Advanced ML Models

#### 4.3.1 Ada Boosting Classifier

In [None]:
# Instantiating an AdaBoost Classifier
ada_boost = AdaBoostClassifier()

# Fitting the model on the training data
ada_boost.fit(tfidf_train, y_train)

# scoring the adaboost model
score_model(ada_boost, y_test, tfidf_test)

The Ada Boosting model performs better than the logistic regression and random forest models with an accuracy score of 99.5%. This can be attributed to the fact that it is able to capture more non-linear relationships in the data. This implies that a news article is correctly classified 99.5% of the time. 

The precision and recall scores are 99.4% and 99.6%. This implies that out of the total real news, 99.4% were truly real.

23 samples were classified as fake yet they were true. On the other hand, 39 samples were classified as true yet they were false.

13345 samples were correctly classified correctly.

#### 4.3.2 Gradient Boosting Classifier

In [None]:
# Instantiating a Gradient Boosting Classifier
grad_boost = GradientBoostingClassifier()

# Fiiting to train set
grad_boost.fit(tfidf_train, y_train)

# Scoring the gradient boosting classifier
score_model(grad_boost, y_test, tfidf_test)

The Gradient Boosting model performs slightly better than the Ada Boosting model with an accuracy score of 99.4%. This implies that a news article is correctly classified 99.4% of the time.

The precision and recall scores are 99.1 % and 99.7 %. This implies that out of the total real news, 99.1 % were truly real.

17 samples were classified as fake yet they were true. On the other hand, 57 samples were classified as true yet they were false.

13333 samples were correctly classified correctly.

#### 4.3.3 XG Boosting Classifier

In [None]:
# Instantiating a class of XG Boost
xg_boost = XGBClassifier()

# Fitting on training data
xg_boost.fit(tfidf_train, y_train)

# scoring the XG Boost Classifier
score_model(xg_boost, y_test, tfidf_test)

The XG Boost model performs better than the other models with an accuracy score of 99.7 %. This implies that a news article is correctly classified 99.7 % of the time.

The precision and recall scores are 99.8 % and 99.6 % respectively. This implies that out of the total real news, 99.8 % are truly real.

14 samples were classified as fake yet they were true. On the other hand, 28 samples were classified as true yet they were false.

13365 samples were correctly classified correctly.

### 4.4 ROC Curves

In [None]:
# Creating a class that generates tpr and fpr for Area under the curve from ROC
class get_roc(object):
    """ A function that gets the roc values"""

    def train_rates(model, feature_train, target_train):
        """ A fucntion that gets the true positive rate, false positive rate and the thresholds"""
        
        # Calculate the fpr, tpr, and thresholds for the training set
        model_y_train_score = model.decision_function(feature_train)
        
        # Calculate the probability scores of each point in the train set
        model_train_fpr, model_train_tpr, model_thresholds = roc_curve(target_train, model_y_train_score)
        
        # Seaborn's beautiful styling
        sns.set_style('darkgrid', {'axes.facecolor': '0.9'})

        # ROC curve for training set
        plt.figure(figsize=(10, 8))
        lw = 2
        plt.plot(model_train_fpr, model_train_tpr, color='darkorange',
                lw=lw, label='ROC curve')
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.yticks([i/20.0 for i in range(21)])
        plt.xticks([i/20.0 for i in range(21)])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver operating characteristic (ROC) Curve for Training Set')
        plt.legend(loc='lower right')
        print('Training AUC: {}'.format(auc(model_train_fpr, model_train_tpr)))
        plt.show()
        
    def test_rates (model, feature_test, target_test):
        """ A fucntion that gets the true positive rate, false positive rate and the thresholds"""
        model_y_test_score = lr.decision_function(feature_test)
        
        # Calculate the fpr, tpr, and thresholds for the test set
        model_test_fpr, model_test_tpr, model_test_thresholds = roc_curve(target_test, model_y_test_score)
        
        # Seaborn's beautiful styling
        sns.set_style('darkgrid', {'axes.facecolor': '0.9'})

        # ROC curve for training set
        plt.figure(figsize=(10, 8))
        lw = 2
        plt.plot(model_test_fpr, model_test_tpr, color='darkorange',
                lw=lw, label='ROC curve')
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.yticks([i/20.0 for i in range(21)])
        plt.xticks([i/20.0 for i in range(21)])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver operating characteristic (ROC) Curve for Training Set')
        plt.legend(loc='lower right')
        print('Testing AUC: {}'.format(auc(model_test_fpr, model_test_tpr)))
        plt.show()
        
    def combined_rates(model, feature_train, feature_test, target_train, target_test):
        """ A function that gets the roc curves of both train and test in one plot"""
        model_y_train_score = model.decision_function(feature_train)
        
        # Calculate the probability scores of each point in the train set
        model_train_fpr, model_train_tpr, model_thresholds = roc_curve(target_train, model_y_train_score)
        
        # Calculate the probability scores of each point in the test set
        model_y_test_score = lr.decision_function(feature_test)
        
        # Calculate the fpr, tpr, and thresholds for the test set
        model_test_fpr, model_test_tpr, model_test_thresholds = roc_curve(target_test, model_y_test_score)
        
        print('Model Test AUC: {}'.format(auc(model_test_fpr, model_test_tpr)))
        print('Model Train AUC: {}'.format(auc(model_train_fpr, model_train_tpr)))
        
        plt.figure(figsize=(10,8))
        lw = 2
        
        plt.plot(model_test_fpr, model_test_tpr, color='darkorange',
         lw=lw, label='Model Test ROC curve')
        plt.plot(model_train_fpr, model_train_tpr, color='blue',
         lw=lw, label='Model Train ROC curve')
        
        # Formatting
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.yticks([i/20.0 for i in range(21)])
        plt.xticks([i/20.0 for i in range(21)])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver operating characteristic (ROC) Curve')
        plt.legend(loc="lower right")
        plt.show()
        
    def get_auc(model, feature_train, feature_test, target_train, target_test):
        """ A function that gets the roc curves of both train and test in one plot"""
        model_y_train_score = model.decision_function(feature_train)
        
        # Calculate the probability scores of each point in the train set
        model_train_fpr, model_train_tpr, model_thresholds = roc_curve(target_train, model_y_train_score)
        
        # Calculate the probability scores of each point in the test set
        model_y_test_score = lr.decision_function(feature_test)
        
        # Calculate the fpr, tpr, and thresholds for the test set
        model_test_fpr, model_test_tpr, model_test_thresholds = roc_curve(target_test, model_y_test_score)
        
        print('Model Test AUC: {}'.format(auc(model_test_fpr, model_test_tpr)))
        print('Model Train AUC: {}'.format(auc(model_train_fpr, model_train_tpr)))

#### 4.4.1 Logistic Regression ROC

In [None]:
lr_combined = get_roc.combined_rates(lr, tfidf_train, tfidf_test, y_train, y_test)

#### 4.4.2 Random Forest ROC

In [None]:
print(f"The area under the curve of the random forest is {roc_auc_score(y_test, rf.predict_proba(tfidf_test)[:,1])}")
# Plotting ROC curve for Random Forest
fig, ax = plt.subplots(figsize=(12,8))
ax = plt.gca()
rf_train = RocCurveDisplay.from_estimator(rf, tfidf_train, y_train, ax=ax, alpha=0.8, label ='Random Forest Train Roc Curve')
rf_disp = RocCurveDisplay.from_estimator(rf, tfidf_test, y_test, ax=ax, alpha=0.8, label= 'Random Forest Test RoC Curve')
plt.show()

#### 4.4.3 Ada Boosting Classifier ROC

In [None]:
print(f"The area under the curve of the Ada Boosting classifier is {roc_auc_score(y_test, ada_boost.predict_proba(tfidf_test)[:,1])}")

In [None]:
# Plotting ROC curve for Ada Boosting Classifier
fig, ax = plt.subplots(figsize=(12,8))
ax = plt.gca()
ada_disp = RocCurveDisplay.from_estimator(ada_boost, tfidf_test, y_test, ax=ax, alpha=0.8)
plt.show()

#### 4.4.4 Gradient Boosting Classifier ROC

In [None]:
print(f"The area under the curve of the gradient boosting classifier is {roc_auc_score(y_test, grad_boost.predict_proba(tfidf_test)[:,1])}")

In [None]:
# Plotting ROC curve for Gradient Boosting Classifier
fig, ax = plt.subplots(figsize=(12,8))
ax = plt.gca()
grad_disp = RocCurveDisplay.from_estimator(grad_boost, tfidf_test, y_test, ax=ax, alpha=0.8)
plt.show()

#### 4.4.5 XG Boosting Classifier ROC

In [None]:
print(f"The area under the curve of the XG Boost is {roc_auc_score(y_test, xg_boost.predict_proba(tfidf_test)[:,1])}")

In [None]:
# Plotting ROC curve for XG Boosting Classifier
fig, ax = plt.subplots(figsize=(12,8))
ax = plt.gca()
xg_disp = RocCurveDisplay.from_estimator(xg_boost, tfidf_test, y_test, ax=ax, alpha=0.8)
plt.show()

### 4.5 Summary Table

In [None]:
# A class that gets scores
class get_scores(object):
    """ a class that gets the scores from a model"""
    def acc(model, x, y):
        """ A function that gets the accuracy score"""
        y_pred = model.predict(x)
        score = accuracy_score(y, y_pred)
        return score
    
    def precision (model, x, y):
        """ A function that gets the precision scores"""
        return precision_score(y, model.predict(x))
    
    def recall(model, x, y):
        """ A function that gets the recall scores"""
        return recall_score(y, model.predict(x))
    
    def f1 (model, x, y):
        """ A function that that gets the f1 scores"""
        return f1_score(y, model.predict(x))
    

In [None]:
# Creating instances of get_scores class
lr_scores  = get_scores
rf_scores = get_scores
gb_score = get_scores
ada_score = get_scores
xg_score = get_scores


####  4.5.1 Train Summary Table

In [None]:
# train summary table
train_summary_table = pd.DataFrame({'Model': [],
                              'Accuracy': [], 
                              'Precision': [], 'Recall': [], 'F1 Score': [],
                              })

In [None]:
# train summary
train_summary_table.loc[0] = ["Logistic Regression",
                         lr_scores.acc(lr, tfidf_train, y_train),
                         lr_scores.precision(lr, tfidf_train, y_train),
                         lr_scores.recall(lr, tfidf_train, y_train),
                         lr_scores.f1(lr, tfidf_train, y_train)]

train_summary_table.loc[2] = ["Random Forest",
                        rf_scores.acc(rf, tfidf_train, y_train),
                        rf_scores.precision(rf, tfidf_train, y_train),
                        rf_scores.recall(rf, tfidf_train, y_train),
                        rf_scores.f1(rf, tfidf_train, y_train)]

train_summary_table.loc[2] = ["Gradient Boost",
                        gb_score.acc(grad_boost, tfidf_train, y_train),
                        gb_score.precision(grad_boost, tfidf_train, y_train),
                        gb_score.recall(grad_boost, tfidf_train, y_train),
                        gb_score.f1(grad_boost, tfidf_train, y_train)]

train_summary_table.loc[3] = ["Ada Boosting Classifier",
                        ada_score.acc(ada_boost, tfidf_train, y_train),
                        ada_score.precision(ada_boost, tfidf_train, y_train),
                        ada_score.recall(ada_boost, tfidf_train, y_train),
                        ada_score.f1(ada_boost, tfidf_train, y_train)]

train_summary_table.loc[4] = ["XG Boosting Classifier",
                        xg_score.acc(xg_boost, tfidf_train, y_train),
                        xg_score.precision(xg_boost, tfidf_train, y_train),
                        xg_score.recall(xg_boost, tfidf_train, y_train),
                        xg_score.f1(xg_boost, tfidf_train, y_train)]

In [None]:
# showing the train summary table
train_summary_table

####  4.5.2 Test Summary Table

In [None]:
# test summary table
#summary table
test_summary_table = pd.DataFrame({'Model': [],
                              'Accuracy': [], 
                              'Precision': [], 'Recall': [], 'F1 Score': [],
                              })

test_summary_table.loc[0] = ["Logistic Regression",
                         lr_scores.acc(lr, tfidf_test, y_test),
                         lr_scores.precision(lr, tfidf_test, y_test),
                         lr_scores.recall(lr, tfidf_test, y_test),
                         lr_scores.f1(lr, tfidf_test, y_test)]

test_summary_table.loc[1] = ["Random Forest",
                         rf_scores.acc(rf, tfidf_test, y_test),
                         rf_scores.precision(rf, tfidf_test, y_test),
                         rf_scores.recall(rf, tfidf_test, y_test),
                         rf_scores.f1(rf, tfidf_test, y_test)]

test_summary_table.loc[2] = ["Gradient Boost",
                        gb_score.acc(grad_boost, tfidf_test, y_test),
                        gb_score.precision(grad_boost, tfidf_test, y_test),
                        gb_score.recall(grad_boost, tfidf_test, y_test),
                        gb_score.f1(grad_boost, tfidf_test, y_test)]

test_summary_table.loc[3] = ["Ada Boosting Classifier",
                        ada_score.acc(ada_boost, tfidf_test, y_test),
                        ada_score.precision(ada_boost, tfidf_test, y_test),
                        ada_score.recall(ada_boost, tfidf_test, y_test),
                        ada_score.f1(ada_boost, tfidf_test, y_test)]

test_summary_table.loc[4] = ["XG Boosting Classifier",
                        xg_score.acc(xg_boost, tfidf_test, y_test),
                        xg_score.precision(xg_boost, tfidf_test, y_test),
                        xg_score.recall(xg_boost, tfidf_test, y_test),
                        xg_score.f1(xg_boost, tfidf_test, y_test)]

# showing the test summary table
test_summary_table

### 4.6 Cross Validation


In [None]:
# Creating a class that performs a cross validation score
class cross_v_scores(object):
    """ A class that performs cross validation"""
    def acc_score(model, x, y):
        """ A function that gets the mean cross validation accuracy"""
        mean_score = np.mean(cross_val_score(model, x, y, cv=3, scoring='accuracy'))
        return mean_score
    
    def mse(model, x, y):
        """ A function that calculates the mean squared error"""
        mean_mse = np.mean(-cross_val_score(model, x, y, cv=5, scoring="neg_mean_squared_error"))
        return mean_mse

In [None]:
lr_cv = cross_v_scores
rf_cv = cross_v_scores
ada_cv = cross_v_scores
grad_cv = cross_v_scores
xg_cv = cross_v_scores

In [None]:
# Creating a cross validation summary table
cross_validation_summary_table = pd.DataFrame({'Model': [],
                              'Accuracy': [], 
                              'mean squared error': [],
                              })

In [None]:
# logistic regression
cross_validation_summary_table.loc[0] = ["Logistic Regression",
                                         lr_cv.acc_score(lr, tfidf_test, y_test),
                                         lr_cv.mse(lr, tfidf_test, y_test)]

# random forest
cross_validation_summary_table.loc[1] = ["Random Forest",
                                         rf_cv.acc_score(rf, tfidf_test, y_test),
                                         rf_cv.mse(rf, tfidf_test, y_test)]


# XG Boost
cross_validation_summary_table.loc[4] = ["XG Boosting Classifier",
                                         xg_cv.acc_score(xg_boost, tfidf_test, y_test),
                                         xg_cv.mse(xg_boost, tfidf_test, y_test)]

# showing the table
cross_validation_summary_table


### 5. Evaluation

We evaluated 5 different models in this project. 2 of which were simple algorithms: Logistic Regression and Random Forrest, and 3 being advanced algorithms: Ada Boosting Classifier, Gradient Boosting Classifier, and Extreme Gradient Boosting Classifier.

The accuracy and precision scores were compared above using the summary tables. Of the 5 algorithms, XGBoost returned the best accuracy of 99.7% on our test data.

With this extremely high accuracy, we chose to employ cross validation in an aim to reduce the potential overfitting. XGBoost Classifier averaged the the best accuracy (99.5%) as well as the least mean squared error (0.005). Therefore, we saw this as the best choice to adopt for binary classification.

### 6. Deployment
> In this section, we shall be building a pipeline with the text vectorizer as well as the XGBoosting Classifier. This pipeline will be important when it comes to building our ml-app

In [None]:
# Create a pipeline
pipe = Pipeline([('tfidf', TfidfVectorizer()),
                 ('xgb', XGBClassifier())])

pipe.fit(X_train, y_train)

pipe.predict(X_test)

# Pickle the pipeline
pickle.dump(pipe, open('./models/model.pkl', 'wb'))

## 7. Findings
- The common keywords in fake news are administration, trump, government, republican

- Fake news is fairly distributed throughout the year but periods following elections , campaigns, new governments see a slight surge in volume

- The most prevalent topics in fake news are politics(evidenced by common words in the word cloud like Republican, trump, obama, government) and war indicated by conflict, attack, terrorist

- Trump is the name that appears most in fake news

## 8. Limitations
- The data was only restricted to a timeframe of 2015-2018

- Most of the news articles covered USA leaving out other locations

- Our algorithms were computationally expensive

## 9. Future Work
- Build an automated fact-checking system that combines data looking at different aspects to help non-experts in classifying news.

- Use data that covers a wide range of time focusing on world news.

- Use PySpark to process data so as to reduce computation complexity

- Use twitter API to get current news

## 10.Conclusion
- Every single news has different characteristics so there is a need for a system that can check the content of the news in depth.

- The results suggested that the approach is highly favorable since the application helps in classifying fake news and identifying key features that can be used for fake news detection.

## 11.Reccommendations
- The use of the system classifier to detect whether an article posted is legitimate to avoid misinformation to the readers
- The classifier can be used to improve the accuracy and effectiveness of other fake news detection tools and systems

