# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Data Access

As stated in [Problem Understanding](01_problem-understanding.ipynb) the training data is scraped from [POLITIFACT.com](https://www.politifact.com/). [POLITIFACT.com](https://www.politifact.com/) is a project run by non-profit organization Poynter Institue for Media Studies. Via its website POLITIFACT, it verifies the truth of political statements in the USA. It originated as an election-year project of the Tampa Bay Times. Since then it rates political statements into different truth categories. The statements originate from politicans, members of the US congress, White House and lobbyists. Every statement is reviewed and then classified into a category of either "true", "mostly true", "half true", "mostly false", "false", and "pants on fire". The rating into one of these classification is done by intensive research using the information which was public the day the statement was made. 

[2]

For the scope of this project, the website is scraped to retrieve the latest political statements which are then used to train the fake-news and sentiment model. As the model should only be trained on the statements itself, not including metadata such as speaker, date or channel, only the necessary data is scraped.

In [154]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

To scrape the statements, a class "Scraping" is defined. Scraping contains all methods to access POLITIFACT.com and scrape the statements including truth factor and issue. All these information are saved in a pandas DataFrame.

In [155]:
class Scraping:
    def __init__(self):
        '''
        Creates an object of class Scraping.
        '''
        self.__url = "https://www.politifact.com/factchecks/list/?page={page}&category={category}"
        self.__data = pd.DataFrame(columns=["statement", "issue", "truth"])
        self.__issues = self.__getIssues()

    def __getIssues(self):
        '''
        Scrapes POLITIFACT.com to retrieve all issues statements were made in. The scraped issues are then used to scrape up to 150 statements from every issue.
        Returns a list of strings, the strings being the issues which can be put into the URL.
        '''
        results = list()

        url_issues = "https://www.politifact.com/issues/"
        response = requests.get(url_issues)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            
            divs = soup.find_all("div", class_="c-chyron__value")
            for div in divs:
                links = div.find_all("a", href=True)
                for link in links:
                    results.append(link["href"].replace("/",""))

            return results
                    
        else:
            print(f"Error! Aborted with status code: {response.status_code}")
            raise ConnectionRefusedError

    def scrape(self):
        '''
        For every issue, five pages each containing 30 statements are scraped and then added to the existing DataFrame.
        '''
        for issue in self.__issues:
            for i in range(1,6):
                
                response = requests.get(self.__url.format(page=i, category=issue))

                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, "html.parser")

                    statement = self.__getQuote(soup)
                    truth = self.__getTruth(soup)
                
                    if len(statement) !=  0 and len(truth) != 0:
                        self.__addToDataFrame(statement, issue, truth)

                time.sleep(0.01)
    
    def __getQuote(self, soup):
        '''
        Searches the BeautifulSoup object for statements which can be retrieved and returned as a list.
        '''
        results = list()
        quotes = soup.find_all("div", class_="m-statement__quote")
        for quote in quotes:
            try:
                results.append(quote.text.strip())
            except:
                results.append(None)

        return results

    def __getTruth(self, soup):
        '''
        Searches the BeautifulSoup object for truth classifications to retrieve and return them as a list.
        '''
        results = list()
        meter = soup.find_all("div", class_="m-statement__meter")
        for m in meter:
            true = m.find_all("img", class_="c-image__thumb")
            for tr in true:
                try:
                    results.append(tr.get("alt").strip())
                except:
                    results.append(None)

        return results

    def __addToDataFrame(self, st, iss, tr):
        '''
        Creates a DataFrame from the given lists and concatenates the created DataFrame to the existing DataFrame. 
        '''
        new = pd.DataFrame({"statement": st, "issue": iss, "truth": tr})
        try:
            self.__data = pd.concat([self.__data, new], ignore_index=True)
        except Exception as e:
            print(f"Data:\t\tnew")
            print(e)

    def getData(self):
        '''
        Returns the DataFrame.
        '''
        return self.__data

An object of class "Scraping" is created and then the data is scraped.

In [157]:
scrape = Scraping()
scrape.scrape()

Now, the scraped data should be returned. Currently, the data is still in a kind of raw format which means no pre-processing was done yet.

In [158]:
scraped_data = scrape.getData()

In [159]:
scraped_data.head()

Unnamed: 0,statement,issue,truth
0,"Says Sen. Bob Casey, D-Pa., “is trying to chan...",2024-senate-elections,false
1,Says the election results are suspicious becau...,2024-senate-elections,false
2,A “ballot dump” around 4 a.m. in Milwaukee sho...,2024-senate-elections,pants-fire
3,“Kari Lake is threatening Social Security and ...,2024-senate-elections,half-true
4,Republican Senate candidate Sam Brown “wants t...,2024-senate-elections,half-true


In [None]:
scraped_data.shape

The data should now be saved in a csv-file. This file lays the fundation for the coming steps.

In [None]:
scraped_data.to_csv("data/scraped.csv", header=True, sep=";")

## Join Test Data from LIAR Dataset

The downloaded LIAR dataset should be used only as a validation & test dataset as described in [Problem Understanding](01_problem-understanding.ipynb). The LIAR dataset consists of three tsv-files which are already labeled as train, test and validation data. For the purposes of this project those three datasets should be joined to only one dataset which will then be used after training, evaluation and optimization of the fake-news and sentiment model to test the performance of both models.

Firstly, all three datasets are imported into a pandas DataFrame.

In [130]:
train = pd.read_csv("data/LIAR/train.tsv", header=None, sep="\t")

In [131]:
test = pd.read_csv("data/LIAR/test.tsv", header=None, sep="\t")

In [132]:
valid = pd.read_csv("data/LIAR/valid.tsv", sep="\t")

It is assumed that all datasets have the same number of columns and are ordered in the same way.
For control purposes, the shapes of all datasets are printed out as well as the sum of all rows of all three datasets. This number is used to later control whether the join of the datasets worked correctly. 

In [133]:
print(f"Shape train:\t{train.shape}")
print(f"Shape test:\t{test.shape}")
print(f"Shape valid:\t{valid.shape}")

print(f"Sum of rows:\t{train.shape[0]+test.shape[0]+valid.shape[0]}")

Shape train:	(10240, 14)
Shape test:	(1267, 14)
Shape valid:	(1284, 14)
Sum of rows:	12791


The three datasets are joined vertically using the 'concat' method.

In [134]:
LIAR = pd.concat([train, test, valid], axis=0)

The result is one DataFrame with a shape of 14 columns such as all the original datasets. The number of rows matches the sum of the rows from the earlier output. So we can be assume that the data was concatenated correctly.

In [135]:
LIAR.shape

(12791, 14)

Finally, the concatenated data is saved as csv-file.

In [None]:
LIAR.to_csv("data/LIAR.csv", sep=";")

# Sources

[2] https://www.politifact.com/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/