# Final Project: Fake News Detection

By Felix Daubner - Hochschule der Medien

Module 'Supervised and Unsupervised Learning' - Prof. Dr.-Ing. Johannes Maucher

## Data Access

As stated in [problem understanding](01_problem-understanding.ipynb) the training data is scraped from [POLITIFACT.com](https://www.politifact.com/). [POLITIFACT.com](https://www.politifact.com/) is a project run by non-profit organization Poynter Institue for Media Studies. [7] Via its website POLITIFACT, it verifies the truth of political statements in the USA. It originated as an election-year project of the Tampa Bay Times. [7] Since then it rates political statements into different truth categories. The statements originate from politicans, members of the US congress, White House and lobbyists. [7] Every statement is reviewed and then classified into a category of either "true", "mostly true", "half true", "mostly false", "false", and "pants on fire". [7] The rating into one of these classification is done by intensive research using the information which was public the day the statement was made. [7]

For the scope of this project, the website is scraped to retrieve the latest political statements including metadata which are then used to train the fake-news model. 

In [154]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

To scrape the statements, a class "Scraping" is defined. Scraping contains all methods to access [POLITIFACT.com](https://www.politifact.com/) and scrape the statements including truth factor which will be the target variable of the model and given metadata. All these information are saved in a pandas DataFrame.

[POLITIFACT.com](https://www.politifact.com/) classifies each reviewed statement into an issue like health-care, economy, immigration and more. For each issue, a list of all statements assigned to this issue can be found. Before scraping the statements itself, the all available issues are scraped and saved in a list. 
Now, for each issue a maximum of 150 statements shall be scraped from [POLITIFACT.com](https://www.politifact.com/). This procedure makes it easy to access the subpages for each issue where around 30 statements are published before the next pages has to be called up. The URL changes only in attributes "page" and "category" when iterating over issues and pages. 

So the process of scraping statements and metadata can be implemented once and is then applied to all issues and subpages to retrieve a reasonable amount of data.

In [191]:
class Scraping:
    def __init__(self):
        '''
        Creates an object of class Scraping. This class is design to specifically scrape content from POLITIFACT.com
        '''
        self.__url = "https://www.politifact.com/factchecks/list/?page={page}&category={category}"
        self.__data = pd.DataFrame(columns=["statement", "issue", "person", "channel", "truth"])
        self.__issues = self.__getIssues()

    def __getIssues(self):
        '''
        Scrapes POLITIFACT.com to retrieve all issues statements were made in. 
        Returns a list of strings.
        '''
        results = list()

        url_issues = "https://www.politifact.com/issues/"
        response = requests.get(url_issues)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            
            divs = soup.find_all("div", class_="c-chyron__value")
            for div in divs:
                links = div.find_all("a", href=True)
                for link in links:
                    results.append(link["href"].replace("/",""))

            return results
                    
        else:
            print(f"Error! Aborted with status code: {response.status_code}")
            raise ConnectionRefusedError

    def scrape(self):
        '''
        Coordinates the scraping process by calling private methods to retrieve 150 statements and associated truth classification per issue. 
        The scraped data is then added to a DataFrame.
        '''
        for issue in self.__issues:
            for i in range(1,6):
                
                response = requests.get(self.__url.format(page=i, category=issue))

                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, "html.parser")

                    statement = self.__getQuote(soup)
                    truth = self.__getTruth(soup)
                    person = self.__getPerson(soup)
                    channel = self.__getChannel(soup)
                
                    if len(statement) !=  0 and len(truth) != 0:
                        self.__addToDataFrame(statement, issue, person, channel, truth)

                time.sleep(0.01)
    
    def __getQuote(self, soup):
        '''
        Searches the BeautifulSoup object containing HTML content for statements which can be retrieved and returned as a list.
        '''
        results = list()
        quotes = soup.find_all("div", class_="m-statement__quote")
        for quote in quotes:
            try:
                results.append(quote.text.strip())
            except:
                results.append(None)

        return results

    def __getTruth(self, soup):
        '''
        Searches the BeautifulSoup object containing HTML content for truth classifications to retrieve and return them as a list.
        '''
        results = list()
        meter = soup.find_all("div", class_="m-statement__meter")
        for m in meter:
            true = m.find_all("img", class_="c-image__thumb")
            for tr in true:
                try:
                    results.append(tr.get("alt").strip())
                except:
                    results.append(None)

        return results
    
    def __getPerson(self, soup):
        '''
        Searches the BeautifulSoup object containing HTML content for the person making the statements to retrieve and return them as a list.
        '''
        results = list()
        meta = soup.find_all("div", class_="m-statement__meta")
        for m in meta:
            name = m.find_all("a", class_="m-statement__name")
            for n in name:
                try:
                    results.append(n.text.strip())
                except:
                    results.append(None)

        return results
    
    def __getChannel(self, soup):
        '''
        Searches the BeautifulSoup object containing HTML content for the channel the statement was made to retrieve and return them as a list.
        '''
        results = list()
        desc = soup.find_all("div", class_="m-statement__desc")
        
        for d in desc:
            try:
                channel = d.text.strip()
                results.append(self.__extractChannel(channel))
            except Exception as e:
                results.append(None)
                print(f"Exception: {e}")

        return results
    
    def __extractChannel(self, channel):
        start = channel.find(", 20") + 6
        stop = channel.find(":")

        return channel[start:stop]

    def __addToDataFrame(self, st, iss, p, ch, tr):
        '''
        Creates a DataFrame using the given lists and concatenates the created DataFrame to the already existing DataFrame. 
        '''
        new = pd.DataFrame({"statement": st, "issue": iss, "person": p, "channel": ch, "truth": tr})
        try:
            self.__data = pd.concat([self.__data, new], ignore_index=True)
        except Exception as e:
            print(f"Data:\t\t{new}")
            print(e)

    def getData(self):
        '''
        Returns the DataFrame.
        '''
        return self.__data

An object of class "Scraping" is created and then the data is scraped using the defined 'scrape'-method.

In [186]:
scrape = Scraping()
scrape.scrape()

Now, the scraped data should be returned. Currently, the data is still in a raw format which means no pre-processing was done yet. 

The data was scraped on December 31, 2024. Therefore, newer statements are not included in the dataset.

In [187]:
scraped_data = scrape.getData()

As defined in class 'Scraping' not only the statement and truth classifcation but also existing metadata was scraped and then collected in a pandas DataFrame. The following output displays the first 50 rows of the scraped data.

The column "statement" contains the statement in the same format as published on [POLITIFACT.com](https://www.politifact.com/). 
"issue" contains the issue the statements was made about. "person" is the person making the statement on the channel defined in the respective column. Currently, it needs to investigated whether the data was scraped sufficiently without missing any values. Solely based on the output below, all required information got scraped.

In [188]:
scraped_data.head(50)

Unnamed: 0,statement,issue,person,channel,truth
0,"Says Sen. Bob Casey, D-Pa., “is trying to chan...",2024-senate-elections,Elon Musk,in an X post,false
1,Says the election results are suspicious becau...,2024-senate-elections,Eric Hovde,"in X, formerly Twitter",false
2,A “ballot dump” around 4 a.m. in Milwaukee sho...,2024-senate-elections,Instagram posts,in an Instagram post,pants-fire
3,“Kari Lake is threatening Social Security and ...,2024-senate-elections,WinSenate,in a Facebook ad,half-true
4,Republican Senate candidate Sam Brown “wants t...,2024-senate-elections,Make the Road Nevada,in an X post,half-true
5,Says opponent Eric Hovde “opposes efforts to n...,2024-senate-elections,Tammy Baldwin,in TV debate,barely-true
6,“Jacky Rosen voted to allow biological men to ...,2024-senate-elections,Sam Brown,in an X post,false
7,"“In Montana, we cherish our public lands. But ...",2024-senate-elections,Jon Tester,in a post on X,false
8,"Sen. Bob Casey ""has voted in lockstep with his...",2024-senate-elections,Dave McCormick,in a debate,false
9,"""Ruben Gallego wanted to defund the police. He...",2024-senate-elections,Kari Lake,in a press conference,false


The scraped data consists of five column which were named before. Nearly 17,000 statements could be scraped which should be sufficient data to train a machine learning model in the scope of this project.

In [189]:
scraped_data.shape

(16926, 5)

The data should now be saved in a csv-file. This file lays the fundation for the coming steps.

In [190]:
scraped_data.to_csv("data/scraped.csv", header=True, sep=";")

## Join Test Data from LIAR Dataset

The downloaded [LIAR](https://paperswithcode.com/dataset/liar) dataset should be used only as a test dataset as described in [problem understanding](01_problem-understanding.ipynb). The [LIAR](https://paperswithcode.com/dataset/liar) dataset consists of three tsv-files which are already labeled as train, test and validation data. For the purposes of this project those three datasets should be joined to only one dataset which will then be used after training, evaluation and optimization of the fake-news model to test the performance.

Firstly, all three datasets are imported into a pandas DataFrame.

In [130]:
train = pd.read_csv("data/LIAR/train.tsv", header=None, sep="\t")

In [131]:
test = pd.read_csv("data/LIAR/test.tsv", header=None, sep="\t")

In [132]:
valid = pd.read_csv("data/LIAR/valid.tsv", sep="\t")

It is assumed that all datasets have the same number of columns and are ordered in the same way.
For control purposes, the shapes of all datasets are printed out as well as the sum of all rows of all three datasets. This number is used to later control whether the join of the datasets worked correctly. 

In [133]:
print(f"Shape train:\t{train.shape}")
print(f"Shape test:\t{test.shape}")
print(f"Shape valid:\t{valid.shape}")

print(f"Sum of rows:\t{train.shape[0]+test.shape[0]+valid.shape[0]}")

Shape train:	(10240, 14)
Shape test:	(1267, 14)
Shape valid:	(1284, 14)
Sum of rows:	12791


The three datasets are joined vertically using the 'concat'-method.

In [134]:
LIAR = pd.concat([train, test, valid], axis=0)

The result is a single DataFrame consisting of 14 columns such as all the original datasets. The number of rows matches the sum of the rows from the earlier output. So we can be assume that the data was concatenated correctly.

In [135]:
LIAR.shape

(12791, 14)

Finally, the concatenated data is saved as csv-file.

In [None]:
LIAR.to_csv("data/LIAR.csv", sep=";")

# Sources

[7] https://www.politifact.com/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/