# Pre-Processing the articles

Goal: assign an id to each sentence in a text to "trap" hallucinations (Feldman et al. 2023)

In [None]:
import pandas as pd
import os
from dotenv import load_dotenv
load_dotenv()

data_folder = os.getenv("DATA_FOLDER")
text_file_path = f"{data_folder}/True_Articles.xlsx" #the notebook was run for the 17 Aslett et al. F/M articles, its 1769 respective articles, the 10 True articles (for control) and its respective articles ()

#change the read function depending on source data type
df = pd.read_excel(text_file_path) 
#df = pd.read_csv(text_file_path, sep="|", on_bad_lines="skip")
df

Unnamed: 0,URL,Response,Wayback URL,Title,Body
0,www.washingtontimes.com/news/2021/jul/19/arizo...,,,Majority of Arizona Republicans believe electi...,\nMaricopa County ballots cast in the 2020 gen...
1,www.nbcnews.com/news/olympics/member-u-s-women...,,,"Kara Eaker, U.S. women's gymnastics alternate,...",U.S. women's gymnastics alternates Kara Eaker ...
2,www.dailywire.com/news/shock-nbc-poll-shows-am...,,,Shock NBC Poll Shows Americans Have ‘Lost Thei...,A whopping 71% of Americans believe the U.S. i...
3,www.newsmax.com/us/ama-medical-doctor-langauge...,,,AMA Document: Doctors Should Use Language 'Ins...,he American Medical Association on Thursday re...
4,www.huffpost.com/entry/steve-buscemi-30-rock-m...,,,Steve Buscemi Hands Out Candy Dressed As His O...,“Fargo” star Steve Buscemi handed out Hallowee...
5,crooksandliars.com/2021/11/major-ivermectin-st...,,,Ivermectin Study Retracted After Data Found To...,Remember all those studies that purportedly sh...
6,www.newsmax.com/headline/terry-mcauliffe-glenn...,,,McAuliffe Concedes Virginia Governor's Race,Democrat Terry McAuliffe conceded defeat in th...
7,news.yahoo.com/qanon-supporters-gather-over-th...,,,QAnon supporters gather over theory that JFK J...,Some supporters of the QAnon conspiracy gather...
8,bipartisanreport.com/2021/11/07/liz-cheney-app...,,,Liz Cheney Appears On ‘Fox Sunday’ To Hand Tru...,Rep. Liz Cheney (R-Wyo.) is continuing to face...
9,www.foxnews.com/politics/biden-approval-harris...,,,Nearly half of voters say Biden worse presiden...,With exactly one year until the midterm electi...


Further cleanup:

In [5]:
#df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df

Unnamed: 0,URL,Response,Wayback URL,Title,Body
0,www.washingtontimes.com/news/2021/jul/19/arizo...,,,Majority of Arizona Republicans believe electi...,\nMaricopa County ballots cast in the 2020 gen...
1,www.nbcnews.com/news/olympics/member-u-s-women...,,,"Kara Eaker, U.S. women's gymnastics alternate,...",U.S. women's gymnastics alternates Kara Eaker ...
2,www.dailywire.com/news/shock-nbc-poll-shows-am...,,,Shock NBC Poll Shows Americans Have ‘Lost Thei...,A whopping 71% of Americans believe the U.S. i...
3,www.newsmax.com/us/ama-medical-doctor-langauge...,,,AMA Document: Doctors Should Use Language 'Ins...,he American Medical Association on Thursday re...
4,www.huffpost.com/entry/steve-buscemi-30-rock-m...,,,Steve Buscemi Hands Out Candy Dressed As His O...,“Fargo” star Steve Buscemi handed out Hallowee...
5,crooksandliars.com/2021/11/major-ivermectin-st...,,,Ivermectin Study Retracted After Data Found To...,Remember all those studies that purportedly sh...
6,www.newsmax.com/headline/terry-mcauliffe-glenn...,,,McAuliffe Concedes Virginia Governor's Race,Democrat Terry McAuliffe conceded defeat in th...
7,news.yahoo.com/qanon-supporters-gather-over-th...,,,QAnon supporters gather over theory that JFK J...,Some supporters of the QAnon conspiracy gather...
8,bipartisanreport.com/2021/11/07/liz-cheney-app...,,,Liz Cheney Appears On ‘Fox Sunday’ To Hand Tru...,Rep. Liz Cheney (R-Wyo.) is continuing to face...
9,www.foxnews.com/politics/biden-approval-harris...,,,Nearly half of voters say Biden worse presiden...,With exactly one year until the midterm electi...


In [None]:
df = df.rename(columns={'text': 'body'}) #rename full text column to 'body', if necessary

Use SpaCy instead of NLTK for better sentence splitting.

In [6]:
#https://stackoverflow.com/questions/46290313/how-to-break-up-document-by-sentences-with-spacy
from spacy.lang.en import English

nlp = English()
nlp.add_pipe('sentencizer')

def split_sentences(raw_text):
    try:
        doc = nlp(raw_text)
        sentences = [sent.text.strip() for sent in doc.sents]
        return sentences
    except:
        return "ERROR"


def assign_ids(sentences, title):
    if "ERROR" in sentences:
        return "INVALID INPUT"
    id_sentence_dict = {}
    id_sentence_dict['headline'] = title
    for id, sentence in enumerate(sentences):
        id_sentence_dict[id+1] = sentence #for our purposes, these ids are enough
    return id_sentence_dict

Split, then assign ids:

In [9]:
df['id_body'] = df['Body'].apply(split_sentences) #apply this to the column with the full text
df['id_body'] = df.apply(lambda row: assign_ids(row['id_body'], row['Title']), axis=1)
df['id_body'] #16 invalid inputs across the 1769 entries (FM texts)

0    {'headline': 'Majority of Arizona Republicans ...
1    {'headline': 'Kara Eaker, U.S. women's gymnast...
2    {'headline': 'Shock NBC Poll Shows Americans H...
3    {'headline': 'AMA Document: Doctors Should Use...
4    {'headline': 'Steve Buscemi Hands Out Candy Dr...
5    {'headline': 'Ivermectin Study Retracted After...
6    {'headline': 'McAuliffe Concedes Virginia Gove...
7    {'headline': 'QAnon supporters gather over the...
8    {'headline': 'Liz Cheney Appears On ‘Fox Sunda...
9    {'headline': 'Nearly half of voters say Biden ...
Name: id_body, dtype: object

In [11]:
output_path = f"{data_folder}/TRUE_articles_pre_processed.csv" #/article_contents_pre_processed.csv"; 
df.to_csv(output_path, sep=";")

## Pre-processing the example that will be included in the prompt (can be ignored during future runs of this notebook)

In [20]:
example_text_headline = "Three Russia Hoax Bombshells Hidden In IG Report On DOJ Surveillance Of Congress"
example_text_body = '''
By: Mollie Hemingway
December 17, 2024
9 min read
Sen. Adam Schiff
Image Credit
Sen. Adam Schiff/YouTube

These revelations show why the DOJ needs massive reform in the next administration.
Author Mollie Hemingway profile
Mollie Hemingway
Visit on Twitter
@mzhemingway
More Articles
Share

    Share Article on Facebook
    Share Article on Twitter

Share Article on Truth Social

    Share Article via Email

Last week the Department of Justice’s inspector general released a report on some of the DOJ’s tracking of communications from media and congressional figures as part of its purported investigation into who was leaking classified information against President Donald Trump in 2017. Three significant bombshells about the Russia collusion hoax were hidden inside the dense and dry 100-page report.

For context, when Trump won the 2016 presidential election, anonymous Democrat operatives in the federal government and Congress began leaking like sieves as part of a coordinated effort to paint Trump as a mastermind spy who had worked with Russian President Vladimir Putin for decades in order to steal the election.

Two Washington Post stories, a New York Times story, and a CNN story were all found to have included classified information. None of the four stories are specified in the report, but they all appeared in the first half of President Trump’s first year in office.

The first Washington Post story is likely the April 2017 story by Ellen Nakashima, Devlin Barrett, and Adam Entous revealing that DOJ had gotten a Foreign Intelligence Surveillance Act (FISA) warrant to spy on Carter Page, a Trump affiliate. The true story of that warrant would end up revealing the corruption of the DOJ, including how it falsified evidence in its application and relied on the laughable Steele dossier as the basis. But at the time of its publication, the FISA story suggested that an honorable DOJ had serious reason to suspect the Trump campaign of colluding with Russia to steal the election.

As outlandish and unhinged as the conspiracy theory was, it was fueled with daily drops of classified and deceptively packaged information designed to make it appear legitimate. The corporate media dutifully regurgitated, published, and aired the leaks as part of their campaign against the Republican president.

Nakashima, Barrett, and Entous were awarded a Pulitzer Prize for their perpetuation of the Russia collusion hoax in this and other stories. The leaks, which threatened national security and were intended to get Trump removed from office, threw the White House into chaos.

For years, polling has indicated that most Democrats continued to cling to the conspiracy theory as an explanation for Trump’s first presidential victory. The leakers have never been brought to justice.
Bombshell #1: A Democrat Whistleblower Identified Schiff, Swalwell As Leakers

One of the more surprising claims in the report was that a Democrat staffer on one of the congressional committees “voluntarily told the FBI” almost immediately after the investigation began in 2017 that he suspected two members of Congress and a number of Democrat staffers of being involved in the leaking of the classified information, leading to further investigation of those identified.

While the report doesn’t identify the whistleblower, his committee, or name the members of Congress, a 2021 New York Times story already identified then-Rep. Adam Schiff and Rep. Eric Swalwell, both of California, as the two congressmen on the House Permanent Select Committee on Intelligence (HPSCI) who were under investigation.

The DOJ report further notes that only these two members of Congress were investigated. Schiff was the top Democrat on HPSCI at the time its Republican chair Devin Nunes was engaged in painstaking efforts to reveal the Russia collusion hoax and many of its participants.

Both Schiff and Swalwell were notorious for going on left-wing media outlets such as CNN and MSNBC to push the Russia conspiracy theory. Schiff, now California’s junior senator, lied publicly for years about the matter, falsely claiming to have secret evidence substantiating the hoax. Schiff was widely suspected of leaking information to his allies in the press, or otherwise misrepresenting information from the committee.

Swalwell, for his part, famously had an intimate relationship with Communist Chinese spy “Fang Fang,” who had targeted him and other Democrats as part of a honey-trap operation. Despite these serious problems, both men served on HPSCI until former Speaker of the House Kevin McCarthy removed them in early 2023.

The whistleblower told the FBI he “suspected that Member 1 had previously leaked classified information and that Member 2 wanted to influence public opinion via the release of classified information.” However, the FBI said the whistleblower didn’t offer enough “direct evidence” of the suspected leaking.

The DOJ itself would go on to stonewall Nunes and Senate colleagues who were attempting to investigate DOJ’s lead role in the Russia collusion scam. Many of the top leadership at the FBI, including former Director James Comey and Deputy Director Andrew McCabe, were later unveiled as some of the worst leakers in government and leaders of the Russia collusion hoax. While they were removed from office, the Biden administration later paid some of them off.
Bombshell #2: A Top Democrat Staffer Was Caught Communicating With Three Reporters Who Published the Classified Info He Had Access To, But FBI Didn’t Think It Meant Much

The IG report says the whistleblower identified a top Democrat “staffer from the same committee” as a potential leaker. The report notes that DOJ “focused its investigation on the Senior Committee Staffer as the potential source of the leak” and that beyond being identified by the whistleblower, he was someone they “suspected of being the source of the unauthorized disclosure for other reasons as well.”

The DOJ was so interested in rooting out leaks of classified information, however, that it waited three full years to interview their main suspect and only after Attorney General Bill Barr apparently caused the investigation to be re-opened in 2020. One almost gets the sense that the DOJ wasn’t super-interested in stopping leaks that fed the Russia collusion hoax they ran.

The IG report doesn’t name the senior staffer, but another 2021 New York Times story identifies the suspected leaker. “[T]he leak investigation appeared to have been primarily focused on Michael Bahar, then a staff member on the House Intelligence Committee,” Russia collusion hoaxers Michael S. Schmidt and Charlie Savage wrote. “It remains unclear whether agents were pursuing a theory that Mr. Bahar had leaked on his own or whether they suspected him of talking to reporters with the approval of lawmakers. Either way, it appears they were unable to prove their suspicions that he was the source of any unauthorized disclosures; the case has been closed, and no charges were brought.”

It’s unclear if the DOJ was unable or, perhaps more likely, unwilling to go after participants in the Russia collusion hoax it helped run.

The IG report reveals that “[r]ecords showed that the Senior Committee Staffer visited the room where the classified material was made available to Members of Congress and congressional staff (Read Room) on at least one and possibly two occasions in early 2017, while still working for the committee.”

DOJ obtained his phone records and learned that they “showed that immediately before accessing the Read Room and continuing through shortly after publication of the articles containing the relevant classified information, the Senior Committee Staffer’s phone number was in contact with telephone numbers used by all three of the reporters who authored the articles that disclosed the classified information.”

Well, that seems like a big deal. But the senior staffer had the perfect alibi, at least in the view of our trusty FBI investigators. He explained to them that if they looked at more of his phone records, they’d see that he had been yakking it up with two of the three reporters long before they published the classified info he had access to.

They discovered that he had been talking to these reporters for a long time and therefore they … decided to drop the investigation with no charges. I’m sure that this explanation would have passed muster with the FBI if it were offered by a Republican.

Incidentally, the report keeps asserting the whistleblower had little foundation for his suspicions. In one case, he said he thought some individuals might be using their spouse’s phone to contact the media.

And in a December 2017 interview, he said he “overheard the Senior Committee Staffer tell other staffers that the Senior Committee Staffer would use their spouse’s cell phone to make calls, which the Committee Witness believed was intended to conceal the Senior Committee Staffer’s activity. However, the Committee Witness later admitted that they had little foundation for the belief that the Senior Committee Staffer used their spouse’s phone.”

Hunh? First off, the IG report said that in addition to the whistleblower’s claims, “the DOJ had other indicators that the Senior Committee Staffer and his spouse used each other’s accounts” and later reiterated they had “indicators that the Senior Committee Staffer and their spouse sometimes used each other’s accounts.”

Did the whistleblower later deny his own claim about what he heard the senior staffer tell people? Or is the DOJ simply dismissing eyewitness testimony it wished didn’t exist as having “little foundation?”
Bombshell #3: DOJ Spied On As Many Republicans As Democrats When Investigating Democrat Leaks

The IG report shows a surprisingly high number of congressional staff had their communications monitored secretly by the DOJ as part of the investigation into who was leaking classified information to hurt Republicans. A look into who the DOJ was monitoring suggests the investigation was never done in good faith.

Of the 43 congressional staffers who were monitored, 21 worked for Democrats and 20 worked for Republicans. Another two worked in nonpartisan positions.

At the time that the leak investigation began, Republicans on the HPSCI were famously battling against the Russia collusion conspiracy theory while Democrats were loudly pushing it. Yet the investigators decided that they’d surveil Republicans and Democrats equally because the lead career prosecutor said “because the leakers’ motivations are unknown, prosecutors must explore all possibilities and cannot assume political motives one way or the other.” This career prosecutor added that the Carter Page FISA story “was a good example of that principle because both parties had potential political motivations to leak the information.”

In fact, the Carter Page FISA story was transparently designed to cast suspicion on Carter Page, and by extension, the entire Trump campaign, as being Russian agents. Not a single rational person in the world thought that the leak came from the Republicans on HPSCI or other committees trying to get the truth of the Russia collusion hoax out.

The decision to use the leak investigation as a pretext to dig into those Republican staffers’ communications instead of tenaciously targeting Democrat leakers or taking the deluge of leaks from the DOJ itself seriously is a great example of why the DOJ needs massive reform in the next administration.
'''

In [None]:
# example_text_ids = assign_ids(split_sentences(example_text_body), example_text_headline)
# example_text_ids

{'headline': 'Three Russia Hoax Bombshells Hidden In IG Report On DOJ Surveillance Of Congress',
 1: 'By: Mollie Hemingway\nDecember 17, 2024\n9 min read\nSen. Adam Schiff\nImage Credit\nSen. Adam Schiff/YouTube\n\nThese revelations show why the DOJ needs massive reform in the next administration.',
 2: 'Author Mollie Hemingway profile\nMollie Hemingway\nVisit on Twitter\n@mzhemingway\nMore Articles\nShare\n\n    Share Article on Facebook\n    Share Article on Twitter\n\nShare Article on Truth Social\n\n    Share Article via Email\n\nLast week the Department of Justice’s inspector general released a report on some of the DOJ’s tracking of communications from media and congressional figures as part of its purported investigation into who was leaking classified information against President Donald Trump in 2017.',
 3: 'Three significant bombshells about the Russia collusion hoax were hidden inside the dense and dry 100-page report.',
 4: 'For context, when Trump won the 2016 presidential

In [None]:
# example_path = "example_text.txt"

# with open(example_path, 'w', encoding="utf-8") as f:
#     results = f.write(str(example_text_ids))

: 