## Headlines Database

In [1]:
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import pandas as pd
import itertools
import re

### Fraud Headlines

To create a dataset containing fraud headlines, we parse the [ACFE](https://www.acfe.com/default.aspx) website 

<u>Parameters</u>

In [2]:
from_ = 2012
to_ = 2022

In [3]:
archives = [f"https://www.acfe.com/fraud-headlines-{year}.aspx" for year in range(
    from_, to_)] + ["https://www.acfe.com/fraud-headlines.aspx"]

In [4]:
def scrape_website(url):
    result = requests.get(url)
    c = result.content
    return BeautifulSoup(c)

In [5]:
def parse_titles(soup):
    titles = []
    headlines = soup.find("div", {
                        "id": "ctl00_MainContent_DropZone1_uxColumnDisplay_ctl00_uxControlColumn_ctl00_uxWidgetHost_uxWidgetHost_widget_CB"}).findAll('strong')
    for h in headlines:
        try:
            titles.append(h.findAll(text=True))
        except IndexError:
            pass
    titles = ["".join(l) for l in titles]
    titles = list(filter(None, titles))
    return list(set(titles))

In [6]:
titles = []
for url in tqdm(archives):
    soup = scrape_website(url)
    title = parse_titles(soup)
    titles.append(title)
titles = list(itertools.chain.from_iterable(titles))

  0%|          | 0/11 [00:00<?, ?it/s]

In [7]:
df_fraud = pd.DataFrame(titles, columns=["text"])
df_fraud["text"] = df_fraud["text"].str.replace("\xa0", "")
df_fraud["target"] = 1
# save to csv
df_fraud.to_csv(r"data\fraud_news.csv")

### Classic News

The dataset comme from [Kaggle](https://www.kaggle.com/gennadiyr/us-equities-news-data)

This data represents the historical news archive for the last 12 years of the US equities publicly traded on NYSE/NASDAQ which still has a price higher than 10$ per share.

In [8]:
df_fin = pd.read_csv(r"data\stock_news.csv", encoding='latin-1')[["title"]]
df_fin = df_fin.rename(columns={"title":"text"})
df_fin = df_fin.sample(n=len(df_fraud)) # undersample the majority class
df_fin["target"] = 0

### Merge dataframes

In [9]:
df_data = pd.concat([df_fraud, df_fin])
df_data = df_data.sample(frac=1).reset_index(drop=True)
# save to csv
df_data.to_csv(r"data\raw_dataset.csv")

In [10]:
df_data

Unnamed: 0,text,target
0,How Banks Fight Fraud In Electronic Banking,1
1,"Nicolas Sarkozy, Ex-President Of France, Faces...",1
2,Medicare Fraud Is Often Cloaked As ‘Free’ Serv...,1
3,U S sees role for India in battle against Isl...,0
4,IBM seeks 167 million from Groupon in dispute...,0
...,...,...
2633,China Injects $9.7 Billion Into Anbang After F...,1
2634,Watchdog Claims Billions In Covid Stimulus Fraud,1
2635,Potential Flat 35 Loan Fraud — Another Low Blo...,1
2636,Help Elderly Parents Fight Financial Fraud,1
