# In this demo, we walk through how to retrieve and use our data

As described in the paper, we discovered press release/article pairs through two primary directions: __forwards__ (i.e. crawling news articles and discovering hyperlinks to press releases) and __backwards__ (i.e. crawling press releases and querying a backlink service to discover news articles.) 

We show how to access the data collected from each direction now.

## Forwards Direction: Press Releases $\rightarrow$ News Articles  (Backlinks)

The data for this direction is primarily stored in three files:

They can be accessed from this [Google Drive Link](https://drive.google.com/drive/folders/11IpwmFKuFn7LryUHW1df1fcJ2RmFUub1?usp=drive_link).

This part of the dataset is far more comprehensive, clean and balanced, so we recommend starting with it. The other half of the dataset (Backwards Direction) is a little less clean but has a lot more articles that could serve as background articles so it might be suited to a different set of tasks.

In [64]:
from datasets import load_from_disk
import pandas as pd
import numpy as np 
from tqdm.auto import tqdm
data_dir = 'data'
source_df = pd.read_json(f'{data_dir}/press_release_to_article_data/full-source-scored-data.jsonl.gz', lines=True, compression='gzip', nrows=5000)
article_d = load_from_disk(f'{data_dir}/press_release_to_article_data/all-coref-resolved')

Loading dataset from disk:   0%|          | 0/41 [00:00<?, ?it/s]

In [40]:
article_to_pr_mapper = pd.read_csv('data/press_release_to_article_data/article-to-pr-mapper.csv.gz', index_col=0)

In [49]:
print(f"Number of news articles: {article_to_pr_mapper['URL'].drop_duplicates().shape[0]}")
print(f"Number of press releases: {article_to_pr_mapper['Target URL'].drop_duplicates().shape[0]}")

Number of news articles: 587464
Number of press releases: 176777


In [55]:
article_to_pr_mapper['company_name'].value_counts().head(20)

company_name
corporate                39823
newscorp                 39391
ice                      39382
go_factset               38991
usa_visa                 33739
moodys                   25340
mastercard               24385
bhhs                     20043
paramount                19533
pfizer                   17191
fisglobal                15572
sec                      13812
stories_starbucks        11429
prnewswire               11256
raymondjames             11079
verizon                  10820
progressive_mediaroom    10650
blackrock                10592
about_fb                  9912
tesla                     9781
Name: count, dtype: int64

In [60]:
article_to_pr_mapper['news_url_domain'].value_counts().head(20)

news_url_domain
businessinsider      80606
cbsnews              64546
cnbc                 52562
pbs                  44027
fortune              43387
wsj                  40784
forbes               30708
independent          22303
ft                   21140
propublica           18303
nytimes              18038
business-standard    15320
progressive          10655
prweb                10424
cnn                   9799
marketwatch           9526
thestreet             9482
washingtonpost        8472
theguardian           7170
patch                 7127
Name: count, dtype: int64

`article_d` is a large set of articles (in a `Huggingface Dataset` object) with metadata that we processed for each one:

In [66]:
article_d

Dataset({
    features: ['article_url', 'target_timestamp_key', 'target_timestamp', 'sort_criteria', 'wayback_url', 'wayback_timestamp', 'method', 'links', 'article_text', 'word_lists', 'sent_lists', 'best_class', 'coref_resolved_sents'],
    num_rows: 496380
})

A central goal of the paper was to examine the sources used by human journalists and assess how LLMs might select these sources:

![alt text](images/model-criticism-emnlp-paper.png "Title") 

Now, let's examine a single article and how we can look at the sources inside this article:

In [67]:
a_urls = set(source_df['article_url'])
filtered_article_d = article_d.filter(lambda x: x['article_url'] in a_urls, num_proc=10)
filtered_article_df = filtered_article_d.to_pandas()

Filter (num_proc=10):   0%|          | 0/496380 [00:00<?, ? examples/s]

Here, we filter down to articles with known quotes:

In [71]:
disallowed_quote_types = set(['Other', 'Background/Narrative', 'No Quote'])

sentences_with_quotes = (
    filtered_article_d
         .to_pandas()
         .merge(source_df, on='article_url')
         [['article_url', 'attributions', 'quote_type', 'sent_lists',]]
         .explode(['attributions', 'quote_type', 'sent_lists'])
)

sentences_with_quotes = (
    sentences_with_quotes
        .assign(attributions=lambda df: 
             df.apply(lambda x: x['attributions'] if x['quote_type'] not in disallowed_quote_types else np.nan, axis=1)
    )
)

We examine a single article (we will show in later sections how to extract sources from this article)

In [73]:
one_article = (
    sentences_with_quotes
         .loc[lambda df: df['article_url'] == df['article_url'].unique()[1]]
        .reset_index(drop=True)
)

doc_str = one_article[['sent_lists', 'attributions']].to_csv(sep='\t', index=False)
json_str = one_article[['sent_lists', 'attributions']].to_json(lines=True, orient='records')

In [75]:
one_article[['sent_lists', 'attributions']]

Unnamed: 0,sent_lists,attributions
0,Based on a cohesive and committed philanthropi...,
1,The Socrata Foundation will proactively suppor...,
2,We're determined to support these organization...,Socrata Foundation
3,We believe that the innovative use of open dat...,Socrata Foundation
4,"Adds Robert Runge, a member of Socrata's Board...",Robert Runge
5,Making Data-Driven Government in Detroit a Rea...,
6,Detroit Mayor Mike Duggan wants to show reside...,
7,That's why he turned to data-driven government...,Mike Duggan
8,"Detroit knew that the transparency, accountabi...",Mike Duggan wants to show residents in his cit...
9,So we asked the Socrata Foundation to remove t...,Mike Duggan


## Forwards Direction: Press Releases $\leftarrow$ News Articles (Hyperlinks)

The data for this direction is primarily stored in a SQLite3 database: `data/article_to_press_release_data.db.tar.gz`. Please download it and run `tar -xvzf ...` to untar it.

It can be accessed from this [Google Drive link](https://drive.google.com/drive/folders/1CmJkwpbV84pYaEMtNWhWkzG-B4_3hXR8?usp=sharing).

In [9]:
import sqlite3
import pandas as pd 

In [7]:
con = sqlite3.connect('data/article_to_press_release_data.db')

In [10]:
all_tables = """SELECT * FROM sqlite_master WHERE type='table';"""
pd.read_sql(all_tables, con=con)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,article_map,article_map,2,"CREATE TABLE ""article_map"" (\n""target_article_..."
1,table,article_ents,article_ents,1580014,"CREATE TABLE ""article_ents"" (\n""common_crawl_u..."
2,table,article_to_href,article_to_href,1585745,"CREATE TABLE ""article_to_href"" (\n""article_url..."
3,table,article_data,article_data,1536716,"CREATE TABLE ""article_data"" (\n""common_crawl_u..."
4,table,press_release_data,press_release_data,1537952,"CREATE TABLE ""press_release_data"" (\n""article_..."
5,table,more_article_data,more_article_data,3136060,"CREATE TABLE ""more_article_data"" (\n""article_u..."


`article_data` is a huge table full of news articles and all kinds of metadata that we extract from the HTML:

In [21]:
pd.read_sql("SELECT COUNT(1) from article_data", con=con)

Unnamed: 0,COUNT(1)
0,938880


Many of these articles are background articles, related articles, or articles that do not link directly to press releases. By setting the flag `is_press_release_article = 1`, you will find articles with hyperlinks to known press release websites:

In [20]:
pd.read_sql("SELECT canonical_domain, canonical_timestamp, article_text from article_data where is_press_release_article = 1 limit 5", con=con)

Unnamed: 0,canonical_domain,canonical_timestamp,article_text
0,https://www.washingtonpost.com/technology/2022...,2022-07-29 11:00:40.529000+00:00,Democracy Dies in Darkness Innovations Slowly ...
1,https://www.washingtonpost.com/business/2021/0...,2021-08-20 11:00:00.084000+00:00,Democracy Dies in Darkness Business Booming bu...
2,https://www.washingtonpost.com/business/techno...,2012-10-16 04:44:41+00:00,Subscribe Account Profile Newsletters &amp...
3,https://www.washingtonpost.com/business/techno...,2013-10-28 11:26:01+00:00,Subscribe Account Profile Newsletters &amp...
4,https://www.washingtonpost.com/business/techno...,2012-02-04 02:57:17+00:00,"var siteContext = '/rw/sites/twpweb', sectionC..."


`press_release_data` contains the text of press releases

In [36]:
pd.read_sql("SELECT COUNT(1) from press_release_data", con=con)

Unnamed: 0,COUNT(1)
0,75528


In [28]:
pd.read_sql("SELECT article_url, target_timestamp, data from press_release_data limit 100", con=con).drop_duplicates('data').sample(5)

Unnamed: 0,article_url,target_timestamp,data
53,http://www.globenewswire.com/newsroom/prs/?pkg...,2013-01-03 20:15:00+00:00,\nWe will keep fighting for all libraries - s...
55,https://www.globenewswire.com/NewsRoom/Attachm...,2018-01-03 21:57:00+00:00,\nEnglish \nFrançais \nRegister \nSign In \nAb...
26,https://www.globenewswire.com/Tracker?data=7zN...,2018-03-22 10:45:00+00:00,\nEnglish \nFrançais \nRegister \nSign In \nTh...
29,https://www.globenewswire.com/NewsRoom/Attachm...,2018-03-22 10:45:00+00:00,\nEnglish \nFrançais \nRegister \nSign In \nAb...
49,http://www.globenewswire.com/newsroom/ctr?d=10...,2013-01-03 20:15:00+00:00,\nEasily Send & Share Press Releases \n(800) 3...


The `article_to_href` table contains the mapping between articles and hyperlinks. There is a flag, there, that notes whether the hyperlink is from a known press release website:

In [27]:
pd.read_sql("SELECT * from article_to_href where is_press_release = 1 limit 5", con=con)

Unnamed: 0,article_url,href,text,char_start_idx,char_end_idx,is_press_release,source
0,"com,washingtonpost)/technology/2022/07/29/com,...",https://news.mit.edu/2021/fiber-battery-longes...,batteries,1393,1404,1,../data/open-sourced
1,"com,washingtonpost)/technology/2022/07/29/com,...",https://sam.gov/opp/8414c81ebfd445a8a491e91a7e...,announced,1562,1573,1,../data/open-sourced
2,"com,washingtonpost)/business/2021/08/20/com,wa...",https://www.bls.gov/news.release/cesan.nr0.htm,"spent $4,643",8839,8853,1,../data/open-sourced
3,"com,washingtonpost)/business/economy/federal-r...",http://www.cafepress.com/washingtonpost,Post Store,3502,3514,1,../data/open-sourced
4,"com,washingtonpost)/business/economy/federal-r...",http://www.expressnightout.com,Express,3862,3871,1,../data/open-sourced


You might try merging them like this:

In [34]:
pd.read_sql("""
    SELECT a.article_text as news_article_text, p.data as press_release_text
    FROM article_data a 
    JOIN article_to_href h ON h.article_url = a.common_crawl_url
    JOIN press_release_data p ON h.href = p.article_url
    LIMIT 10
""", con=con)

Unnamed: 0,news_article_text,press_release_text
0,Democracy Dies in Darkness Innovations Slowly ...,\nSkip to content ↓ \nMassachusetts Institute ...
1,Democracy Dies in Darkness Innovations Slowly ...,\n
2,Democracy Dies in Darkness Business Booming bu...,\nSkip to Content \nUS Department of Labor \nA...
3,Real Estate Rentals Cars Today's Paper Go...,\nCart & Checkout \nHelp \nOrder Status \nShop...
4,Real Estate Rentals Cars Today's Paper Go...,\n504 Gateway Time-out \nnginx/1.19.5 \n
5,Real Estate Rentals Cars Today's Paper Go...,\nCart & Checkout \nHelp \nOrder Status \nShop...
6,Real Estate Rentals Cars Today's Paper Go...,\n504 Gateway Time-out \nnginx/1.19.5 \n
7,Real Estate Rentals Cars Today's Paper Go...,\nCart & Checkout \nHelp \nOrder Status \nShop...
8,Real Estate Rentals Cars Today's Paper Go...,\n504 Gateway Time-out \nnginx/1.19.5 \n
9,Real Estate Rentals Cars Today's Paper Go...,\nCart & Checkout \nHelp \nOrder Status \nShop...


# Finding Contrasting Press Releases

In our paper, we described a method to find news articles that provided "critical coverage" of press releases. This method used aggregated sentence-level NLI scores on the document level.

![alt text](images/doc-level-nli.png "Title")

for more information, see `src/assess_factual_consistency.py`. However, we do recommend taking a different, more modern approach, possibly one that utilizes LLMs to distill a BERT-based classification model.

# Extracting Sources

We extract sources in human-written news articles by running llama3.1. You can test locally using Ollama: https://ollama.com/. Follow the link to install the program.

This can be useful for trying things out before we work out compute access.

Run this in your terminal:

`ollama run llama3`

Even llama3 8b is pretty powerful. Llama 70b is better, but that may not run on your local computer.

In [111]:
import requests
import pyperclip

In [None]:
pyperclip.copy(doc_str)
pyperclip.copy(json_str)

In [105]:
r = requests.post(
    'http://localhost:11434/api/generate', 
    json = {
        "model": "llama3",
        "prompt":f"""
            Here is a news article, with each sentence annotated according to the source of it's information:
            ```
            {json_str}
            ```

            Please summarize each of our source annotations. Tell me in one paragraph per source: (1) who the source is (2) what informational content they provide to the article. 
            Only rely on the annotations I have provided, don't identify additional sources.
        """,
        "stream": False
})

In [106]:
print(r.json()['response'])

**Socrata Foundation**: The Socrata Foundation is a 501(c)(3) organization that provides information about its philanthropic philosophy and mission. They describe their support for unique organizations that lack resources or financial means to fulfill their data-driven missions. They also mention the importance of open data in removing barriers to social justice and economic progress.

**Robert Runge**: Robert Runge, a member of Socrata's Board of Directors, provides additional context about the Socrata Foundation's purpose and how it bridges the gap between publicly funded open data projects and underfunded or unfunded opportunities.

**Mike Duggan**: Detroit Mayor Mike Duggan shares his perspective on why he turned to data-driven government in Detroit. He highlights the importance of transparency, accountability, and fact-based decision-making enabled by open data. He also explains how the Socrata Foundation helped Detroit gain access to the necessary technology and infrastructure.

