# Process

# Inspect the data

The project obtained the data from the official "The Guardian" api covered the full year span news under the section "world" during 2025. The original data inlcudes 7384 items with 3 attributes (publication_date, headline and body_html). Therefore, the first step is to clean html tags and extract related news content.

In [3]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import country_converter as coco
import spacy
import ast
import csv

In [34]:
raw_data = pd.read_csv("../data/world_articles_2025.csv")

In [35]:
print(raw_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7384 entries, 0 to 7383
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   publication_date  7384 non-null   object
 1   headline          7384 non-null   object
 2   body_html         7384 non-null   object
dtypes: object(3)
memory usage: 173.2+ KB
None


In [36]:
print(raw_data.head(3))

       publication_date                                           headline  \
0  2025-01-01T01:44:41Z  Ukraine war briefing: Zelenskyy vows his count...   
1  2025-01-01T05:00:04Z  Top Venezuelan pianist urges music world to sn...   
2  2025-01-01T06:15:53Z  South Korea plane crash investigators extract ...   

                                           body_html  
0  <ul> <li><p><strong>Ukraine’s President Volody...  
1  <p>One of Venezuela’s most celebrated musician...  
2  <p>Investigators in South Korea have extracted...  


# Filter the data
## Clean Tags
According to the raw content, there are several tags interupting the news body:

1. \<aside\>: contains the navigation or other recommended news from the website
2. \<p class="block-time"\>: indicates the published time
3. \<figure\>: contains a figure

Importantly, the Guardian also use \<h2\> to contain paragraph titles, therefore the study first cleaned the disturbing tages then extracted texts from both the tags \<p\> and \<h2\>.

In [37]:
def clean(html_raw):
    if not html_raw:
        return ""
   
    extracted_text = []
    
    soup = bs(str(html_raw), 'lxml') 
    
    for aside in soup.find_all('aside'):
        aside.decompose()
    for figure in soup.find_all('figure'):
        figure.decompose()
    for timestamp in soup.find_all('p', class_='block-time'):
        timestamp.decompose()

    for tag in soup.find_all(['p', 'h2']):
        extracted_text.append(tag.get_text(strip=True))

    return " ".join(extracted_text)

data_cleaned = raw_data
data_cleaned['body_cleaned'] = data_cleaned['body_html'].apply(clean)
data_cleaned = data_cleaned.drop(columns=['body_html'])
data_cleaned.to_csv('../data/news_cleaned.csv', index=False, encoding='utf-8-sig')

KeyboardInterrupt: 

In [None]:
print(data_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7384 entries, 0 to 7383
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   publication_date  7384 non-null   object
 1   headline          7384 non-null   object
 2   body_cleaned      7384 non-null   object
dtypes: object(3)
memory usage: 173.2+ KB
None


## Extract states
Based on the restored news content, the study further seeked for the relationships between states reflected in the news. The study utilized 2 criteria to identify connections:

1. Two states co-exist in a positive context.
2. Two states share the same issues.

Accordingly, for each article a full list of mentioned states is necessary. However, states may be referred in different forms, for example, U.S. may also appear as America and The United States. For generalization and standarderlization, the study first identified geopolitical entities (GPE) through spacy, then passed them to country_converter to decide whether entities represent states and return states' names in a uniformed format. Although the definition of state varies in different standards, the study drew on the UN member state list and include the 2 Non-member Observer States (Palestine and Holy See / Vatican City) to adjustify the identifications resulted from country_converter.

Reference:

https://spacy.io/usage/linguistic-features#named-entities

https://github.com/IndEcol/country_converter

> Stadler, K. (2017). The country converter coco - a Python package for converting country names between different classification schemes. The Journal of Open Source Software. doi: 10.21105/joss.00332




In [22]:
data_cleaned = pd.read_csv("../data/news_cleaned.csv")

In [None]:
nlp = spacy.load("en_core_web_sm")


def extract_states(content):
    doc = nlp(str(content))
    states = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    coco_states = coco.convert(names=states, to='name_short', not_found="NA");

    # coco.convert returns a string if there is a single state in the list
    # when it cannot find the state, it will return the state itself when setting not_found = None
    if isinstance(coco_states, str):
        coco_states = [coco_states]

    # flatten
    coco_states = [item for sublist in coco_states for item in (
        sublist if isinstance(sublist, list) else [sublist])]
    unique_states = list(set(s for s in coco_states if s != "NA"))
    return unique_states


data_state = data_cleaned
data_state['state_list'] = data_state['body_cleaned'].apply(extract_states)

Kyiv not found in regex
Moscow not found in regex
Zelenskyy not found in regex
Moscow not found in regex
Smolensk not found in regex
Smolensk not found in regex
SBSOV not found in regex
El Sistema not found in regex
London not found in regex
Caracas not found in regex
El Sistema’s not found in regex
El Sistema not found in regex
El Sistema not found in regex
Caracas not found in regex
London not found in regex
New York’s not found in regex
El Sistema not found in regex
Caracas not found in regex
El Sistema not found in regex
El Sistema not found in regex
El Sistema not found in regex
Mexico City not found in regex
New York City not found in regex
Washington DC not found in regex
Detroit not found in regex
New York City not found in regex
New York not found in regex
New York not found in regex
New York City not found in regex
Long Island not found in regex
New York not found in regex
New York not found in regex
Chila not found in regex
Labrador not found in regex
Sydney not found in reg

In [None]:
# 1. include Palestine and Vatican City by coco's short_name list
data_state['state_list'] = data_state['state_list'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
all_states = data_state['state_list'].explode().dropna().unique()

with open("../data/states.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for s in all_states:
        writer.writerow([s])

['Russia' 'United States' 'Ukraine' 'Venezuela' 'Brazil' 'Colombia'
 'South Korea' 'Thailand' 'Haiti' 'Cayman Islands' 'Argentina' 'Slovenia'
 'Chile' 'Honduras' 'Panama' 'Taiwan' 'France' 'Bolivia' 'El Salvador'
 'Syria' 'Peru' 'Nepal' 'Canada' 'Guatemala' 'Switzerland' 'Germany'
 'Belgium' 'China' 'Spain' 'Hong Kong' 'Italy' 'Guyana' 'Singapore'
 'Ecuador' 'Mexico' 'Jamaica' 'Aruba' 'Puerto Rico' 'Bahamas'
 'Philippines' 'Dominican Republic' 'Australia' 'India' 'Portugal' 'Cuba'
 'United Kingdom' 'New Zealand' 'Belize' 'Costa Rica' 'Nicaragua' 'Israel'
 'Saudi Arabia' 'Sudan' 'Palestine' 'Iran' 'United Arab Emirates' 'Egypt'
 'Iraq' 'Afghanistan' 'Moldova' 'Slovakia' 'Serbia' 'Hungary' 'Poland'
 'Azerbaijan' 'Austria' 'Romania' 'Tunisia' 'Libya' 'Kenya' 'Anguilla'
 'Laos' 'Bosnia and Herzegovina' 'Kosovo' 'Montenegro' 'North Macedonia'
 'Albania' 'Finland' 'Netherlands' 'Czechia' 'Qatar' 'Lebanon' 'Eritrea'
 'North Korea' 'Burundi' 'Tanzania' 'Rwanda' 'Vietnam' 'Japan' 'Denmark'
 'Tü

In [10]:
# 2. create a delete list excluding Palestine（Vatican City is not in the list)
no_un_list = []
for s in all_states:
    try:
        un_membership = coco.convert(names=s, to='UNmember', not_found=None)
    except Exception as e:
        # print(f"{s} is not a formal UN state: {e}")
        no_un_list.append(s)

try:
    no_un_list.remove("Palestine")
    no_un_list.remove("Vatican City")
except Exception as e:
    print({e})

{ValueError('list.remove(x): x not in list')}


In [None]:
# BUG
# when a state is in the short_name list but not in the UNmember list
# the package will throw an error
try:
    states = ["Taiwan"]
    un = coco.convert(names=states, to='UNmember', not_found=True)
except Exception as e:
    print({e})

{TypeError("int() argument must be a string, a bytes-like object or a real number, not 'NAType'")}


In [13]:
def delete_no_un(state_list, no_un_list):
    un_list = list(set(state_list) - no_un_list)
    return un_list

data_state['state_list'] = data_state['state_list'].apply(lambda x: delete_no_un(x, set(no_un_list)))

In [None]:
data_state.to_csv('../data/news_state.csv', index=False, encoding='utf-8-sig')

## Evaluate the state identification
Since the correct and complete identification of states is the prerequisite for the study, the results from the NLP pipeline requires verification for reliability. The study utilized the Cochran’s Modified Smaple Size Formula (e=0.05, p=0.5, Z=1.645/90% CI, N=7384) for finite populations thus sampled 261 articles from the dataset. The results were evaluated by Precision, Recall and F1.

$$
n_0 = \frac{{Z^2 \cdot p \cdot (1 - p)}}{{e^2}},\ 
n = \frac{{n_0}}{{1 + \frac{{n_0 - 1}}{N}}}
$$

Precision: whether identified entities refer to states in articles, for example, Geogia may refer to a state, but can also to a person's name or the American State.

Recall: whether states were neglected during the identification.

F1: balance score between Precision and Recall.

In [10]:
data_state = pd.read_csv('../data/news_state.csv')

In [None]:
state_iden_sample = data_state.sample(n=261, random_state=22)
state_iden_sample.to_csv('../data/sample/state_iden_sample.csv', index=False, encoding='utf-8-sig')

In [None]:
# TODO: Annotation & Calculation

# Data Grouping

To prepare the corpus for the two-layer network, the study first filtered out the articles without explicit mentioning of any states, resulting in a dataset of 7129 items. For each layer, the study applied different strategies. Regerding the sentiment layer, the articles for the network should contain at leat two countries, thus the study first filtered out the articles with only one state and obtained a dataset of 5941 items. With regard to the semantic layer, the study reorganized the articles in the format: state-articles mentioning the state, thus obtained a larger dataset of 35079 items, with each row only represents one state and one article, which is convenient for semantic analysis for each state.

In [26]:
data_state_cleaned = data_state[data_state['state_list'] != "[]"]
print(data_state_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
Index: 7129 entries, 0 to 7383
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   publication_date  7129 non-null   object
 1   headline          7129 non-null   object
 2   body_cleaned      7129 non-null   object
 3   state_list        7129 non-null   object
dtypes: object(4)
memory usage: 278.5+ KB
None


In [None]:
network1 = data_state_cleaned[
    data_state_cleaned['state_list'].apply(lambda x: len(ast.literal_eval(x)) > 1)
]
print(network1.info())
network1.to_csv("../data/network1/network1.csv" index=False, encoding='utf-8-sig')

<class 'pandas.core.frame.DataFrame'>
Index: 5941 entries, 0 to 7383
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   publication_date  5941 non-null   object
 1   headline          5941 non-null   object
 2   body_cleaned      5941 non-null   object
 3   state_list        5941 non-null   object
dtypes: object(4)
memory usage: 232.1+ KB
None


In [None]:
data_state_cleaned['article_id'] = data_state_cleaned.index
data_state_cleaned['state_list'] = data_state_cleaned['state_list'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
exploded = data_state_cleaned.explode('state_list')
network2 = exploded.groupby('state_list')['article_id'].apply(list).reset_index()
network2= network2.explode('article_id')
network2.columns = ['state', 'article_id']

In [None]:
print(network2.info())
network2.to_csv("../data/network2/network2.csv", index=False, encoding='utf-8-sig')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35079 entries, 0 to 35078
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   state             35079 non-null  object
 1   article_id        35079 non-null  object
 2   headline          35079 non-null  object
 3   publication_date  35079 non-null  object
 4   body_cleaned      35079 non-null  object
dtypes: object(5)
memory usage: 1.3+ MB
None


# Sentiment Network
The study defined the conetxt as a sentence, namely, it measures whether two or more states appear within a positive sentence. First, the study segmented the articles into sentences, then it decides whether two or more states are in the sentence. Under the situation, it recorded the sentence into the sentiment corpus, resulting in 38688 items.

In [12]:
network1 = pd.read_csv("../data/network1/network1.csv")
network1["content"] = network1["headline"].astype(str) + '. ' + network1["body_cleaned"].astype(str)

with open("../data/states.csv", "r", encoding="utf-8") as f:
    reader = csv.reader(f)
    all_states = [row[0] for row in reader]

network1_sentence = pd.DataFrame(columns=["sentence", "states"])

nlp = spacy.load("en_core_web_sm")

In [13]:
def record_sentence(states, doc, nlp, network1_sentence):
    for i, sent in enumerate(doc.sents):
        sent_doc = nlp(str(sent))
        sent_states = [ent.text for ent in sent_doc.ents if ent.label_ == "GPE"]
        coco_states = coco.convert(names=sent_states, to='name_short', not_found="NA");

        if isinstance(coco_states, str):
            coco_states = [coco_states]

        coco_states = [item for sublist in coco_states for item in (
            sublist if isinstance(sublist, list) else [sublist])]
        unique_states = list(set(s for s in coco_states if s != "NA"))

        intersection = list(set(states) & set(unique_states))
        if len(intersection) > 1:
            network1_sentence = pd.concat([
                network1_sentence,
                pd.DataFrame({"sentence": [str(sent)], "states": [str(intersection)]})
            ], ignore_index=True)
    return network1_sentence

for i, row in network1.iterrows():
    doc = nlp(str(row["content"]))
    network1_sentence = record_sentence(all_states, doc, nlp, network1_sentence)

Kyiv not found in regex
Moscow not found in regex
Zelenskyy not found in regex
Moscow not found in regex
Smolensk not found in regex
Smolensk not found in regex
SBSOV not found in regex
El Sistema not found in regex
London not found in regex
Caracas not found in regex
El Sistema’s not found in regex
El Sistema not found in regex
El Sistema not found in regex
Caracas not found in regex
London not found in regex
New York’s not found in regex
El Sistema not found in regex
Caracas not found in regex
El Sistema not found in regex
El Sistema not found in regex
El Sistema not found in regex
New York’s not found in regex
Mexico City not found in regex
New York City not found in regex
Washington DC not found in regex
Detroit not found in regex
New York City not found in regex
New York not found in regex
New York not found in regex
New York City not found in regex
Long Island not found in regex
New York not found in regex
New York not found in regex
Chila not found in regex
Newfoundland not foun

In [14]:
network1_sentence.to_csv("../data/network1/network1_sentence.csv",index=False, encoding='utf-8-sig')

In [15]:
print(network1_sentence.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38688 entries, 0 to 38687
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  38688 non-null  object
 1   states    38688 non-null  object
dtypes: object(2)
memory usage: 604.6+ KB
None
