# Task 1.6: Building a Network Dataset with NLP

### Project: 20th Century Geopolitical Interrelations
### Author: Fariya Asghar
### Date: 23.07.2025

### Abstract
This notebook implements the Natural Language Processing (NLP) phase of the project. This approach uses a more robust and accurate methodology to extract geopolitical relationships from the "Key Events of the 20th Century" text.

The refined methodology is as follows:

1.  **Data Loading & Normalization:** Ingest the scraped country list and the raw event text. Both the country list and the event text are converted to lowercase for consistent matching.
2.  **Text Wrangling:** A targeted dictionary is used to standardize common adjectival forms (e.g., "Soviet" -> "russia") and abbreviations (e.g., "u.s." -> "united states"). This manual approach is more precise than the previous automated one.
3.  **Named Entity Recognition (NER):** The wrangled text is processed using spaCy. We then leverage spaCy's pre-trained ability to identify Geopolitical Entities (`GPE`).
4.  **Entity Filtering:** We iterate through the sentences and their identified `GPE` entities. An entity is only kept if it exists in our master list of countries, ensuring high accuracy and eliminating false positives. This is the correct filtering method as per the task requirements.
5.  **Relationship Extraction:** For sentences containing two or more verified countries, unique relationship pairs (edges) are created.
6.  **Output:** The final output is a structured pandas DataFrame counting the frequency of each unique country-pair interaction, saved as `country_relationships.csv`. This file is the direct input for the network analysis in the next task.

In [15]:
# Import Libraries and Load Data

import pandas as pd
import re
import itertools
import spacy

# Load the pre-trained spaCy English language model
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy 'en_core_web_sm' model loaded successfully.")
except OSError:
    print("Model not found. Please run 'python -m spacy download en_core_web_sm'")

spaCy 'en_core_web_sm' model loaded successfully.


In [16]:
# --- Load and Prepare Data ---

# Load the list of countries.
# We immediately strip whitespace and convert to lowercase for consistent matching.
try:
    countries_df = pd.read_csv('countries_list_20th_century_1.5.csv')
    country_list = countries_df['country_name'].str.strip().str.lower().tolist()
    print(f"Successfully loaded and normalized {len(country_list)} countries.")
except FileNotFoundError:
    print("Error: 'countries_list_20th_century_1.5.csv' not found. Please ensure the file is in the same directory.")
    country_list = []

# Load the raw text from the events page.
try:
    with open('20th_century_key_events.txt', 'r', encoding='utf-8') as file: 
        raw_text = file.read()
    print("Successfully loaded the raw text from the events file.")
except FileNotFoundError:
    print("Error: The events text file was not found.")
    raw_text = ""

Successfully loaded and normalized 209 countries.
Successfully loaded the raw text from the events file.


### 2. Text Wrangling and Standardization

**Observations and Plan:**
The raw text contains many variations of country names (adjectives, abbreviations). To ensure the NER model performs optimally, we must standardize these. This approach uses a curated, manual dictionary for replacements, which is more accurate and avoids the errors (e.g., mapping "and" to a country). The entire text will first be converted to lowercase.

In [17]:
# Create a targeted dictionary for common replacements.
replacement_dict = {
    "u.s.": "united states",
    "us": "united states",
    "u.k.": "united kingdom",
    "uk": "united kingdom",
    "soviet union": "russia",
    "soviet": "russia",
    "german": "germany",
    "british": "united kingdom",
    "french": "france",
    "italian": "italy",
    "japanese": "japan",
    "chinese": "china",
    "american": "united states"
}

# First, normalize the entire text to lowercase.
normalized_text = raw_text.lower()

# Apply each replacement using regular expressions with word boundaries (\b).
# This prevents replacing parts of words (e.g., 'us' in 'house').
print("Applying text standardizations...")
for old, new in replacement_dict.items():
    normalized_text = re.sub(rf'\b{re.escape(old)}\b', new, normalized_text)

# Save the clean, wrangled version for traceability and future use.
wrangled_filename = "20th_century_wrangled_text.txt"
with open(wrangled_filename, 'w', encoding='utf-8') as f:
    f.write(normalized_text)
print(f"Wrangling complete. Cleaned text saved to '{wrangled_filename}'.")

Applying text standardizations...
Wrangling complete. Cleaned text saved to '20th_century_wrangled_text.txt'.


### 3. Apply NER and Filter for Country Entities

Now that the text is clean, we process it with spaCy. We will then iterate through each sentence, find entities labeled as `GPE` (Geopolitical Entity), and perform the final, crucial filtering step: checking if the found entity exists in our master `country_list`.

In [18]:
# Process the entire normalized text with spaCy
doc = nlp(normalized_text)

# This list will store dictionaries, each representing a sentence with verified countries.
filtered_sentences = []
print("Filtering sentences for verified country names...")

for sent in doc.sents:
    sentence_text = sent.text.strip()
    
    # Use a set to store unique countries found in this sentence to avoid duplicates.
    countries_in_sentence = set()
    
    # Check every entity identified by spaCy in the sentence.
    for ent in sent.ents:
        # We are only interested in Geopolitical Entities.
        if ent.label_ == "GPE":
            # Clean the entity text (lowercase, strip whitespace) for matching.
            ent_clean = ent.text.strip().lower()
            
            # THE KEY STEP: Check if this cleaned entity is in our master country list.
            if ent_clean in country_list:
                # If it's a match, add the standardized, title-cased name.
                countries_in_sentence.add(ent_clean.title())

    # Only if we found at least one verified country, we add it to our results.
    if len(countries_in_sentence) > 0:
        filtered_sentences.append({
            "sentence": sentence_text,
            "country_entities": list(countries_in_sentence)
        })

# Create the final filtered DataFrame.
filtered_sentences_df = pd.DataFrame(filtered_sentences)
print(f"Filtering complete. Found {len(filtered_sentences_df)} sentences containing at least one country.")
display(filtered_sentences_df.head())

Filtering sentences for verified country names...
Filtering complete. Found 162 sentences containing at least one country.


Unnamed: 0,sentence,country_entities
0,after a period of diplomatic and military esca...,"[France, Germany, Russia]"
1,the bolsheviks negotiated the treaty of brest-...,"[Russia, Germany]"
2,"in the treaty, bolshevik russia ceded the balt...","[Russia, Germany]"
3,it also recognized the independence of ukraine...,"[United States, Germany]"
4,combined with already existing malnourishment ...,[Russia]


### 4. Create and Aggregate Relationship Pairs

With the correctly filtered sentences, we can now extract the relationships. We will iterate through sentences containing two or more countries and create a pair for each unique combination. Finally, we will count the frequency of each pair to determine the strength of their relationship.

In [19]:
# This list will store all raw relationship pairs, e.g., ('Germany', 'Russia').
relationships = []

# Iterate through the rows of our filtered DataFrame.
for index, row in filtered_sentences_df.iterrows():
    countries = row['country_entities']
    
    # We only care about relationships, which require at least two countries.
    if len(countries) >= 2:
        # itertools.combinations creates all unique pairs from the list.
        # We sort the list first to ensure the pairs are created in a consistent order.
        for pair in itertools.combinations(sorted(countries), 2):
            relationships.append(pair)

# Create a DataFrame from the list of pairs.
relationships_df = pd.DataFrame(relationships, columns=["source", "target"])

# --- Count Frequencies and Finalize ---

# The most efficient way to count pairs is using value_counts().
final_relationships_df = relationships_df.value_counts().reset_index()
final_relationships_df.columns = ["source", "target", "value"]

print(f"Created a final DataFrame with {len(final_relationships_df)} unique country relationships.")
display(final_relationships_df.head(15))

Created a final DataFrame with 154 unique country relationships.


Unnamed: 0,source,target,value
0,Germany,Russia,12
1,Poland,Russia,6
2,Japan,Russia,6
3,France,Russia,5
4,Germany,Poland,5
5,France,Germany,5
6,Japan,United States,4
7,Germany,Italy,4
8,India,Pakistan,3
9,Germany,Japan,3


### 5. Save Final Output
The process is complete. The final DataFrame, containing the source country, target country, and the frequency of their co-occurrence (value), is now saved to a CSV file. This file is ready for the network visualization task in Exercise 1.7.

In [20]:
# Save the final DataFrame to a CSV file.
output_filename = "country_relationships.csv"
final_relationships_df.to_csv(output_filename, index=False)

print(f"\nSuccessfully saved the final relationships data to '{output_filename}'.")


Successfully saved the final relationships data to 'country_relationships.csv'.
