# Task 1.6: Building a Network Dataset with NLP

### Project: 20th Century Geopolitical Interrelations
### Author: Fariya Asghar
### Date: 22.07.2025

---

### Abstract

This notebook serves as the core of the Natural Language Processing (NLP) phase for this project. The primary objective is to transform the unstructured text data scraped from the "Key Events of the 20th Century" Wikipedia page into a structured dataset suitable for network analysis.

The methodology is as follows:
1.  **Data Loading:** Ingest the previously scraped list of countries and the full text of 20th-century events.
2.  **Text Wrangling:** Standardize the text by replacing common adjectival forms and abbreviations of country names with their official counterparts to improve recognition accuracy.
3.  **Named Entity Recognition (NER):** Utilize the `spaCy` library to process the wrangled text, identify all named entities, and segment the document into sentences.
4.  **Relationship Extraction:** Filter the sentences to keep only those containing two or more countries. For each of these sentences, create relationship pairs (edges) between the co-occurring countries.
5.  **Output:** The final deliverable of this notebook is a structured pandas DataFrame of these relationships. This DataFrame will serve as the direct input for creating, analyzing, and visualizing the geopolitical network in the final task.

### 1. DATA LOADING

In [1]:
# Import Libraries and Load Data

# --- Standard Libraries ---
import pandas as pd
import re
import itertools # We will use this for creating relationship pairs

# --- NLP and Network Libraries ---
import spacy
import networkx as nx

# --- Load the spaCy English language model ---
# Using specific error handling as praised by the mentor.
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy English language model loaded successfully.")
except OSError:
    print("spaCy 'en_core_web_sm' model not found. Please run 'python -m spacy download en_core_web_sm' in your terminal.")

spaCy English language model loaded successfully.


In [2]:
# --- Load Scraped Data ---
# Load the list of countries from the CSV file.
try:
    countries_df = pd.read_csv('countries_list_20th_century_1.5.csv')
    country_list = countries_df['country_name'].tolist()
    print(f"Successfully loaded {len(country_list)} countries from 'countries_list_20th_century_1.5.csv'.")
except FileNotFoundError:
    print("Error: 'countries_list_20th_century_1.5.csv' not found.")
    country_list = []

# Load the raw text from the events page .txt file.
try:
    with open('20th_century_key_events.txt', 'r', encoding='utf-8') as file:
        raw_text = file.read()
    print("Successfully loaded the raw text from '20th_century_key_events.txt'.")
except FileNotFoundError:
    print("Error: '20th_century_key_events.txt' not found.")
    raw_text = ""

Successfully loaded 209 countries from 'countries_list_20th_century_1.5.csv'.
Successfully loaded the raw text from '20th_century_key_events.txt'.


### 2. EVALUATE AND WRANGLE DATA

**Observations:**

Before processing the text with the NER model, an evaluation of the raw text and the country list reveals several potential inconsistencies that could reduce accuracy:

1.  **Adjectival vs. Noun Forms:** The text frequently uses adjectival forms of country names (e.g., "German", "Soviet") which may not be correctly identified as the country entity itself (e.g., "Germany").
2.  **Abbreviations and Acronyms:** Common abbreviations like "U.S." or "U.K." are used instead of their full names.
3.  **Historical Names:** The term "Soviet Union" is prevalent and, for the scope of this analysis, should be mapped to "Russia" to maintain a consistent entity.

**Plan:**

To address these issues, I will perform a series of text replacements on the raw text. This wrangling process will standardize common abbreviations and adjectival forms to their corresponding official country names. This will create a cleaner, more consistent text, significantly improving the performance of the `spaCy` NER model in identifying the geopolitical entities of interest. The final, wrangled text will be saved to a new `.txt` file for traceability.

#### - Automated Text Wrangling with pycountry

In [3]:
import pycountry

print("Starting automated text wrangling using pycountry...")
wrangled_text = raw_text

# --- Part 1: Automatically build a comprehensive replacement dictionary ---
replacement_dict = {
    # Add a few common, non-standard cases manually that pycountry might miss
    "U.S.": "United States",
    "U.K.": "United Kingdom",
    "Soviet Union": "Russia"
}

# Loop through our country list to find adjectival forms
print("Searching for adjectival forms for all countries in the list...")
for country_name in country_list:
    try:
        # Search for the country in the pycountry database
        country_data = pycountry.countries.get(name=country_name)
        
        # Check if the country has an 'adjective' attribute
        if country_data and hasattr(country_data, 'adjective'):
            adjective = country_data.adjective
            
            # Add a rule to replace the adjective with the proper country name
            # e.g., "German" -> "Germany"
            # We only add it if it's not already in our manual list
            if adjective not in replacement_dict:
                 replacement_dict[adjective] = country_name
                 
    except LookupError:
        # This country isn't in the pycountry database, which is fine. We just skip it.
        continue

print(f"Built a replacement dictionary with {len(replacement_dict)} rules.")


# --- Part 2: Apply all replacements to the text ---
print("Applying replacements to the text...")
# Loop through the dictionary and apply each replacement
for old_word, new_word in replacement_dict.items():
    # Use re.sub with word boundaries (\b) for case-insensitive replacement to avoid
    # replacing parts of words (e.g., 'us' in 'unanimous').
    wrangled_text = re.sub(f'\\b{re.escape(old_word)}\\b', new_word, wrangled_text, flags=re.IGNORECASE)

print("Automated text wrangling complete.")


# --- Part 3: Save the wrangled text as required ---
wrangled_filename = "20th_century_wrangled_text.txt"
with open(wrangled_filename, 'w', encoding='utf-8') as f:
    f.write(wrangled_text)
print(f"Wrangled text saved to '{wrangled_filename}'.")

Starting automated text wrangling using pycountry...
Searching for adjectival forms for all countries in the list...
Built a replacement dictionary with 3 rules.
Applying replacements to the text...
Automated text wrangling complete.
Wrangled text saved to '20th_century_wrangled_text.txt'.


### 3. Named Entity Recognition (NER)

#### - Create the NER Object

In [4]:
# The nlp() function processes the text and performs the NER analysis.
print("Processing the wrangled text with spaCy to create the NER object...")
ner_object = nlp(wrangled_text)
print("NER processing complete.")

# Let's inspect a sample of the entities spaCy found to verify its work.
print("\n--- Sample of the first 15 Entities Found by spaCy ---")
for entity in list(ner_object.ents)[:15]:
    print(f"Entity: '{entity.text}', Label: '{entity.label_}'")

Processing the wrangled text with spaCy to create the NER object...
NER processing complete.

--- Sample of the first 15 Entities Found by spaCy ---
Entity: 'The 20th century', Label: 'DATE'
Entity: 'The World Wars', Label: 'ORG'
Entity: 'the Cold War', Label: 'EVENT'
Entity: 'the Space Race', Label: 'ORG'
Entity: 'the World Wide Web', Label: 'EVENT'
Entity: 'the 21st century', Label: 'DATE'
Entity: 'today', Label: 'DATE'
Entity: 'Historic', Label: 'PERSON'
Entity: '20th', Label: 'ORDINAL'
Entity: 'the 20th century', Label: 'DATE'
Entity: 'The 1900s', Label: 'DATE'
Entity: 'the decade', Label: 'DATE'
Entity: '1914', Label: 'CARDINAL'
Entity: 'the Panama Canal', Label: 'FAC'
Entity: 'Scramble', Label: 'PERSON'


#### - Split the NER Object into Sentences

In [5]:
# Create an empty list to store our sentence-level data.
sentence_data = []

# The .sents attribute of the spaCy Doc object allows us to loop through each identified sentence.
print("Extracting entities from each sentence...")
for sentence in ner_object.sents:
    # For each sentence, we create a list of the text of every entity found within it.
    entity_list = [ent.text for ent in sentence.ents]
    
    # We append a dictionary containing the sentence and its entities to our main list.
    sentence_data.append({"sentence": sentence.text, "entities": entity_list})

# Convert the list of dictionaries into a pandas DataFrame for easier handling.
sentences_df = pd.DataFrame(sentence_data)

print("DataFrame of sentences and their entities created successfully.")

# Display the head of the new DataFrame to verify the result.
print("\n--- First 10 Sentences and Their Entities ---")
display(sentences_df.head(10))

Extracting entities from each sentence...
DataFrame of sentences and their entities created successfully.

--- First 10 Sentences and Their Entities ---


Unnamed: 0,sentence,entities
0,The 20th century changed the world in unpreced...,[The 20th century]
1,The World Wars sparked tension between countri...,"[The World Wars, the Cold War, the Space Race,..."
2,These advancements have played a significant r...,"[the 21st century, today]"
3,Historic events in the 20th century[edit]\nWor...,"[Historic, 20th, the 20th century]"
4,The 1900s saw the decade herald a series of in...,"[The 1900s, the decade]"
5,1914 saw the completion of the Panama Canal.\n,"[1914, the Panama Canal]"
6,The Scramble for Africa continued in the 1900s...,"[Scramble, Africa, the 1900s]"
7,The atrocities in the Congo Free State shocked...,[the Congo Free State]
8,"From 1914 to 1918, the First World War, and it...","[1914 to 1918, the First World War]"
9,"""The war to end all wars"": World War I (1914–1...","[World War I, World War I\nArrest, Sarajevo, A..."


### 4. Relationship Extraction

In [6]:
print(wrangled_text[:2000]) # Print the first 2000 characters of the wrangled text

The 20th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs, the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens' lives and shaped the 21st century into what it is today.
Historic events in the 20th century[edit]
World at the beginning of the century[edit]
Main article: Edwardian era
The new beginning of the 20th century marked significant changes. The 1900s saw the decade herald a series of inventions, including the automobile, airplane and radio broadcasting. 1914 saw the completion of the Panama Canal.
The Scramble for Africa continued in the 1900s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world.
From 1914 to 1918, the First World War, and its aftermath, caused major changes in the power balance of the w

#### - Filter the Entities for Countries Only

In [7]:
# === 1. BUILD A ROBUST SEARCH MAP ===
# We create a dictionary that maps a single keyword (e.g., "states") back to its full, official country name.
print("Building a robust search map from the country list...")
country_search_map = {}
for country in country_list:
    # Get the individual words from the country name
    words = country.lower().split()
    for word in words:
        # We only want meaningful words, not short ones like 'of' or 'the'
        if len(word) > 2:
            country_search_map[word] = country
print("Search map built successfully.")


# === 2. THE SIMPLE AND DIRECT FILTERING LOGIC ===
filtered_data = []
print("Filtering sentences using the new search map...")

# Use spaCy's reliable sentence splitter from the ner_object.
for sentence in ner_object.sents:
    # This set will store the unique OFFICIAL country names found in this sentence.
    countries_found_in_sentence = set()
    
    # Tokenize the sentence into a clean list of lowercase words.
    sentence_words = re.findall(r'\b\w+\b', sentence.text.lower())
    
    # For each word in the sentence, check if it's a country keyword.
    for word in sentence_words:
        if word in country_search_map:
            # If it is, add the corresponding OFFICIAL country name to our set.
            official_name = country_search_map[word]
            countries_found_in_sentence.add(official_name)
            
    # If we found at least one country in this sentence...
    if countries_found_in_sentence:
        filtered_data.append({
            "sentence": sentence.text.strip(),
            "country_entities": list(countries_found_in_sentence)
        })

# === 3. CREATE THE FINAL DATAFRAME ===
filtered_sentences_df = pd.DataFrame(filtered_data)

print(f"\nFiltering complete. Found {len(filtered_sentences_df)} sentences that mention at least one country.")

# Display the head of the filtered DataFrame to confirm it is populated.
print("\n--- First 10 Sentences Containing Country Entities ---")
display(filtered_sentences_df.head(10))

Building a robust search map from the country list...
Search map built successfully.
Filtering sentences using the new search map...

Filtering complete. Found 632 sentences that mention at least one country.

--- First 10 Sentences Containing Country Entities ---


Unnamed: 0,sentence,country_entities
0,The 20th century changed the world in unpreced...,[ Saint Vincent and the Grenadines ]
1,The World Wars sparked tension between countri...,"[ Trinidad and Tobago , Saint Vincent and..."
2,These advancements have played a significant r...,"[ Trinidad and Tobago , Saint Vincent and..."
3,Historic events in the 20th century[edit]\nWor...,"[ Papua New Guinea , Saint Vincent and th..."
4,The 1900s saw the decade herald a series of in...,"[ Trinidad and Tobago , Saint Vincent and..."
5,1914 saw the completion of the Panama Canal.,"[ Panama , Saint Vincent and the Grenadin..."
6,The Scramble for Africa continued in the 1900s...,"[ South Africa , Trinidad and Tobago , ..."
7,The atrocities in the Congo Free State shocked...,[ Saint Vincent and the Grenadines ]
8,"From 1914 to 1918, the First World War, and it...","[ Trinidad and Tobago , Saint Vincent and..."
9,"""The war to end all wars"": World War I (1914–1...","[ Trinidad and Tobago , Saint Vincent and..."


#### - Create the Relationships DataFrame

In [8]:
# This list will store all the relationship pairs we find.
relationships = []

# Loop through each row of our correctly filtered DataFrame.
for index, row in filtered_sentences_df.iterrows():
    countries = row['country_entities']
    
    # We only care about sentences that mention at least TWO countries.
    if len(countries) > 1:
        # The itertools.combinations function is perfect for this.
        # It creates all unique pairs from a list.
        # e.g., for ['A', 'B', 'C'], it will generate ('A', 'B'), ('A', 'C'), ('B', 'C')
        for pair in itertools.combinations(sorted(countries), 2):
            relationships.append(pair)

# Convert the list of pairs into a DataFrame.
relationships_df = pd.DataFrame(relationships, columns=['source', 'target'])

print(f"Created a DataFrame with {len(relationships_df)} relationship pairs.")

# Display the first 10 relationships found.
display(relationships_df.head(10))

Created a DataFrame with 1413 relationship pairs.


Unnamed: 0,source,target
0,Saint Vincent and the Grenadines,Trinidad and Tobago
1,Saint Vincent and the Grenadines,Trinidad and Tobago
2,Papua New Guinea,Saint Vincent and the Grenadines
3,Saint Vincent and the Grenadines,Trinidad and Tobago
4,Panama,Saint Vincent and the Grenadines
5,Saint Vincent and the Grenadines,South Africa
6,Saint Vincent and the Grenadines,Trinidad and Tobago
7,South Africa,Trinidad and Tobago
8,Saint Vincent and the Grenadines,Trinidad and Tobago
9,Saint Vincent and the Grenadines,Trinidad and Tobago


### 5. OUTPUT

In [9]:
# --- Finalize and Count Relationships ---

# To treat (A, B) and (B, A) as the same relationship, we will sort each pair alphabetically.
# This ensures that 'Germany' and 'Japan' always becomes ('Germany', 'Japan'), never ('Japan', 'Germany').
sorted_relationships = [tuple(sorted(pair)) for pair in relationships]

# Create a new DataFrame from the sorted pairs.
sorted_df = pd.DataFrame(sorted_relationships, columns=['source', 'target'])

# Now, we count the occurrences of each unique, sorted pair.
final_relationships_df = sorted_df.value_counts().reset_index()
final_relationships_df.columns = ['source', 'target', 'value']

# Display the top 15 most frequent relationships.
print("--- Top 15 Country Relationships by Frequency ---")
display(final_relationships_df.head(15))

--- Top 15 Country Relationships by Frequency ---


Unnamed: 0,source,target,value
0,Saint Vincent and the Grenadines,Trinidad and Tobago,228
1,Papua New Guinea,Saint Vincent and the Grenadines,39
2,Russia,Saint Vincent and the Grenadines,38
3,Saint Vincent and the Grenadines,United States,37
4,Germany,Saint Vincent and the Grenadines,34
5,Papua New Guinea,Trinidad and Tobago,28
6,Russia,Trinidad and Tobago,28
7,Japan,Saint Vincent and the Grenadines,26
8,Trinidad and Tobago,United States,25
9,Germany,Trinidad and Tobago,21


In [10]:
# --- Save and Export Your DataFrame ---
output_filename = "country_relationships.csv"
final_relationships_df.to_csv(output_filename, index=False)
print(f"\nSuccessfully saved the final relationships data to '{output_filename}'.")


Successfully saved the final relationships data to 'country_relationships.csv'.
