`Preprocessing Text:`

Clean and normalize the text in each cell.
Extract key terms using TF-IDF or NER to identify terms related to spatial and subject matter jurisdiction.

`Term/Keyword Extraction:`

Apply TF-IDF, NER, or other keyword extraction techniques to identify important terms (spatial areas, key objectives, subject areas).

`Similarity Analysis:`

Build similarity matrices or networks of IGOs based on the extracted keywords.
Perform clustering to identify groups of IGOs with similar mandates and jurisdictions.

`Visualization:`

Create visualizations using network graphs or dimensionality reduction techniques (PCA or t-SNE) to display relationships and clustering of IGOs.

`Gap and Overlap Analysis:`

Quantify the overlaps and gaps in IGO activities to understand the areas of focus that might need more attention or collaboration.

# Text Preprocessing and Summarization
<p>The first step involves cleaning and preprocessing the raw text in the dataset, as discussed earlier. This includes removing irrelevant data, such as URLs, punctuation, and stopwords, and normalizing the text (lowercasing, lemmatization, etc.).</p>

<p>The raw text data collected for each IGO contained a variety of non-essential elements, such as URLs, punctuation, and stopwords (commonly used words that do not carry meaningful information). We applied a cleaning process to remove these elements, ensuring that only relevant terms are retained.

**Steps Taken:**
* $URLs Removal:$ URLs were identified and removed, as they do not contribute to the analysis.
* $Punctuation Removal:$ All punctuation marks (e.g., commas, periods, quotation marks) were removed to standardize the text.
* $Stopwords Removal:$ Commonly used words, such as "the", "and", "is", "are", were removed. These words are frequently encountered in texts but do not add meaningful information to the analysis.
* $Lowercasing:$ All text was converted to lowercase to ensure uniformity and avoid differentiating between the same word in different cases (e.g., "Marine" vs "marine").</p>

In [None]:
# !pip install -U spacy
# !python -m spacy download en_core_web_sm
# !pip install keybert
# !pip install pandas
# !pip install flair
# !pip install keyphrase-vectorizers

In [3]:
# import relevant libraries
from keybert import KeyBERT
import pandas as pd
import ast


import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# from keyphrase_vectorizers import KeyphraseVectorizer
from keyphrase_vectorizers import KeyphraseCountVectorizer
import string

# Ensure that the necessary NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/milo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/milo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# File path
file_path = "../Data/Ocean Governance and ocean economy governance matrix_IGOs.xlsx"
# Load the dataset
df = pd.read_excel(file_path, sheet_name="Original Columns Cleaned")

# Preview the df 
df.head(4)

Unnamed: 0,Institution,Year,Scale,Spatial Jurisdiction,Subject Matter Jurisdiction,Source of Jurisdiction,Defined Objectives,Strategies,Inter-institutional Relationship,Practical Vertical Coordination,Practical Horizontal Coordination,Horizontal Coordination 1,Horizontal Coordination 2,Horizontal Coordination 3,Horizontal Coordination 4,Horizontal Coordination 5,Horizontal Coordination 6,Horizontal Coordination 7,Horizontal Coordination 8
0,Intergovernmental Oceanographic Commission (IOC),1960.0,Global,IOC jurisdiction is global delineated by the b...,The IOC's subject matter jurisdiction encompas...,The IOC’s authority is derived from its statut...,The objectives of the Intergovernmental Oceano...,IOC implements its objectives through series o...,IOC collaborates with UN specialized agencies ...,Vertical coordination within the IOC involves ...,Horizontal coordination within the IOC encompa...,,,,,,,,
1,Food and Agriculture Organization of the Unite...,1945.0,Global,The FAO’s jurisdiction spans a vast array of m...,"FAO’s remit includes nutrition, food and agric...",The FAO’s jurisdiction is established through ...,"As stated in Article 1 of the Constitution, FA...",The FAO executes its objectives through a ser...,"As stated in its constitution, the FAO maintai...",The FAO’s vertical coordination involves colla...,Horizontal coordination within the FAO involve...,https://www.jus.uio.no/english/services/librar...,FAO https://www.fao.org/strategic-framework/en,,,,,,
2,Convention on the Intergovernmental Maritime C...,1948.0,Global,The IMO’s authority spans a global geographica...,The IMO's jurisdiction encompasses a comprehen...,The IMO's jurisdiction is established by the C...,"According to Part I, Article 1 of the Internat...",IMO implements its objectives and mandates thr...,The IMO collaborates with a diverse array of o...,Vertical coordination within IMO involves coll...,Horizontal coordination within the IMO involve...,https://wwwcdn.imo.org/localresources/en/About...,https://wwwcdn.imo.org/localresources/en/Knowl...,https://www.imo.org/en/MediaCentre/HotTopics/P...,,,,,
3,Division for Ocean Affairs and the Law of the ...,1992.0,Global,UN DALOS does not have authority over any spec...,The UN DOALOS's mandate includes providing inf...,DOALOS derives its mandate from the United Nat...,According to the Secretary-General’s bulletin ...,. DOALOS) executes its objectives through a mu...,DOALOS collaborates with key organizations to ...,DOALOS engages in vertical coordination with v...,Horizontal coordination within DOALOS involves...,https://www.un.org/oceancapacity/projects,https://www.un.org/oceancapacity/tf,https://www.un.org/Depts/los/doalos_publicatio...,https://treaties.un.org/doc/source/docs/A_RES_...,https://documents-dds-ny.un.org/doc/UNDOC/GEN/...,https://documents-dds-ny.un.org/doc/UNDOC/GEN/...,https://www.un.org/depts/los/clcs_new/document...,https://unsceb.org/sites/default/files/2023-11...


## Preprocess

In [5]:
# Make a copy of the df
data = df.copy(deep=True)

#### Cleaning

    The initial step in this project involved the cleaning of text data extracted from various sources. The cleaning process was essential for ensuring that the subsequent analysis and extraction of keywords and keyphrases would be accurate and relevant. The first stage of the cleaning procedure involved the removal of links, URLs, and any external references within the text. These external elements often introduce noise and irrelevant information that can distort the accuracy of keyword extraction models. Furthermore, special characters, such as symbols and punctuation marks, were eliminated to streamline the text and focus only on the core content. Additionally, all organization names, whether in full or in abbreviation, were stripped from the text. This included entities like the World Health Organization (WHO), Food and Agriculture Organization (FAO), and many others. By removing these organization names, the focus shifted solely to the key concepts and ideas present within the documents, preventing any biases that may arise from the frequent mention of such entities.

    Moreover, stop words, which are common words such as "the," "and," "is," "in," and "on," were removed from the text. These words generally do not carry significant meaning in the context of keyword extraction and often act as filler words that may cloud the analysis. The cleaning process ensured that the data was refined and primed for the next step—extraction of meaningful keywords and keyphrases.

In [7]:
# List of organization names (both full and abbreviation)
organization_names = [
    "Intergovernmental Oceanographic Commission", "IOC", 
    "Food and Agriculture Organization of the United Nations", "FAO", 
    "Convention on the Intergovernmental Maritime Consultative Organization", "IMO", 
    "Division for Ocean Affairs and the Law of the Sea", "UN DOALOS", 
    "Climate Change Secretariat", "International Seabed Authority", 
    "United Nations Environment Programme", "UNEP", "United Nations Development Programme", "UNDP", 
    "United Nations Conference on Trade and Development", "UNCTAD", 
    "United Nations Industrial Development Organization", "UNIDO", "International Labour Organization", "ILO", 
    "International Telecommunication Union", "ITU", "United Nations Children’s Fund", "UNICEF", 
    "World Health Organization", "WHO", "Commissioner for Refugees", "UNHCR", 
    "Office of the United Nations High Commissioner for Human Rights", "OHCHR", 
    "United Nations Office for Disaster Risk Reduction", "UNDRR", 
    "UN Global Compact Office", "International Atomic Energy Agency", "IAEA", 
    "World Meteorological Organization", "WMO", "Organisation for Economic Co-operation and Development", "OECD", 
    "World Bank Group", "WBG", "International Monetary Fund", "IMF", 
    "International Hydrographic Organization", "IHO", "International Council for the Exploration of the Sea", "ICES", 
    "Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services", "IPBES", 
    "IPCC", "World Trade Organization", "WTO", 
    "International Organization for Migration", "IOM", "United Nations Office for Project Services", "UNOPS", 
    "UN Entity for Gender Equality and the Empowerment of Women", "UN-Women", "World Intellectual Property Organisation", "WIPO", 
    "UN Population Fund", "UNFPA", "United Nations Human Settlements Programme", "UN-Habitat", 
    "World Food Programme", "WFP", "World Tourism Organization", "UN Tourism", "UNWTO", 
    "UN Research Institute for Social Development", "UNRISD", 
    "Secretariat of the Basel, Rotterdam and Stockholm Conventions", "BRS", 
    "Secretariat of the Convention on Biological Diversity", "CBD", 
    "Secretariat of the Convention on International Trade in Endangered Species of Wild Fauna and Flora", "CITES", 
    "Secretariat of the Convention on Migratory Species", "CMS", "International Fund for Agricultural Development", "IFAD", 
    "International Trade Centre", "ITC", "Secretariat of the United Nations Convention to Combat Desertification", "UNCCD", 
    "United Nations University", "UNU", "Ramsar Convention on Wetlands Secretariat", "Ramsar", 
    "Minamata Convention on Mercury", "Minamata", "United Nations Office for Outer Space Affairs", "UNOOSA", 
    "United Nations Office on Drugs and Crime", "UNODC", "unfccc", "UNFCC"
]

# Function to clean the text
def clean_text(text):
    # Step 1: Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)

    # Step 2: Remove external references (anything inside parentheses)
    text = re.sub(r'\(.*?\)', '', text)

    # Step 3: Remove organization names (full and abbreviation)
    for org in organization_names:
        text = re.sub(r'\b' + re.escape(org) + r'\b', '', text)

    # Step 4: Remove punctuation (except for spaces)
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Step 5: Tokenize the text into words
    words = word_tokenize(text.lower())  # Convert to lowercase

    # Step 6: Remove stop words
    stop_words = set(stopwords.words('english'))
    cleaned_words = [word for word in words if word not in stop_words]

    # Step 7: Join words back into a cleaned sentence
    cleaned_text = ' '.join(cleaned_words)

    return cleaned_text

#### Keyword and Keyphrase Extraction
After the text data was thoroughly cleaned, the next step involved the extraction of relevant keywords and keyphrases. Keywords and keyphrases are critical for summarizing the content of the document, identifying core themes, and facilitating better search and categorization. To achieve this, the KeyBERT model was employed, a state-of-the-art tool known for its ability to extract contextually relevant keywords and keyphrases from text.

The keyword extraction process focused on identifying single, relevant terms that represent the core topics of the document. Using the cleaned text, a model was applied that extracts unigrams (single words) with a defined n-gram range of (1, 1). The stop_words='english' parameter was applied to filter out common, unimportant words, further enhancing the accuracy of the results. Keywords extracted through this method were considered significant based on their relevance to the content, and these terms help in identifying primary subjects and themes within the document.

In parallel, keyphrase extraction was performed to capture multi-word phrases that are indicative of deeper or more complex concepts within the text. The KeyBERT model's n-gram range was adjusted to (2, 3), which allowed the extraction of bigrams and trigrams—pairs and triplets of words—providing a more nuanced understanding of the document’s content. Additionally, Maximum Marginal Relevance (MMR) was utilized to improve the diversity of the extracted keyphrases, ensuring that they covered a broad spectrum of topics without redundancy. By applying these methods, keyphrases were able to highlight important combinations of words that reflect complex ideas, themes, and concepts relevant to the document.

Both keyword and keyphrase extraction were designed to be flexible, allowing adjustments to parameters such as the n-gram range, stop words, and relevance scoring. The flexibility in these parameters ensured that the extraction process could be fine-tuned to meet the specific needs of the documents being analyzed, providing more precise results tailored to different contexts.

* **Keybert**

#### KeyBERT: An Overview and Why It Was the Best Choice for This Task
KeyBERT is a transformer-based model specifically designed for the extraction of keywords and keyphrases from text documents. It leverages the power of pre-trained BERT (Bidirectional Encoder Representations from Transformers) models to understand the semantic meaning of words in context, allowing it to capture the most relevant terms and phrases within a body of text. KeyBERT stands out due to its ability to perform high-quality keyword and keyphrase extraction, even with minimal configuration, making it highly effective for content summarization and text analysis tasks.

$Why KeyBERT Was Chosen for This Task$

KeyBERT was selected for this task due to several reasons that align with the objectives of the project. The primary goal was to extract relevant keywords and keyphrases from cleaned text documents, focusing on meaningful terms that reflect the core concepts and ideas. KeyBERT excels in this area for the following reasons:

* Contextual Understanding: Unlike traditional methods, which rely on frequency-based algorithms (such as TF-IDF), KeyBERT takes into account the semantic context of each word. This contextual understanding is crucial for extracting keywords that truly represent the content, rather than simply counting occurrences of words. For example, it can differentiate between words with multiple meanings based on the surrounding text, ensuring that the extracted keywords accurately represent the intended subject matter.

* BERT-based Model: KeyBERT uses the BERT model, which is pre-trained on vast amounts of text and has a deep understanding of language semantics. BERT's ability to capture contextual relationships between words enables KeyBERT to identify keywords and keyphrases that are not just syntactically relevant but also semantically meaningful within the document. This makes it superior to simpler keyword extraction methods, as it produces more accurate and relevant results.

* Flexibility with N-grams: KeyBERT supports the extraction of both single-word keywords (unigrams) and multi-word keyphrases (bigrams and trigrams). This flexibility allows it to adapt to the needs of the analysis—whether a task requires the identification of individual keywords or more complex multi-word phrases that capture nuanced ideas. In this project, the n-gram range was adjusted to extract both unigrams and bigrams/trigrams, ensuring that the extracted terms ranged from simple concepts to more complex expressions, thus providing a more comprehensive summary of the text.

* Diversity through Maximum Marginal Relevance (MMR): One of KeyBERT's powerful features is its use of Maximum Marginal Relevance (MMR), which helps reduce redundancy in the extracted keyphrases. MMR ensures that each keyphrase provides unique information, enhancing the diversity of the results. This was particularly important in this project, where a wide range of topics and themes needed to be captured without repetition, ensuring that the keyphrases provided a well-rounded representation of the content.

* Speed and Efficiency: KeyBERT is known for its efficiency in performing keyword extraction. It can handle large documents and datasets with relative speed, making it a practical choice for this project, where numerous text documents needed to be processed. The model’s lightweight nature also ensures that the extraction process can be carried out without significant computational overhead, making it accessible for real-time applications or batch processing of large volumes of text.

* Minimal Preprocessing: KeyBERT requires minimal preprocessing of the text compared to traditional keyword extraction methods. Once the text is cleaned (removing stop words, organization names, and external references), KeyBERT can directly extract meaningful keywords and keyphrases without requiring elaborate manual adjustments or extensive parameter tuning. This made it ideal for this project, where the focus was on automating the extraction process and reducing the complexity of manual interventions.


In [8]:
# Function to extract keywords from a document
kw_model = KeyBERT()

def keywords_extractor(doc):
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words='english')
    return [keyword[0] for keyword in keywords]

# Function to extract keyphrases from a document
def keyphrases_extractor(doc):
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words='english',
                                     use_mmr=True, diversity=0.7)
    return [keyword[0] for keyword in keywords]  


#### Hightligts

In [12]:
# Function to highlight keyphrases in a document using KeyBERT's highlighting
def keywords_highlight(doc):
    kw_model = KeyBERT()
    
    # Extract keywords with highlighting enabled (returns highlighted text)
    keywords = kw_model.extract_keywords(doc, highlight=True)
    
    # The highlighted text is returned directly by KeyBERT, so just return that
    highlighted_text = doc  # KeyBERT has already applied the highlighting internally
    
    # Return the highlighted text and the list of extracted keywords
    return highlighted_text

#### 1. Spatial Jurisdiction
    Spatial jurisdiction delineates geographical areas where an IGO operates (geographic coverage).

In [13]:
# Clean
data['Spatial Jurisdiction Cleaned'] = data['Spatial Jurisdiction'].apply(clean_text)
# Apply the function to each document and create a new column 'keywords'
data['Spatial Jurisdiction_keywords'] = data['Spatial Jurisdiction Cleaned'].apply(keywords_extractor)
data['Spatial Jurisdiction_keyphrases'] = data['Spatial Jurisdiction Cleaned'].apply(keyphrases_extractor)
df['Spatial Jurisdiction_highlights'] = df['Spatial Jurisdiction'].apply(keywords_highlight)

#### 2. Subject Matter Jurisdiction
    Subject matter jurisdiction defines thematic areas of focu, influence, and impact.

In [None]:
# Initialize the KeyBERT model
kw_model = KeyBERT()

# Custom stop words list (Add any words/phrases you want to exclude)
custom_stop_words = [
    "jurisdiction", "subject matter jurisdiction", "subject", "mandate", "encompasses",
    "responsible", "space2030 agenda 2030", "facilitating", "unfccc", "isa",
    "womens", "girls", "rotterdam stockholm conventions", " 2030 agenda"
]

# Function to extract keywords from a document, excluding custom stop words
def keywords_extractor(doc):
    # Extract keywords using KeyBERT
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words='english')

    # Filter out keywords that are in the custom stop words list
    filtered_keywords = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

    return filtered_keywords

# Function to extract keyphrases from a document, excluding custom stop words
def keyphrases_extractor(doc):
    # Extract keyphrases using KeyBERT
    keyphrases = kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words='english',
                                           use_mmr=True, diversity=0.7)

    # Filter out keyphrases that are in the custom stop words list
    filtered_keyphrases = [keyword[0] for keyword in keyphrases if keyword[0].lower() not in custom_stop_words]

    return filtered_keyphrases


In [None]:
# Clean
data['Subject Matter Jurisdiction Cleaned'] = data['Subject Matter Jurisdiction'].apply(clean_text)
# Apply the function to each document and create a new column 'keywords'
data['Subject Matter Jurisdiction_keywords'] = data['Subject Matter Jurisdiction Cleaned'].apply(keywords_extractor)
data['Subject Matter Jurisdiction_keyphrases'] = data['Subject Matter Jurisdiction Cleaned'].apply(keyphrases_extractor)
# df['Spatial Jurisdiction_highlights'] = df['Spatial Jurisdiction'].apply(keywords_highlight)

#### 3. Source of Jurisdiction
    Indicates an IGO’s legal basis and authority, reflecting on compliance and enforcement

In [None]:
# Initialize the KeyBERT model
kw_model = KeyBERT()

# Custom stop words list
custom_stop_words = [
    "jurisdiction", "authority", "organization", "organizations", "authority", "organizational", "nations", "oceans",
    "establishment", "development", "provides", "purpose", "outline", "ocean", "mandate", "doalos"
]

# Function to extract keywords from a document, excluding custom stop words
def keywords_extractor(doc):
    # Extract keywords using KeyBERT
    keywords = kw_model.extract_keywords(doc, vectorizer=KeyphraseCountVectorizer(), stop_words='english')

    # Filter out keywords that are in the custom stop words list
    filtered_keywords = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

    return filtered_keywords


# Function to extract keyphrases from a document, excluding custom stop words
def keyphrases_extractor(doc):
    # Extract keyphrases using KeyBERT
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3), stop_words='english',
                              use_maxsum=True, nr_candidates=15, top_n=10, highlight=True)

    # Extract keywords
    filtered_keyphrases = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

    return filtered_keyphrases

In [None]:
# from keybert import KeyBERT

# # Initialize KeyBERT
# kw_model = KeyBERT()

# # Sample document
# doc = data['Source of Jurisdiction Cleaned'][0]

# # Extract keywords
# keywords = kw_model.extract_keywords(doc, top_n=20)
# filtered_keywords = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

# # Print the keywords
# filtered_keywords

In [None]:
# Clean
data['Source of Jurisdiction Cleaned'] = data['Source of Jurisdiction'].apply(clean_text)
# Apply the function to eachorganizations document and create a new column 'keywords'
data['Source of Jurisdiction Keywords'] = data['Source of Jurisdiction Cleaned'].apply(keywords_extractor)
data['Source of Jurisdiction Keyphrases'] = data['Source of Jurisdiction Cleaned'].apply(keyphrases_extractor)
# df['Spatial Jurisdiction_highlights'] = df['Spatial Jurisdiction'].apply(keywords_highlight)

In [None]:
df.head()

#### 4. Defined Objectives
`Indicates an IGO’s mission, vision, expected outcomes, and impacts.`

In [6]:
# from transformers import pipeline
# summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
# summary = summarizer(doc, max_length=200, min_length=150, do_sample=False)
# print(summary)


In [None]:
# from transformers import pipeline
# summarizer = pipeline("summarization", model="google/pegasus-xsum")
# summary = summarizer(doc, max_length=100, min_length=50, do_sample=False)
# print(summary)


In [None]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration

# model = T5ForConditionalGeneration.from_pretrained("t5-small")
# tokenizer = T5Tokenizer.from_pretrained("t5-small")

# input_text = data['Defined Objectives'][0]
# inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
# summary_ids = model.generate(inputs['input_ids'], max_length=100, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
# summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
# print(summary)


In [7]:
# Initialize the KeyBERT model
kw_model = KeyBERT()

# Custom stop words list
custom_stop_words = ["secretariats", "statutes", "include", "duties", " convention", "objectives"
]

# Function to extract keywords from a document, excluding custom stop words
def keywords_extractor(doc):
    # Extract keywords using KeyBERT
    keywords = kw_model.extract_keywords(doc, vectorizer=KeyphraseCountVectorizer(), stop_words='english')

    # Filter out keywords that are in the custom stop words list
    filtered_keywords = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

    return filtered_keywords


# Function to extract keyphrases from a document, excluding custom stop words
def keyphrases_extractor(doc):
    # Extract keyphrases using KeyBERT
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 4), stop_words='english',
                              use_maxsum=True, nr_candidates=25, top_n=25, highlight=True)
        
    # Extract keywords
    filtered_keyphrases = [keyword[0] for keyword in keywords if keyword[0].lower() not in custom_stop_words]

    return filtered_keyphrases


from transformers import pipeline
# Initialize the BART summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def summarize_text(text):
    summary = summarizer(text, max_length=150, min_length=50, do_sample=False)
    return summary[0]['summary_text']

In [8]:
# Clean
data['Defined Objectives Cleaned'] = data['Defined Objectives'].apply(clean_text)
# Apply the function to eachorganizations document and create a new column 'keywords'
data['Defined Objectives Cleaned Keywords'] = data['Defined Objectives Cleaned'].apply(keywords_extractor)
data['Defined Objectives Cleaned Keyphrases'] = data['Defined Objectives Cleaned'].apply(keyphrases_extractor)
# Create a new column 'summarized' with the summarized text
data['Defined Objectives Summarized'] = data['Defined Objectives Cleaned'].apply(summarize_text)

Your max_length is set to 150, but your input_length is only 73. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)
Your max_length is set to 150, but your input_length is only 76. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=38)
Your max_length is set to 150, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 150, but your input_length is only 105. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=52)
You

In [9]:
cols = ["Institution", "Defined Objectives", "Defined Objectives Cleaned", "Defined Objectives Cleaned Keywords", "Defined Objectives Cleaned Keyphrases", "Defined Objectives Summarized"]

new_df = data[cols]

In [10]:
# Save Spatial df sheet
with pd.ExcelWriter(file_path, mode='a') as writer:
    new_df.to_excel(writer, sheet_name='Defined Objectivesss', index=False)

In [None]:
data.head(3)

#### 4. Strategies
`Indicate means and methods of achieving objectives, reflecting an IGO’s adaptation and innovation capabilities.`

In [11]:
# Clean
data['Strategies Cleaned'] = data['Strategies'].apply(clean_text)
# Apply the function to eachorganizations document and create a new column 'keywords'
data['Strategies Keywords'] = data['Strategies Cleaned'].apply(keywords_extractor)
data['Strategies Keyphrases'] = data['Strategies Cleaned'].apply(keyphrases_extractor)
# Create a new column 'summarized' with the summarized text
data['Strategies Summarized'] = data['Strategies Cleaned'].apply(summarize_text)

Your max_length is set to 150, but your input_length is only 143. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=71)
Your max_length is set to 150, but your input_length is only 115. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=57)
Your max_length is set to 150, but your input_length is only 89. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
Your max_length is set to 150, but your input_length is only 118. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=59)
Y

#### 5. Inter-institutional Relationship
`Indicates an IGO’s role in the institutional environment and its influence on other actors and institutions.`

In [91]:
import re

def remove_coordination_and_links(text):
    # Regex to match and remove sentences starting with "Horizontal coordination within" or "Vertical coordination within"
    coordination_pattern = r'\bHorizontal coordination within[^\n]*'
    
    # Regex to match and remove URLs (links)
    url_pattern = r'https?://\S+'
    
    # Remove sentences starting with the specified phrases
    text_without_coordination = re.sub(coordination_pattern, '', text)
    
    # Remove URLs (links)
    cleaned_text = re.sub(url_pattern, '', text_without_coordination)
    
    return cleaned_text

In [95]:
# Function to remove URLs from text
def remove_urls(text):
    url_pattern = r'https?://\S+'
    return re.sub(url_pattern, '', text)


In [31]:
import spacy

# Load the spaCy model for English
nlp = spacy.load("en_core_web_sm")

def extract_orgs(text):
    
    # Process the text using spaCy NLP model
    doc = nlp(text)

    # Extract named entities of type ORG (organizations) and store them in a set
    organizations_set = {ent.text for ent in doc.ents if ent.label_ == "ORG"}

    return organizations_set

In [32]:
# Clean urls 
data['Inter-institutional Relationship Cleaned'] = data['Inter-institutional Relationship'].apply(remove_urls)
data['Inter-institutional Relationship Summary'] = data['Inter-institutional Relationship Cleaned'].apply(extract_orgs)

In [42]:
new_df = data[['Institution', 'Inter-institutional Relationship', 'Inter-institutional Relationship Cleaned', 'Inter-institutional Relationship Summary']]

#### 5. Vertical Coordination
`Indicates an IGO’s interactions and collaborations across different levels of governance.`

In [69]:
from transformers import pipeline

# Use the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def vert_summ(doc):
    # Summarize the text
    summary = summarizer(doc, max_length=150, min_length=30, do_sample=False)
    return summary


In [70]:
data["Practical Vertical Coordination Cleaned"] = data["Practical Vertical Coordination"].apply(remove_urls)
data["Practical Vertical Coordination Summary"] = data["Practical Vertical Coordination Cleaned"].apply(vert_summ)

Your max_length is set to 150, but your input_length is only 67. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)


In [71]:
data["Practical Vertical Coordination Summary"][7]

[{'summary_text': "The UNDP operates through the Executive Board of 36 members, which oversees and supports the activities of UNDP, UNFPA, UNOPS, and UN Women. There is also the UN Development Group (UNDG), which unites the 40 UN funds, programmes, specialized agencies, departments, and offices that play a role in development. The UNDP also has country offices and technical experts that support the UNDP's work."}]

#### Horizontal Coordination
`Indicates an IGO’s interaction with actors and institutions at the same governance level.`

In [101]:
data['Practical Horizontal Coordination Cleaned'] = data['Practical Horizontal Coordination'].apply(remove_urls)
data["Practical Horizontal Coordination Summary"] = data["Practical Horizontal Coordination Cleaned"].apply(vert_summ)

Your max_length is set to 150, but your input_length is only 126. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)
Your max_length is set to 150, but your input_length is only 78. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=39)


In [104]:
data['Practical Horizontal Coordination Summary'][19]

[{'summary_text': 'Horizontal coordination within the OECD involves collaboration with different sectors and stakeholders within the economic and social domains. This coordination is guided by Article 2 of the Convention on the Organisation for Economic Co-operation and Development. The OECD works with the education sector to improve learning outcomes and skills development.'}]

In [105]:
new_df = data[["Institution", "Practical Horizontal Coordination", "Practical Horizontal Coordination Cleaned", "Practical Horizontal Coordination Summary"]]

In [106]:
# Save Spatial df sheet
with pd.ExcelWriter(file_path, mode='a') as writer:
    new_df.to_excel(writer, sheet_name='Horizontal Coordination', index=False)

