# <center>Themes Identification

#### Overview:
<p>The proposed method involves using KeyBERT, a state-of-the-art keyword extraction tool based on transformer models like BERT. KeyBERT excels in extracting semantically relevant keywords and themes from large or complex text datasets. This method is chosen for its context-aware extraction, efficiency, and ability to handle diverse and nuanced language used in academic or policy-related texts.</p>

##### Why KeyBERT is the Best Choice:

`Contextual Understanding:`

KeyBERT leverages BERT and other transformer-based models, which have been pre-trained on vast corpora to understand words in context, unlike traditional methods like TF-IDF or Word2Vec.
This makes it particularly useful for identifying themes that are deeply tied to the meaning of the words in the document rather than just their frequency.
In the context of ocean governance, KeyBERT can accurately identify complex terms and ideas related to international cooperation, policy frameworks, and environmental governance that might be overlooked by simpler methods.

`Theme Identification and Semantic Relevance:`

KeyBERT extracts the most relevant keywords by calculating the semantic similarity between words and the document. This allows for the identification of key concepts and themes (e.g., state actors, international law, environmental co-operation) without manually tagging or categorizing.
It is especially helpful when extracting abstract themes from documents that discuss high-level concepts like global environmental governance or policy analysis.

`Flexibility with Multilingual and Domain-Specific Texts:`

KeyBERT is flexible enough to handle multilingual datasets, which may be useful for analyzing international documents related to governance, treaties, and policies that might use multiple languages.
It can also be easily adapted to handle domain-specific jargon and technical terms by using appropriate transformer models tailored for the domain (e.g., domain-specific BERT models).

`Ease of Use and Minimal Setup:`

KeyBERT is easy to implement with minimal coding required. It provides an intuitive interface to extract keywords or themes, which is ideal for a researcher or practitioner with limited machine learning background.
The method requires no labeled data or complex training processes, as it is an unsupervised approach. This makes it highly practical for quick analysis of large datasets.

`Maximal Marginal Relevance (MMR) for Diversity:`

The MMR feature in KeyBERT allows for a balance between relevance and diversity in the keywords. This feature is valuable for ensuring that the extracted themes do not become repetitive and that the set of keywords represents a broad spectrum of ideas related to the subject.
This is particularly useful in documents where you might want to capture the full range of themes (e.g., different aspects of ocean governance like international treaties, climate change adaptation, and public-private partnerships).

In [None]:
# import relevant libraries
from keybert import KeyBERT
import pandas as pd
import ast


In [None]:
# load cleaned sections
file_path = '../Data/Extracted Sections/Attribute_Papers (1).xlsx'
data = pd.read_excel(file_path, sheet_name="Cleaned_Extracted_Sections")

In [None]:
# Replacing NaN values with empty strings 
data.fillna('', inplace=True)
# preview the df
data.head()

In [None]:
# Create the function to remove list brackets and join words into a single string
def remove_list_brackets(input_str):
    """
    Function to remove list brackets and join words into a continuous string.
    Assumes input is a string representation of a list of lists.
    """
    # Convert string to list using ast.literal_eval
    data = ast.literal_eval(input_str)
    
    # Flatten the list and join words into a single string
    flat_list = [word for sublist in data for word in sublist]
    
    # Join the words into a single string separated by spaces
    return ' '.join(flat_list)

In [None]:
# Apply the function to the column
data['Cleaned_Introduction'] = data['Cleaned_Introduction'].apply(remove_list_brackets)


In [None]:
# Preview cleaned data
data.head()

#### KeyBERT Workflow:

$Input:$
>Provide KeyBERT with raw text (e.g., paragraphs from research papers, policy documents, or other content related to ocean governance).

$Keyword Extraction:$
>Use KeyBERT’s extract_keywords() method to generate semantically relevant keywords. The model uses transformer-based embeddings to find the most important keywords.

$Post-processing:$
>Optionally, use Maximal Marginal Relevance (MMR) to adjust the diversity of the extracted keywords.
Review and refine the extracted keywords to capture the core themes, ensuring that they align with the specific aspects of ocean governance you wish to focus on.

$Theme Identification:$
>After extracting the keywords, you can cluster similar keywords together or manually group them to identify abstract themes or topics (e.g., marine conservation, policy frameworks, climate change mitigation, etc.).

#### Advantages of KeyBERT:
* Deep Semantic Understanding: Captures the meaning of words in context rather than relying on frequency or superficial relationships.
* Unsupervised Learning: Does not require labeled data, making it ideal for quick analysis and explorations of large text datasets.
* High Accuracy: Accurate identification of themes, especially when working with complex, technical, or domain-specific texts (like ocean governance).
* Flexible and Scalable: Suitable for both small datasets (e.g., individual reports) and large datasets (e.g., collections of academic papers or policy documents).
* Minimal Computational Overhead: The transformer models used by KeyBERT are pre-trained, meaning that you don’t need to train a model from scratch, saving significant time and computational resources.

* #### Utilities

In [None]:
# Function to extract keywords from a document
def keywords_extractor(doc):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
    return [keyword[0] for keyword in keywords]  # Extract only the keyword part from the tuples

# Function to extract keyphrases from a document
def keyphrases_extractor(doc):
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 5), stop_words='english',
                                     use_mmr=True, diversity=0.7)
    return [keyword[0] for keyword in keywords]  # Extract only the keyword part from the tuples


In [None]:
# Function to highlight keyphrases in a document using KeyBERT's highlighting
def keywords_highlight(doc):
    kw_model = KeyBERT()
    
    # Extract keywords with highlighting enabled (returns highlighted text)
    keywords = kw_model.extract_keywords(doc, highlight=True)
    
    # The highlighted text is returned directly by KeyBERT, so just return that
    highlighted_text = doc  # KeyBERT has already applied the highlighting internally
    
    # Return the highlighted text and the list of extracted keywords
    return highlighted_text

* #### Abstract

In [None]:
# Apply the function to each document and create a new column
data['Cleaned_Abstract_keywords'] = data['Cleaned_Abstract'].apply(keywords_extractor)
data['Cleaned_Abstract_keyphrases'] = data['Cleaned_Abstract'].apply(keyphrases_extractor)
# data['Cleaned_Abstract_highlights'] = data['Cleaned_Abstract'].apply(keywords_highlight)

* #### Introduction

In [None]:
# Apply the function to each document and create a new column 'keywords'
data['Cleaned_Introduction_keywords'] = data['Cleaned_Introduction'].apply(keywords_extractor)
data['Cleaned_Introduction_keyphrases'] = data['Cleaned_Introduction'].apply(keyphrases_extractor)
# data['Cleaned_Introduction_highlights'] = data['Cleaned_Introduction'].apply(keywords_highlight)

* #### Conclusion

In [None]:
# Apply the function to each document and create a new column 'keywords'
data['Cleaned_Conclusion_keywords'] = data['Cleaned_Conclusion'].apply(keywords_extractor)
data['Cleaned_Conclusion_keyphrases'] = data['Cleaned_Conclusion'].apply(keyphrases_extractor)
# data['Cleaned_Conclusion_highlights'] = data['Cleaned_Conclusion'].apply(keywords_highlight)

In [None]:
df = data[["File Name", "Cleaned_Introduction_keywords", "Cleaned_Introduction_keyphrases", "Cleaned_Abstract_keywords", "Cleaned_Abstract_keyphrases", "Cleaned_Conclusion_keywords", "Cleaned_Conclusion_keyphrases"]]

In [None]:
# # Save Spatial df sheet
# with pd.ExcelWriter(file_path, mode='a') as writer:
#     df.to_excel(writer, sheet_name='Keypharses_Extracted_Sections', index=False)

In [None]:
# load cleaned sections
file_path = '../Data/Extracted Sections/Attribute_Papers (1).xlsx'
data = pd.read_excel(file_path, sheet_name="Keypharses_Extracted_Sections")
data.head()

In [None]:
df = data.copy(deep=True)

In [None]:
# Replacing NaN values with empty strings 
df.fillna('', inplace=True)
# preview the df
df.head()

In [None]:
 # Create the function to remove list brackets and join words into a single string
def remove_list_brackets(input_str):
    """
    Function to remove list brackets and join words into a continuous string.
    Assumes input is a string representation of a list of lists.
    """
    # Convert string to list using ast.literal_eval
    data = ast.literal_eval(input_str)
    
    # Flatten the list and join words into a single string
    flat_list = [word for sublist in data for word in sublist]
    
    # Join the words into a single string separated by spaces
    return ','.join(flat_list)

In [None]:
# Combine the keywords from the three sections into a new column
df['Keywords'] = (
    df['Cleaned_Introduction_keywords'] + ', ' +
    df['Cleaned_Abstract_keywords'] + ', ' +
    df['Cleaned_Conclusion_keywords']
)

# Remove any extra spaces or commas
df['Keywords'] = df['Keywords'].str.replace(',\s*,', ',', regex=True).str.strip(', ')
# Apply the function to the column
df['Keywords'] = df['Keywords'].apply(remove_list_brackets)

In [None]:
# Combine the keyphrases from the three sections into a new column
df['Keyphrases'] = (
    df['Cleaned_Introduction_keyphrases'] + ', ' +
    df['Cleaned_Abstract_keyphrases'] + ', ' +
    df['Cleaned_Conclusion_keyphrases']
)

# Remove any extra spaces or commas
df['Keyphrases'] = df['Keyphrases'].str.replace(',\s*,', ',', regex=True).str.strip(', ')

# Convert to set to remove duplicates and convert it back to a comma-separated string
df['Keyphrases'] = df['Keyphrases'].apply(lambda x: ', '.join(set(x.split(', '))))

# Removing empty brackets '[]' from all columns
df['Keyphrases'] = df['Keyphrases'].apply(lambda x: x.replace("[]", ""))

# Function to clean the list-like strings and join with commas
def clean_phrases(phrase_str):
    # Remove the list structure and extra quotes, then join with commas
    return ', '.join([phrase.strip(" '") for phrase in phrase_str.replace("['", "").replace("']", "").split(',')])

# Apply the cleaning function to the column
df['Keyphrases'] = df['Keyphrases'].apply(clean_phrases)


# Display the resulting DataFrame with the new column
df[['File Name', 'Keyphrases']]

In [None]:
# df.head()
# df = df[["File Name", "Keywords", "Keyphrases"]]

In [None]:
# # Save Spatial df sheet
# with pd.ExcelWriter(file_path, mode='a') as writer:
#     df.to_excel(writer, sheet_name='Key_Sections_Extracted', index=False)

In [None]:
# load cleaned sections
file_path = '../Data/Extracted Sections/Attribute_Papers (1).xlsx'
data = pd.read_excel(file_path, sheet_name="Key_Sections_Extracted")

# Replacing NaN values with empty strings 
data.fillna('', inplace=True)
# preview the df
data.head()


### Key Themes Identified:

1. Global Governance and Policy Coordination
* `Keywords:` governance, accountability, organizations, intergovernmental, regimes, IGOs, policymaking, stakeholders, institutions, diplomats, governments, sanctions, international, multilateral, coordination, alliances, partnerships
* $Focus:$ This theme focuses on the broader global governance structures, mechanisms, and processes that coordinate and enforce ocean-related policies, often at the international and intergovernmental levels.
2. Institutional Design and Organizational Structures
* `Keywords:` institutional, organizations, bureaucracies, secretariat, bureaucratic, regulatory, institutionalization, organizational, decentralization, authority
* $Focus:$ It highlights the formal structures and institutional design involved in ocean governance, including organizations and their regulatory frameworks, decision-making bodies, and authority.
3. Sustainability and Environmental Management
* `Keywords:` sustainability, sustainable, environmental, ecosystems, biodiversity, conservation, reforestation, pollution, resource depletion, socio-economic, policies, initiatives
* $Focus:$ This theme involves the management and conservation of marine environments, emphasizing sustainable practices and policies aimed at protecting ecosystems, biodiversity, and preventing pollution.
4. Climate Change and Adaptation
* `Keywords:` climate, adaptation, global, mainstreaming, politicization, environmental risks
* $Focus:$ This theme focuses on the intersection of climate change with ocean governance, particularly in terms of adaptation strategies, climate mitigation efforts, and addressing environmental risks that impact marine ecosystems.
5. Scientific Research and Knowledge Creation
* `Keywords:` research, study, scholars, academics, researcher, journal, academic, discipline, examination, inquiry, findings, literature, methodology, survey, review, data, analysis
* $Focus:$ This theme involves the creation and dissemination of scientific knowledge related to the oceans, covering research efforts, methodologies, and findings that inform policy and governance decisions.
6. Private Sector, Non-State Actors, and Stakeholder Engagement
* `Keywords:` private sector, nongovernmental, NGO, deliberation, initiatives, transnational, stakeholder
* $Focus:$ This theme examines the involvement of non-state actors (e.g., NGOs, businesses, civil society) in ocean governance, and their influence in shaping policies, regulations, and actions through advocacy, initiatives, and partnerships.
7. Legal Frameworks and Norms
* `Keywords:` legal norms, treaties, regulatory, sanctions, law, internationalization, norms, policies
* $Focus:$ This theme focuses on the legal aspects of ocean governance, including international treaties, regulatory frameworks, legal norms, and enforcement mechanisms that guide actions and compliance in marine governance.

In [None]:
# Define the keyword-theme mapping based on the previous categorization
keyword_theme_mapping = {
    'Global Governance and Policy Coordination': [
        'governance', 'accountability', 'organizations', 'intergovernmental', 'regimes', 'igos', 'policymaking', 'stakeholders', 
        'institutions', 'diplomats', 'governments', 'sanctions', 'international', 'multilateral', 'coordination', 'alliances', 'partnerships'
    ],
    
    'Institutional Design and Organizational Structures': [
        'institutional', 'organizations', 'bureaucracies', 'secretariat', 'bureaucratic', 'regulatory', 'institutionalization', 
        'organizational', 'decentralization', 'authority'
    ],
    
    'Sustainability and Environmental Management': [
        'sustainability', 'sustainable', 'environmental', 'ecosystems', 'biodiversity', 'conservation', 'reforestation', 
        'pollution', 'resource depletion', 'socioeconomic', 'policies', 'initiatives'
    ],
    
    'Climate Change and Adaptation': [
        'climate', 'adaptation', 'global', 'mainstreaming', 'politicization', 'environmental risks'
    ],
    
    'Scientific Research and Knowledge Creation': [
        'research', 'study', 'scholars', 'academics', 'researcher', 'journal', 'academic', 'discipline', 'examination', 
        'inquiry', 'findings', 'literature', 'methodology', 'survey', 'review', 'data', 'analysis'
    ],
    
    'Private Sector, Non-State Actors, and Stakeholder Engagement': [
        'private sector', 'nongovernmental', 'ngo', 'deliberation', 'initiatives', 'transnational', 'stakeholder'
    ],
    
    'Legal Frameworks and Norms': [
        'legal norms', 'treaties', 'regulatory', 'sanctions', 'law', 'internationalization', 'norms', 'policies'
    ]
}

# Function to map keyphrases to themes based on the keywords
def map_themes(keyphrase):
    themes = []
    
    # Iterate through each theme and check if any keyword from that theme is in the keyphrase
    for theme, theme_keywords in keyword_theme_mapping.items():
        if any(keyword in keyphrase.lower() for keyword in theme_keywords):
            themes.append(theme)
    
    return ', '.join(themes) if themes else 'No Theme'


In [None]:
# Apply the function to the keyphrases column
data['Themes'] = data['Keyphrases'].apply(map_themes)


In [None]:
df = data[["File Name", "Themes"]]

In [None]:
df.head()

In [None]:
# # Save Spatial df sheet
# with pd.ExcelWriter(file_path, mode='a') as writer:
#     df.to_excel(writer, sheet_name='Identified Themes', index=False)