## Documentation of the Process to Prepare and Analyze Extracted Journal Sections
<p>This document outlines the steps undertaken to preprocess and prepare the extracted sections (Introduction, Abstract, Keywords, and Conclusion) from 78 journals for thematic analysis. The process was designed to ensure the text data was clean, consistent, and ready for the identification of recurring themes.</p>


In [None]:
# !pip install clean-text

In [1]:
# relevant libaries
import re
import nltk
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [18]:
# file path
file_path = "../Data/Extracted Sections/Attribute_Papers - extracted sections.csv" 
df = pd.read_csv(file_path, index_col='File Name) #read file to df

In [19]:
# Replacing NaN values with empty strings 
df.fillna('', inplace=True)
# preview the df
df.head()

Unnamed: 0,File Name,Abstract,Introduction,Conclusion,Keywords,Type
0,Environment- The Path of Global Environmental...,After revisiting the concept of accountability...,,,,Research Paper
1,INTERGOVERNMENTAL ORGANIZATIONS (IGOS) AND TH...,"This journal explores the evolution, roles, an...",,,,Review
2,“Privatisation_’ in the United Nations system_...,This journal examines the growing influence of...,,,,Review
3,A Changing United Nations_Multilateral Evoluti...,Intergovernmental organisations (IGOs) help to...,In analysing international environmental co-op...,,,Research Paper
4,A typology of board design for highly effectiv...,The United Nations (UN) system comprises sever...,,"In conclusion, understanding how to monitor th...","board design, governance, intergovernmental or...",Research Paper


### Abstarct Cleaning

<p>To prepare the journal abstracts for thematic analysis, we used a Python-based preprocessing workflow. The aim was to clean the text thoroughly by removing noise, reducing redundancy, and ensuring uniformity in the data. This facilitated a more effective thematic extraction from the textual content.</p>

#### Steps Taken

##### Importing Necessary Libraries:

$Pandas:$ For data manipulation and working with Excel files.

$String Module:$ To handle punctuation removal.

$NLTK (Natural Language Toolkit):$ For tokenization and stopword removal.

##### Downloading Required NLTK Resources:
* Used nltk.download('stopwords') and nltk.download('punkt') to ensure the necessary datasets for stopword lists and tokenization were available.
* Preprocessing Function: A robust preprocessing function was created to handle the following:
>`Case Normalization:` Converted all text to lowercase to ensure uniformity and prevent case-sensitive mismatches during analysis.
>`Punctuation Removal:` Stripped punctuation marks to focus only on meaningful text.
>`Tokenization:` Split text into individual words (tokens) to facilitate further processing.
>`Stopword Removal:` Eliminated common English stopwords (e.g., "and," "the") to focus on significant words that contribute to themes.
>`Handling Missing Values:` Checked for and addressed null values, replacing them with empty strings to maintain data consistency.
>`Application of Preprocessing:` The function was applied to the "Abstract" column of the dataset using Pandas' .apply() method. A new column, Cleaned_Abstract, was created to store the cleaned text while preserving the original abstracts for reference.

Saving the Cleaned Data: The processed data, including the new Cleaned_Abstract column, was exported to a new Excel file for further analysis.



In [20]:
# Ensure necessary NLTK data is downloaded
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/milo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/milo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
# Function to clean the abstarct section
def abstract_preprocess_text(text):
    """
    Preprocesses text by:
    1. Converting to lowercase.
    2. Removing punctuation.
    3. Tokenizing the text.
    4. Removing stopwords.
    """
    if pd.isnull(text):
        return ""  # Handle missing values as empty strings

    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Join tokens back to a single string
    cleaned_text = ' '.join(filtered_tokens)
    return cleaned_text



The cleaning process was necessary for several reasons:
* `Reduce Noise:` Punctuation, stopwords, and mixed case could obscure patterns in the data.
* `Enhance Uniformity:` Standardizing text ensures consistency, which is crucial for computational analysis.
* `Improve Relevance:` By removing stopwords and focusing on meaningful words, we ensured that the extracted themes would be more focused and insightful.
* `Prepare for Thematic Analysis:` Clean text is essential for techniques like clustering, topic modeling, or manual coding of themes, which are sensitive to noisy data.


In [22]:
# Apply the cleaning function to the 'Abstract' column
df['Cleaned_Abstract'] = df['Abstract'].apply(abstract_preprocess_text)

# df['Cleaned_Conclusion'] = df['Conclusion'].apply(preprocess_text)

In [23]:
df.tail()

Unnamed: 0,File Name,Abstract,Introduction,Conclusion,Keywords,Type,Cleaned_Abstract
73,International Organizations- The Politics and ...,"Growing evidence of climate change, along with...",,,,,growing evidence climate change along continui...
74,International Organizations under Pressure_Leg...,"In this book, we document how the list of norm...",,,,,book document list normative expectations inte...
75,International regulation without international...,nternational organizations (IOs) have been wid...,,IOs have been broadly criticized as ineffectiv...,,,nternational organizations ios widely criticiz...
76,Learning in International Organizations in Glo...,This article draws on a 4-year research effort...,,This article has analyzed learning processes i...,,,article draws 4year research effort within glo...
77,Moving forward by looking back_Learning from U...,With a growing recognition that global problem...,,,,,growing recognition global problems demand glo...


### Process of Text Cleaning and Named Entity Recognition (NER) for the Introduction Column
<p>This provides an overview of the steps taken to clean and process textual data in the Introduction column of a pandas DataFrame. The objective was to handle missing data (NaN values), clean the text, and perform Named Entity Recognition (NER) to extract relevant named entities.</p>


#### Step-by-Step Process
1. Handling Missing Data (NaN Values)
Before applying any text processing functions, it's important to ensure that missing data in the Introduction column (NaN values) is handled appropriately. The presence of NaN values can cause errors when applying text functions.

$Replace NaN with an Empty String$

This ensures that the text processing functions receive a valid string, even if the original value was missing. The NaN values are replaced with empty strings ("").

2. Text Cleaning Function
The text cleaning function was applied to the Introduction column to prepare the text for further analysis, including Named Entity Recognition (NER).

$Tokenizing Text into Sentences:$
We used nltk.sent_tokenize to split the text into individual sentences. This ensures that we can process text at a more granular level (sentence by sentence).

$Tokenizing Sentences into Words:$
We used nltk.word_tokenize to break down each sentence into individual words.

$Lowercasing and Removing Punctuation:$
The text was converted to lowercase to maintain consistency and to avoid case-sensitive issues. We also removed non-alphanumeric characters using a regular expression to clean the text.

3. Named Entity Recognition (NER)
Named Entity Recognition (NER) was performed on the Introduction column to identify key entities such as names of people, organizations, locations, etc.

$Using spaCy for NER:$
We used the spaCy library, a popular NLP library, to extract named entities from the text. The en_core_web_sm model was loaded to recognize various entities (e.g., persons, organizations, locations).

In [24]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import ne_chunk
from nltk.tree import Tree
import string

# Download necessary datasets from NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package punkt to /home/milo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/milo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/milo/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/milo/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/milo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /home/milo/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


True

In [25]:
# Stopwords and punctuation removal
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Function for tokenizing and cleaning text
def introduction_clean_text(text):
    # Tokenize into sentences
    sentences = sent_tokenize(text)
    
    # Tokenize each sentence into words
    words = [word_tokenize(sentence) for sentence in sentences]
    
    # Remove stopwords and punctuation
    cleaned_words = []
    for word_list in words:
        cleaned_words.append([word.lower() for word in word_list if word.lower() not in stop_words and word not in punctuation])
    
    return cleaned_words

# Function for Named Entity Recognition (NER)
def extract_named_entities(text):
    sentences = nltk.sent_tokenize(text)
    named_entities = []
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)
        chunked = ne_chunk(pos_tags)
        for subtree in chunked:
            if isinstance(subtree, Tree):
                entity = " ".join([word for word, tag in subtree])
                named_entities.append(entity)
    return named_entities



<p>By handling missing data (NaN values) effectively, cleaning the text, and applying Named Entity Recognition, we were able to prepare the Introduction column for further analysis. The process ensures that the data is consistent, clean, and enriched with useful named entity information, making it ready for any additional tasks or analyses.</p>

In [26]:
# Apply the text cleaning function to the introduction column
df['Cleaned_Introduction'] = df['Introduction'].apply(introduction_clean_text)

# Apply the NER function to extract named entities
df['Named_Entities'] = df['Introduction'].apply(extract_named_entities)

In [27]:
df.head(3)

Unnamed: 0,File Name,Abstract,Introduction,Conclusion,Keywords,Type,Cleaned_Abstract,Cleaned_Introduction,Named_Entities
0,Environment- The Path of Global Environmental...,After revisiting the concept of accountability...,,,,Research Paper,revisiting concept accountability national gov...,[],[]
1,INTERGOVERNMENTAL ORGANIZATIONS (IGOS) AND TH...,"This journal explores the evolution, roles, an...",,,,Review,journal explores evolution roles activities in...,[],[]
2,“Privatisation_’ in the United Nations system_...,This journal examines the growing influence of...,,,,Review,journal examines growing influence privatizati...,[],[]


### Conclusion

In [28]:
import re
# Function to clean the conclusion
def conclusion_clean_text(text):
    """
    This function takes a raw text input and performs the following cleaning steps:
    1. Removes extra spaces, newlines, and tabs.
    2. Standardizes punctuation.
    3. Removes any unnecessary content like footnotes or reference markers.
    4. Adjusts sentence structure if necessary.
    5. Removes unnecessary commas and fixes comma placement.

    Parameters:
    text (str): The raw text to be cleaned.

    Returns:
    str: The cleaned text.
    """

    # Step 1: Remove extra spaces, newlines, and tabs
    text = re.sub(r'\s+', ' ', text)  # Replaces any whitespace (including newlines and tabs) with a single space

    # Step 2: Remove footnote markers, numbers in parentheses, or references like (e.g., Jordan & Lenschow, 2010)
    text = re.sub(r'\([A-Za-z\s&\d,\.-]+\)', '', text)  # Removes in-text citations like (e.g., Jordan & Lenschow, 2010)
    
    # Step 3: Normalize punctuation
    text = re.sub(r'([.,;!?()])', r' \1', text)  # Adds space before punctuation marks (if not already present)
    text = re.sub(r'\s([.,;!?()])', r'\1', text)  # Removes space before punctuation marks
    
    # Step 4: Remove any unnecessary words like "In conclusion," or "Summary Points"
    unwanted_phrases = ["SUMMARY POINTS :", "In summary,", "In conclusion,", "In grand sense,"]
    for phrase in unwanted_phrases:
        text = text.replace(phrase, '')
    
    # Step 5: Remove unnecessary commas
    # Remove spaces before commas (unless there is a number)
    text = re.sub(r'\s*,\s*', ', ', text)  # Remove spaces around commas, ensuring one space after comma if necessary
    text = re.sub(r',\s*,', ',', text)  # Remove consecutive commas if any
    text = re.sub(r',\s*$', '', text)   # Remove trailing commas at the end of the text
    
    # Step 6: Normalize capitalization and improve readability (optional, depends on needs)
    # For example, you can convert to lowercase if necessary:
    text = text.lower()  # Converts the text to lowercase (if required), adjust this as needed.
    
    # Step 7: Final cleanup - remove any leading or trailing whitespace
    text = text.strip()

    return text


In [29]:
# Apply the text cleaning function to the introduction column
df['Cleaned_Conclusion'] = df['Conclusion'].apply(conclusion_clean_text)
# df['Cleaned_Conclusion'] = df['Cleaned_Conclusion'].apply(conclusion_clean_text)


In [30]:
df.head(17)

Unnamed: 0,File Name,Abstract,Introduction,Conclusion,Keywords,Type,Cleaned_Abstract,Cleaned_Introduction,Named_Entities,Cleaned_Conclusion
0,Environment- The Path of Global Environmental...,After revisiting the concept of accountability...,,,,Research Paper,revisiting concept accountability national gov...,[],[],
1,INTERGOVERNMENTAL ORGANIZATIONS (IGOS) AND TH...,"This journal explores the evolution, roles, an...",,,,Review,journal explores evolution roles activities in...,[],[],
2,“Privatisation_’ in the United Nations system_...,This journal examines the growing influence of...,,,,Review,journal examines growing influence privatizati...,[],[],
3,A Changing United Nations_Multilateral Evoluti...,Intergovernmental organisations (IGOs) help to...,In analysing international environmental co-op...,,,Research Paper,intergovernmental organisations igos help crea...,"[[analysing, international, environmental, co-...","[Intergovernmental, IR, United Nations, UN, IR...",
4,A typology of board design for highly effectiv...,The United Nations (UN) system comprises sever...,,"In conclusion, understanding how to monitor th...","board design, governance, intergovernmental or...",Research Paper,united nations un system comprises several int...,[],[],understanding how to monitor the un system is ...
5,A World Environment Organization_Solution.pdf,,Document T1 Summary:\nThe introduction of the ...,,,Unidentified,,"[[document, t1, summary, introduction, book, `...",[Effective International Environmental Governa...,
6,Accountability in International Governance and...,The debate on a World Environment Organization...,,,,Reviews,debate world environment organization weo star...,[],[],
7,An Unfinished Foundation_The United Nations an...,Environmental management has emerged as an imp...,,,,Reviews,environmental management emerged important ele...,[],[],
8,Assessing the effectiveness of intergovernment...,Our world is getting smaller. Political and ec...,"At the beginning of the 21st century, globaliz...",,,Research Paper,world getting smaller political economic liber...,"[[beginning, 21st, century, globalization, bec...","[GPP, United Nations, Economic]",
9,Autonomous Institutional Arrangements in Multi...,,,This article demonstrates that intergovernment...,,Research Paper,,[],[],this article demonstrates that intergovernment...


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   File Name             78 non-null     object
 1   Abstract              78 non-null     object
 2   Introduction          78 non-null     object
 3   Conclusion            78 non-null     object
 4   Keywords              78 non-null     object
 5   Type                  78 non-null     object
 6   Cleaned_Abstract      78 non-null     object
 7   Cleaned_Introduction  78 non-null     object
 8   Named_Entities        78 non-null     object
 9   Cleaned_Conclusion    78 non-null     object
dtypes: object(10)
memory usage: 6.2+ KB


In [32]:
df = df[["File Name", "Cleaned_Abstract", "Cleaned_Introduction", "Named_Entities", "Cleaned_Conclusion"]]

In [33]:
df.head()

Unnamed: 0,File Name,Cleaned_Abstract,Cleaned_Introduction,Named_Entities,Cleaned_Conclusion
0,Environment- The Path of Global Environmental...,revisiting concept accountability national gov...,[],[],
1,INTERGOVERNMENTAL ORGANIZATIONS (IGOS) AND TH...,journal explores evolution roles activities in...,[],[],
2,“Privatisation_’ in the United Nations system_...,journal examines growing influence privatizati...,[],[],
3,A Changing United Nations_Multilateral Evoluti...,intergovernmental organisations igos help crea...,"[[analysing, international, environmental, co-...","[Intergovernmental, IR, United Nations, UN, IR...",
4,A typology of board design for highly effectiv...,united nations un system comprises several int...,[],[],understanding how to monitor the un system is ...


In [35]:
# file_path = '../Data/Extracted Sections/Attribute_Papers (1).xlsx'

In [36]:
# # Save Spatial df sheet
# with pd.ExcelWriter(file_path, mode='a') as writer:
#     df.to_excel(writer, sheet_name='Cleaned_Extracted_Sections', index=False)