# Keyword-based Text Mining

**In this notebook an approach to identify documents related to LU from large historical archives, with the use of keywords identified based on an earlier round of text mining. The notebook is part of a set of three notebooks which can all be found on [GitHub](https://github.com/Yegberink/VOC_land_use).**

## Identify keywords

The first step in this process is to identify keywords. This is done by hand, however the following code can help create some insights into what words are actually interesting by looking at the most occuring words in the interesting documents identified in the Location-Based text mining. When more interesting documents are found, this codeblock can be run again to identify additional keywords. To properly run the code, your documents should be in a similar format as datasets.xlsx. You can find the dataset on [GitHub](https://github.com/Yegberink/VOC_land_use) for reference.

In [10]:
#pip install pandas flair fuzzywuzzy python-Levenshtein bertopic ipywidgets spacy xlrd

In [6]:
# Load packages
import pandas as pd
import os
from flair.data import Sentence
from collections import Counter

# Set the directory
path = "/Users/Yannick/Documents/RA work"
os.chdir(path)

# Load document numbers of previously assessed documents
found_datasets = pd.read_excel("output/results/datasets.xlsx", sheet_name='short_summary_of_datasets')

# Load the document content of the documents from Ceylon
ceylon_documents_clean = pd.read_csv("Data/ceylon_documents_clean", sep=";")

# Load stopwords
stopwords_nl = pd.read_table('Data/stopwords_nl.txt', header=None)
stopwords_nl = stopwords_nl[0].tolist()

# initiate empty list to hold all documents that are previously assessed
checked_document_labels = []

# Iterate over each row in the DataFrame
for index, row in found_datasets.iterrows():
    
    #extract useful information
    label_number = row['label_number']
    start_page = row['start_page']
    end_page = row['end_page']
    
    # Generate labels for each page in the range
    for page in range(start_page, end_page + 1):
        formatted_page = str(page).zfill(4) #add 0's in front of the page number
        label = f"HaNA_1.04.02_{label_number}_{formatted_page}" #fix the formatting
        checked_document_labels.append(label) #store the labels

#initiate empty list to hold all documents that were assessed to be interesting
interesting_documents = []

#filter for interesting datasets
interesting_datasets = found_datasets[(found_datasets['interesting'] == 'yes') | (found_datasets['interesting'] == 'maybe')]

# Iterate over each row in the DataFrame
for index, row in interesting_datasets.iterrows():
    
    #extract useful information
    label_number = row['label_number']
    start_page = row['start_page']
    end_page = row['end_page']
    
    # Generate labels for each page in the range
    for page in range(start_page, end_page + 1):
        formatted_page = str(page).zfill(4) #add 0's in front of the page number
        label = f"HaNA_1.04.02_{label_number}_{formatted_page}" #fix the formatting
        interesting_documents.append({
            'label': label,
            'interesting': row['interesting'],
            'year': row['year'],
            'list': row['List?'],
            'remarks': row['remarks'],
            'archive_number': row['label_number']}) 


interesting_documents = pd.DataFrame(interesting_documents)
interesting_documents_labels = set(interesting_documents['label'])

# create a df with interesting documents
merged_interesting_documents = pd.merge(interesting_documents, ceylon_documents_clean, left_on='label', right_on='LabelIdentifier')

# Drop the 'LabelIdentifier' column
merged_interesting_documents = merged_interesting_documents.drop('LabelIdentifier', axis=1)

# List to hold all tokens
all_tokens = []
texts = []

# Tokenize each document using the flair Sentence tokeniser
for text in merged_interesting_documents['DocumentContent']:
    sentence = Sentence(text)
    tokens = [token.text.lower() for token in sentence.tokens if len(token.text) >= 3 and token.text and token.text.lower() not in stopwords_nl]
    all_tokens.extend(tokens)
    texts.append(tokens)

# Count the frequency of each word
word_counts = Counter(all_tokens)

# Display the most common words
most_common_words = word_counts.most_common()

# Convert the result to a DataFrame for easier viewing
most_common_words_df = pd.DataFrame(most_common_words, columns=['Word', 'Frequency'])

# Save the df
most_common_words_df.to_csv("output/datasets/keyword_identification.csv")
print("done")

done


## BERTopic

In the following codeblock a BERTopic modeling is performed to check the results of the keyword identification. In this codeblock a representation of the BERTopic modeling is presented and in the codeblock after this all identified keywords for an interesting topic are extracted. The topic modeling takes some time to run.

In [18]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech
import spacy
from hdbscan import HDBSCAN

# Extract document content from the previously loaded ceylon_documents_clean
data = ceylon_documents_clean['DocumentContent']
data = pd.DataFrame(data)

# Transform the dataframe into a list:
data = data.apply(lambda row: ' '.join(row.values.astype(str)), axis=1).tolist()

# Create the embeddings
embed_model = SentenceTransformer("all-MiniLM-L6-V2")

# Create the dimension reduction model
umap_model = UMAP(n_neighbors = 30, n_components = 5, min_dist = 0.0, metric = "cosine", random_state = 42)

# Create the clustering model
hdbscan_model = HDBSCAN(min_cluster_size = 7, metric = "euclidean", cluster_selection_method = 'eom', prediction_data = True)

# Create the representation model
vect_model = CountVectorizer(stop_words = stopwords_nl, min_df = 2, ngram_range = (1,1)) 
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity = 0.3)
representation_model = {
    "keyBERT": keybert_model,
    "MMR": mmr_model,
 #  "POS": pos_model
}

# regroup all the models into one
topic_model = BERTopic(
    embedding_model = embed_model,
    umap_model = umap_model,
    hdbscan_model = hdbscan_model,
    vectorizer_model = vect_model,
    representation_model = representation_model,
    top_n_words = 20, # how many words to include in the representation
    verbose = True # this options ensures that the function returns some info while running
)


#Apply the model
topics, probs = topic_model.fit_transform(data)

visualisation=topic_model.visualize_hierarchy(hierarchical_topics=my_hier_topic_model)

# Save the visualization as an HTML file
visualisation.write_html("hierarchy_visualization.html")

print("done")

2024-10-03 11:46:50,902 - BERTopic - Embedding - Transforming documents to embeddings.
Batches:  15%|████▊                            | 330/2267 [01:38<09:38,  3.35it/s]


KeyboardInterrupt: 

## Extract keywords of interesting topics

The topics in the hierarchy are analysed manually and the interesting topic numbers can be collected in the interesting_topics list. This is then used to extract the interesting keywords. Any additional interesting keywords should be added to the more_keywords document.

In [None]:
# List of interesting topic IDs
interesting_topics = [236, 63, 167, 285, 66, 69, 24, 129, 162, 26]

# Initialize an empty list to store topic keywords
topic_keywords = []

# Loop through each interesting topic and extract keywords
for topic in interesting_topics:
    # Get the keywords and their respective scores for the topic
    words, scores = zip(*topic_model.get_topic(topic))
    
    # Create a dictionary for the topic with words and scores
    topic_data = {
        "Topic": topic,
        "Keywords": ', '.join(words),  # Join keywords into a single string for readability
        "Scores": ', '.join([f'{score:.4f}' for score in scores])  # Format the scores
    }
    
    # Append to the list
    topic_keywords.append(topic_data)

# Convert the list of dictionaries into a DataFrame
df_topics = pd.DataFrame(topic_keywords)

# Save the results
df_topics.to_csv("output/datasets/BERTopic_keywords.csv")

print("done")

## Load and clean documents for keyword based text mining

**The following part of the code can be used independently of the keyword identification. Therefore, some part of the previous code are repeated.**

In the following codeblock packages and data are loaded and prepared for the matching of keywords with words in texts. 

In [23]:
# import packages
import pandas as pd
import os
from flair.data import Sentence
from collections import Counter
import string
from fuzzywuzzy import fuzz

# Set the directory
path = "/Users/Yannick/Documents/RA work"
os.chdir(path)

# Load previously assessed documents
found_datasets = pd.read_excel("output/results/datasets.xlsx", sheet_name='short_summary_of_datasets')

#initiate empty df
checked_document_labels = []

# Iterate over each row in the DataFrame
for index, row in found_datasets.iterrows():
    
    #extract useful information
    label_number = row['label_number']
    start_page = row['start_page']
    end_page = row['end_page']
    
    # Generate labels for each page in the range
    for page in range(start_page, end_page + 1):
        formatted_page = str(page).zfill(4) #add 0's in front of the page number
        label = f"HaNA_1.04.02_{label_number}_{formatted_page}" #fix the formatting
        checked_document_labels.append(label) #store the labels

# Load ceylon documents
ceylon_documents_clean = pd.read_csv("Data/ceylon_documents_clean", sep=";")

# Load the more_keywords document

more_keywords = pd.read_table("Data/more_keywords.txt", header=None) 
more_keywords = set(more_keywords[0])

# List to hold all tokens
all_tokens = []
texts = []

#load stopwords_nl
stopwords_nl = pd.read_table('Data/stopwords_nl.txt', header=None)
stopwords_nl = stopwords_nl[0].tolist()

#Load trade data
#Trade datasets
cargo_1 = pd.read_excel("Data/trade_data/cargos (1).xls", "cargo", header=None)
cargo_2 = pd.read_excel("Data/trade_data/cargos (2).xls", "cargo", header=None)
cargo_3 = pd.read_excel("Data/trade_data/cargos (3).xls", "cargo", header=None)

#Concat into single df
trade_data = pd.concat([cargo_1, cargo_2, cargo_3])

#Select first 8 columns
trade_data = trade_data.iloc[:,0:8]

#Name these columns
trade_data.columns = ["book_year", "quantity", "unit", "product", "departure_place", "departure_region", "arrival_place", "arrival_region"]

#Convert - to NAN
trade_data.replace('-', None, inplace=True)

#Create set of unique commodities
commodity_list = set(trade_data['product'])

print("done")

done


## Flag interesting documents

In the following codeblock two keyword based filterings are applied. Firstly, only texts containing a high number of LU keywords are kept(x), and seconly only documents containing a low number of commodities are kept (y). The filtering showed no new documents when in the (x >= 2 and y < 3) or (x >= 3) configuration. The best working filtering was therefore found to be:(x >= 3 and y < 3) or (x >= 4). y was kept stable for the filtering since changing this led to the inclusion of many wrong texts. 


In [25]:
#%%Flag interesting documents

flagged_text = []

# Iterate over each row in the DataFrame
for index, row in ceylon_documents_clean.iterrows():
    text = row['DocumentContent']  # Extract the document content
    tokenized_text = Sentence(text)  # Tokenize the text (assuming Sentence is a tokenizer)
    
    #initiate counter
    x = 0
    y = 0
   
    #initiate empty lists for matched words
    found_tokens = []
    found_commodities = []
    
    # Iterate over each token in the tokenized text
    for token in tokenized_text:
        word = token.text.lower()  # Convert token to lowercase
        if word in more_keywords:  # Check if token is in cultivation_keywords
            x += 1  # Increment the counter if a keyword is found
            found_tokens.append(word)  # Store the matching token
        if word in commodity_list:  # Check if token is in commodity_list
            y += 1  # Increment the counter if a commodity is found
            found_commodities.append(word)  # Store the matching commodity

    # Determine the flag and append the result to promising_text
    if (x >= 3 and y < 3) or (x >= 4):
        flag = 'yes'
    else:
        flag = 'no'
    
    #append the list
    flagged_text.append({
        'LabelIdentifier': row['LabelIdentifier'],  # Store the label identifier
        'DocumentContent': text,  # Store the original document content
        'tokened_text': tokenized_text,  # Store the tokenized text
        'cultivation_words': found_tokens,  # Store the list of found cultivation keywords
        'flag': flag,  # Store the determined flag
        'commodities': found_commodities  # Store the list of found commodities
    })
    #Print index to track progress
    if index % 10000 == 0:
        print(index)

#convert to df
flagged_text = pd.DataFrame(flagged_text)

#%%Filter for documents that have yes or maybe

filtered_flagged = flagged_text[(flagged_text['flag'] == 'yes')].reset_index()

#%% save the dataframe for checking
filtered_flagged.to_csv("output/datasets/filtered_flagged_text.csv", index=False, sep=';')

print("done")

0
10000
20000
30000
40000
50000
60000
70000
done


## Table identification

The last step in the identification of potentially interesting documents is the identification of tables. Due to the structure of the HTR texts from the GLOBALISED project which is used for this study, this can be done based on punctuation percentage. If the punctuation percentage is lower than 30% we found that many none-tables were returned and therefore the threshold was set on 30%. The resulting dataset is saved and can be qualitatively assessed to find pages on LU. If this assessment is done using the Excel file on [GitHub](https://github.com/Yegberink/VOC_land_use), previously assessed documents are filtered out to make the iterative process of setting the thresholds easier. 

In [29]:
#%% Find documents with high amounts of punctuation

#Define functions
def sliding_window(elements, window_size):
    """
    Extracts sliding windows of a specified size from a list.
    
    Parameters:
    elements (list): List of elements.
    window_size (int): Size of the sliding window.
    
    Returns:
    list: List of sliding windows.
    """
    if len(elements) <= window_size:
        return [elements]  # Return the whole list as one window
    return [elements[i:i + window_size] for i in range(len(elements) - window_size + 1)]


def count_punctuation(sentence):
    """
    Counts the number of punctuation marks in a sentence.

    Parameters:
    sentence (str): The input sentence.

    Returns:
    int: The count of punctuation marks in the sentence.
    """
    return sum(1 for char in sentence if char in string.punctuation)


def punctuation_percentage(sentence):
    """
    Calculates the percentage of punctuation marks in a sentence. Uses the count_punctuation function.

    Parameters:
    sentence (str): The input sentence.

    Returns:
    float: The percentage of punctuation marks in the sentence.
    """
    num_punctuation = count_punctuation(sentence)
    total_chars = len(sentence)
    return (num_punctuation / total_chars) * 100 if total_chars > 0 else 0

#do the filtering 

promising_text = [] #initiate list

#loop ocer the filtered and flagged documents
for index, row in filtered_flagged.iterrows():
    text = row['tokened_text']  # Extract the document content
    
    #divide into parts of the document
    for element in sliding_window(text, 30):
        sentence = " ".join([str(token.text) for token in element]) #create a workable string
        percentage = punctuation_percentage(sentence) #calculate punctuation percentage
        
        #append the row if there is high punctuation somewhere in the document
        if percentage >= 30:
            promising_text.append(row)  # Append the row if percentage criteria met
    
    #Print index to keep track of progress
    if index % 1000 == 0:
        print(index)

#%%Filter the promised text

#create df
promising_text = pd.DataFrame(promising_text)

#filter out previously identified documents
#promising_text = promising_text[~promising_text['LabelIdentifier'].isin(checked_document_labels)]

#filter out duplicates (these are created with the sliding window approach)
promising_text_unique = promising_text.drop_duplicates(subset=['LabelIdentifier'], keep='first')

#save the results
promising_text_unique.to_csv("output/datasets/promising_text2.csv", index=False, sep=';')

print("done")

0
1000
2000
done


## Done

The resulting dataframe can now be qualitatively assessed and the found tables transcribed. This can then be used to georeference the locations in the tables by using the notebook matching_placenames.