# Location-based Text Mining

**In this notebook an approach to identify documents related to LU from large historical archives, without ex-ante knowledge of the content of these archives. The notebook is part of a set of three notebooks which can all be found on [GitHub](https://github.com/Yegberink/VOC_land_use).**

## Load data

The data for the approach is loaded and cleaned in this codeblock using the index of the historical archives. In this research the focus is on Ceylon, however update the label list according to your region or archives.

In [2]:
#Install packages if necessary
#pip install pandas flair fuzzywuzzy python-Levenshtein

In [8]:
# Load packages
import pandas as pd
import os
from flair.data import Sentence
from flair.nn import Classifier
from fuzzywuzzy import fuzz

# Set the directory (change accordingly
path = "/Users/Yannick/Documents/RA work"
os.chdir(path)

# Load data
files_df = pd.read_csv("Data/files.csv") #Text files
ceylon_labels = pd.read_table("Data/index_ceylon.txt", header=None) #Labels for ceylon documents

#
placenames_ceylon = pd.read_excel("Data/placenames_sri.xlsx", sheet_name="alloetkoer_corle", header=None) #placenames in ceylon

#filter for the labels of ceylon
#Create set of complete labels
ceylon_labels_set = set("HaNA_1.04.02_" + ceylon_labels[0].astype(str))

#Create short number in the files used for filtering
files_df['file_number_short'] = files_df['LabelIdentifier'].apply(lambda x: '_'.join(x.split('_')[:-1]))

#Filter for the documents that concern Ceylon
ceylon_documents = files_df[files_df["file_number_short"].isin(ceylon_labels_set)].reset_index(drop=True)

#Make smaller one with just the label and the content
ceylon_documents_clean = ceylon_documents[["LabelIdentifier", "DocumentContent"]]

print("packages and data loaded") 

packages and data loaded


## Apply Named Entity Recognition (NER)

In this codeblock NER is applpied to find locations in the texts. This runs for a long time (in the case of Ceylon ±48 hours) and therefore the dataframe is saved after the identification. IMPORTANT: update the save path before you run the codeblock otherwise you have to start over.

In [None]:
#Load the NER model
tagger = Classifier.load("nl-ner-large")

# apply the model to the generale missiven
for i in range(len(ceylon_documents_clean)):
    
    # Extract the question from the second column
    content = ceylon_documents_clean.iloc[i, 1]
    
    # make sentence (tokenisation) --> other tokenisers can be used this one just gives back the words
    sentence = Sentence(content)
    
    # predict NER tags
    tagger.predict(sentence)
    
    #extract the named entitiese 
    entities = sentence.get_spans('ner')
    
    # Initiate empty list
    location_words = []

    #Filter for locations  
    for span in entities:
        if span.tag == "LOC":
            location_words.append(span.text) #Append the list
            
    # Save the list of location words as a string in the dataframe
    locations_string = ', '.join(location_words)
    ceylon_documents_clean.at[i, "NamedEntities"] = locations_string #Save data
    
    #Track the progress (make the division number smaller to get more feedback
    if i % 10000 == 0:
        print(i)
        
#%% Save the results (update the path according to your data structure)
results_NER_ceylon = ceylon_documents_clean
results_NER_ceylon.to_csv("output/datasets/NER_Ceylon.csv", index=False)

print("done")

## Clean output of NER and split into sentences

The output of the NER is now cleaned and sentences of 30 words around the location are created so that these can be used in identifying useful documents.

In [11]:
#Load the NER sentences if needed
results_NER_ceylon = pd.read_csv("output/datasets/NER_Ceylon.csv")

# define functions
# Function to split values at commas and return a single list of named entities
def split_and_concatenate(row):
    if isinstance(row, str):
        entities = row.split(", ")  # Split the string by comma and space
        return entities
    else:
        return []

# Define function for extracting locations
def extract_locations(NER_output):
    #Loop over the locations to find locations with at least 2 characters
    NER_output_filtered = [loc for loc in NER_output if len(loc) > 2]
    locations_from_function = set(NER_output_filtered)
    return locations_from_function

# split the NER results into lists
results_NER_ceylon['listed_values'] = results_NER_ceylon['NamedEntities'].apply(lambda x: split_and_concatenate(x))

#Initiate empty list
tokened_sentences = []

#Extract the locations and save the 30 tokens around it

for index, row in results_NER_ceylon.iterrows():
    text = row['DocumentContent']  # Extract the content
    NER_output = row['listed_values']  # Extract locations
    locations_set = extract_locations(NER_output) #clean up the locations

    #Tokenise the sentence again using the tokeniser
    tokenized_text = Sentence(text)

    #Define the context window (Change if you want longer sentences)
    context_window = 30

    #Loop over tokens from the tokenized text
    for i, token in enumerate(tokenized_text):

        #Append the list if there is a location in the sentence
        if token.text in locations_set:
            start_index = max(0, i - context_window)
            end_index = min(len(tokenized_text), i + context_window + 1)
            context_tokens = tokenized_text[start_index:end_index]
            context_text = ', '.join(token.text for token in context_tokens if token.text.strip(" ") != ",")
            tokened_sentences.append({'LabelIdentifier': row['LabelIdentifier'], 'sentence': context_text, 'location': token.text})
    
    #Track the progress       
    if index % 30000 == 0:
        print(index)

# Convert the list of dictionaries to a DataFrame
tokened_sentences = pd.DataFrame(tokened_sentences)

print("done")

0
30000
60000
done


## Identify promising pages

The last step is to identify promising text based on the locations found in the NER step and a list of known locations. By matching these two lists we can identify documents that metion locations that are probably interesting for LU. This is done using two functions specified below. The first is a sliding window function which is used to break up placenames to ensure that if there are two placenames that are pasted together, these are not ignored. The second is a fuzzy matching function which uses the Levenshtein distance between to strings to compute a similarity score. By assigning a certain threshold only the most alike locations are returned giving a manageable dataframe to assess on the occurence of LU information. 

In [12]:
# Initiate sliding window function
def sliding_window(elements, window_size):
    """
    Extracts sliding windows of a specified size from a list.
    
    Parameters:
    elements (list): List of elements.
    window_size (int): Size of the sliding window.
    
    Returns:
    list: List of sliding windows.
    """
    if len(elements) <= window_size:
        return elements
    return [elements[i:i+window_size] for i in range(len(elements)- window_size + 1)]

def find_matches(word, list_to_match, threshold=90, threshold2=90):
    """
    Finds matches to a word by fuzzy matching with a list of items.
    
    Parameters:
    word (str): Input word to match.
    list_to_match (list): List of items for matching.
    threshold (int, optional): Minimum threshold for fuzzy matching. Defaults to 80.
    threshold2 (int, optional): Higher threshold for more exact matching used in combination with the sliding window. Defaults to 90.
    
    Returns:
    list: List of tuples containing (word, item, similarity_score).
    """
    matches = []  # Create empty list
    
    # Iterate over the tokens in the list
    for item in list_to_match:  # Iterate over the list 
        similarity_score = fuzz.ratio(word.lower(), item.lower())  # Apply fuzzy matching to the item in the list and the word that has to be matched
        if similarity_score >= threshold:
            matches.append((word, item, similarity_score))  # Append the list
        
            # If the commodity does not represent the exact word there might be a window in which it is correct
        else:
            for element in sliding_window(word, len(item)): #Loop over the elements of the sliding window
                similarity_score = fuzz.ratio(element.lower(), word.lower()) 
                if similarity_score >= threshold2:
                    matches.append((word, item, similarity_score))  # Append the list
    
    return matches  # Return the list of matches


## Apply functions

In the following part the functions are applied and a dataframe of promising text is created by merging with the original documents. An area list is specified to extract only the areas found on the map. The results are saved to be qualitatively assessed. 

In [15]:
#Initiate list for promising sentences
promising_sentences = []

#List of areas
areas_list = ["alloetkoer corle", "hina corle", "happitigam corle"]

#Iterate over the areas so the placenames can be identified
for area in areas_list:
    
    #Print the area for tracking progress
    print(area)
    
    #filter out locations that give problems with matching
    problematic_locations = ["oeddetoetterepittige", "talloewatteheenpittie"]

    # Filter out problematic locations
    filtered_placenames = placenames_ceylon.loc[~placenames_ceylon[0].isin(problematic_locations)]
    
    # Filter the filtered placenames to only have the current area
    placename_set = set(filtered_placenames.loc[filtered_placenames[1] == area, 0])
        
    #Iterate over the tokened_sentences df
    for index, row in tokened_sentences.iterrows():
        
        #Extract the location
        location = row["location"]
    
        #Apply the find_matches function to match the location on the map with a location mentioned in the texts
        found_match = find_matches(location, placename_set, threshold=95, threshold2=98)
            
        #If a match is found add the match and the area to the row and append dataframe
        if found_match:
            row["match"] = found_match #Add match
            row["area"] = area #Add area
            promising_sentences.append(row) #append df
            
        #Print index to track progress
        if index % 30000 == 0:
            print(index)
      
#Create df from the list
promising_sentences = pd.DataFrame(promising_sentences)

#merge the df with the generale missiven
#remove the dot from the generale missiven label
ceylon_documents["labels"] = ceylon_documents["LabelIdentifier"].str.rstrip(".")

# Remove trailing dot from the "LabelIdentifier" column
promising_sentences["labels"] = promising_sentences["LabelIdentifier"].str.rstrip(".")

#Drop the unnecessary columns
ceylon_documents = ceylon_documents.drop('LabelIdentifier', axis=1)

#%% merge the df
promising_text_ceylon = pd.merge(promising_sentences, ceylon_documents, on="labels", how="inner")
promising_text_ceylon = promising_text_ceylon.sort_values(by='labels').reset_index(drop=True)

promising_text_ceylon = promising_text_ceylon[["labels", "DocumentContent", "location", "area", "match"]]

#%% wrtie csv for checking
promising_text_ceylon.to_csv("output/datasets/promising_text_ceylon.csv", index=False, sep=';')

print("done")

alloetkoer corle
0
30000
60000
90000
hina corle
0
30000
60000
90000
happitigam corle
0
30000
60000
90000
done


## Done

After the qualitative assessment keywords can be identified and the structure of the documents can be used to inform a second round of text mining, which can be found in the Jupyter notebook keyword_based_TM