**Documentation for the Data Processing Code**

This code processes text data containing sentences with mountain names and generates annotations in the BIO (Begin-Inside-Outside) format. The process involves reading data, replacing keywords, tokenizing, annotating, and saving the results. Below is a step-by-step explanation.

In [57]:
!pip install pandas
        



**Reading the List of Mountain Names**

- What happens:

Opens the file mountains_list100.txt, which contains a list of mountain names (one per line).
Reads all mountain names into the list mountains.
Removes empty lines and extra spaces using strip().
 - Result:
 
A cleaned list of mountain names stored in mountains.

In [None]:

file_path = 'data/mountains_list100.txt'

with open(file_path, 'r') as file:
    mountains = file.readlines()

mountains = [mountain.strip() for mountain in mountains if mountain.strip()]


**Replacing the Word "mountains" in Sentences with Mountain Names**

 - What happens:

Opens the file mountains_sentences.csv, which contains textual sentences.
In each sentence that contains the word "mountains", this word is replaced with a random mountain name from the mountains list.
Modified sentences are stored in new_sentences.
Writes the updated sentences to the file annotated_sentences500.csv.
 - Result:
 
A file annotated_sentences500.csv with "mountains" replaced by specific mountain names.

In [None]:
import random


sentences_file_path = 'data/mountains_sentences.csv'
mountains_file_path = 'data/mountains_list100.txt'
output_file_path = 'data/annotated_sentences500.csv'   

with open(mountains_file_path, 'r') as file:
    mountains = [line.strip() for line in file.readlines()]

new_sentences = []
with open(sentences_file_path, 'r') as file:
    for sentence in file:
        if 'mountains' in sentence.lower():

            new_sentence = sentence.replace('mountains', random.choice(mountains))
            new_sentences.append(new_sentence)
        else:
            new_sentences.append(sentence)

with open(output_file_path, 'w') as file:
    for sentence in new_sentences:
        file.write(sentence)

**Tokenizing Sentences**

 - What happens:

The tokenize function splits a sentence into individual words (tokens) and symbols (e.g., periods).
Uses a regular expression \b\w+\b|\.:
\b\w+\b matches words (sequences of letters, digits, etc.).
\. matches periods.
 - Result:

A list of tokens for each sentence.

**Annotating Sentences in BIO Format**

- What happens:

The bio_annotate function assigns BIO labels to each token:
B-MOUNTAIN (Begin) for the first token of a mountain name.
I-MOUNTAIN (Inside) for subsequent tokens in the mountain name.
O (Outside) for tokens not related to mountain names.
Handles multi-word mountain names correctly.
 - Result:

A list of BIO labels corresponding to the tokens in the sentence.

**Annotating Text and Saving the Results**

 - What happens:

Reads sentences from the file mountains_sentences.csv.
For each sentence:
Tokenizes it using tokenize.
Annotates it using bio_annotate.
Stores the tokens and labels in the list annotated_data.
Creates a DataFrame df_annotations with columns:
tokens — the list of tokens in each sentence.
labels — the corresponding BIO labels.
Saves the DataFrame to the file annotated_sentences500.csv.
 - Result:
 
An annotated CSV file with tokens and their BIO labels.


In [None]:
import pandas as pd
import csv
import re

file_path = 'data/annotated_sentences500.csv'
df = pd.read_csv(file_path)

mountains_file_path = 'data/mountains_list100.txt' 
with open(mountains_file_path, 'r') as file:
    mountains = [line.strip() for line in file.readlines()]
    mountains = sorted(mountains, key=len, reverse=True)  

def tokenize(sentence):
    return re.findall(r'\b\w+\b|\.', sentence)

def bio_annotate(tokens, mountains):
    labels = []
    skip = 0
    for i, token in enumerate(tokens):
        if skip > 0:
            skip -= 1
            continue
        matched = False
        for mountain in mountains:
            mountain_tokens = mountain.split()
            if tokens[i:i+len(mountain_tokens)] == mountain_tokens:
                labels += ['B-MOUNTAIN'] + ['I-MOUNTAIN'] * (len(mountain_tokens) - 1)
                skip = len(mountain_tokens) - 1
                matched = True
                break
        if not matched:
            labels.append('O')
    return labels

file_path = 'data/mountains_sentences.csv'
with open(file_path, 'r', encoding='utf-8') as file:
    sentences = file.readlines()

annotated_data = []
for sentence in df['sentence']:
    tokens = tokenize(sentence)
    labels = bio_annotate(tokens, mountains)
    annotated_data.append([tokens, labels])


df_annotations = pd.DataFrame(annotated_data, columns=['tokens', 'labels'])
df_annotations.to_csv('data/annotated_sentences500.csv', index=False)



**Previewing the Results**

 - What happens:

Loads the annotated file annotated_sentences500.csv.
Displays the first 5 rows of the DataFrame for verification.
 - Result:
 
A preview of the DataFrame with tokens and labels columns.

In [None]:
import pandas as pd

df = pd.read_csv("data/annotated_sentences500.csv")
df.head(5)

Unnamed: 0,tokens,labels
0,"['There', 'are', 'no', 'Manaslu', 'between', '...","['O', 'O', 'O', 'B-MOUNTAIN', 'O', 'O', 'O', '..."
1,"['We', 'were', 'just', 'about', 'to', 'go', 'u...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,"['The', 'quaint', 'village', 'is', 'surrounded...","['O', 'O', 'O', 'O', 'O', 'O', 'B-MOUNTAIN', '..."
3,"['Ridgway', 'a', 'few', 'more', 'miles', 'away...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,"['He', 'was', 'angry', 'with', 'her', 'for', '...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
