# Dataset creation
This Jupyter Notebook is dedicated to the development of a structured dataset for natural language processing applications. The initial dataset, generated by ChatGPT, is stored in a CSV format. The primary objectives of this notebook include the correction of errors within the raw data, augmentation of the dataset to improve its comprehensiveness, and the incorporation of additional data obtained from an external dataset available on Kaggle. These steps are integral to enhancing the dataset's accuracy and applicability for subsequent neural network training and analysis.

In [1]:
import pandas as pd
import re
from termcolor import colored
from IPython.display import HTML
import random

In [2]:
"""
The function to convert a list, which consists of tuples to dataframe. Each tuple contains a string and a list of strings.
"""

def lst2df(lst, colomn_name1, colomn_name2):
    data = []
    for col1, col2 in lst:
        col2_seq = ",".join(col2)
        data.append({colomn_name1: col1, colomn_name2: col2_seq})

    return pd.DataFrame(data)

"""
Reverse function
"""
def df2lst(df):
    lst = []
    for index, row in df.iterrows():
        col1 = row[0]
        col2 = row[1].split(',')
        lst.append((col1, col2))
    return lst

# loading our raw data
df_set_GPT = pd.read_csv('raw_set_GPT.csv')
display(df_set_GPT)

Unnamed: 0,sentence,named_mountains
0,"Mount Everest, the world's highest peak, is a ...",Mount Everest
1,Kilimanjaro in Tanzania is known for its snow-...,Kilimanjaro
2,The Rocky Mountains are famous for their stunn...,Rocky Mountains
3,Mount Fuji is a sacred symbol in Japanese cult...,Mount Fuji
4,The Andes stretch along the western edge of So...,Andes
...,...,...
775,The perilous journey across the Stormlands alw...,Thunder Peak
776,"The mysterious Moonlit Path, winding through t...",Moonstone Cliffs
777,"The ancient Frostvale Path, known for its icy ...",Frostcrown Mountain
778,"In the realm of the sky islands, the floating ...",Cloudreach Summit


# Adressing mistakes, created by chatGPT
This dataframe consists of sentences and a list of mountains, named in the list.
Note, that chatGPT can make mistakes and sometimes forget to add named mountain to the list. We will adress this problem by creating a list of all mountain names being listed by chatGPT and check every sentence for matches. Although, 100% accuracy is not garantied, this approach prooves to create good enough results for our purpose.


In [3]:

# creating a list of unique mountain names
name_lst = df_set_GPT['named_mountains'].to_list()
new_name_lst = []
for name_seq in name_lst:
    for name in name_seq.split(","):
        if name in new_name_lst:
            continue
        else:
            new_name_lst.append(name)
print(len(new_name_lst))


524


In [4]:
### remove mistakes in data
for i, row in df_set_GPT.iterrows():
    text, mountains = row['sentence'], row['named_mountains'].split(",")
    for name in new_name_lst:
        if name in text and name not in mountains and not any(name in mountain for mountain in mountains):
            updated_mountains = mountains + [name]
            df_set_GPT.at[i, 'named_mountains'] = ",".join(updated_mountains)
            

In [5]:
display(df_set_GPT)
df_set_GPT.to_csv('purified_set_GPT.csv', index=False)

Unnamed: 0,sentence,named_mountains
0,"Mount Everest, the world's highest peak, is a ...","Mount Everest,Himalayas"
1,Kilimanjaro in Tanzania is known for its snow-...,Kilimanjaro
2,The Rocky Mountains are famous for their stunn...,Rocky Mountains
3,Mount Fuji is a sacred symbol in Japanese cult...,Mount Fuji
4,The Andes stretch along the western edge of So...,Andes
...,...,...
775,The perilous journey across the Stormlands alw...,Thunder Peak
776,"The mysterious Moonlit Path, winding through t...",Moonstone Cliffs
777,"The ancient Frostvale Path, known for its icy ...",Frostcrown Mountain
778,"In the realm of the sky islands, the floating ...",Cloudreach Summit


## Data augmentation
At this point, we face two problems:

1. The dataset is small.
2. Named entities tend to appear more frequently at the start of sentences, which can lead to positional encoding impact predictions more than necessary.
To address these issues, we will augment our data in two ways:

1. Add a short phrase at the beginning of sentences, which doesn't significantly impact their meaning.
2. Concatenate two sentences with a meaningless part in between.

In [6]:
def augment_sentences(sentences, augmentations):
    # Adds a short phrase at the beginning of sentences, which doesn't significantly impact their meaning.
    augmented_sentences = []
    for sentence, tags in sentences:
        if random.random() < 0.5:
            intro = random.choice(augmentations)
            new_sentence = f"{intro} {sentence}"
            augmented_sentences.append((new_sentence, tags))
    return augmented_sentences

augmentations = [
    "According to explorers,",
    "As reported in the news,",
    "In the history of mountaineering,",
    "From a geographical perspective,",
    "As many hikers have observed,",
    "Legends say that",
    "Tourists often find that",
    "It is well-known that",
    "Historically speaking,",
    "Among the local population, it is believed that",
    "Travel guides mention that",
    "According to ancient myths,",
    "Frequent climbers state that",
    "Geologists have discovered that",
    "Nature enthusiasts enjoy that",
    "It's a lesser-known fact that",
    "In the realm of folklore,",
    "Photographers love that",
    "Experienced guides often say that",
    "The local lore suggests that",
    "In popular travel blogs,",
    "Adventure seekers have noted that",
    "It is often said that",
    "From a bird's-eye view,",
    "In the eyes of historians,",
    "Environmental studies show that",
    "Mountain climbers often recount that",
    "According to recent studies,",
    "Nature documentaries often highlight that",
    "In travel documentaries,",
    "From a climber's perspective,",
    "Local traditions state that",
    "As seen in satellite imagery,"
]

middles = [
    ", however",
    ", another fact is that",
    ", but"
]

def combine_sentences(sentences, middles):
    # Concatenates two sentences with a meaningless part in between.
    combined_sentences = []
    pairs_T = [random.choices(range(len(sentences)), k = len(sentences)), random.choices(range(len(sentences)), k = len(sentences))]
    pairs = []
    for i in range(len(pairs_T[0])):
        pairs.append([pairs_T[0][i], pairs_T[1][i]])
    for pair in pairs:
        mid = random.choice(middles)
        first = sentences[pair[0]][0][:-1]
        for name in sentences[pair[1]][1]:
            if sentences[pair[1]][0][:len(name)] == name:
                second = sentences[pair[1]][0]
                break
            else:
                second = f"{sentences[pair[1]][0][0].lower()}{sentences[pair[1]][0][1:]}"
        new_sentence = f"{first}{mid} {second}"
        names = [*sentences[pair[0]][1], *sentences[pair[1]][1]]
        combined_sentences.append((new_sentence, names))
    return combined_sentences

combined_data_lst = combine_sentences(df2lst(df_set_GPT), middles)
combined_data_df = lst2df(combined_data_lst, 'sentence', 'named_mountains')
augmented_data_lst = augment_sentences(df2lst(df_set_GPT), augmentations)
augmented_data_df = lst2df(augmented_data_lst, 'sentence', 'named_mountains')

In [7]:
display(augmented_data_df)
augmented_data_df.to_csv('augmented_part.csv', index=False)

Unnamed: 0,sentence,named_mountains
0,Nature documentaries often highlight that The ...,Rocky Mountains
1,"As many hikers have observed, Denali, formerly...","Denali,Mount McKinley"
2,"According to explorers, The Ural Mountains for...",Ural Mountains
3,"From a bird's-eye view, The Appalachians are o...",Appalachians
4,"In travel documentaries, Mount Kilimanjaro is ...",Mount Kilimanjaro
...,...,...
356,Environmental studies show that In the far nor...,Aurora Point
357,Travel guides mention that The treacherous Fro...,Iceheart Summit
358,Mountain climbers often recount that The enigm...,Shadowfall Mountains
359,"In the eyes of historians, The mysterious Moon...",Moonstone Cliffs


In [8]:
display(combined_data_df)
combined_data_df.to_csv('combined_part.csv', index=False)

Unnamed: 0,sentence,named_mountains
0,The Bone Mountains in George R.R. Martin's A S...,"Bone Mountains,Skyward Peaks"
1,"The Brocken, in the Harz Mountains of Germany,...","Brocken,Harz Mountains,Mount Kazbek"
2,"According to local folklore, the Skyfire Range...","Skyfire Range,Mount Vesuvius"
3,Mount Kailash in Tibet is revered as a sacred ...,"Mount Kailash,Mount Kailash"
4,The Glass Mountains in the Land of Oz are a fa...,"Glass Mountains,Mount St. Helens"
...,...,...
775,"Mount Helios, in the sun-drenched isles of Hel...","Mount Helios,Carpathian Mountains"
776,Mount Vesuvius is famous for its eruption in 7...,"Mount Vesuvius,Mount Cook,Aoraki,Southern Alps"
777,"The Altai Mountains, straddling the borders of...","Altai Mountains,Mount Aspiring"
778,The Sierra Madre Oriental range in Mexico is k...,"Sierra Madre Oriental,Mount Shasta"


In [9]:
# combining our dataset with all augmentations 
full_data_df = pd.concat([df_set_GPT, combined_data_df, augmented_data_df], ignore_index=True)
display(full_data_df)
full_data_df.to_csv('full_data.csv', index=False)

Unnamed: 0,sentence,named_mountains
0,"Mount Everest, the world's highest peak, is a ...","Mount Everest,Himalayas"
1,Kilimanjaro in Tanzania is known for its snow-...,Kilimanjaro
2,The Rocky Mountains are famous for their stunn...,Rocky Mountains
3,Mount Fuji is a sacred symbol in Japanese cult...,Mount Fuji
4,The Andes stretch along the western edge of So...,Andes
...,...,...
1916,Environmental studies show that In the far nor...,Aurora Point
1917,Travel guides mention that The treacherous Fro...,Iceheart Summit
1918,Mountain climbers often recount that The enigm...,Shadowfall Mountains
1919,"In the eyes of historians, The mysterious Moon...",Moonstone Cliffs


## Creating Beginning-Outside-Inside tags
We will use "named_mountains" to produce BOI tags

In [10]:
def generate_boi_tags(df):
    boi_data = []

    for index, row in df.iterrows():
        sentence = row['sentence']
        named_mountains = row['named_mountains'].split(',')
        words = re.findall(r'\b\w+\b|\S', sentence)
        tags = ['O'] * len(words)

        for mountain in named_mountains:
            mountain_words = re.findall(r'\b\w+\b|\S', mountain)
            start_index = -1

            for i in range(len(words)):
                if words[i:i+len(mountain_words)] == mountain_words:
                    start_index = i
                    break

            if start_index != -1:
                tags[start_index] = 'B'
                for i in range(start_index + 1, start_index + len(mountain_words)):
                    tags[i] = 'I'

        boi_data.append((sentence, tags))

    return boi_data

boi_dataset = generate_boi_tags(full_data_df)

## Visualize our tagged data in easy to read way

In [11]:
def visualize_tagged_data_html(sentences, tags):
    visualized_sentences = ""
    for index, (sentence, tag_seq) in enumerate(zip(sentences, tags)):
        visualized_sentences += f"{index}: "
        words = re.findall(r'\b\w+\b|\S', sentence)
        for word, tag in zip(words, tag_seq):
            if tag == 'B':
                visualized_sentences += f"<span style='color:red'>{word}</span> "
            elif tag == 'I':
                visualized_sentences += f"<span style='color:orange'>{word}</span> "
            else:
                visualized_sentences += f"{word} "
        visualized_sentences += "<br>"
    return HTML(visualized_sentences)

In [12]:
painted = visualize_tagged_data_html([sentence for sentence, _ in boi_dataset], [tag_seq for _, tag_seq in boi_dataset])

In [13]:
display(painted)

## Adding to our dataset some sentences with no mountains
The dataset should be representative of the real data on which the neural network will be used. Accordingly, the percentage of named entities and the percentage of sentences containing them should match the distribution in real data. Since this neural network is intended solely to demonstrate my skills, we will assume that it will be sufficient to add a certain number of sentences to the current dataset in which mountain names do not appear.

In [14]:
# dataset from https://www.kaggle.com/datasets/debasisdotcom/name-entity-recognition-ner-dataset/data
kaggle_df = pd.read_csv('NER_Kaggle.csv', encoding='unicode_escape')

In [15]:
display(kaggle_df)
# transforming kaggle dataset to be compatible with ours
def create_sentence_list(df):
    sentence_list = []
    current_sentence = []
    current_tags = []

    for _, row in df.iterrows():
        if pd.isna(row['Sentence #']):
            current_sentence.append(row['Word'])
            current_tags.append(row['Tag'])
        else:
            if current_sentence:
                sentence_list.append((" ".join(current_sentence), current_tags))
            current_sentence = [row['Word']]
            current_tags = [row['Tag']]

    if current_sentence:
        sentence_list.append((" ".join(current_sentence), current_tags))

    return sentence_list


kaggle_lst_transformed = create_sentence_list(kaggle_df)
print(kaggle_lst_transformed[:100])

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
...,...,...,...,...
1048570,,they,PRP,O
1048571,,responded,VBD,O
1048572,,to,TO,O
1048573,,the,DT,O


[('Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .', ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']), ('Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ('They marched from the Houses of Parliament to a rally in Hyde Park .', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O']), ('Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ("The protest comes on the eve of the annual conference of Britai

In [17]:
# filtering all sentences, where B-geo is appear, to filter mountains 
kaggle_lst_filtered = []
for i, (_, tag_seq) in enumerate(kaggle_lst_transformed):
    if any(tag == 'B-geo' for tag in tag_seq):
        continue
    kaggle_lst_filtered.append(kaggle_lst_transformed[i])
print(len(kaggle_lst_filtered))

# clearing all tags 
kaggle_lst_filtered_sentences = [sentence for (sentence, _) in kaggle_lst_filtered]
kaggle_lst_tagged_sentences =[]
for sentence in kaggle_lst_filtered_sentences:
    words = re.findall(r'\b\w+\b|\S', sentence)
    tags = ['O'] * len(words)
    kaggle_lst_tagged_sentences.append((sentence, tags))
print(kaggle_lst_tagged_sentences[:100])
final_dataset_lst =[]
final_dataset_lst = boi_dataset.copy()
for i in range(len(boi_dataset)): # number of sentences we adding is equal to lenght of our dataset with mountains
    final_dataset_lst.append(kaggle_lst_tagged_sentences[i])

23548
[('Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ('Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ('Iranian officials say they expect to get access to sealed sensitive parts of the plant Wednesday , after an IAEA surveillance system begins functioning .', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ('The step will allow the facility to operate at full capacity .', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']), ('The European Union , with U.S. backing , has threatene

In [18]:
print(len(final_dataset_lst), len(boi_dataset))

3842 1921


In [19]:
# converting final dataset to dataframe and saving it
final_data_df = lst2df(final_dataset_lst, 'sentence', 'word_labels')
final_data_df.to_csv('dataset_mountains_NER.csv', index=False)
final_data_df.head(20)

Unnamed: 0,sentence,word_labels
0,"Mount Everest, the world's highest peak, is a ...","B,I,O,O,O,O,O,O,O,O,O,O,O,O,O,B,O"
1,Kilimanjaro in Tanzania is known for its snow-...,"B,O,O,O,O,O,O,O,O,O,O,O"
2,The Rocky Mountains are famous for their stunn...,"O,B,I,O,O,O,O,O,O,O"
3,Mount Fuji is a sacred symbol in Japanese cult...,"B,I,O,O,O,O,O,O,O,O"
4,The Andes stretch along the western edge of So...,"O,B,O,O,O,O,O,O,O,O,O"
5,"Denali, formerly known as Mount McKinley, is t...","B,O,O,O,O,B,I,O,O,O,O,O,O,O,O"
6,Mont Blanc straddles the border between France...,"B,I,O,O,O,O,O,O,O,O"
7,The Swiss Alps are known for their beautiful s...,"O,B,I,O,O,O,O,O,O,O,O"
8,Mount Vesuvius is famous for its eruption in 7...,"B,I,O,O,O,O,O,O,O,O,O"
9,"K2, also known as Mount Godwin-Austen, is the ...","B,O,O,O,O,B,I,I,I,O,O,O,O,O,O,O,O,O,O"
