# This notebook is to prepare a dataset for training the ANNABELL model.
The dataset is derived from the SQuAD database.  Each Question and Answer pair was used to prompt a LLM to provide a declarative statement.


In [2]:
import pandas as pd
from generate_declarative_sentences import load_squad_dataset
from dataset_processing import remove_quotes_from_file, filter_by_max_words, clean_text, write_training_file, write_testing_file, dataset_summary
import os

In [3]:
train_dir = "/Users/chris/Library/CloudStorage/GoogleDrive-cjameswalmsley@gmail.com/My Drive/Shared with Julia/Education/Kent University/PhD/work/annabell/training"
train_filename = "declarative_sentences_train_gemma3:4b_20250617_201822.tsv"
train_filepath = os.path.join(train_dir, train_filename)

## To be optimal for training the ANNABELL model,  the examples need to follow the specific format below:
* Uppercase letters are used only for first letter of proper nouns – e.g. Chris, London, Big Ben
* Questions start with a question mark – e.g. "how old are you"
* Words with a suffix are split in the form base –suffix.  e.g. animals -> animal –s, writing \t *> write \t *ing
    * Apart from the above exceptions the following rules apply:
    * every character must be lowercase
    * No punctuation
    * No Special Characters
    * No Whitespace between lines
    * Lines can be prefixed with # to insert comments
    * If .ph is used, the entire phrase in the exact format must be input

In [4]:
# Load the datafile into a pandas DataFrame
train_filepath = remove_quotes_from_file(train_filepath)
#validation_filepath = remove_quotes_from_file("/Volumes/X9 Pro/datasets/declarative_sentences_validation_gemma3:4b_20250618_200853.tsv")
train_df = pd.read_csv(train_filepath, sep="\t")
#validation_df = pd.read_csv(validation_filepath, sep="\t")
#remove rows with null values in any columns
train_df = train_df.dropna()
#validation_df = validation_df.dropna()
print(train_df.info())
#print(validation_df.info())

Cleaned data saved to /Users/chris/Library/CloudStorage/GoogleDrive-cjameswalmsley@gmail.com/My Drive/Shared with Julia/Education/Kent University/PhD/work/annabell/training/declarative_sentences_train_gemma3:4b_20250617_201822_cleaned.tsv
<class 'pandas.core.frame.DataFrame'>
Index: 85419 entries, 0 to 85483
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 85419 non-null  object
 1   title              85419 non-null  object
 2   question           85419 non-null  object
 3   answer             85419 non-null  object
 4   response_question  85419 non-null  object
 5   response_answer    85419 non-null  object
 6   statement          85419 non-null  object
dtypes: object(7)
memory usage: 5.2+ MB
None


In [5]:
filtered_train_df = filter_by_max_words(train_df, max_words=20)
filtered_train_df["response_question"] = clean_text(filtered_train_df["response_question"], True)
filtered_train_df["response_answer"] = clean_text(filtered_train_df["response_answer"], False)
filtered_train_df["statement"] = clean_text(filtered_train_df["statement"], False)
filtered_train_df.reset_index(drop=True, inplace=True)
filtered_train_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_train_df["response_question"] = clean_text(filtered_train_df["response_question"], True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_train_df["response_answer"] = clean_text(filtered_train_df["response_answer"], False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_tra

Unnamed: 0,id,title,question,answer,response_question,response_answer,statement
0,5733be284776f41900661182,University_of_Notre_Dame,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,? to whom did the virgin mary allegedly appear...,saint Bernadette Soubirous,the virgin mary allegedly appeared in 1858 in ...
1,5733be284776f4190066117f,University_of_Notre_Dame,What is in front of the Notre Dame Main Building?,a copper statue of Christ,? what is in front of the Notre Dame Main Buil...,a copper statue of Christ,the Notre Dame Main Building is in front of a ...
2,5733be284776f41900661180,University_of_Notre_Dame,The Basilica of the Sacred heart at Notre Dame...,the Main Building,? the Basilica of the Sacred heart at Notre Da...,the Main Building,the Basilica of the Sacred heart at Notre Dame...
3,5733be284776f41900661181,University_of_Notre_Dame,What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,? what is the grotto at notre dame,a Marian place of prayer and reflection,the grotto at notre dame is a Marian place of ...
4,5733be284776f4190066117e,University_of_Notre_Dame,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,? what sits on top of the Main Building at Not...,a golden statue of the Virgin Mary,the Main Building has a golden statue of the V...
...,...,...,...,...,...,...,...
84053,5732a8a6328d981900601fed,Geological_history_of_Earth,Which current resulted in the cooling of Antar...,the Antarctic Circumpolar Current,? which current resulted in the cooling of Ant...,the Antarctic Circumpolar Current,the Antarctic Circumpolar Current resulted in ...
84054,5732ac1fcc179a14009dabe7,Geological_history_of_Earth,Which continent was India colliding with in th...,Asia,? Which continent was India colliding with in ...,asia,india was colliding with asia in the Miocene
84055,5732ac1fcc179a14009dabe8,Geological_history_of_Earth,When Africa was colliding with Eurasia which s...,The Tethys Seaway,? when Africa was colliding with Eurasia which...,the Tethys Seaway,Africa was colliding with Eurasia the Tethys S...
84056,5732ac1fcc179a14009dabe9,Geological_history_of_Earth,Between what period of time did the Tethys dis...,19 and 12 Ma,? between what period of time did the tethys d...,19 and 12 ma,the tethys disappeared between the period of 1...


In [6]:
filtered_train_df = filtered_train_df[filtered_train_df["title"] == "New_York_City"]
filtered_train_df.reset_index(drop=True, inplace=True)
filtered_train_df

Unnamed: 0,id,title,question,answer,response_question,response_answer,statement
0,56ce304daab44d1400b8850e,New_York_City,What city in the United States has the highest...,New York,? What city in the United States has the highe...,new york,new york city has the highest population in th...
1,56ce304daab44d1400b8850f,New_York_City,In what city is the United Nations based?,New York,? in what city is the united nations based,new york,the united nations is based in New York
2,56ce304daab44d1400b88510,New_York_City,What city has been called the cultural capital...,New York,? what city has been called the cultural capit...,new york,new york has been called the cultural capital ...
3,56ce304daab44d1400b88511,New_York_City,What American city welcomes the largest number...,New York,? What American city welcomes the largest numb...,new york,new york welcomes the largest number of legal ...
4,56cf5d41aab44d1400b89130,New_York_City,The major gateway for immigration has been whi...,New York City,? the major gateway for immigration has been w...,new york city,the major gateway for immigration has been in ...
...,...,...,...,...,...,...,...
801,56d1218c17492d1400aaba1e,New_York_City,What ZIP code was responsible for the greatest...,10021,? what ZIP code was responsible for the greate...,10021,the ZIP code 10021 was responsible for the gre...
802,56d1218c17492d1400aaba1f,New_York_City,How much money in cents does New York City rec...,83,? how much money in cents does New York City r...,83,New York City receives 83 cents for every doll...
803,56d1218c17492d1400aaba20,New_York_City,How much more money does the city give to the ...,$11 billion,? how much more money does the city give to th...,11 billion,the city gives 11 billion to the state of new ...
804,56d1218c17492d1400aaba21,New_York_City,"Each year, how much more money does New York C...",$11.4 billion,? each year how much more money does New York ...,11 point 4 billion,new york city gives 11 point 4 billion more to...


In [5]:
#write a file that can be used to train ANNABELL
#write_training_file(filtered_train_df, "training/filtered_train_data")

file created: training/filtered_train_data.txt
Compressed file created: training/filtered_train_data.tar.xz


In [6]:
#write a file that can be used to test annabell
test_filepath = "testing/test_nyc_questions.txt"
write_testing_file(filtered_train_df["response_question"].tolist(), test_filepath)

with open(test_filepath, "r") as test_file:
    test_lines = test_file.readlines()
test_lines

['? What city in the United States has the highest population\n',
 '.x\n',
 '? in what city is the united nations based\n',
 '.x\n',
 '? what city has been called the cultural capital of the world\n',
 '.x\n',
 '? What American city welcomes the largest number of legal immigrants\n',
 '.x\n',
 '? the major gateway for immigration has been which US city\n',
 '.x\n',
 '? the most populated city in the united states is which city\n',
 '.x\n',
 '? How many boroughs comprise New York City\n',
 '.x\n',
 '? in what year were the five borough -s combined into one city\n',
 '.x\n',
 '? in what year were the five borough -s combined into one city\n',
 '.x\n',
 '? what is the size of New York City in square miles\n',
 '.x\n',
 '? what is the population of New Yorks Combined Statistical Area\n',
 '.x\n',
 '? How man boroughs does New York City contain\n',
 '.x\n',
 '? the five boroughs of New York City are named what\n',
 '.x\n',
 '? all five borough -s of New York City formed into one city on wha

In [12]:
#write a file that can be used to test annabell that does not include the questions that were used in pre-training
pretraining_questions_filepath	=  "/Users/chris/PycharmProjects/dataset/pre_training/pre_training_nyc_samples.txt"
with open(pretraining_questions_filepath, "r") as pre_training_file:
	pre_training_questions = [line.strip() for line in pre_training_file.readlines() if line.startswith("?")]
print(pre_training_questions[:5])
test_no_pretrain_filepath = "testing/test_nyc_questions_without_pretrain.txt"
#the comparision should ignore case in the questions and the dataframe

#add acolumn to the dataframe that is the lowercase version of the response_question column
filtered_train_df["response_question_lower"] = filtered_train_df["response_question"].str.lower()
#filter the dataframe to remove any rows where the response_question_lower is in the pre_training_questions
pre_training_questions_lower = [q.lower() for q in pre_training_questions]
filtered_train_no_pretrain_df = filtered_train_df[~filtered_train_df["response_question_lower"].isin(pre_training_questions_lower)]
filtered_train_no_pretrain_df.reset_index(drop=True, inplace=True)
filtered_train_no_pretrain_df

['? How many school and universities are in NYC', '? How many leader terrorists of Al Quada were involved with the 911 attacks directly that day', '? how many square miles are land in NYC', '? The mean snowfall between 1981 and 2010 in NYC has been how many inches', '? How many Hispanic people live in the New York metropolitan area']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_train_df["response_question_lower"] = filtered_train_df["response_question"].str.lower()


Unnamed: 0,id,title,question,answer,response_question,response_answer,statement,response_question_lower
0,56ce304daab44d1400b8850e,New_York_City,What city in the United States has the highest...,New York,? What city in the United States has the highe...,new york,new york city has the highest population in th...,? what city in the united states has the highe...
1,56ce304daab44d1400b88510,New_York_City,What city has been called the cultural capital...,New York,? what city has been called the cultural capit...,new york,new york has been called the cultural capital ...,? what city has been called the cultural capit...
2,56ce304daab44d1400b88511,New_York_City,What American city welcomes the largest number...,New York,? What American city welcomes the largest numb...,new york,new york welcomes the largest number of legal ...,? what american city welcomes the largest numb...
3,56cf5d41aab44d1400b89130,New_York_City,The major gateway for immigration has been whi...,New York City,? the major gateway for immigration has been w...,new york city,the major gateway for immigration has been in ...,? the major gateway for immigration has been w...
4,56cf5d41aab44d1400b89131,New_York_City,The most populated city in the United States i...,New York City,? the most populated city in the united states...,new york city,the most populated city in the united states i...,? the most populated city in the united states...
...,...,...,...,...,...,...,...,...
770,56d1218c17492d1400aaba1e,New_York_City,What ZIP code was responsible for the greatest...,10021,? what ZIP code was responsible for the greate...,10021,the ZIP code 10021 was responsible for the gre...,? what zip code was responsible for the greate...
771,56d1218c17492d1400aaba1f,New_York_City,How much money in cents does New York City rec...,83,? how much money in cents does New York City r...,83,New York City receives 83 cents for every doll...,? how much money in cents does new york city r...
772,56d1218c17492d1400aaba20,New_York_City,How much more money does the city give to the ...,$11 billion,? how much more money does the city give to th...,11 billion,the city gives 11 billion to the state of new ...,? how much more money does the city give to th...
773,56d1218c17492d1400aaba21,New_York_City,"Each year, how much more money does New York C...",$11.4 billion,? each year how much more money does New York ...,11 point 4 billion,new york city gives 11 point 4 billion more to...,? each year how much more money does new york ...


In [13]:
write_testing_file(filtered_train_no_pretrain_df["response_question"].tolist(), test_no_pretrain_filepath)

with open(test_no_pretrain_filepath, "r") as test_file:
    test_lines = test_file.readlines()
test_lines

['? What city in the United States has the highest population\n',
 '.x\n',
 '? what city has been called the cultural capital of the world\n',
 '.x\n',
 '? What American city welcomes the largest number of legal immigrants\n',
 '.x\n',
 '? the major gateway for immigration has been which US city\n',
 '.x\n',
 '? the most populated city in the united states is which city\n',
 '.x\n',
 '? How many boroughs comprise New York City\n',
 '.x\n',
 '? in what year were the five borough -s combined into one city\n',
 '.x\n',
 '? in what year were the five borough -s combined into one city\n',
 '.x\n',
 '? what is the size of New York City in square miles\n',
 '.x\n',
 '? what is the population of New Yorks Combined Statistical Area\n',
 '.x\n',
 '? How man boroughs does New York City contain\n',
 '.x\n',
 '? the five boroughs of New York City are named what\n',
 '.x\n',
 '? all five borough -s of New York City formed into one city on what date\n',
 '.x\n',
 '? what is the population of New York

In [14]:
len(test_lines)

1550

In [7]:
#produce a summary of a dataset by splits
# Load the SQuAD dataset
ds = load_squad_dataset()
dataset_summary(ds)

summary of train split
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87599 entries, 0 to 87598
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        87599 non-null  object
 1   title     87599 non-null  object
 2   context   87599 non-null  object
 3   question  87599 non-null  object
 4   answers   87599 non-null  object
dtypes: object(5)
memory usage: 3.3+ MB
None
number of titles: 442
{'Antenna_(radio)', 'Hydrogen', 'Imamah_(Shia_doctrine)', 'Identity_(social_science)', 'Green', 'Elevator', 'Crucifixion_of_Jesus', 'Group_(mathematics)', '2008_Sichuan_earthquake', 'Solar_energy', 'Military_history_of_the_United_States', 'Federal_Aviation_Administration', 'Capacitor', 'Tuberculosis', 'Catalan_language', 'Intellectual_property', 'Myocardial_infarction', 'Transistor', 'Cardinal_(Catholicism)', 'Nutrition', 'Exhibition_game', 'Nonprofit_organization', 'Boston', 'Railway_electrification_system', 'IBM', 'Middle_Ages',

In [None]:
df = ds["train"].to_pandas()
df.columns