In [8]:
import pandas as pd
import ast
from sklearn.model_selection import train_test_split

**Loading CSV file**

In [13]:
df = pd.read_csv("data/annotated_sentences500.csv")

**Convert Strings to Python Lists**

 - What happens:

The tokens and labels columns in the DataFrame are likely stored as strings (e.g., '["token1", "token2"]').
The ast.literal_eval function converts these string representations of lists into actual Python lists.
 - Why this is needed:
 
String representations cannot be processed effectively for NLP tasks; converting them to lists ensures proper handling during tokenization and label alignment.
 - Result:

Both columns, tokens and labels, are now lists of tokens and labels, respectively.

In [14]:
df['tokens'] = df['tokens'].apply(ast.literal_eval)
df['labels'] = df['labels'].apply(ast.literal_eval)

print(df.head())

                                              tokens  \
0  [There, are, no, Manaslu, between, the, Atlant...   
1  [We, were, just, about, to, go, up, in, the, M...   
2  [The, quaint, village, is, surrounded, by, Ann...   
3  [Ridgway, a, few, more, miles, away, from, the...   
4  [He, was, angry, with, her, for, going, up, in...   

                                              labels  
0  [O, O, O, B-MOUNTAIN, O, O, O, O, O, O, O, O, ...  
1  [O, O, O, O, O, O, O, O, O, B-MOUNTAIN, I-MOUN...  
2         [O, O, O, O, O, O, B-MOUNTAIN, O, O, O, O]  
3  [O, O, O, O, O, O, O, O, O, O, B-MOUNTAIN, I-M...  
4   [O, O, O, O, O, O, O, O, O, O, B-MOUNTAIN, O, O]  


**Split the Dataset into Train, Validation, and Test Sets / Print / Save**

 - What happens:

The dataset is split into three subsets:
Training set (train_df): Used to train the model.
Validation set (val_df): Used to tune hyperparameters and evaluate the model during training.
Test set (test_df): Used for final evaluation after training is complete.
 - The splitting process:

The first split separates 70% of the data into the training set and 30% into a temporary set (temp_df).
The temporary set is then split equally (50/50) into validation and test sets.
 - Parameters:

test_size=0.30: Reserves 30% of the data for validation and test sets.
random_state=42: Ensures reproducibility of the split.
 - Result:
 
The dataset is divided into three subsets: training, validation, and test sets.

In [15]:
train_df, temp_df = train_test_split(df, test_size=0.30, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.50, random_state=42)

print("Train size:", len(train_df))
print("Validation size:", len(val_df))
print("Test size:", len(test_df))


train_df.to_csv("data/train.csv", index=False)
val_df.to_csv("data/val.csv", index=False)
test_df.to_csv("data/test.csv", index=False)

Train size: 350
Validation size: 75
Test size: 75
