Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.6: Splitting the Dataset

When we run a neural language model, we are currently usually relying on a [Huggingface checkpoint](https://huggingface.co/models). To faciliate the processing, Huggingface has published a [code database](https://huggingface.co/docs/datasets/) for datasets. In this lab, we transform our dataset into the Huggingface format and split it into a train and a test portion.

## 1. Huggingface dataset

In [21]:
import pandas as pd

# We first check if the dataset is well formatted. If you get warnings, check the erroneous lines. 
# We are currently ignoring the wrongly formatted lines
pandas_dataset = pd.read_csv("../results/mediastack_results/veganism_overview.tsv", sep = "\t", on_bad_lines="warn")


Skipping line 12: expected 6 fields, saw 69
Skipping line 32: expected 6 fields, saw 10
Skipping line 64: expected 6 fields, saw 69
Skipping line 89: expected 6 fields, saw 192
Skipping line 90: expected 6 fields, saw 60



**Investigate what went wrong in the erroneous lines and maybe adjust the crawling process.** Your peers need to be able to read in your test dataset without errors.

In [22]:
from datasets import Dataset

# You could also load the Huggingface dataset directly from the tsv file but only if it does not contain erroneous lines
# Check the documentation on Huggingface for more options on loading data. 
dataset = Dataset.from_pandas(pandas_dataset)
reduced_dataset = dataset.select_columns(['URL', 'Text'])
print(reduced_dataset[0:3])



{'URL': ['https://tribuneonlineng.com/10-alternative-sources-of-protein/', 'https://www.reviewjournal.com/opinion/letters/letter-las-vegans-face-the-mosquito-menace-2917655/', 'https://www.twincities.com/2023/10/07/five-great-veggie-burgers-that-wont-have-you-missing-the-meat/'], 'Text': [None, 'Where are the abatement efforts? This summer, we in Clark County have been besieged by the aides aegypti mosquito. They mostly lay their eggs in standing water. The washes throughout the county are full of just that. Isn’t it time for the powers that be to institute a mosquito abatement plan? Public health is at risk.', 'Over the years, I’ve gotten more than a few emails from readers asking me where they can get a good veggie burger. And though I really appreciate a great one, I find that many restaurants either serve a pre-made patty or a flavorless facsimile thereof. But lately, I’ve been researching regular burgers for our annual burger guide, and where there’s a house-made veggie burger on 

In [29]:
# Let's remove empty articles
filtered_dataset = reduced_dataset.filter(lambda instance: not instance["Text"] == None)
print(filtered_dataset[0:3])

Filter:   0%|          | 0/95 [00:00<?, ? examples/s]

{'URL': ['https://www.reviewjournal.com/opinion/letters/letter-las-vegans-face-the-mosquito-menace-2917655/', 'https://www.twincities.com/2023/10/07/five-great-veggie-burgers-that-wont-have-you-missing-the-meat/', 'https://www.essentiallysports.com/soccer-football-news-wta-tennis-news-despite-opposing-vegan-novak-djokovic-principles-five-hundred-million-dollar-rich-zlatan-ibrahimovic-lauds-twenty-four-times-grand-slam-champion/'], 'Text': ['Where are the abatement efforts? This summer, we in Clark County have been besieged by the aides aegypti mosquito. They mostly lay their eggs in standing water. The washes throughout the county are full of just that. Isn’t it time for the powers that be to institute a mosquito abatement plan? Public health is at risk.', 'Over the years, I’ve gotten more than a few emails from readers asking me where they can get a good veggie burger. And though I really appreciate a great one, I find that many restaurants either serve a pre-made patty or a flavorles

In [33]:
# Let's keep track of the articles we removed. We might want to investigate what went wrong during crawling. 
removed = reduced_dataset.filter(lambda instance: instance["Text"] == None)

print("URLs with empty articles:")
print(removed["URL"])

print("\nThis should add up:" )
print(len(filtered_dataset), len(removed), len(reduced_dataset))

Filter:   0%|          | 0/95 [00:00<?, ? examples/s]

URLs with empty articles:
['https://tribuneonlineng.com/10-alternative-sources-of-protein/', 'https://torontosun.com/news/world/men-wont-go-vegan-because-its-not-seen-as-masculine-study', 'https://www.healthcanal.com/nutrition/diet/is-vegetable-oil-vegan', 'https://www.healthcanal.com/nutrition/diet/is-white-chocolate-vegan', 'https://www.healthcanal.com/nutrition/diet/is-tahini-vegan', 'https://www.marketscreener.com/quote/stock/CARNIVAL-CORPORATION-PLC-12213/news/Princess-Cruises-Introduces-Expansive-Vegan-Menus-For-Plant-Based-Cruisers-44964714/?utm_medium=RSS&utm_content=20231002', 'https://montreal.ctvnews.ca/a-taste-for-plant-based-foods-is-growing-and-so-is-montreal-s-vegan-festival-1.6584727', 'https://www.jutarnji.hr/dobrahrana/promo/nove-di-go-gotove-mjesavine-idealno-su-rjesenje-za-domaci-kruh-iz-snova-15379427', 'https://www.healthcanal.com/nutrition/diet/is-greek-yogurt-vegan', 'https://www.hellomagazine.com/cuisine/503411/pow-food-dinner-party-review/', 'https://thebusine

## 2. Split into train and test

For NLP experiment, we want to split our dataset into training and test data. When working with machine learning models, the training data is further split into a training and a development portion. The development data is used for exploring hyperparameters and for finetuning the model. During the development phase, the test data is not touched at all. For many shared tasks, the test data is not even publicly available to avoid overfitting. 

In [35]:
seed = 5
splitted = filtered_dataset.train_test_split(test_size=0.2)
train = splitted["train"]
test = splitted["test"]
print(len(train), len(test), len(filtered_dataset))

64 17 81


Take a look at the source code of the method. **Why do we need to set a seed?**