Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.6: Splitting the Dataset

When we run a neural language model, we are currently usually relying on a [Huggingface checkpoint](https://huggingface.co/models). To faciliate the processing, Huggingface has published a [code database](https://huggingface.co/docs/datasets/) for datasets. In this lab, we transform our dataset into the Huggingface format and split it into a train and a test portion.

## 1. Huggingface dataset

In [1]:
import pandas as pd

# We first check if the dataset is well formatted. If you get warnings, check the erroneous lines. 
# We are currently ignoring the wrongly formatted lines
pandas_dataset = pd.read_csv("../results/mediastack_results/veganism_overview.tsv", sep = "\t", on_bad_lines="warn")


Skipping line 7: expected 6 fields, saw 69
Skipping line 59: expected 6 fields, saw 192
Skipping line 74: expected 6 fields, saw 19
Skipping line 85: expected 6 fields, saw 69



**Investigate what went wrong in the erroneous lines and maybe adjust the crawling process.** Your peers need to be able to read in your test dataset without errors.

In [2]:
from datasets import Dataset

# You could also load the Huggingface dataset directly from the tsv file but only if it does not contain erroneous lines
# Check the documentation on Huggingface for more options on loading data. 
dataset = Dataset.from_pandas(pandas_dataset)
reduced_dataset = dataset.select_columns(['URL', 'Text'])
print(reduced_dataset[0:3])



{'URL': ['https://www.healthcanal.com/nutrition/healthy-eating/kefir-vs-kombucha', 'https://www.ksro.com/2023/11/02/miyokos-production-moving-out-of-petaluma/', 'https://www.yenisafak.com/ekonomi/veganlara-ozel-humm-organic-lezzetleri-4571893'], 'Text': [None, 'Miyoko’s Creamery is closing its production facility in Petaluma on January 1st. The move will impact 30 to 40 employees. A spokesperson for the company, which is known for its vegan dairy alternatives, says a new production facility is needed to increase production and efficiency after the company saw double-digit growth in the past few years. Miyoko’s Creamery will still have its headquarters in Petaluma.', 'Sürdürülebilir organik tarımı ve üretimi destekleyen “Temiz Reçete”li Humm Organic; vegan beslenmeyi tercih edenler için kurabiyeden grissiniye, kekten gevreğe 13 farklı lezzet sunuyor. Hiçbir katkı maddesi, koruyucu, renklendirici ve ilave şeker bulundurmayan “Temiz Reçete”li Humm Organic atıştırmalıkları, lezzetini ve yü

In [3]:
# Let's remove empty articles
filtered_dataset = reduced_dataset.filter(lambda instance: not instance["Text"] == None)
print(filtered_dataset[0:3])

Filter:   0%|          | 0/96 [00:00<?, ? examples/s]

{'URL': ['https://www.ksro.com/2023/11/02/miyokos-production-moving-out-of-petaluma/', 'https://www.yenisafak.com/ekonomi/veganlara-ozel-humm-organic-lezzetleri-4571893', 'https://www.hellomagazine.com/hfm/wish-list/506404/guide-to-vegan-beauty/'], 'Text': ['Miyoko’s Creamery is closing its production facility in Petaluma on January 1st. The move will impact 30 to 40 employees. A spokesperson for the company, which is known for its vegan dairy alternatives, says a new production facility is needed to increase production and efficiency after the company saw double-digit growth in the past few years. Miyoko’s Creamery will still have its headquarters in Petaluma.', 'Sürdürülebilir organik tarımı ve üretimi destekleyen “Temiz Reçete”li Humm Organic; vegan beslenmeyi tercih edenler için kurabiyeden grissiniye, kekten gevreğe 13 farklı lezzet sunuyor. Hiçbir katkı maddesi, koruyucu, renklendirici ve ilave şeker bulundurmayan “Temiz Reçete”li Humm Organic atıştırmalıkları, lezzetini ve yükse

In [4]:
# Let's keep track of the articles we removed. We might want to investigate what went wrong during crawling. 
removed = reduced_dataset.filter(lambda instance: instance["Text"] == None)

print("URLs with empty articles:")
print(removed["URL"])

print("\nThis should add up:" )
print(len(filtered_dataset), len(removed), len(reduced_dataset))

Filter:   0%|          | 0/96 [00:00<?, ? examples/s]

URLs with empty articles:
['https://www.healthcanal.com/nutrition/healthy-eating/kefir-vs-kombucha', 'https://euroweeklynews.com/2023/11/01/world-vegan-day-is-a-vegan-diet-wise/', 'https://www.dewsburyreporter.co.uk/lifestyle/food-and-drink/world-vegan-day-here-are-the-11-best-rated-restaurants-with-vegan-friendly-food-options-in-dewsbury-mirfield-batley-and-spen-according-to-tripadvisor-4392551', 'http://www.wnyc.org/story/kung-food-cookbook-shares-third-culture-recipes/', 'https://sfist.com/2023/10/30/first-five-food-vendors-several-of-them-vegan-announced-for-ikea-adjacent-food-hall-in-sf/', 'https://www.yorkshirepost.co.uk/news/people/beck-hall-malham-yorkshire-dales-hotel-to-become-first-in-england-to-go-fully-vegan-4387931', 'http://www.wnyc.org/articles/splendidtable', 'https://www.healthcanal.com/nutrition/weight-management/best-vegan-fat-burner', 'https://www.tmz.com/2023/10/27/kylie-jenner-vegan-leather-clothing-line-support-peta/', 'https://www.healthcanal.com/nutrition/diet

## 2. Split into train and test

For NLP experiment, we want to split our dataset into training and test data. When working with machine learning models, the training data is further split into a training and a development portion. The development data is used for exploring hyperparameters and for finetuning the model. During the development phase, the test data is not touched at all. For many shared tasks, the test data is not even publicly available to avoid overfitting. 

In [5]:
seed = 5
splitted = filtered_dataset.train_test_split(test_size=0.2)
train = splitted["train"]
test = splitted["test"]
print(len(train), len(test), len(filtered_dataset))

62 16 78


Take a look at the source code of the method. **Why do we need to set a seed?**

Take a look at the source code of the method. **Why do we need to set a seed?**

## 2. Split into train and test

For NLP experiment, we want to split our dataset into training and test data. When working with machine learning models, the training data is further split into a training and a development portion. The development data is used for exploring hyperparameters and for finetuning the model. During the development phase, the test data is not touched at all. For many shared tasks, the test data is not even publicly available to avoid overfitting. 

In [7]:
seed = 5
splitted = filtered_dataset.train_test_split(test_size=0.2)
train = splitted["train"]
test = splitted["test"]
print(len(train), len(test), len(filtered_dataset))

62 16 78


Take a look at the source code of the method. **Why do we need to set a seed?**