# Hugging Face Datasets

In this notebook we will learn how to work with Hugging Face datasets.

## Loading Existing Datasets

By having the name of a dataset, it is possible to load it directly from Hugging Face.

In [5]:
from datasets import load_dataset

In [6]:
# load a dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

If we call the dataset we will see that is stored in form of a dictionary, from which we can see:
- the name of the features, and
- the number of rows or datapoints

In [None]:
dataset

It is possible to explore the dataset by investigating the single rows:

In [7]:
# First row of the dataset
dataset['train']

Dataset({
    features: ['act', 'prompt'],
    num_rows: 170
})

In [None]:
# Last row of the dataset
dataset['train'][-1]

In [8]:
# for a better exploration of the content, you can also list the features
feature_names = list(dataset['train'].features)
print(feature_names)


['act', 'prompt']


In [None]:
# This means we can reverse engineer the data into a pandas dataframe and viceversa
import pandas as pd
df = pd.DataFrame(columns=feature_names)

for feature in feature_names:
    df[feature] = [value for value in dataset['train'][feature]]

print(f"Size of DataFrame = {len(df)}\n")
df.head()

In [None]:
# and save the data into preferred format like .csv, .json etc
print('Storing dataframe as file locally ...\n')
df.to_csv('data/awesome_chatgpt_prompts.csv', encoding='UTF-8', index=False)
print('File saved.\n')

## Data Preprocessing Method

In this section we will preprocess data loaded from Hugging Face.

<br>

### Shuffling

Supposing we want to create a train-test split, we will first **shuffle** the dataset and select a number of samples.

In [None]:
# Select 100 random samples (seed = random n. generator) (range = length subset)
shuffled_data = dataset['train'].shuffle(seed=42).select(range(100))

print(f"Size shuffled sampled dataset: {len(shuffled_data)}")

shuffled_data

### Train-Test Split

A dataset can be divided into two parts, one for training and another one for testing.

Most common splits are 80/20 or 70/30 according to size of dataset and purpose.

In [None]:
# Perform Train-Test Split (80/20 split)
split_data = shuffled_data.train_test_split(train_size=0.8, seed=42)

split_data

## Create a dataset

In this section we will load some unprocessed dataset and prepare for LLM training.

**Data Source**: old articles from Reuters

Source Link = https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz 

### Download the raw data

<i>The command `!wget file_link` allows to download a file locally.</i>

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz 

**UNIX/LINUX ONLY:**

The command `!tar -xzvf filename.tar.gz`  is used  to extract the contents of a .tar.gz file.

In [9]:
!tar -xzvf reuters21578.tar.gz -C extracted_tars

100.25s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


x README.txt
x all-exchanges-strings.lc.txt
x all-orgs-strings.lc.txt
x all-people-strings.lc.txt
x all-places-strings.lc.txt
x all-topics-strings.lc.txt
x cat-descriptions_120396.txt
x feldman-cia-worldfactbook-data.txt
x lewis.dtd
x reut2-000.sgm
x reut2-001.sgm
x reut2-002.sgm
x reut2-003.sgm
x reut2-004.sgm
x reut2-005.sgm
x reut2-006.sgm
x reut2-007.sgm
x reut2-008.sgm
x reut2-009.sgm
x reut2-010.sgm
x reut2-011.sgm
x reut2-012.sgm
x reut2-013.sgm
x reut2-014.sgm
x reut2-015.sgm
x reut2-016.sgm
x reut2-017.sgm
x reut2-018.sgm
x reut2-019.sgm
x reut2-020.sgm
x reut2-021.sgm


### Parse the dataset

We can use BeautifulSoup to parse the dataset and make it readable

In [10]:
from bs4 import BeautifulSoup
import os

In [11]:
# establish directory where files are stored (if not same as notebook or script)
directory = "/Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/"


# Open file and parse its content
articles = []
for file_name in os.listdir(directory):
    if file_name.endswith('.sgm'):  # Ensure we only process .sgm files
        full_path = os.path.join(directory, file_name)  # Create full file path
        print(f"Processing file: {full_path}")  # Show the file being processed
        
        try:
            with open(full_path, "r", encoding="latin-1") as file:
                soup = BeautifulSoup(file, "html.parser")
                articles.append(soup)  # Append parsed content to articles
        except Exception as e:
            print(f"Error processing file {full_path}: {e}")

Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-004.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-010.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-011.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-005.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-013.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-007.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-006.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-012.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-016.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-002.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-003.sgm
Processing

In [112]:
# Extract articles titles and bodies
parsed_articles = []

for reuters in soup.find_all('reuters'):
    title = reuters.title.string if reuters.title else ""
    body = reuters.body.string if reuters.body else ""
    parsed_articles.append(
        {
            "title":title,
            "body":body
        }
    )
articles.extend(parsed_articles)

In [12]:
# Print out first few articles for inspection
for i, article in enumerate(parsed_articles[:5]):
    print(article)
    print("-"*100)

NameError: name 'parsed_articles' is not defined

In [None]:
# Another way to print the data
print(parsed_articles[1]['title'])
print("-"*75)
print(parsed_articles[1]['body'])
print("-"*75)

### Split the dataset: Train, Test, and Validation

In [108]:
import json

In [130]:
# Establish % of training and validation set
TRAIN_PCT, VALID_PCT = 0.8, 0.1
total_articles = len(parsed_articles)


In [None]:
# Split Data
train_set = parsed_articles[:int(total_articles * TRAIN_PCT)]
valid_set = parsed_articles[int(total_articles * TRAIN_PCT): int(total_articles * (TRAIN_PCT + VALID_PCT))]
test_set = parsed_articles[int(total_articles * (TRAIN_PCT + VALID_PCT)):]

for set, set_name in zip([train_set, valid_set, test_set], ['train set', 'valid set', 'test set']):
    print(f"Length {set_name} = {len(set)} == {round(len(set) / total_articles, 4)}%")


### Save data as JSON

In [138]:
# Define directory for output
output_dir = "/Users/apavigli/gitrepos/hugging-face-learn/extracted_json_articles/"

# Helper function
def save_as_json(data, filename):
    with open(filename, "w") as f:
        for article in data:
            f.write(json.dumps(article) + '\n')


# Save as json
save_as_json(train_set, f"{output_dir}train.json")
save_as_json(valid_set, f"{output_dir}valid.json")
save_as_json(test_set, f"{output_dir}test.json")

## Load preprocessed dataset from JSON for model training

In [None]:
# create ingestion training dataset

data_files = {
    "train": f"{output_dir}train.json",
    "validation": f"{output_dir}valid.json",
    "test": f"{output_dir}test.json"
}

dataset = load_dataset("json", data_files=data_files)

In [None]:
# Explore dataset: General
dataset

In [None]:
# Explore dataset: test set
dataset['test'][0]

In [None]:
# Explore dataset: validation set
dataset['validation'][0]

In [None]:
# Explore dataset: training set
dataset['train'][0]

## Upload Dataset to GitHub

We can upload our dataset to Git Hub hence contributing to the community

In [156]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()
