# Hugging Face Datasets

In this notebook we will learn how to work with Hugging Face datasets.

## Loading Existing Datasets

By having the name of a dataset, it is possible to load it directly from Hugging Face.

In [31]:
# Linux/Unix: create subfolders for organizing data

!mkdir data
!mkdir extracted_json_articles
!mkdir extracted_tars


In [1]:
from datasets import load_dataset

In [2]:
# load a dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

If we call the dataset we will see that is stored in form of a dictionary, from which we can see:
- the name of the features, and
- the number of rows or datapoints

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 170
    })
})

It is possible to explore the dataset by investigating the single rows:

In [4]:
# First row of the dataset
dataset['train']

Dataset({
    features: ['act', 'prompt'],
    num_rows: 170
})

In [5]:
# Last row of the dataset
dataset['train'][-1]

{'act': 'Architectural Expert',
 'prompt': 'I am an expert in the field of architecture, well-versed in various aspects including architectural design, architectural history and theory, structural engineering, building materials and construction, architectural physics and environmental control, building codes and standards, green buildings and sustainable design, project management and economics, architectural technology and digital tools, social cultural context and human behavior, communication and collaboration, as well as ethical and professional responsibilities. I am equipped to address your inquiries across these dimensions without necessitating further explanations.'}

In [6]:
# for a better exploration of the content, you can also list the features
feature_names = list(dataset['train'].features)
print(feature_names)


['act', 'prompt']


In [7]:
# This means we can reverse engineer the data into a pandas dataframe and viceversa
import pandas as pd
df = pd.DataFrame(columns=feature_names)

for feature in feature_names:
    df[feature] = [value for value in dataset['train'][feature]]

print(f"Size of DataFrame = {len(df)}\n")
df.head()

Size of DataFrame = 170



Unnamed: 0,act,prompt
0,An Ethereum Developer,Imagine you are an experienced Ethereum develo...
1,SEO Prompt,"Using WebPilot, create an outline for an artic..."
2,Linux Terminal,I want you to act as a linux terminal. I will ...
3,English Translator and Improver,"I want you to act as an English translator, sp..."
4,`position` Interviewer,I want you to act as an interviewer. I will be...


In [8]:
# and save the data into preferred format like .csv, .json etc
print('Storing dataframe as file locally ...\n')
df.to_csv('data/awesome_chatgpt_prompts.csv', encoding='UTF-8', index=False)
print('File saved.\n')

Storing dataframe as file locally ...

File saved.



## Data Preprocessing Method

In this section we will preprocess data loaded from Hugging Face.

<br>

### Shuffling

Supposing we want to create a train-test split, we will first **shuffle** the dataset and select a number of samples.

In [9]:
# Select 100 random samples (seed = random n. generator) (range = length subset)
shuffled_data = dataset['train'].shuffle(seed=42).select(range(100))

print(f"Size shuffled sampled dataset: {len(shuffled_data)}")

shuffled_data

Size shuffled sampled dataset: 100


Dataset({
    features: ['act', 'prompt'],
    num_rows: 100
})

### Train-Test Split

A dataset can be divided into two parts, one for training and another one for testing.

Most common splits are 80/20 or 70/30 according to size of dataset and purpose.

In [10]:
# Perform Train-Test Split (80/20 split)
split_data = shuffled_data.train_test_split(train_size=0.8, seed=42)

split_data

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 80
    })
    test: Dataset({
        features: ['act', 'prompt'],
        num_rows: 20
    })
})

## Create a dataset

In this section we will load some unprocessed dataset and prepare for LLM training.

**Data Source**: old articles from Reuters

Source Link = https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz 

### Download the raw data

<i>The command `!wget file_link` allows to download a file locally.</i>

In [11]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz 

--2024-09-27 16:24:13--  https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: 'reuters21578.tar.gz.1'

reuters21578.tar.gz     [  <=>               ]   7.25M   149KB/s               ^C


**UNIX/LINUX ONLY:**

The command `!tar -xzvf filename.tar.gz`  is used  to extract the contents of a .tar.gz file.

In [9]:
!tar -xzvf reuters21578.tar.gz -C extracted_tars

100.25s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


x README.txt
x all-exchanges-strings.lc.txt
x all-orgs-strings.lc.txt
x all-people-strings.lc.txt
x all-places-strings.lc.txt
x all-topics-strings.lc.txt
x cat-descriptions_120396.txt
x feldman-cia-worldfactbook-data.txt
x lewis.dtd
x reut2-000.sgm
x reut2-001.sgm
x reut2-002.sgm
x reut2-003.sgm
x reut2-004.sgm
x reut2-005.sgm
x reut2-006.sgm
x reut2-007.sgm
x reut2-008.sgm
x reut2-009.sgm
x reut2-010.sgm
x reut2-011.sgm
x reut2-012.sgm
x reut2-013.sgm
x reut2-014.sgm
x reut2-015.sgm
x reut2-016.sgm
x reut2-017.sgm
x reut2-018.sgm
x reut2-019.sgm
x reut2-020.sgm
x reut2-021.sgm


### Parse the dataset

We can use BeautifulSoup to parse the dataset and make it readable

In [12]:
from bs4 import BeautifulSoup
import os

In [13]:
# establish directory where files are stored (if not same as notebook or script)
directory = "/Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/"


# Open file and parse its content
articles = []
for file_name in os.listdir(directory):
    if file_name.endswith('.sgm'):  # Ensure we only process .sgm files
        full_path = os.path.join(directory, file_name)  # Create full file path
        print(f"Processing file: {full_path}")  # Show the file being processed
        
        try:
            with open(full_path, "r", encoding="latin-1") as file:
                soup = BeautifulSoup(file, "html.parser")
                articles.append(soup)  # Append parsed content to articles
        except Exception as e:
            print(f"Error processing file {full_path}: {e}")

Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-004.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-010.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-011.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-005.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-013.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-007.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-006.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-012.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-016.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-002.sgm
Processing file: /Users/apavigli/gitrepos/hugging-face-learn/extracted_tars/reut2-003.sgm
Processing

In [14]:
# Extract articles titles and bodies
parsed_articles = []

for reuters in soup.find_all('reuters'):
    title = reuters.title.string if reuters.title else ""
    body = reuters.body.string if reuters.body else ""
    parsed_articles.append(
        {
            "title":title,
            "body":body
        }
    )
articles.extend(parsed_articles)

In [15]:
# Print out first few articles for inspection
for i, article in enumerate(parsed_articles[:5]):
    print(article)
    print("-"*100)

{'title': 'CITYFED FINANCIAL CORP SAYS IT CUT QTRLY DIVIDEND TO ONE CENT FROM 10 CTS/SHR\n', 'body': ''}
----------------------------------------------------------------------------------------------------
{'title': 'HUGE OIL PLATFORMS DOT GULF LIKE BEACONS', 'body': 'Huge oil platforms dot the Gulf like\nbeacons -- usually lit up like Christmas trees at night.\n    One of them, sitting astride the Rostam offshore oilfield,\nwas all but blown out of the water by U.S. Warships on Monday.\n    The Iranian platform, an unsightly mass of steel and\nconcrete, was a three-tier structure rising 200 feet (60\nmetres) above the warm waters of the Gulf until four U.S.\nDestroyers pumped some 1,000 shells into it.\n    The U.S. Defense Department said just 10 pct of one section\nof the structure remained.\n    U.S. helicopters destroyed three Iranian gunboats after an\nAmerican helicopter came under fire earlier this month and U.S.\nforces attacked, seized, and sank an Iranian ship they said had\

In [16]:
# Another way to print the data
print(parsed_articles[1]['title'])
print("-"*75)
print(parsed_articles[1]['body'])
print("-"*75)

HUGE OIL PLATFORMS DOT GULF LIKE BEACONS
---------------------------------------------------------------------------
Huge oil platforms dot the Gulf like
beacons -- usually lit up like Christmas trees at night.
    One of them, sitting astride the Rostam offshore oilfield,
was all but blown out of the water by U.S. Warships on Monday.
    The Iranian platform, an unsightly mass of steel and
concrete, was a three-tier structure rising 200 feet (60
metres) above the warm waters of the Gulf until four U.S.
Destroyers pumped some 1,000 shells into it.
    The U.S. Defense Department said just 10 pct of one section
of the structure remained.
    U.S. helicopters destroyed three Iranian gunboats after an
American helicopter came under fire earlier this month and U.S.
forces attacked, seized, and sank an Iranian ship they said had
been caught laying mines.
    But Iran was not deterred, according to U.S. defense
officials, who said Iranian forces used Chinese-made Silkworm
missiles to hit a U

### Split the dataset: Train, Test, and Validation

In [17]:
import json

In [18]:
# Establish % of training and validation set
TRAIN_PCT, VALID_PCT = 0.8, 0.1
total_articles = len(parsed_articles)


In [19]:
# Split Data
train_set = parsed_articles[:int(total_articles * TRAIN_PCT)]
valid_set = parsed_articles[int(total_articles * TRAIN_PCT): int(total_articles * (TRAIN_PCT + VALID_PCT))]
test_set = parsed_articles[int(total_articles * (TRAIN_PCT + VALID_PCT)):]

for set, set_name in zip([train_set, valid_set, test_set], ['train set', 'valid set', 'test set']):
    print(f"Length {set_name} = {len(set)} == {round(len(set) / total_articles, 4)}%")


Length train set = 462 == 0.7993%
Length valid set = 58 == 0.1003%
Length test set = 58 == 0.1003%


### Save data as JSON

In [20]:
# Define directory for output
output_dir = "/Users/apavigli/gitrepos/hugging-face-learn/extracted_json_articles/"

# Helper function
def save_as_json(data, filename):
    with open(filename, "w") as f:
        for article in data:
            f.write(json.dumps(article) + '\n')


# Save as json
save_as_json(train_set, f"{output_dir}train.json")
save_as_json(valid_set, f"{output_dir}valid.json")
save_as_json(test_set, f"{output_dir}test.json")

## Load preprocessed dataset from JSON for model training

In [21]:
# create ingestion training dataset

data_files = {
    "train": f"{output_dir}train.json",
    "validation": f"{output_dir}valid.json",
    "test": f"{output_dir}test.json"
}

dataset = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [22]:
# Explore dataset: General
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'body'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
    test: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
})

In [23]:
# Explore dataset: test set
dataset['test'][0]

{'title': 'BONN MINISTRY HAS NO COMMENT ON BAKER REMARKS',
 'body': "The West German Finance Ministry declined to\ncomment on weekend criticism by U.S. Treasury Secretary James\nBaker of recent West German interest rate increases.\n    Baker said the U.S. Would re-examine the February Louvre\nAccord to stabilise currencies reached by leading industrial\ndemocracies. The rise in West Germany short term interest rates\nwas not in the spirit of an agreement by these nations in\nWashington, which reaffirmed the Louvre pact, he said.\n    A Finance Ministry spokesman, asked for an official\nministry reaction to Baker's remarks, said he could make no\ncomment.\n REUTER\n\x03"}

In [24]:
# Explore dataset: validation set
dataset['validation'][0]

{'title': 'MICROSOFT CORP 1ST QTR SHR 38 CTS VS 29 CTS\n', 'body': ''}

In [25]:
# Explore dataset: training set
dataset['train'][0]

{'title': 'CITYFED FINANCIAL CORP SAYS IT CUT QTRLY DIVIDEND TO ONE CENT FROM 10 CTS/SHR\n',
 'body': ''}

## Upload Dataset to GitHub

We can upload our dataset to Git Hub hence contributing to the community

In [26]:
from huggingface_hub import notebook_login

In [28]:
# Login to personal Hugging Face Hub (via token with writing permission)
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [29]:
# Store dataset on Hugging Face Hub
dataset.push_to_hub("reuters_articles")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Kain17/reuters_articles/commit/837a2bdf56f2c24eaa581564094f1e447776143f', commit_message='Upload dataset', commit_description='', oid='837a2bdf56f2c24eaa581564094f1e447776143f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Kain17/reuters_articles', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Kain17/reuters_articles'), pr_revision=None, pr_num=None)

<hr>

###### End of the Notebook