In [1]:
from sklearn import datasets
from datasets import Dataset, DatasetDict

In [2]:
seed = 42
data_cache = "../data/"
train_data = datasets.fetch_20newsgroups(
    data_home=data_cache, subset="train", random_state=seed
)
test_data = datasets.fetch_20newsgroups(
    data_home=data_cache, subset="test", random_state=seed
)

In [3]:
print(train_data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [4]:
print(f"Size of training data: {len(train_data.data)}")

Size of training data: 11314


In [5]:
print(f"Size of testing data: {len(test_data.data)}")

Size of testing data: 7532


In [6]:
print("Target class names")
for i, cn in enumerate(train_data.target_names):
    print(f"{i + 1: >2d}. {cn}")

Target class names
 1. alt.atheism
 2. comp.graphics
 3. comp.os.ms-windows.misc
 4. comp.sys.ibm.pc.hardware
 5. comp.sys.mac.hardware
 6. comp.windows.x
 7. misc.forsale
 8. rec.autos
 9. rec.motorcycles
10. rec.sport.baseball
11. rec.sport.hockey
12. sci.crypt
13. sci.electronics
14. sci.med
15. sci.space
16. soc.religion.christian
17. talk.politics.guns
18. talk.politics.mideast
19. talk.politics.misc
20. talk.religion.misc


In [7]:
print("Example")
print(f"Text: {train_data.data[0]}")
print(f"Label: {train_data.target_names[train_data.target[0]]}")

Example
Text: From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Label: rec.autos


In [8]:
print("Example")
print(f"Text: {train_data.data[1]}")
print(f"Label: {train_data.target_names[train_data.target[1]]}")

Example
Text: From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>

Label: comp.sys.mac.hardware


In [9]:
train = {"text": train_data.data, "labels": train_data.target}
test = {"text": test_data.data, "labels": test_data.target}

In [10]:
# creating a huggingface dataset

from datasets import DatasetInfo, Features, Value, ClassLabel

In [11]:
label_names = train_data.target_names

In [12]:
data_features = {
    "text": Value(dtype="string"),
    "labels": ClassLabel(num_classes=len(label_names), names=label_names),
}
features = Features(data_features)

In [13]:
data_info = DatasetInfo(description=train_data.DESCR, features=features)

In [14]:
train = Dataset.from_dict(mapping=train, features=features, info=data_info)
test = Dataset.from_dict(mapping=test, features=features, info=data_info)

dataset = DatasetDict({"train": train, "test": test})
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 11314
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 7532
    })
})

In [15]:
ds_name = "20-news-groups"
dataset.save_to_disk(data_cache + ds_name)

Saving the dataset (0/1 shards):   0%|          | 0/11314 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7532 [00:00<?, ? examples/s]