Reference: https://towardsdatascience.com/build-nlp-pipelines-with-huggingface-datasets-d597ff5f68ad

The essay talks about the good datasets for NLP projects provided by HuggingFace (HF). There are 718 datasets available so far, and HF develops an app to view them: https://huggingface.co/datasets/viewer/

This script is just a start to look into these datasets

# Import Libraries

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 670 kB/s eta 0:00:01
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 1.7 MB/s eta 0:00:01
[?25hCollecting tqdm>=4.62.1
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 2.6 MB/s eta 0:00:01
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-macosx_10_9_x86_64.whl (570 kB)
[K     |████████████████████████████████| 570 kB 2.3 MB/s eta 0:00:01
Collecting multiprocess
  Downloading multiprocess-0.70.12.2-py37-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 2.2 MB/s eta 0:00:01
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-macosx_10_9_x86_64.whl (31 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |███████████

In [2]:
import datasets
import json

# View List of Datasets Available

In [3]:
ds_list = datasets.list_datasets()

In [5]:
type(ds_list), len(ds_list)

(list, 2028)

In [7]:
ds_list[-5:]

['jinmang2/temp',
 'kiyoung2/temp',
 'Graphcore/wikipedia-bert-512',
 's3h/customized-qalb-v2',
 's3h/arabic-grammar-corrections']

In [8]:
[ds for ds in ds_list if 'squad' in ds.lower()]

['GEM/squad_v2',
 'Gabriel/squad_v2_sv',
 'Serhii/Custom_SQuAD',
 'Tevatron/wikipedia-squad-corpus',
 'Tevatron/wikipedia-squad',
 'Wikidepia/IndoSQuAD',
 'adamlin/coqa_squad',
 'dweb/squad_with_cola_scores',
 'lhoestq/custom_squad',
 'lhoestq/squad',
 'lhoestq/squad_titles',
 'lincoln/newsquadfr',
 'philschmid/test_german_squad',
 'piEsposito/squad_20_ptbr',
 'qwant/squad_fr',
 'shivmoha/squad-unanswerable',
 'shivmoha/squad_adversarial_manual',
 'susumu2357/squad_v2_sv',
 'vershasaxena91/squad_multitask',
 'z-uo/squad-it',
 'iapp_wiki_qa_squad',
 'squad',
 'squad_adversarial',
 'squad_es',
 'squad_it',
 'squad_kor_v1',
 'squad_kor_v2',
 'squad_v1_pt',
 'squad_v2',
 'squadshifts',
 'thaiqa_squad']

# Load Data

We can load the whole dataset directly, but if the dataset is too large, this may not be a good idea. An alternative way is to download the dataset iteratively by setting streaming = True 

In [52]:
# Load dataset. Data is split into train and test, here only loading the train part
data = datasets.load_dataset('squad', split='train', streaming=True)

In [53]:
type(data)

datasets.iterable_dataset.IterableDataset

In [54]:
print(data.dataset_size )
print(data.citation+ '\n')
print(data.description+ '\n')
print(data.features)

89789763
@article{2016arXiv160605250R,
       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}


Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.


{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int3

# View Data

If we load the whole data, then we may use data[0] to view the content, but if we use streaming = True, then we need to use the following loop to see it. Note: next() doesn't work here as 'IterableDataset' object is not an iterator

In [55]:
for i in data:
    print(json.dumps(i, indent=4))
    break

{
    "id": "5733be284776f41900661182",
    "title": "University_of_Notre_Dame",
    "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
    "question": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
    "answers": {
        "text": [
            "Saint Bernadette Soubirous"
        ],
        "answer_start": [

# Process Data

You can go over the data to turn it into a data frame or any format you are familiar, then start data processing. Or you can use the data processing function already avaialble for the HF data object.

In [63]:
# Delete column
data1 = data.remove_columns(['question'])
print(data1.features)  # after this practice, data1 doesn't have the same properties as data

None


In [66]:
# Change and keep necessary columns only, use "map"
# Example 1
data1 = data.map(
    lambda x: {
        'answers': {
            **x['answers'],
            **{'answer_end': [x['answers']['answer_start'][0] + len(x['answers']['text'][0])]}
        }
    }
)

for i in data1:
    print(i)
    break
    
# Example 2
data2 = data.map(
    lambda x: {
        'id': x['id'],
        'answers': {
            **x['answers'],
            **{'answer_end': [x['answers']['answer_start'][0] + len(x['answers']['text'][0])]}
        },
        'question': x['question'],
        'title': x['title']
    }
)

for i in data2:
    print(i)
    break

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous'], 'answer_end': [541]}}
{'id': '5733be284776f41900661182', 'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous'], 'answer_end': [541]}, 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
