# Natural Language Processing

## 1. Working on your own file

we’ll show you how Datasets can be used to load datasets that aren’t available on the Hugging Face Hub.

### Working with local and remote datasets

Datasets provides loading scripts to handle the loading of local and remote datasets.

|Data format	   | Loading script	    |Example                                                 |
| ---              | ---                | ---                                                    |
|CSV & TSV         | csv                | `load_dataset("csv", data_files="my_file.csv")`        |
|Text files	       | text	            | `load_dataset("text", data_files="my_file.txt")`       |
|JSON & JSON Lines | json	            | `load_dataset("json", data_files="my_file.jsonl")`     |
|Pickled DataFrames|pandas	            | `load_dataset("pandas", data_files="my_dataframe.pkl")`|

### Loading a local dataset

For this example we’ll use the SQuAD-it dataset, which is a large-scale dataset for question answering in Italian.

Download https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz and https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

You can also use the `wget` in your terminal:

    wget -P data/ https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
    wget -P data/ https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz
    
This will download two compressed files called SQuAD_it-train.json.gz and SQuAD_it-test.json.gz

We can load a JSON file with the `load_dataset()` function.  We just need to know if we’re dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a data field. This means we can load the dataset by specifying the field argument as follows:

In [1]:
from datasets import load_dataset

#Datasets support automatic decompression of the input files, so .gz is ok
squad_it_dataset = load_dataset("json", data_files="data/SQuAD_it-train.json.gz", field="data")

ModuleNotFoundError: No module named 'datasets'

By default, loading local files creates a `DatasetDict` object with a train split. We can see this by inspecting the `squad_it_dataset` object:

In [None]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

This shows us the number of rows and the column names associated with the training set. We can view one of the examples by indexing into the train split as follows:

In [None]:
squad_it_dataset["train"][0]

{
    "title": "Terremoto del Sichuan del 2008",
    "paragraphs": [
        {
            "context": "Il terremoto del Sichuan del 2008 o il terremoto...",
            "qas": [
                {
                    "answers": [{"answer_start": 29, "text": "2008"}],
                    "id": "56cdca7862d2951400fa6826",
                    "question": "In quale anno si è verificato il terremoto nel Sichuan?",
                },
                ...
            ],
        },
        ...
    ],
}

Great, we’ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the train and test splits in a single DatasetDict object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:

In [6]:
data_files = {"train": "data/SQuAD_it-train.json.gz", "test": "data/SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Using custom data configuration default-0f18ebd34ebc56ce


Downloading and preparing dataset json/default to /Users/chaklam/.cache/huggingface/datasets/json/default-0f18ebd34ebc56ce/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/chaklam/.cache/huggingface/datasets/json/default-0f18ebd34ebc56ce/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

If you’re working as a data scientist or coder in a company, there’s a good chance the datasets you want to analyze are stored on some remote server. Fortunately, loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point data_files to the SQuAD_it-*.json.gz URLs as follows:

In [7]:
# url = "https://github.com/crux82/squad-it/raw/master/"
# data_files = {
#     "train": url + "SQuAD_it-train.json.gz",
#     "test":  url + "SQuAD_it-test.json.gz",
# }
# squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Using custom data configuration default-57dcee3ea6992346


Downloading and preparing dataset json/default to /Users/chaklam/.cache/huggingface/datasets/json/default-57dcee3ea6992346/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/7.73M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset