The load_dataset function is used to load a dataset from the Hugging Face Hub or from local files. It has many arguments that can be used to customize the loading process. Here is a table that explains each argument in detail:

| Argument | Description |
| --- | --- |
| path | The path or name of the dataset to load. It can be a local path, a URL, or a dataset identifier on the Hub. |
| name | The name of the configuration to use when loading a dataset with multiple configurations. |
| data_dir | The directory where the data files are stored. It can be used to load a dataset from a specific subdirectory of the path. |
| data_files | The data files to load. It can be a single file, a list of files, or a dictionary that maps splits to files. |
| split | The split or splits to load. It can be a single split, a list of splits, or a slice expression. |
| cache_dir | The directory where the cached data are stored. It can be used to specify a custom location for the cache files. |
| features | The features of the dataset. It can be used to specify the schema of the data, such as the column names and types. |
| download_config | The configuration for the download process. It can be used to specify parameters such as the download URL, the expected checksum, or the extraction method. |
| download_mode | The download mode to use. It can be one of FORCE_REDOWNLOAD, REUSE_CACHE_IF_EXISTS, or REUSE_DATASET_IF_EXISTS. |
| verification_mode | The verification mode to use. It can be one of AUTO, FORCE_VERIFICATION, or SKIP_VERIFICATION. |
| ignore_verifications | A deprecated argument that is equivalent to setting verification_mode to SKIP_VERIFICATION. |
| keep_in_memory | Whether to load the dataset in memory or not. It can be used to improve performance by avoiding disk I/O. |
| save_infos | Whether to save the dataset information or not. It can be used to store metadata such as the citation or the license of the dataset. |
| revision | The version of the dataset to load. It can be a git commit hash, a branch name, or a tag name. |
| token | The token to use for authentication when loading a private dataset from the Hub. It can be a string or a boolean. |
| use_auth_token | A deprecated argument that is equivalent to setting token to True. |
| task | A deprecated argument that is not used anymore. |
| streaming | Whether to load the dataset in streaming mode or not. It can be used to load large datasets without downloading or caching them. |
| num_proc | The number of processes to use for parallel data processing. It can be used to speed up the loading and preprocessing of the dataset. |
| storage_options | The options to pass to the backend file system when loading data from remote sources. It can be used to specify credentials or other parameters. |
| **config_kwargs | Additional keyword arguments to pass to the builder configuration of the dataset. It can be used to customize the dataset loading script. |

For more information and examples,
- (1) Load - Hugging Face. https://huggingface.co/docs/datasets/loading.
- (2) seaborn.load_dataset — seaborn 0.13.1 documentation. https://seaborn.pydata.org/generated/seaborn.load_dataset.html.
- (3) Prepare data for fine tuning Hugging Face models - Databricks. https://docs.databricks.com/en/machine-learning/train-model/huggingface/load-data.html.
- (4) Python:Seaborn | Built-in Functions | load_dataset() | Codecademy. https://www.codecademy.com/resources/docs/seaborn/built-in-functions/load-dataset.
(5) undefined. https://github.com/mwaskom/seaborn-data.



- To load a dataset from the Hugging Face Hub with a specific configuration and split:

```python
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")
```

- To load a dataset from a local directory with multiple data files for different splits:

```python
from datasets import load_dataset
data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("path/to/my_dataset", data_files=data_files)
```

- To load a dataset from a URL with custom features and download options:

```python
from datasets import load_dataset, Features, Value, DownloadConfig
features = Features({"text": Value("string"), "label": Value("int32")})
download_config = DownloadConfig(extract_compressed_file=True)
dataset = load_dataset("https://example.com/my_dataset.zip", features=features, download_config=download_config)
```



```python
from datasets import load_dataset
dataset = load_dataset("namespace/private_dataset", streaming=True, revision="main", token="my_token")
```

- To load a dataset with parallel processing and additional configuration parameters:

```python
from datasets import load_dataset
dataset = load_dataset("squad", num_proc=4, lang="en")
```
Here are some more code snippets that show how to use the load_dataset function for different formats of files:

- To load a dataset from a JSON file with a specific field as the column name:

```python
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_data.json", field="data")
```

- To load a dataset from a text file with one example per line:

```python
from datasets import load_dataset
dataset = load_dataset("text", data_files="my_data.txt")
```

- To load a dataset from a pandas dataframe:

```python
from datasets import load_dataset
import pandas as pd
df = pd.read_csv("my_data.csv")
dataset = load_dataset("pandas", data_files=df)
```

- To load a dataset from a CSV file with a header row and a delimiter:

```python
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_data.csv", delimiter=",", header=True)
```

- To load a dataset from multiple JSON files in a folder:

```python
from datasets import load_dataset
dataset = load_dataset("json", data_files="path/to/my_folder/*.json", field="data")
```

- (1) Load - Hugging Face. https://huggingface.co/docs/datasets/loading.
- (2) seaborn.load_dataset — seaborn 0.13.1 documentation. https://seaborn.pydata.org/generated/seaborn.load_dataset.html.
- (3) Quickstart - Hugging Face. https://bing.com/search?q=load_dataset+function+examples.
- (4) Create a dataset loading script - Hugging Face. https://huggingface.co/docs/datasets/dataset_script.
- (5) mlflow.data — MLflow 2.9.2 documentation. https://mlflow.org/docs/latest/python_api/mlflow.data.html.
- (6) undefined. https://github.com/mwaskom/seaborn-data.

In [1]:
!pip install -q -U datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
from datasets import load_dataset


dataset=load_dataset("fka/awesome-chatgpt-prompts")

split = list(dataset.keys())
columns = dataset[str(split[0])].column_names



In [None]:
from datasets import load_dataset
import os

# Get a list of all csv files in the directory
folder_path = "/content/sample_data/"
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Create a dictionary with file labels as keys (train, test, etc.) and file paths as values
data_files = {os.path.splitext(file)[0]: os.path.join(folder_path, file) for file in csv_files}

# Load the dataset
dataset = load_dataset('csv', data_files=data_files)


In [None]:
import datasets

# List of datasets to load
datasets_to_load = [
    "ag_news",
    "civil_comments",
    "dbpedia_14",
    "emotion",
    "gss",
    "hate_speech_offensive",
    "imdb",
    "quora_duplicate_questions",
    "sms_spam",
    "trec",
]

# Load the datasets
datasets_loaded = [datasets.load_dataset(dataset_name) for dataset_name in datasets_to_load]

In [4]:
# Define a dictionary of data type to loader function mappings
loaders = {
    "remote": load_dataset,
    "json": lambda source, **kwargs: load_dataset("json", data_files=source, **kwargs),
    "csv": lambda source, **kwargs: load_dataset("csv", data_files=source, **kwargs),
    "parquet": lambda source, **kwargs: load_dataset("parquet", data_files=source, **kwargs),
    "text": lambda source, **kwargs: load_dataset("text", data_files=source, **kwargs),
    "pandas": lambda source, **kwargs: load_dataset("pandas", data_files={'train': source}, **kwargs)
}

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from typing import Union, Dict, Any
from datasets import load_dataset, Features, Value, DownloadConfig
import pandas as pd

def load_dataset_from_source(
    source: Union[str, Dict[str, str]],
    data_type: str = None,  # 'json', 'csv', 'parquet', 'text', etc.
    split: str = "train",
    **kwargs: Any
):
    """
    Loads a dataset from a specified source.

    :param source: A path or URL to the dataset, or a mapping of split names to paths or URLs.
    :param data_type: The format of the dataset (e.g., 'json', 'csv', 'parquet', 'text').
                      If not provided, it will be inferred from the file extension or source content.
    :param split: The dataset split to load. Defaults to 'train' if not specified.
    :param kwargs: Additional keyword arguments for specific data types or custom configurations.
    :return: A `Dataset` or `DatasetDict` object.
    """

    # Infer the data type from the source if not provided
    if not data_type:
        if isinstance(source, str):
            if source.endswith(".json"):
                data_type = "json"
            elif source.endswith(".csv"):
                data_type = "csv"
            elif source.endswith(".parquet"):
                data_type = "parquet"
            elif source.endswith(".txt"):
                data_type = "text"
            elif source.startswith("http"):
                data_type = "remote"
            else:
                raise ValueError("Unable to infer the data type from the source. Please specify the data_type.")
        elif isinstance(source, dict):
            # Assume the first file in the dictionary to infer the data type
            first_key = next(iter(source))
            return load_dataset_from_source(source[first_key], data_type, split, **kwargs)
        else:
            raise ValueError("Source must be a string or a dictionary mapping split names to paths or URLs.")

    # Load the dataset based on the inferred or provided data type
    if data_type == "remote":
        dataset = load_dataset(source, split=split, **kwargs)
    elif data_type=="":
        dataset=load_dataset(source)
    else:
        dataset = load_dataset(data_type, data_files=source, split=split, **kwargs)

    return dataset

In [None]:
load_dataset("https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/blob/main/prompt",data_type='remote')

FileNotFoundError: Couldn't find a dataset script at /content/https:/huggingface.co/datasets/fka/awesome-chatgpt-prompts/blob/main/prompt/prompt.py or any data file in the same directory.

In [None]:


# Test 2: Load dataset from Hugging Face with dataset identifier.
dataset2 = load_dataset_from_source(
    "fka/awesome-chatgpt-prompts",data_type=""
)
print(dataset2)


# Test 3: Load dataset from a local CSV file.
dataset3 = load_dataset_from_source(
    "/content/sample_data/california_housing_test.csv",
    data_type="csv"
)
print(dataset3)

# Test 4: Load dataset from a local JSON file.
dataset4 = load_dataset_from_source(
    "/content/sample_data/anscombe.json",
    data_type="json"
)
print(dataset4)

In [None]:
dataset3.column_names

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

In [None]:
dataset4.column_names

['Series', 'Y', 'X']

In [None]:
!cat /content/drive/MyDrive/Github repo adversarical attack and defence's .gdoc

/bin/bash: -c: line 1: unexpected EOF while looking for matching `''
/bin/bash: -c: line 2: syntax error: unexpected end of file
