In [1]:
from importlib.metadata import version
print(f"You are building with dataset.sh version: {version('dataset_sh')}\n")

You are building with dataset.sh version: 0.0.34.post1



## How to convert json file to a dataset using dataset.sh

In this tutorial, we’ll walk through how to convert JSON data into a `dataset.sh` dataset.

We’ll cover the following steps:

1. Get the JSON data – Fetch or load your raw JSON content.
2. Define the data type – Describe the shape and structure of the data using a `easytype` type annotation.
3. Create and save the dataset – Convert the JSON into a standardized dataset format and save it locally.
4. (Optional) Upload the dataset to a remote server – Share your dataset by pushing it to a hosted location.


In [2]:
from IPython.display import Markdown, display
import pandas as pd

dataset_readme = """

This dataset contains **ISO 639-1**: The International Standard for country codes and codes for their subdivisions.

This dataset provides information on languages, including:

- **ISO Language Code** (`code`): A unique, two-letter code representing the language, following the ISO 639-1 standard.
- **Language Name in English** (`name`): The name of the language in English.
- **Native Name** (`native`): The name of the language as written in its native script or language form.
"""

display(Markdown(dataset_readme))



This dataset contains **ISO 639-1**: The International Standard for country codes and codes for their subdivisions.

This dataset provides information on languages, including:

- **ISO Language Code** (`code`): A unique, two-letter code representing the language, following the ISO 639-1 standard.
- **Language Name in English** (`name`): The name of the language in English.
- **Native Name** (`native`): The name of the language as written in its native script or language form.


## Download and get the json data


In [3]:
print(dataset_readmet)



This dataset contains **ISO 639-1**: The International Standard for country codes and codes for their subdivisions.

This dataset provides information on languages, including:

- **ISO Language Code** (`code`): A unique, two-letter code representing the language, following the ISO 639-1 standard.
- **Language Name in English** (`name`): The name of the language in English.
- **Native Name** (`native`): The name of the language as written in its native script or language form.



In [4]:
import requests
import json

languages = requests.get('https://gist.githubusercontent.com/joshuabaker/d2775b5ada7d1601bcd7b31cb4081981/raw/fb71e8ff9d7d970899d690fe23351601c5b70f04/languages.json')
codes = json.loads(languages.content)
codes

[{'code': 'aa', 'name': 'Afar', 'native': 'Afar'},
 {'code': 'ab', 'name': 'Abkhazian', 'native': 'Аҧсуа'},
 {'code': 'af', 'name': 'Afrikaans', 'native': 'Afrikaans'},
 {'code': 'ak', 'name': 'Akan', 'native': 'Akana'},
 {'code': 'am', 'name': 'Amharic', 'native': 'አማርኛ'},
 {'code': 'an', 'name': 'Aragonese', 'native': 'Aragonés'},
 {'code': 'ar', 'name': 'Arabic', 'native': 'العربية', 'rtl': 1},
 {'code': 'as', 'name': 'Assamese', 'native': 'অসমীয়া'},
 {'code': 'av', 'name': 'Avar', 'native': 'Авар'},
 {'code': 'ay', 'name': 'Aymara', 'native': 'Aymar'},
 {'code': 'az', 'name': 'Azerbaijani', 'native': 'Azərbaycanca'},
 {'code': 'ba', 'name': 'Bashkir', 'native': 'Башҡорт'},
 {'code': 'be', 'name': 'Belarusian', 'native': 'Беларуская'},
 {'code': 'bg', 'name': 'Bulgarian', 'native': 'Български'},
 {'code': 'bh', 'name': 'Bihari', 'native': 'भोजपुरी'},
 {'code': 'bi', 'name': 'Bislama', 'native': 'Bislama'},
 {'code': 'bm', 'name': 'Bambara', 'native': 'Bamanankan'},
 {'code': 'bn', 

## Define the data types

It’s essential for dataset creators to document the structure and contents of their datasets by defining types, as this helps developers who use the dataset understand its content quickly and accurately. 

Providing a clear type definition reduces potential errors, makes downstream integration smoother, and ultimately saves time for everyone, making the dataset more accessible and usable for developers.


In [5]:
from easytype import TypeBuilder
LanguageInfo = TypeBuilder.create(
    'LanguageInfo',
    code=str,
    name=str,
    native=str,
)

## Create and save the dataset

In [6]:
from importlib.metadata import version
print(f"dataset_sh version: {version('dataset_sh')}")

dataset_sh version: 0.0.34.post1


In [7]:
import dataset_sh as dsh

new_dataset = dsh.dataset('iso/language-639-1')
new_dataset.set_readme(dataset_readme)

latest_version = new_dataset.import_collection(
    codes,
    type_annotation=LanguageInfo, 
    description='my first commit.'
)

latest_version

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 185/185 [00:00<00:00, 388750.62it/s]


iso/language-639-1:version=cc1b5c883184985c5a9d886b0afaaa18ea64ac7b3ab1c8416e06516c6088ba2b

## (Optional) upload the dataset to a remote server

To uplaod the dataset to a remote host, make sure you have created a profile for the remote host using the following command:

```shell
dataset.sh profile add
```




In [8]:
# you can also use a specific profile
# # rmt = dsh.remote(profile='profile_name') 
rmt = dsh.remote() # using the default profile

remote_dataset = rmt.dataset('haowu4/language-639-1')
remote_dataset

haowu4/language-639-1 at https://base.dataset.sh

In [9]:
latest_version.upload_as_latest_to(remote_dataset)

Uploading haowu4/language-639-1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% Completed
