# Dataset Uploader for Hugging Face Hub
## Objective
The objective of this notebook is to take the final, curated dataset files from Google Drive and publish them to a new dataset repository on the Hugging Face Hub. It specifically uploads both the core, structured parallel corpus (`authentic_corpus_final.jsonl`) and the final training-ready version (`bidirectional_corpus_final.jsonl`), making both assets publicly available.

## Methodology
The script uses the `huggingface_hub` library to programmatically manage the upload process. It performs the following steps:

1. **Authentication:** Securely logs into a Hugging Face account using an access token.
2. **Repository Creation:** Creates a single, new dataset repository on the Hub.
3. **File Upload:** Sequentially uploads both specified `.jsonl` files from their local Google Drive paths to the same Hub repository.

## Workflow
1. Mounts Google Drive to access the final corpus files.

2. Installs the necessary `huggingface_hub` library.

3. Prompts the user to log in to their Hugging Face account.

4. Creates a new dataset repository on the Hub with a specified name.

5. Uploads `authentic_corpus_final.jsonl`.

6. Uploads `bidirectional_corpus_final.jsonl` to the same repository.

## Input & Output
* **Input:** Two `.jsonl` files located in Google Drive.
* **Output:** A new, public dataset repository on the Hugging Face Hub containing both of the uploaded `.jsonl` files.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Install the library to interact with the Hub
!pip install -q huggingface_hub

In [None]:
from huggingface_hub import login, HfApi
import os

In [None]:
# Authenticate with Hugging Face
from google.colab import userdata
huggingface_token = userdata.get('HF_TOKEN')
login(token=huggingface_token)

In [None]:
# --- Configuration ---
# Hugging Face username
HF_USERNAME = "abhinandansamal" # Use your actual username

# The name of the dataset repository
DATASET_REPO_ID = f"{HF_USERNAME}/bidirectional_odia_german_translation_parallel_corpus"

# The local path in Google Drive where the datasets are saved
LOCAL_PATH_AUTHENTIC_CORPUS = "/content/drive/MyDrive/Thesis/data/transformed/authentic_corpus_final.jsonl"
LOCAL_PATH_BIDIRECTIONAL_CORPUS = "/content/drive/MyDrive/Thesis/data/transformed/bidirectional_corpus_final.jsonl"

# --- Initialize HuggingFace API ---
api = HfApi()

In [None]:
# Create the dataset repository if it doesn't exist (using exist_ok=True)
# This step ensures the repo is ready for upload.
print(f"Ensuring dataset repository exists: {DATASET_REPO_ID}")
api.create_repo(
    repo_id=DATASET_REPO_ID,
    repo_type="dataset", # Specify repo_type="dataset"
    exist_ok=True
)
print(f"Dataset repository ready at: https://huggingface.co/datasets/{DATASET_REPO_ID}")

Ensuring dataset repository exists: abhinandansamal/bidirectional_odia_german_translation_parallel_corpus
Dataset repository ready at: https://huggingface.co/datasets/abhinandansamal/bidirectional_odia_german_translation_parallel_corpus


In [None]:
# Upload your data files
print(f"\nUploading {os.path.basename(LOCAL_PATH_AUTHENTIC_CORPUS)} to the Hub...")
api.upload_file(
    path_or_fileobj=LOCAL_PATH_AUTHENTIC_CORPUS,
    path_in_repo=os.path.basename(LOCAL_PATH_AUTHENTIC_CORPUS),
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)
print(f"Successfully uploaded: {os.path.basename(LOCAL_PATH_AUTHENTIC_CORPUS)}")

print(f"\nUploading {os.path.basename(LOCAL_PATH_BIDIRECTIONAL_CORPUS)} to the Hub...")
api.upload_file(
    path_or_fileobj=LOCAL_PATH_BIDIRECTIONAL_CORPUS,
    path_in_repo=os.path.basename(LOCAL_PATH_BIDIRECTIONAL_CORPUS),
    repo_id=DATASET_REPO_ID,
    repo_type="dataset",
)
print(f"Successfully uploaded: {os.path.basename(LOCAL_PATH_BIDIRECTIONAL_CORPUS)}")

print("\n\n✅ Both data files have been successfully uploaded to your Hugging Face Dataset repository!")
print(f"You can view them under 'Files and versions' here: https://huggingface.co/datasets/{DATASET_REPO_ID}/tree/main")


Uploading authentic_corpus_final.jsonl to the Hub...
Successfully uploaded: authentic_corpus_final.jsonl

Uploading bidirectional_corpus_final.jsonl to the Hub...
Successfully uploaded: bidirectional_corpus_final.jsonl


✅ Both data files have been successfully uploaded to your Hugging Face Dataset repository!
You can view them under 'Files and versions' here: https://huggingface.co/datasets/abhinandansamal/bidirectional_odia_german_translation_parallel_corpus/tree/main
