In [None]:
!python --version

In [None]:
# NLP Purpose
!pip install "transformers[sentencepiece]" datasets

## **Using Pretrained Models**

Let’s say we’re looking for a French-based model that can perform mask filling.

We select the `camembert-base` checkpoint to try it out. The identifier `camembert-base` is all we need to start using it!

In [None]:
from transformers import pipeline

camembert_fill_mask = pipeline("fill-mask", model="camembert-base")
results = camembert_fill_mask("Le camembert est <mask> :)")

As you can see, loading a model within a pipeline is extremely simple. The only thing you need to watch out for is that the **chosen checkpoint is suitable for the task it’s going to be used for**.

We recommend using the task selector in the Hugging Face Hub interface in order to select the appropriate checkpoints:

You can also instantiate the checkpoint using the model architecture directly:

In [None]:
from transformers import CamembertTokenizer, CamembertForMaskedLM

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")

However, we recommend using the `Auto\* classes` instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `Auto\* classes` makes switching checkpoints simple

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = AutoModelForMaskedLM.from_pretrained("camembert-base")

When using a pretrained model, make sure to **check how it was trained, on which datasets, its limits, and its biases.**

## **Sharing Pretrained Models**
There are three ways to go about creating new model repositories:

- Using the `push_to_hub` API
- Using the `huggingface_hub` Python library
- Using the `web interface`

Once you’ve created a repository, you can upload files to it via git and git-lfs.

### **Using `push_to_hub` API**

1. **Generate an Authentication** (*username* and *password*)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

# --- terminal
# huggingface-cli login

2. If using the `Trainer`, we set `push_to_hub=True` inside `TrainingArguments`

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    # hub_model_id = "differenet_name",
    "bert-finetuned-mrpc",
    save_strategy="epoch",
    push_to_hub=True      # set to True
)

When you call `trainer.train()`, the `Trainer` will then upload your model to the Hub each time it is saved (here every epoch) in a repository in your namespace.

That repository will be named like the output directory you picked (here `bert-finetuned-mrpc`) but you can choose a different name with **`hub_model_id = "a_different_name"`.**

Once your training is finished, you should do a final

In [None]:
# After training finished, run this
trainer.push_to_hub()

the code above is to **upload the last version of the model**. It will also generate a model card with all the relevant metadata, reporting the hyperparameters used and the evaluation results!

3 If not using the `Trainer` we can push directly on models, tokenizers, and configurations via `push_to_hub()` also

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "camembert-base"

model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

We can do whatever we want with these (add token, change labels, fine-tune, etc.)

In [None]:
repo_name = "repo_location"

model.push_to_hub(repo_name)
tokenizers.push_to_hub(repo_name)
# tokenizers.push_to_hub(repo_name, organization="")
# tokenizers.push_to_hub(repo_name, organization="", use_auth_token="<TOKEN>")

### **Using the `huggingface_hub` Python lobrary**
The `huggingface_hub` Python library is a package which offers a set of tools for the model and datasets hubs.
- **It provides simple methods and classes for common tasks** (information about repositories on the hub and managing them).
- **It provides simple APIs** that work on top of git to manage those repositories’ content and to integrate the Hub in your projects and libraries.

1. **Authenticate**

In [None]:
huggingface-cli login

## if running in Google Colab
# !huggingface-cli login

2. **Methods to manage *a local* repository**

In [None]:
from huggingface_hub import (
    # User management
    login,
    logout,
    whoami,

    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,

    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

---- `create_repo` ➡ create a new repository on the hub

In [None]:
from huggingface_hub import create_repo

create_repo("dummy-model") # create repo 'dummy-model' in namespace
create_repo("dummy-model", organization="") # create repo that belong to the organization

# Other arugments
## -- private ➡ visibility of the repository
## -- token ➡ override the token stored
## -- repo_type ➡ instead create a model -- 'dataset' or 'space'

### **Using the Web Interface**
The web interface offers tools to manage repositories directly in the Hub. Using the interface, you can easily create repositories, add files (even large ones!), explore models, visualize diffs, and much more

### **Uploading the Model files**
The system to manage files on the Hugging Face Hub is based on **`git` for regular files**, and **`git-lfs**` (which stands for Git Large File Storage) **for larger files**.

#### **1. The `upload_file` approach**
Does not require git or git-lfs, it **pushes file directly to the Hub**.

A limitation of this approach is, it **doesn't handle larger than 5GB in size**.

If the files are larger than 5GB then,

In [None]:
from huggingface_hub import upload_file

upload_file(
    "<path_to_file>/config.json",
    path_in_repo="config.json",
    repo_id="<namespace>/dummy-model",
)

upload file `config.json` available at `<path_to_file>` to the root of the repository as `config.json` ➡ `dummy-model` repository

#### **2. The `Repository` class**
Using this class **requires having git and git-lfs** installed and set up before begin.

---- **Initialising** the repository into a local folder **by cloning the remote repository**

In [None]:
from huggingface_hub import Repository

repo = Repository("<path_to_dummy_folder>", clone_from="<namespace>/dummy-model")

This created the folder `<path_to_dummy_folder>` in our working directory. And only contains the `.gitattributes` files.

Leveraging the methods:
- `repo.git_pull()`
- `repo.git_add()`
- `repo.git_commit()`
- `repo.git_push()`
- r`epo.git_tag()`

---- `.pull()` [make sure, local clone is up to date -- latest changes]

In [None]:
repo.git_pull()

---- **save the model and tokenizer files**

In [None]:
model.save_pretrained("<path_to_dummy_folder>")
tokenizer.save_pretrained("<path_to_dummy_folder>")

The <path_to_dummy_folder> now contains all the model and tokenizer files.

Next, is follow the usual git workflow (adding files to the staging area, commiting, and pushin to the hub)

In [None]:
repo.git_add()
repo.git_commit("Add model and tokenizer files")
repo.git_push()

#### **3. The `git-based` approach**
Using this class requires having git and git-lfs installed and set up before begin.

---- **Inializing `git-lfs`**

In [None]:
git lfs install

---- **Clone model repository**

In [None]:
git clone https://huggingface.co/<namespace>/<your-model-id>

## namespace = <username>
## model-id  = <model-that-you-used>

---- cd to the folder and look at the contents

In [None]:
cd dummy && ls

---- **Generate code to commit** to dummy repository

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "camembert-base"

model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Do whatever with the model, train it, fine-tune it...

model.save_pretrained("<path_to_dummy_folder>")
tokenizer.save_pretrained("<path_to_dummy_folder>")

-------- list of the `dummy` folder

In [None]:
ls
#config.json  pytorch_model.bin  README.md  sentencepiece.bpe.model  special_tokens_map.json tokenizer_config.json  tokenizer.json

ls -lh
# check the size of the model

✏️ When creating the repository from the web interface, the *.gitattributes* file is automatically set up to consider files with certain extensions, such as *.bin* and *.h5*, as large files, and git-lfs will track them with no necessary setup on your side.

In [None]:
# Add to staging area
git add .

# File that currently staged
git status

# check that git-lfs tracking the correct file
git lfs status
## LFS is LargeFiles
## Git is SmallerFiles

# Commit and Push to huggingface
git commit -m "Messege"
git push

## **Building a Model Card**
It is the **central definition of the model, ensuring reusability by fellow community members** and **reproducibility of results**, and **providing a platform** on which other members may build their artifacts.

Documenting the training and evaluation process helps others **understand what to expect of a model** — and **providing sufficient information** regarding the **data that was used** and the **preprocessing and postprocessing that were done** ensures that the limitations, biases, and contexts in which the model is and is not useful can be identified and understood.

⚠ **Therefore, creating a model card that clearly defines your model is a very important step**

The model card usually starts with a very brief, high-level overview of what the model is for.
- Model description
- Intended uses & limitations
- How to use
- Limitations and bias
- Training data
- Training procedure
- Evaluation results

### **Model Description**
📓 Basic details about the model. Includes:
1. Architecture version
2. If it was introduced in a paper
3. If an original implementation is available
4. The author
5. Any copyright
6. General information about (the model, training procedures, parameters, disclaimer)

### **Intended uses & limitations**
📓 Use cases the model is intended for:
1. The language
2. Fields and domains where it can be applied
3. Document areas that are know to be out of scope for the model
4. Where it is perform suboptimally

### **How to Use**
📓 Include some examples of how to use the model:
1. Showcase usage of the `pipeline()` function
2. Usage of the model and tokenizer classes
3. And any other code (*helpful*)

### **Training data**
📓 Indicate which dataset(s) the model was trained on. A brief description of the dataset(s) is also welcome

### **Training procedure**
📓 Describe all the relevant aspects of training that are useful from a reproducibility perspective.
1. Any preprocessing and postprocessing that were done on the data
2. Details such as the number of:
    - `epochs` the model was trained for,
    - the `batch size`,
    - the `learning rate`,
    - and so on.

### **Variable and metrics**
📓 Describe the **metrics** you use **for evaluation**, and the **different factors you are mesuring**.
1. Mentioning which metric(s) were used, on which dataset and which dataset split,
2. makes it easy to compare you model’s performance compared to that of other models.
3. These should be informed by the previous sections, such as the intended users and use cases.

### **Evaluation results**
📓 Provide an **indication of how well the model performs on the evaluation dataset**. If the model uses a decision threshold, either provide the decision threshold used in the evaluation, or provide details on evaluation at different thresholds for the intended uses.