- 环境配置及路径
    - `export HF_HOME='/media/whaow/.cache/huggingface'`
    - model path: `HF_HOME/hub/models--xx-xx`
        - models--meta-llama--Llama-2-7b-hf: `meta-llama/Llama-2-7b-hf`

In [1]:
from datasets import load_dataset
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

## dtypes

In [2]:
from transformers.models.gpt2.modeling_gpt2 import GPT2Block
from transformers import AutoConfig

In [6]:
config = AutoConfig.from_pretrained('gpt2-medium')
gpt2_block = GPT2Block(config, layer_idx=0)

In [9]:
# default torch.float32
next(gpt2_block.parameters()).dtype

torch.float32

In [11]:
config = AutoConfig.from_pretrained('gpt2-medium')
gpt2_block = GPT2Block(config, layer_idx=0)
# half: 半精度
gpt2_block = GPT2Block(config, layer_idx=0).half()
next(gpt2_block.parameters()).dtype

torch.float16

### load dataset

- `lvwerra/stack-exchange-paired`
    - data_dir
- `cais/mmlu`
    - subset
    - split
-----
- data_dir
    - "data/finetune"
    - "data/rl"
    - "data/evaluate"
    - "data/reward"
- subset: mmlu
- split
    - "train"
    - "test"
    - "valid"
- num_proc: 多少个 cpu 进程下载；
    - 4
- streaming
    - return IterableDataset
        - has no len()

In [2]:
streaming = True
dataset = load_dataset(
    'lvwerra/stack-exchange-paired',
    data_dir='data/finetune',
    split='train',
    use_auth_token=True,
    num_proc=4 if not streaming else None,
    streaming=streaming,
)



Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [3]:
len(dataset)

TypeError: object of type 'IterableDataset' has no len()

In [4]:
dataset.features

{'qid': Value(dtype='int64', id=None),
 'question': Value(dtype='string', id=None),
 'date': Value(dtype='string', id=None),
 'metadata': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'response_j': Value(dtype='string', id=None),
 'response_k': Value(dtype='string', id=None)}

In [5]:
valid_ds = dataset.take(4000)
train_ds = dataset.skip(4000)
train_ds = train_ds.shuffle(buffer_size=5000, seed=None)

In [6]:
data = next(iter(valid_ds))

In [7]:
# type(data): dict
data.keys()

dict_keys(['qid', 'question', 'date', 'metadata', 'response_j', 'response_k'])

In [8]:
# https://stackoverflow.com/questions/12891264/jquery-file-upload-plugin-not-calling-success-callback
data['qid']

12891264

In [9]:
print(data['response_j'])

Looking at the library code, seems all events are renamed removing 'fileupload' ... so 'fileuploaddone' becomes just 'done'. It is valid for all other callbacks.
look at this section:

```
    // Other callbacks:
    // Callback for the submit event of each file upload:
    // submit: function (e, data) {}, // .bind('fileuploadsubmit', func);
    // Callback for the start of each file upload request:
    // send: function (e, data) {}, // .bind('fileuploadsend', func);
    // Callback for successful uploads:
    // done: function (e, data) {}, // .bind('fileuploaddone', func);
    // Callback for failed (abort or error) uploads:
    // fail: function (e, data) {}, // .bind('fileuploadfail', func);
    // Callback for completed (success, abort or error) requests:
    // always: function (e, data) {}, // .bind('fileuploadalways', func);
    // Callback for upload progress events:
    // progress: function (e, data) {}, // .bind('fileuploadprogress', func);
    // Callback for global uplo

### cache_dir

- `cache_dir`：会自动下载，并将其缓存到该 `cache_dir` 内，而不是默认的 `HF_HOME`；
    - 这样不用修改代码，第二次执行时，就直接在 cache_dir 内查找了

```
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf', cache_dir='./model')
```

## 加载本地文件

In [1]:
from transformers import AutoConfig

In [3]:
config = AutoConfig.from_pretrained('/home/whaow/.cache/huggingface/hub/models--gpt2-medium/snapshots/425b0cc90498ac177aa51ba07be26fc2fea6af9d/config.json')
config

GPT2Config {
  "_name_or_path": "/home/whaow/.cache/huggingface/hub/models--gpt2-medium/snapshots/425b0cc90498ac177aa51ba07be26fc2fea6af9d/config.json",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version":

## models

- device_map
    - 单张 GPU: 使用 device_map={"": "cuda"}。
      - All on cuda:0;
    - 多张 GPU（推荐自动分片）: 使用 device_map="auto"。
    - 多张 GPU（手动分片）: 明确指定每一层的 device_map。

### single, half, double

- `--fp16` 与 `--bf16`
    - single（单精度）: `fp16 == False and bf16 == False`
    - half（半精度）: `fp16 == False and bf16 == True`