# get_pretrained_t2v

## 概述

使用 EduNLP 项目组给定的预训练模型将一组题目的切分序列表征为向量。

- 优点：简单方便。
- 缺点：只能使用项目中给定的模型，局限性较大。

## 导入功能块

In [1]:
from tqdm import tqdm
from EduNLP.SIF.segment import seg
from EduNLP.SIF.tokenization import tokenize
from EduNLP.Pretrain import GensimWordTokenizer
from EduNLP.Vector import get_pretrained_t2v



## 输入

类型：list  
内容：一个题组中每个题目切分序列的组合。
> 这里需要调用 `GensimWordTokenizer` 将题目文本（`str` 类型）转换成 tokens。

In [2]:
def load_items():
    test_items = [
        {'ques_content':'有公式$\\FormFigureID{wrong1?}$和公式$\\FormFigureBase64{wrong2?}$，如图$\\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$'},
        {'ques_content':'如图$\\FigureID{088f15ea-8b7c-11eb-897e-b46bfc50aa29}$,若$x,y$满足约束条件$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$'},
        {'ques_content':'<div>Below is a discussion on a website.<br><table border=\1'},
    ]
    for line in test_items:
        yield line


token_items = []
for item in tqdm(load_items(), "sifing"):
    # transform content into special marks('g','m','a','s'), except text('t') and formula('f').
    # 'general' means symbolize the Formula in figure format and use 'linear' method for formula segmentation
    tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
    token_item = tokenizer(item["ques_content"])
    if token_item:
        token_items.append(token_item.tokens)

sifing: 3it [00:01,  1.70it/s]


## 模型的选择与使用

根据题目所属学科选择预训练模型：  
| 预训练模型名称 | 模型训练数据的所属学科 |
| -------------- | ---------------------- |
| d2v_all_256    | 全学科                 |
| d2v_sci_256    | 理科                   |
| d2v_eng_256    | 英语                   |
| d2v_lit_256    | 文科                   |

In [3]:
# make a model -> t2v
t2v = get_pretrained_t2v("d2v_sci_256")
print(t2v(token_items))

downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_science_256.zip is saved as /home/lvrui/.EduNLP/model/general_science_256.zip
downloader, INFO file existed, skipped


[array([ 0.06439726, -0.00687894,  0.00189043, -0.09257554,  0.05736929,
       -0.11668563, -0.10532057, -0.05405452, -0.0999706 ,  0.06865267,
       -0.02944315,  0.01962002, -0.05162324,  0.20003118,  0.0697258 ,
       -0.07881501,  0.02241233, -0.09130172,  0.00737528,  0.12381789,
        0.17466484,  0.06978347, -0.04626915,  0.0546972 ,  0.04325071,
        0.08251449,  0.02235107,  0.03021882,  0.06331849,  0.047343  ,
       -0.0523185 ,  0.03466854,  0.04471075, -0.157977  , -0.17309383,
       -0.0940221 , -0.03896975,  0.10220466, -0.03738068,  0.01231602,
       -0.03859513,  0.04407888, -0.13459732, -0.00885631, -0.03051728,
        0.10884743,  0.09097778,  0.10442872, -0.08097434, -0.10975049,
       -0.07766262,  0.0059125 ,  0.04991117, -0.01943343,  0.04166247,
       -0.00084558,  0.05679052,  0.08069104, -0.11168302, -0.06381108,
        0.02258897, -0.04278664, -0.06129966, -0.0530447 ,  0.07391489,
       -0.04376722, -0.15739298, -0.06582512, -0.12223923,  0.0