### 概要

#### 数据简介
1. `triple_zh.txt` 中文关系的三元组，共 104941 个
2. `triple_en.txt` 英文关系的三元组，共 162544 个
3. `ILLs(zh-en).txt` 中英关键词转换，共 13636 个
4. `train_data.txt` 训练集，共 14262 个，每个解答有 2-3 个三元组
5. `valid_data.txt` 验证集

#### 预处理内容
1. 提取三元组的关键词
2. 训练集分解为：问题 + 解答
3. 提取问题中的混合输入（需确定 unicode 范围，待进行）
4. 统计连接词的词频（可选，未进行）

### 关键词提取
存储在 `extract` 目录下

In [1]:
;cd ../data

/home/rex/work_space/7 others/ccks/CCKS-mKGQA/data


#### 提取文件
ILLs(zh-en).txt, triple_en.txt, triple_zh.txt

In [2]:
txt2triple(txt) = Tuple(rstrip(last(split(st, '/')), '>') for st in split(txt))

In [3]:
filename = "ILLs(zh-en).txt"
output_io = open("extract/$filename", "w")
open(filename, "r") do io
    for i in 1:13636 # 162544, 104941, 13636
        # txt = join(txt2triple(readline(io)), '\t') # triple_zh, triple_en
        txt = txt2triple(readline(io))
        txt = txt[1] * '\t' * txt[3]
        write(output_io, txt, "\n")
    end
end
close(output_io)

#### 拆分训练集数据
train_data.txt

In [9]:
zh_source = r"<http://zh.dbpedia.org/resource/(.*)>"
zh_property = r"<http://zh.dbpedia.org/property/(.*)>"
en_source = r"<http://dbpedia.org/resource/(.*)>"
en_property = r"<http://dbpedia.org/property/(.*)>"
"""获取三元组信息"""
function get_info(triple)
    s1 = match(zh_source, first(triple))
    if !isnothing(s1)
        s1 = s1.captures[1]
        p1 = match(zh_property, triple[2]).captures[1]
        s2 = match(zh_source, last(triple)).captures[1]
        return "zh:\t$(s1)\t$(p1)\t$(s2)"
    end
    s1 = match(en_source, first(triple)).captures[1]
    p1 = match(en_property, triple[2]).captures[1]
    s2 = match(en_source, last(triple)).captures[1]
    "en:\t$(s1)\t$(p1)\t$(s2)"
end

"""拆分问题和答案"""
function QandA(txt)
    que, ans = split(txt, '\t')
    ans = split(ans, '#')
    len = Int(length(ans)/3)
    "{$(len)} $(que)\n" * join([get_info(ans[3 * i - 2: 3 * i]) for i in 1:len],'\n')
end

QandA

In [10]:
filename = "train_data.txt"
output_io = open("extract/$filename", "w")
dict = Dict{Char,Int}('2'=>0, '3'=>0)
open(filename, "r") do io
    for i in 1:14262
        println(output_io, QandA(readline(io)))
    end
end
close(output_io)

### 数据分析
1. 知识图谱的三元组只包含这几类
   - `<http://zh.dbpedia.org/resource/{中文关键词}>`
   - `<http://zh.dbpedia.org/property/{中/英文连接词}>`
   - `<http://dbpedia.org/resource/{英文关键词}>`
   - `<http://dbpedia.org/property/{英文连接词}>`
   
   对关键词和连接词做训练，提交答案时再将完整链接还原
2. `train_data.txt` 训练集答案为两组的占三分之二(9739/14262)
3. `triple_zh.txt` 的关键词为中文，但连接词大部分为英文。共 1155 种连接词，英文占 1097（中文只有58），英文关系词总数为 102857
3. `triple_en.txt` 共 1637 种关系词
4. `ILLs(zh-en).txt` 与训练集联系更紧密
   - 比如 `triple_zh.txt` 仅有 1/10 左右
5. 在训练集中的混合模式，比如“主干为汉字，夹杂英文关键词”，关键词很大概率会被提取（需确认是否 100% 适用）

In [190]:
# 统计不重复数目
filename = "triple_en.txt"
txts = read(open("extract/$filename", "r"), String)
triples = [Tuple(split(txt, '\t')) for txt in split(txts, '\n')];
pop!(triples) # 去掉末尾空白行
length(unique([triple[2] for triple in triples]))
# length(unique([triple[2] for triple in triples if all(isletter, triple[2])])) # 纯英关键词的数目

In [101]:
# 检查 ILLs(zh-en).txt 与 triple_zh.txt/triple_en.txt 的关系

# en triples
txts = read(open("extract/triple_en.txt", "r"), String)
en_triples = [Tuple(split(txt, '\t')) for txt in split(txts, '\n')]

# zh triples
txts = read(open("extract/triple_zh.txt", "r"), String)
zh_triples = [Tuple(split(txt, '\t')) for txt in split(txts, '\n')]

# zh-en shifts
txts = read(open("extract/ILLs(zh-en).txt", "r"), String)
zh_en_double = [Tuple(split(txt, '\t')) for txt in split(txts, '\n')]
zh_dicts = Set{String}(last.(zh_en_triples))
en_dicts = Set{String}(first.(zh_en_triples))

Set{String} with 13250 elements:
  "Tencent"
  "Altamira_do_Paraná"
  "New_London_County,_Connecticut"
  "Priapus"
  "Xi_County,_Henan"
  "Jataí"
  "Li_Yeguang"
  "Casablanca"
  "Association_of_American_Universities"
  "Badules"
  "Tiszabercel"
  "Lexus"
  "Joseph_McCarthy"
  "Tangier,_Virginia"
  "Luxor"
  "Beire_(Paredes)"
  "Cleveland,_Utah"
  "N-I_(rocket)"
  "Herentals"
  "Filipe_Nyusi"
  "Province_of_Jaén_(Spain)"
  "Roman_Catholic_Diocese_of_Dunkeld"
  "Argente"
  "Sikorsky_(crater)"
  "Chien-Shiung_Wu"
  ⋮ 

In [6]:
file = "extract/train_data.txt"
open(file, "r")  do io
    print(readline(io))
end

{2} what is the title leader of the bay that 瑞曼 is famous for?

In [19]:
## 检查训练集的解答与关键字
file = "extract/train_data.txt"
solutions = Dict{String, Vector{NTuple{4, String}}}()
open(file, "r") do io
    for _ in 1:14262
        que = readline(io)
        ind, que = parse(Int, que[2]), que[5:end]
        solutions[que] = [Tuple(split(readline(io), '\t')) for _ in 1:ind]
    end
end

In [25]:
for (que, sol) in solutions
    keyword = replace(sol[1][2], '_'=>' ')
    occursin(keyword, que) || println(que,'\n', sol[1][2], '\t', sol[1][4])
end

what is the timezong of C's southern regionabanes, Girona?
Cabanes,_Girona	Vilabertran
which bay makes the origin of R's nameayleigh (lunar crater) famous?
Rayleigh_(lunar_crater)	John_William_Strutt,_3rd_Baron_Rayleigh
what is the program that is before the works that make M knownari Yamazaki?
Mari_Yamazaki	Thermae_Romae
who does the origin of Y's nameoung (crater) influence?
Young_(crater)	Thomas_Young_(scientist)
who is the producer of P's famous workshilip Eisner?
Philip_Eisner	Event_Horizon_(film)
Savez-vous ce qu’est la dynastie ?
Elisabeth_of_Bavaria,_Queen_of_Belgium	Prince_Charles,_Count_of_Flanders
which language does I's famous workssmail Shahid belong to?
Ismail_Shahid	Pashto
which program is before A's famous worksyame Goriki?
Ayame_Goriki	Biblia_Koshodō_no_Jiken_Techō
who is the director of the works that make L knownenore Aubert?
Lenore_Aubert	Abbott_and_Costello_Meet_Frankenstein
what is the program that is before A's famous worksyame Goriki?
Ayame_Goriki	Biblia_Koshodō