# Baseline
python: 3.8.*

use ```Ctrl + ]``` to collapse all section :)

Download our starter pack (3~5 min)

notebook1
## PART 1. Document retrieval

In [1]:
!pip install opencc wikipedia pandarallel

[0m

In [2]:
!pip install hanlp

[0m

Prepare the environment and import all library we need

In [1]:
# built-in libs
import json
import pickle
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Set, Tuple, Union

# 3rd party libs
import hanlp
import opencc
import pandas as pd
import wikipedia
from hanlp.components.pipeline import Pipeline
from pandarallel import pandarallel

# our own libs


def load_json(file_path: Union[Path, str]) -> pd.DataFrame:
    """jsonl_to_df read jsonl file and return a pandas DataFrame.

    Args:
        file_path (Union[Path, str]): The jsonl file path.

    Returns:
        pd.DataFrame: The jsonl file content.

    Example:
        >>> read_jsonl_file("data/train.jsonl")
               id            label  ... predicted_label                                      evidence_list
        0    3984          refutes  ...         REFUTES  [城市規劃是城市建設及管理的依據 ， 位於城市管理之規劃 、 建設 、 運作三個階段之首 ，...
        ..    ...              ...  ...             ...                                                ...
        945  3042         supports  ...         REFUTES  [北歐人相傳每當雷雨交加時就是索爾乘坐馬車出來巡視 ， 因此稱呼索爾為 “ 雷神 ” 。, ...

        [946 rows x 10 columns]
    """
    with open(file_path, "r", encoding="utf8") as json_file:
        json_list = list(json_file)

    return [json.loads(json_str) for json_str in json_list]


pandarallel.initialize(progress_bar=True, verbose=0, nb_workers=10)
wikipedia.set_lang("zh")

Preload the data.

In [2]:
TRAIN_DATA = load_json("data/public_train.jsonl")
TEST_DATA = load_json("data/public_test.jsonl")
CONVERTER_T2S = opencc.OpenCC("t2s.json")
CONVERTER_S2T = opencc.OpenCC("s2t.json")

Data class for type hinting

In [3]:
@dataclass
class Claim:
    data: str

@dataclass
class AnnotationID:
    id: int

@dataclass
class EvidenceID:
    id: int

@dataclass
class PageTitle:
    title: str

@dataclass
class SentenceID:
    id: int

@dataclass
class Evidence:
    data: List[List[Tuple[AnnotationID, EvidenceID, PageTitle, SentenceID]]]

### Helper function

For the sake of consistency, we convert traditional to simplified Chinese first before converting it back to traditional Chinese.  This is due to some errors occuring when converting traditional to traditional Chinese.

In [4]:
def do_st_corrections(text: str) -> str:
    simplified = CONVERTER_T2S.convert(text)

    return CONVERTER_S2T.convert(simplified)

We use constituency parsing to separate part of speeches or so called constituent to extract noun phrases.  In the later stages, we will use the noun phrases as the query to search for relevant documents.  

In [5]:
def get_nps_hanlp(
    predictor: Pipeline,
    d: Dict[str, Union[int, Claim, Evidence]],
) -> List[str]:
    claim = d["claim"]
    tree = predictor(claim)["con"]
    nps = [
        do_st_corrections("".join(subtree.leaves()))
        for subtree in tree.subtrees(lambda t: t.label() == "NP")
    ]

    return nps

Precision refers to how many related documents are retrieved.  Recall refers to how many relevant documents are retrieved.  

In [6]:
def calculate_precision(
    data: List[Dict[str, Union[int, Claim, Evidence]]],
    predictions: pd.Series,
) -> None:
    precision = 0
    count = 0

    for i, d in enumerate(data):
        if d["label"] == "NOT ENOUGH INFO":
            continue

        # Extract all ground truth of titles of the wikipedia pages
        # evidence[2] refers to the title of the wikipedia page
        gt_pages = set([
            evidence[2]
            for evidence_set in d["evidence"]
            for evidence in evidence_set
        ])

        predicted_pages = predictions.iloc[i]
        hits = predicted_pages.intersection(gt_pages)
        if len(predicted_pages) != 0:
            precision += len(hits) / len(predicted_pages)

        count += 1

    # Macro precision
    print(f"Precision: {precision / count}")


def calculate_recall(
    data: List[Dict[str, Union[int, Claim, Evidence]]],
    predictions: pd.Series,
) -> None:
    recall = 0
    count = 0

    for i, d in enumerate(data):
        if d["label"] == "NOT ENOUGH INFO":
            continue

        gt_pages = set([
            evidence[2]
            for evidence_set in d["evidence"]
            for evidence in evidence_set
        ])
        predicted_pages = predictions.iloc[i]
        hits = predicted_pages.intersection(gt_pages)
        recall += len(hits) / len(gt_pages)
        count += 1

    print(f"Recall: {recall / count}")

The default amount of documents retrieved is at most five documents.  This `num_pred_doc` can be adjusted based on your objective.  Save data in jsonl format.

In [7]:
def save_doc(
    data: List[Dict[str, Union[int, Claim, Evidence]]],
    predictions: pd.Series,
    mode: str = "train",
    num_pred_doc: int = 5,
) -> None:
    with open(
        f"data/{mode}_doc{num_pred_doc}.jsonl",
        "w",
        encoding="utf8",
    ) as f:
        for i, d in enumerate(data):
            d["predicted_pages"] = list(predictions.iloc[i])
            f.write(json.dumps(d, ensure_ascii=False) + "\n")

### Main function for document retrieval

In [8]:
def get_pred_pages(series_data: pd.Series) -> Set[Dict[int, str]]:
    results = []
    tmp_muji = []
    # wiki_page: its index showned in claim
    mapping = {}
    claim = series_data["claim"]
    nps = series_data["hanlp_results"]
    first_wiki_term = []

    for i, np in enumerate(nps):
        # Simplified Traditional Chinese Correction
        wiki_search_results = [
            do_st_corrections(w) for w in wikipedia.search(np)
        ]

        # Remove the wiki page's description in brackets
        wiki_set = [re.sub(r"\s\(\S+\)", "", w) for w in wiki_search_results]
        wiki_df = pd.DataFrame({
            "wiki_set": wiki_set,
            "wiki_results": wiki_search_results
        })

        # Elements in wiki_set --> index
        # Extracting only the first element is one way to avoid extracting
        # too many of the similar wiki pages
        grouped_df = wiki_df.groupby("wiki_set", sort=False).first()
        candidates = grouped_df["wiki_results"].tolist()
        # muji refers to wiki_set
        muji = grouped_df.index.tolist()

        for prefix, term in zip(muji, candidates):
            if prefix not in tmp_muji:
                matched = False

                # Take at least one term from the first noun phrase
                if i == 0:
                    first_wiki_term.append(term)

                # Walrus operator :=
                # https://docs.python.org/3/whatsnew/3.8.html#assignment-expressions
                # Through these filters, we are trying to figure out if the term
                # is within the claim
                if (((new_term := term) in claim) or
                    ((new_term := term.replace("·", "")) in claim) or
                    ((new_term := term.split(" ")[0]) in claim) or
                    ((new_term := term.replace("-", " ")) in claim)):
                    matched = True

                elif "·" in term:
                    splitted = term.split("·")
                    for split in splitted:
                        if (new_term := split) in claim:
                            matched = True
                            break

                if matched:
                    # post-processing
                    term = term.replace(" ", "_")
                    term = term.replace("-", "")
                    results.append(term)
                    mapping[term] = claim.find(new_term)
                    tmp_muji.append(new_term)

    # 5 is a hyperparameter
    if len(results) > 5:
        assert -1 not in mapping.values()
        results = sorted(mapping, key=mapping.get)
    elif len(results) < 1:
        results = first_wiki_term

    return set(results)

### Step 1. Get noun phrases from hanlp consituency parsing tree

Setup [HanLP](https://github.com/hankcs/HanLP) predictor (1 min)

In [9]:
predictor = (hanlp.pipeline().append(
    hanlp.load("MSR_TOK_ELECTRA_BASE_CRF"),
    output_key="tok",
).append(
    hanlp.load("CTB9_CON_ELECTRA_SMALL"),
    output_key="con",
    input_key="tok",
))
# FINE_ELECTRA_SMALL_ZH

                                             

We will skip this process which for creating parsing tree when demo on class

In [10]:
hanlp_file = f"data/hanlp_con_results.pkl"
if Path(hanlp_file).exists():
    with open(hanlp_file, "rb") as f:
        hanlp_results = pickle.load(f)
else:
    hanlp_results = [get_nps_hanlp(predictor, d) for d in TRAIN_DATA]
    with open(hanlp_file, "wb") as f:
        pickle.dump(hanlp_results, f)

Get pages via wiki online api

In [None]:
import wikipedia
doc_path = f"data/train_doc5.jsonl"
if Path(doc_path).exists():
    with open(doc_path, "r", encoding="utf8") as f:
        predicted_results = pd.Series([
            set(json.loads(line)["predicted_pages"])
            for line in f
        ])
else:
    train_df = pd.DataFrame(TRAIN_DATA)
    train_df.loc[:, "hanlp_results"] = hanlp_results
    predicted_results = train_df.parallel_apply(get_pred_pages, axis=1)
    save_doc(TRAIN_DATA, predicted_results, mode="train")
predicted_results.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1134), Label(value='0 / 1134'))), …

### Step 2. Calculate our results

In [None]:
calculate_precision(TRAIN_DATA, predicted_results)
calculate_recall(TRAIN_DATA, predicted_results)

### Step 3. Repeat the same process on test set
Create parsing tree

In [None]:
hanlp_test_file = f"data/hanlp_con_test_results.pkl"
if Path(hanlp_test_file).exists():
    with open(hanlp_test_file, "rb") as f:
        hanlp_results = pickle.load(f)
else:
    hanlp_results = [get_nps_hanlp(predictor, d) for d in TEST_DATA]
    with open(hanlp_test_file, "wb") as f:
        pickle.dump(hanlp_results, f)

Get pages via wiki online api

In [None]:
test_doc_path = f"data/test_doc5.jsonl"
if Path(test_doc_path).exists():
    with open(test_doc_path, "r", encoding="utf8") as f:
        test_results = pd.Series(
            [set(json.loads(line)["predicted_pages"]) for line in f])
else:
    test_df = pd.DataFrame(TEST_DATA)
    test_df.loc[:, "hanlp_results"] = hanlp_results
    test_results = test_df.parallel_apply(get_pred_pages, axis=1)
    save_doc(TEST_DATA, test_results, mode="test")
print(test_results.head())

notebook2
## PART 2. Sentence retrieval

In [1]:
import torch

Import some libs

In [2]:
# built-in libs
from pathlib import Path
from typing import Dict, List, Set, Tuple, Union

# third-party libs
import numpy as np
import pandas as pd
from pandarallel import pandarallel
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
# from transformers import (
#     AutoModelForSequenceClassification,
#     AutoTokenizer,
#     get_scheduler,
# )
# local libs
from typing import Dict, Union
from pathlib import Path
import json
import re
import pandas as pd
# import torch
# from transformers import get_scheduler


# Helper functions


def load_json(file_path: Union[Path, str]) -> pd.DataFrame:
    """jsonl_to_df read jsonl file and return a pandas DataFrame.

    Args:
        file_path (Union[Path, str]): The jsonl file path.

    Returns:
        pd.DataFrame: The jsonl file content.

    Example:
        >>> read_jsonl_file("data/train.jsonl")
               id            label  ... predicted_label                                      evidence_list
        0    3984          refutes  ...         REFUTES  [城市規劃是城市建設及管理的依據 ， 位於城市管理之規劃 、 建設 、 運作三個階段之首 ，...
        ..    ...              ...  ...             ...                                                ...
        945  3042         supports  ...         REFUTES  [北歐人相傳每當雷雨交加時就是索爾乘坐馬車出來巡視 ， 因此稱呼索爾為 “ 雷神 ” 。, ...

        [946 rows x 10 columns]
    """
    with open(file_path, "r", encoding="utf8") as json_file:
        json_list = list(json_file)

    return [json.loads(json_str) for json_str in json_list]


def jsonl_dir_to_df(dir_path: Union[Path, str]) -> pd.DataFrame:
    """jsonl_dir_to_df read jsonl dir and return a pandas DataFrame.

    This function will read all jsonl files in the dir_path and concat them.

    Args:
        dir_path (Union[Path, str]): The jsonl dir path.

    Returns:
        pd.DataFrame: The jsonl dir content.

    Example:
        >>> read_jsonl_dir("data/extracted_dir/")
               id            label  ... predicted_label                                      evidence_list
        0    3984          refutes  ...         REFUTES  [城市規劃是城市建設及管理的依據 ， 位於城市管理之規劃 、 建設 、 運作三個階段之首 ，...
        ..    ...              ...  ...             ...                                                ...
        945  3042         supports  ...         REFUTES  [北歐人相傳每當雷雨交加時就是索爾乘坐馬車出來巡視 ， 因此稱呼索爾為 “ 雷神 ” 。, ...

        [946 rows x 10 columns]
    """
    print(f"Reading and concatenating jsonl files in {dir_path}")
    return pd.concat(
        [pd.DataFrame(load_json(file)) for file in Path(dir_path).glob("*.jsonl")]
    )


def generate_evidence_to_wiki_pages_mapping(
    wiki_pages: pd.DataFrame,
) -> Dict[str, Dict[int, str]]:
    """generate_wiki_pages_dict generate a mapping from evidence to wiki pages by evidence id.

    Args:
        wiki_pages (pd.DataFrame): The wiki pages dataframe
        cache(Union[Path, str], optional): The cache file path. Defaults to None.
            If cache is None, return the result directly.

    Returns:
        pd.DataFrame:
    """

    def make_dict(x):
        result = {}
        sentences = re.split(r"\n(?=[0-9])", x)
        for sent in sentences:
            splitted = sent.split("\t")
            if len(splitted) < 2:
                # Avoid empty articles
                return result
            result[splitted[0]] = splitted[1]
        return result

    # copy wiki_pages
    wiki_pages = wiki_pages.copy()

    # generate parse mapping
    print("Generate parse mapping")
    wiki_pages["evidence_map"] = wiki_pages["lines"].parallel_map(make_dict)
    # generate id to evidence_map mapping
    print("Transform to id to evidence_map mapping")
    mapping = dict(
        zip(
            wiki_pages["id"].to_list(),
            wiki_pages["evidence_map"].to_list(),
        )
    )
    # release memory
    del wiki_pages
    return mapping


def save_checkpoint(model, ckpt_dir: str, current_step: int, mark: str = ""):
    if mark != "":
        mark += "_"
    torch.save(model.state_dict(), f"{ckpt_dir}/{mark}model.{current_step}.pt")


def load_model(model, ckpt_name, ckpt_dir: str):
    model.load_state_dict(torch.load(f"{ckpt_dir}/{ckpt_name}"))
    return model

pandarallel.initialize(progress_bar=True, verbose=0, nb_workers=10)

Global variable

In [3]:
SEED = 42

TRAIN_DATA = load_json("data/public_train.jsonl")
TEST_DATA = load_json("data/public_test.jsonl")
DOC_DATA = load_json("data/train_doc5.jsonl")

LABEL2ID: Dict[str, int] = {
    "supports": 0,
    "refutes": 1,
    "NOT ENOUGH INFO": 2,
}
ID2LABEL: Dict[int, str] = {v: k for k, v in LABEL2ID.items()}

_y = [LABEL2ID[data["label"]] for data in TRAIN_DATA]
# GT means Ground Truth
TRAIN_GT = DOC_DATA

Preload wiki database (1 min)

In [4]:
wiki_pages = jsonl_dir_to_df("data/wiki-pages")
mapping = generate_evidence_to_wiki_pages_mapping(wiki_pages)
del wiki_pages

Reading and concatenating jsonl files in data/wiki-pages
Generate parse mapping


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=118776), Label(value='0 / 118776')…

Transform to id to evidence_map mapping


### Helper function

Calculate precision for sentence retrieval

In [5]:
def evidence_macro_precision(
    instance: Dict,
    top_rows: pd.DataFrame,
) -> Tuple[float, float]:
    """Calculate precision for sentence retrieval
    This function is modified from fever-scorer.
    https://github.com/sheffieldnlp/fever-scorer/blob/master/src/fever/scorer.py

    Args:
        instance (dict): a row of the dev set (dev.jsonl) of test set (test.jsonl)
        top_rows (pd.DataFrame): our predictions with the top probabilities

        IMPORTANT!!!
        instance (dict) should have the key of `evidence`.
        top_rows (pd.DataFrame) should have a column `predicted_evidence`.

    Returns:
        Tuple[float, float]:
        [1]: relevant and retrieved (numerator of precision)
        [2]: retrieved (denominator of precision)
    """
    this_precision = 0.0
    this_precision_hits = 0.0

    # Return 0, 0 if label is not enough info since not enough info does not
    # contain any evidence.
    if instance["label"].upper() != "NOT ENOUGH INFO":
        # e[2] is the page title, e[3] is the sentence index
        all_evi = [[e[2], e[3]]
                   for eg in instance["evidence"]
                   for e in eg
                   if e[3] is not None]
        claim = instance["claim"]
        predicted_evidence = top_rows[top_rows["claim"] ==
                                      claim]["predicted_evidence"].tolist()

        for prediction in predicted_evidence:
            if prediction in all_evi:
                this_precision += 1.0
            this_precision_hits += 1.0

        return (this_precision /
                this_precision_hits) if this_precision_hits > 0 else 1.0, 1.0

    return 0.0, 0.0

Calculate recall for sentence retrieval

In [6]:
def evidence_macro_recall(
    instance: Dict,
    top_rows: pd.DataFrame,
) -> Tuple[float, float]:
    """Calculate recall for sentence retrieval
    This function is modified from fever-scorer.
    https://github.com/sheffieldnlp/fever-scorer/blob/master/src/fever/scorer.py

    Args:
        instance (dict): a row of the dev set (dev.jsonl) of test set (test.jsonl)
        top_rows (pd.DataFrame): our predictions with the top probabilities

        IMPORTANT!!!
        instance (dict) should have the key of `evidence`.
        top_rows (pd.DataFrame) should have a column `predicted_evidence`.

    Returns:
        Tuple[float, float]:
        [1]: relevant and retrieved (numerator of recall)
        [2]: relevant (denominator of recall)
    """
    # We only want to score F1/Precision/Recall of recalled evidence for NEI claims
    if instance["label"].upper() != "NOT ENOUGH INFO":
        # If there's no evidence to predict, return 1
        if len(instance["evidence"]) == 0 or all(
            [len(eg) == 0 for eg in instance]):
            return 1.0, 1.0

        claim = instance["claim"]

        predicted_evidence = top_rows[top_rows["claim"] ==
                                      claim]["predicted_evidence"].tolist()

        for evidence_group in instance["evidence"]:
            evidence = [[e[2], e[3]] for e in evidence_group]
            if all([item in predicted_evidence for item in evidence]):
                # We only want to score complete groups of evidence. Incomplete
                # groups are worthless.
                return 1.0, 1.0
        return 0.0, 1.0
    return 0.0, 0.0

Calculate the scores of sentence retrieval

In [7]:
def evaluate_retrieval(
    probs: np.ndarray,
    df_evidences: pd.DataFrame,
    ground_truths: pd.DataFrame,
    top_n: int = 5,
    cal_scores: bool = True,
    save_name: str = None,
) -> Dict[str, float]:
    """Calculate the scores of sentence retrieval

    Args:
        probs (np.ndarray): probabilities of the candidate retrieved sentences
        df_evidences (pd.DataFrame): the candiate evidence sentences paired with claims
        ground_truths (pd.DataFrame): the loaded data of dev.jsonl or test.jsonl
        top_n (int, optional): the number of the retrieved sentences. Defaults to 2.

    Returns:
        Dict[str, float]: F1 score, precision, and recall
    """
    df_evidences["prob"] = probs
    top_rows = (
        df_evidences.groupby("claim").apply(
        lambda x: x.nlargest(top_n, "prob"))
        .reset_index(drop=True)
    )

    if cal_scores:
        macro_precision = 0
        macro_precision_hits = 0
        macro_recall = 0
        macro_recall_hits = 0

        for i, instance in enumerate(ground_truths):
            macro_prec = evidence_macro_precision(instance, top_rows)
            macro_precision += macro_prec[0]
            macro_precision_hits += macro_prec[1]

            macro_rec = evidence_macro_recall(instance, top_rows)
            macro_recall += macro_rec[0]
            macro_recall_hits += macro_rec[1]

        pr = (macro_precision /
              macro_precision_hits) if macro_precision_hits > 0 else 1.0
        rec = (macro_recall /
               macro_recall_hits) if macro_recall_hits > 0 else 0.0
        f1 = 2.0 * pr * rec / (pr + rec)

    if save_name is not None:
        # write doc7_sent5 file
        with open(f"data/{save_name}", "w") as f:
            for instance in ground_truths:
                claim = instance["claim"]
                predicted_evidence = top_rows[
                    top_rows["claim"] == claim]["predicted_evidence"].tolist()
                instance["predicted_evidence"] = predicted_evidence
                f.write(json.dumps(instance, ensure_ascii=False) + "\n")

    if cal_scores:
        return {"F1 score": f1, "Precision": pr, "Recall": rec}

Inference script to get probabilites for the candidate evidence sentences

AicupTopkEvidenceBERTDataset class for AICUP dataset with top-k evidence sentences

### Main function for sentence retrieval

In [8]:
def pair_with_wiki_sentences(
    mapping: Dict[str, Dict[int, str]],
    df: pd.DataFrame,
    negative_ratio: float,
) -> pd.DataFrame:
    """Only for creating train sentences."""
    claims = []
    sentences = []
    labels = []
    idx = []
    # positive
    mappinglabel = {'supports':1, 'refutes':2}
    for i in range(len(df)):
        if df["label"].iloc[i] == "NOT ENOUGH INFO":
            continue

        claim = df["claim"].iloc[i]
        evidence_sets = df["evidence"].iloc[i]
        labelmap = mappinglabel[ df["label"].iloc[i] ]
        for evidence_set in evidence_sets:
            sents = []
            for evidence in evidence_set:
                # evidence[2] is the page title
                page = evidence[2].replace(" ", "_")
                # the only page with weird name
                if page == "臺灣海峽危機#第二次臺灣海峽危機（1958）":
                    continue
                # evidence[3] is in form of int however, mapping requires str
                sent_idx = str(evidence[3])
                sents.append(mapping[page][sent_idx])

            whole_evidence = " ".join(sents)

            claims.append(claim)
            sentences.append(whole_evidence.replace(" ",""))
            # idx.append(evidence[2])
            labels.append(labelmap)

    # negative
    for i in range(len(df)):
        if df["label"].iloc[i] == "NOT ENOUGH INFO":
            continue
        claim = df["claim"].iloc[i]

        evidence_set = set([(evidence[2], evidence[3])
                            for evidences in df["evidence"][i]
                            for evidence in evidences])
        predicted_pages = df["predicted_pages"][i]
        for page in predicted_pages:
            page = page.replace(" ", "_")
            try:
                page_sent_id_pairs = [
                    (page, sent_idx) for sent_idx in mapping[page].keys()
                ]
            except KeyError:
                # print(f"{page} is not in our Wiki db.")
                continue

            for pair in page_sent_id_pairs:
                if pair in evidence_set:
                    continue
                text = mapping[page][pair[1]]
                # `np.random.rand(1) <= 0.05`: Control not to add too many negative samples
                if text != "" and np.random.rand(1) <= negative_ratio:
                    claims.append(claim)
                    sentences.append(text.replace(" ",""))
                    labels.append(0)
                    # idx.append(page)

    # return pd.DataFrame({"claim": claims, "text": sentences, "idx": idx, "label": labels})
    return pd.DataFrame({"claim": claims, "text": sentences,  "label": labels})


def pair_with_wiki_sentences_eval(
    mapping: Dict[str, Dict[int, str]],
    df: pd.DataFrame,
    is_testset: bool = False,
) -> pd.DataFrame:
    """Only for creating dev and test sentences."""
    claims = []
    sentences = []
    evidence = []
    idx = []
    predicted_evidence = []

    # negative
    for i in range(len(df)):
        # if df["label"].iloc[i] == "NOT ENOUGH INFO":
        #     continue
        claim = df["claim"].iloc[i]

        predicted_pages = df["predicted_pages"][i]
        for page in predicted_pages:
            page = page.replace(" ", "_")
            try:
                page_sent_id_pairs = [(page, k) for k in mapping[page]]
            except KeyError:
                # print(f"{page} is not in our Wiki db.")
                continue

            for page_name, sentence_id in page_sent_id_pairs:
                text = mapping[page][sentence_id]
                if text != "":
                    claims.append(claim)
                    sentences.append(text.replace(" ",""))
                    idx.append(page)
                    if not is_testset:
                        evidence.append(df["evidence"].iloc[i])
                    predicted_evidence.append([page_name, int(sentence_id)])

    return pd.DataFrame({
        "claim": claims,
        "text": sentences,
        # "idx": idx,
        "evidence": evidence if not is_testset else None,
        "predicted_evidence": predicted_evidence,
    })

### Step 1. Setup training environment

Hyperparams

In [9]:
#@title  { display-mode: "form" }
NEGATIVE_RATIO = 0.04  #@param {type:"number"}
TOP_N = 5  #@param {type:"integer"}

Experiment Directory

### Step 2. Combine claims and evidences

In [10]:
train_df = pair_with_wiki_sentences(
    mapping,
    pd.DataFrame(TRAIN_GT),
    NEGATIVE_RATIO,
)
counts = train_df["label"].value_counts()
print("Now using the following train data with 0 (Negative) and 1 (Positive)")
print(counts)
train_df.head(10)

Now using the following train data with 0 (Negative) and 1 (Positive)
0    13735
1     5159
2     3308
Name: label, dtype: int64


Unnamed: 0,claim,text,label
0,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,1787年由威廉·赫歇爾發現，並以威廉·莎士比亞的《仲夏夜之夢》中的妖精王后緹坦妮雅命名。,2
1,信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,信天翁的活動範圍位於南冰洋以及北太平洋。,2
2,南京大學附屬中學，從中國江蘇省遷移。,南京大學附屬中學，位於中國江蘇省南京市。,2
3,毒魚豆的萃取物被西印度群島的原住民發掘可以導致魚麻醉安靜，讓他們得以趁機徒手抓魚。,西印度群島的原住民發現這種植物的提取物可以令魚麻醉安靜，讓他們可以徒手抓魚。,1
4,軟件開發是一項包括需求獲取、開發規劃、需求分析和設計、編程實現、軟件測試、版本控制的系統工程...,軟件開發是一項包括需求獲取、開發規劃、需求分析和設計、編程實現、軟件測試、版本控制的系統工程...,1
5,國立臺灣大學應用力學研究所從1984年開始招收碩、博士班研究生，首任所長爲理論及應用力學專家...,於1984年招收第一屆碩、博士班研究生，首任所長爲理論及應用力學專家，美國國家工程院院士、中...,1
6,威廉·倫琴拒絕定名新電子波爲倫琴射線，堅持稱作X射線。,有人提議將他發現的新射線定名爲“倫琴射線”，倫琴卻堅持用“X射線”這一名稱，產生X射線的機器...,1
7,數學的基礎分支之一的幾何學在中古世紀的西方並未出現在教育中。,許多文化中都有幾何學的發展，包括許多有關長度、面積及體積的知識，在西元前六世紀泰勒斯的時代，...,2
8,收入豐厚的湯姆克魯斯獲得了許多獎項。,作爲世界上收入最多的演員之一，他獲得了多項榮譽，包含三次金球獎和榮譽金棕櫚獎，以及三次奧斯卡...,1
9,收入豐厚的湯姆克魯斯獲得了許多獎項。,他的電影在北美擁有超過45億票房，在全球擁有超過110億美元票房，使他成爲有史以來票房最高的巨星。,1


In [11]:
train_df[train_df["label"]==0]

Unnamed: 0,claim,text,label
8467,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,夢也有可能發生在其他睡眠階段中，不過這時的夢並不真切也難以記憶。,0
8468,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,這種「記憶抹除」的情況通常發生在一個人是自然緩和地從快速動眼睡眠階段經過慢波睡眠期而進入清醒狀態。,0
8469,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,西格蒙德·弗洛伊德創立了精神分析學，在1900年代早期的許多著作中闡述了夢的理論和解釋。,0
8470,信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,北冰洋有巴倫支海、波弗特海、楚克奇海、東西伯利亞海、格陵蘭海、哈得遜灣、哈得遜海峽、喀拉海、...,0
8471,信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,諸如“世界宗教”、“世界語言”、“世界政府”、“世界大戰”、“世界人口”、“世界經濟”或及“...,0
...,...,...,...
22197,朱邦復於1976年創制了由蔣經國命名的倉頡輸入法。,曾任中華民國總統、行政院院長、中國國民黨主席、制憲國民大會代表、國防部部長、行政院國軍退除役...,0
22198,加拿大國家銀行是加拿大第六大商業銀行，是具有信用創作功能的金融機構。,15世紀末，英國和法國殖民者開始探索北美洲的東岸，並在此建立殖民地。,0
22199,法國南部的摩納哥除了南部海岸線是靠地中海，其他三面皆被法國包圍。,這一帶對奧克西塔尼亞很大程度上說，哪些是奧克語(langued'oc）與法國南部的奧依語（l...,0
22200,法國南部的摩納哥除了南部海岸線是靠地中海，其他三面皆被法國包圍。,地中海沿岸夏季炎熱乾燥，冬季溫暖溼潤，被稱作地中海性氣候。,0


In [12]:
train_df['claim'].iloc[4],train_df['claim'].iloc[1650],

('軟件開發是一項包括需求獲取、開發規劃、需求分析和設計、編程實現、軟件測試、版本控制的系統工程，其中包含任何最終獲得軟件產品的活動。',
 '曾擔任中共中央委員的羅榮桓，被授予過中華人民共和國元帥軍銜。')

### Step 3. Start training

Dataloader things

In [None]:
from autogluon.multimodal import MultiModalPredictor
import uuid
from ray import tune
from fastai.metrics import Recall
# Recall(average='weighted')
model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
predictor = MultiModalPredictor(label='label', path=model_path, presets="best_quality")
predictor.fit(train_df, 
        hyperparameters={
            'model.hf_text.checkpoint_name':'hfl/chinese-lert-base',
             "optimization.max_epochs": 20, 
             "optimization.top_k": 5
        }
)
# hfl/chinese-bert-wwm-ext
# hfl/chinese-macbert-base
# hfl/chinese-roberta-wwm-ext-large
# tune.choice(
#    [
#        'hfl/chinese-lert-base', 'hfl/chinese-bert-wwm-ext', 'hfl/chinese-macbert-base'
#    ]
# ),
# presets="best_quality"

Global seed set to 123
AutoMM starts to create your model. ✨

- Model will be saved to "/workspace/AI_Text/tmp/a90295f8b21041368c3dedde440f789a-automm_sst".

- Validation metric is "accuracy".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /workspace/AI_Text/tmp/a90295f8b21041368c3dedde440f789a-automm_sst
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 102 M 
1 | validation_metric | Accuracy            

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
predictor.save("step2_split")

In [14]:
train_df.head()

Unnamed: 0,claim,text,label
0,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,1787年由威廉·赫歇爾發現，並以威廉·莎士比亞的《仲夏夜之夢》中的妖精王后緹坦妮雅命名。,1
1,信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,信天翁的活動範圍位於南冰洋以及北太平洋。,1
2,南京大學附屬中學，從中國江蘇省遷移。,南京大學附屬中學，位於中國江蘇省南京市。,1
3,毒魚豆的萃取物被西印度群島的原住民發掘可以導致魚麻醉安靜，讓他們得以趁機徒手抓魚。,西印度群島的原住民發現這種植物的提取物可以令魚麻醉安靜，讓他們可以徒手抓魚。,1
4,軟件開發是一項包括需求獲取、開發規劃、需求分析和設計、編程實現、軟件測試、版本控制的系統工程，其中包含任何最終獲得軟件產品的活動。,軟件開發是一項包括需求獲取、開發規劃、需求分析和設計、編程實現、軟件測試、版本控制的系統工程。換句話說，軟件開發就是一系列最終構建出軟件產品的活動。,1


Save your memory.

Trainer

In [16]:
train_evidences = pair_with_wiki_sentences_eval(
    mapping=mapping,
    df=pd.DataFrame(TRAIN_GT),
)
train_evidences.head()

Unnamed: 0,claim,text,evidence,predicted_evidence
0,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,天王星（Uranus）是一顆在太陽系中離太陽第七近的青色行星，其體積在太陽系中排名第三、質量排名第四。,"[[[4209, 4331, 天衛三, 2]]]","[天王星, 0]"
1,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,天王星的英文名稱Uranus來自古希臘神話的天空之神烏拉諾斯，是克洛諾斯的父親、宙斯的祖父。,"[[[4209, 4331, 天衛三, 2]]]","[天王星, 3]"
2,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,在西方文化中，天王星是太陽系中唯一以希臘神祇命名的行星，其他行星都依照羅馬神祇命名。,"[[[4209, 4331, 天衛三, 2]]]","[天王星, 4]"
3,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,與在古代就爲人們所知的五顆行星（水星、金星、火星、木星、土星）相比，天王星的亮度也是肉眼可見的，但由於較爲黯淡以及緩慢的繞行速度而未被古代的觀測者認定爲一顆行星。,"[[[4209, 4331, 天衛三, 2]]]","[天王星, 7]"
4,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,直到1781年3月13日，威廉·赫歇耳爵士宣佈發現天王星，從而在太陽系的現代史上首度擴展已知的界限，也是第一顆使用望遠鏡發現的行星。,"[[[4209, 4331, 天衛三, 2]]]","[天王星, 8]"


Validation part (15 mins)

In [16]:
# predictor.save("step2_unbalance")

In [None]:
from autogluon.multimodal import MultiModalPredictor
# predictor = MultiModalPredictor.load("step2_unbalance")
probs = predictor.predict_proba(train_evidences)
max_prob_indices = np.argmax(probs, axis=1)
probs = probs.to_numpy()

Predicting: 0it [00:00, ?it/s]

In [None]:
probs

In [None]:
probs[:, 1] += probs[:, 2]

In [None]:
second_values = [p[1] for p in probs]
second_values[:5]

In [None]:
threshold = 0
train_evidences["probs"] =second_values
train_evidences = train_evidences[train_evidences["probs"] > threshold ]

In [None]:
second_values = train_evidences["probs"].values

In [None]:
import json
train_results = evaluate_retrieval(
    probs=second_values,
    df_evidences=train_evidences,
    ground_truths=TRAIN_GT,
    top_n=5,
    save_name=f"train_doc5sent{TOP_N}.jsonl",
)
print(f"Training scores => {train_results}")

In [None]:
fileName = f"{train_results['F1 score']:.2f}-{train_results['Recall']:.2f}"
predictor.save("step2_base"+fileName)
f = open("step2.txt", "w")
f.write(str(train_results))
f.close()

### Step 4. Check on our test data
(5 min)

In [24]:
test_data = load_json("data/test_doc5.jsonl")

test_evidences = pair_with_wiki_sentences_eval(
    mapping,
    pd.DataFrame(test_data),
    is_testset=True,
)
test_evidences.head()



Unnamed: 0,claim,text,evidence,predicted_evidence
0,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,光學顯微鏡（Opticalmicroscope、Lightmicroscope）是一種利用光學透鏡產生影像放大效應的顯微鏡。,,"[光學顯微鏡, 0]"
1,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,由物體入射的光被至少兩個光學系統（物鏡和目鏡）放大。,,"[光學顯微鏡, 3]"
2,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,首先物鏡產生一個被放大實像，人眼通過作用相當於放大鏡的目鏡觀察這個已經被放大了的實像。,,"[光學顯微鏡, 4]"
3,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,一般的光學顯微鏡有多個可以替換的物鏡，這樣觀察者可以按需要更換放大倍數，也就是增加放大倍率，放大倍率是由目鏡倍率乘上物鏡倍率所得來的。,,"[光學顯微鏡, 5]"
4,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,這些物鏡一般被安置在一個可以轉動的物鏡盤上，轉動物鏡盤就可以使不同的物鏡方便地進入光路，物鏡盤的英文是Nosepiece，又譯作鼻輪。,,"[光學顯微鏡, 6]"


In [25]:
print("Start predicting the test data")
probs = predictor.predict_proba(test_evidences)
probs = probs.to_numpy()
second_values = [p[1] for p in probs]
test_evidences["probs"] =second_values
test_evidences = test_evidences[test_evidences["probs"] > threshold]

Start predicting the test data


Predicting: 0it [00:00, ?it/s]

In [26]:
second_values = test_evidences["probs"].values
evaluate_retrieval(
    probs=second_values,
    df_evidences=test_evidences,
    ground_truths=test_data,
    top_n=TOP_N,
    cal_scores=False,
    save_name=f"test_doc5sent{TOP_N}.jsonl",
)

notebook3
## PART 3. Claim verification

import libs

In [1]:
import pickle
from pathlib import Path
from typing import Dict, Tuple

import numpy as np
import pandas as pd
from pandarallel import pandarallel
from tqdm.auto import tqdm

import torch
from sklearn.metrics import accuracy_score
from torch.optim import AdamW
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_scheduler,
)
from utils import (
    generate_evidence_to_wiki_pages_mapping,
    jsonl_dir_to_df,
    load_json,
    load_model,
    save_checkpoint,
)


pandarallel.initialize(progress_bar=True, verbose=0, nb_workers=4)

Global variables

In [2]:
LABEL2ID: Dict[str, int] = {
    "supports": 0,
    "refutes": 1,
    "NOT ENOUGH INFO": 2,
}
ID2LABEL: Dict[int, str] = {v: k for k, v in LABEL2ID.items()}

TRAIN_DATA = load_json("data/train_doc5sent5.jsonl")
TRAIN_PKL_FILE = Path("data/train_doc5sent5.pkl")


Preload wiki database (same as part 2.)

In [3]:
wiki_pages = jsonl_dir_to_df("data/wiki-pages")
mapping = generate_evidence_to_wiki_pages_mapping(wiki_pages,)
del wiki_pages

Reading and concatenating jsonl files in data/wiki-pages
Generate parse mapping


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=296938), Label(value='0 / 296938')…

Transform to id to evidence_map mapping


In [4]:
print("wiki_pages")

wiki_pages


In [5]:
def join_with_topk_evidence(
    df: pd.DataFrame,
    mapping: dict,
    mode: str = "train",
    topk: int = 5,
) -> pd.DataFrame:
    """join_with_topk_evidence join the dataset with topk evidence.

    Note:
        After extraction, the dataset will be like this:
               id     label         claim                           evidence            evidence_list
        0    4604  supports       高行健...     [[[3393, 3552, 高行健, 0], [...  [高行健 （ ）江西赣州出...
        ..    ...       ...            ...                                ...                     ...
        945  2095  supports       美國總...  [[[1879, 2032, 吉米·卡特, 16], [...  [卸任后 ， 卡特積極參與...
        停各种战争及人質危機的斡旋工作 ， 反对美国小布什政府攻打伊拉克...

        [946 rows x 5 columns]

    Args:
        df (pd.DataFrame): The dataset with evidence.
        wiki_pages (pd.DataFrame): The wiki pages dataframe
        topk (int, optional): The topk evidence. Defaults to 5.
        cache(Union[Path, str], optional): The cache file path. Defaults to None.
            If cache is None, return the result directly.

    Returns:
        pd.DataFrame: The dataset with topk evidence_list.
            The `evidence_list` column will be: List[str]
    """

    # format evidence column to List[List[Tuple[str, str, str, str]]]
    if "evidence" in df.columns:
        df["evidence"] = df["evidence"].parallel_map(
            lambda x: [[x]] if not isinstance(x[0], list) else [x]
            if not isinstance(x[0][0], list) else x)

    print(f"Extracting evidence_list for the {mode} mode ...")
    df["evidence_list"] = df["predicted_evidence"].parallel_map(lambda x: [
            mapping.get(evi_id, {}).get(str(evi_idx), "")
            for evi_id, evi_idx in x  # for each evidence list
        ][:topk] if isinstance(x, list) else [])
    print(df["evidence_list"][:5])

    return df

### Step 1. Setup training environment

Hyperparams

In [6]:
#@title  { display-mode: "form" }

EVIDENCE_TOPK = 5  #@param {type:"integer"}

Experiment Directory

### Step 2. Concat claim and evidences
join topk evidence

In [7]:
train_df = join_with_topk_evidence(
        pd.DataFrame(TRAIN_DATA),
        mapping,
        topk=EVIDENCE_TOPK,
    )
train_df.to_pickle(TRAIN_PKL_FILE, protocol=4)

train_df.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=2837), Label(value='0 / 2837'))), …

Extracting evidence_list for the train mode ...


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=2837), Label(value='0 / 2837'))), …

0    [1787年由威廉 · 赫歇爾發現 ， 並以威廉 · 莎士比亞的 《 仲夏夜之夢 》 中的妖...
1    [漂泊信天翁的翼展可達到 3.7 米 ， 是世界上現存的翼展最大的飛行鳥類 。, 信天翁的活...
2    [組合F.I.R.的鍵盤手 。, 也是F.I.R.飛兒樂團的成員 ， 負責作曲 、 編曲 、...
3    [香港國際機場全年24小時運作 ， 2010年處理 5,390 萬人次旅客及410萬公噸貨物...
4    [黨委書記和校長列入中央管理的高校 ， 簡稱中管高校 ， 俗稱 “ 副部級高校 ” ， 爲中...
Name: evidence_list, dtype: object


Unnamed: 0,id,label,evidence,claim,predicted_pages,predicted_evidence,evidence_list
0,2663,refutes,"[[[4209, 4331, 天衛三, 2]]]",天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,"[天王星, 仲夏夜之夢_(消歧義), 仲夏夜_(羅文專輯), 緹坦妮雅, 天衛三, 磁層, ...","[[天衛三, 2], [天衛三, 3], [磁層, 1], [天衛三, 0], [緹坦妮雅,...",[1787年由威廉 · 赫歇爾發現 ， 並以威廉 · 莎士比亞的 《 仲夏夜之夢 》 中的妖...
1,2399,refutes,"[[[2719, 2928, 信天翁科, 2]]]",信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,"[牠, 翼展, 鳥, 信天翁科, 太平洋, 北冰洋, 世界, 南太平洋]","[[信天翁科, 4], [信天翁科, 2], [信天翁科, 1], [信天翁科, 0], [...","[漂泊信天翁的翼展可達到 3.7 米 ， 是世界上現存的翼展最大的飛行鳥類 。, 信天翁的活..."
2,8075,NOT ENOUGH INFO,"[[[8075, null, null, null]]]",F.I.R.的團員有主唱Faye飛（詹雯婷）、吉他手Real阿沁（黃漢青）、鍵盤手Ian（陳...,"[亞洲, 人, 陳建寧, 吉他, 吉他手, 飛_(消歧義), F._R._大衛, 詹雯婷, ...","[[陳建寧, 1], [阿沁, 1], [飛_(消歧義), 24], [飛_(消歧義), 1...","[組合F.I.R.的鍵盤手 。, 也是F.I.R.飛兒樂團的成員 ， 負責作曲 、 編曲 、..."
3,8931,NOT ENOUGH INFO,"[[[8931, null, null, null]]]",香港國際機場全年24小時運作，它從2001年起一直躋身世界最佳機場，並8度獲評級爲全宇宙最佳機場。,"[機場, 2001年, 24小時, 香港國際機場, 24_(電視劇), 世界, 宇宙]","[[香港國際機場, 7], [香港國際機場, 14], [香港國際機場, 15], [香港國...","[香港國際機場全年24小時運作 ， 2010年處理 5,390 萬人次旅客及410萬公噸貨物..."
4,332,NOT ENOUGH INFO,"[[[332, null, null, null]]]",北理工是歷史上最後一批副部級高校，黨委書記和校長列入中央管理的高校，簡稱中管高校，俗稱“副部...,"[中華人民共和國, 黨委書記和校長列入中央管理的高校, 中央部屬高校, 高等學校, 歷史, 校長]","[[黨委書記和校長列入中央管理的高校, 0], [黨委書記和校長列入中央管理的高校, 1],...",[黨委書記和校長列入中央管理的高校 ， 簡稱中管高校 ， 俗稱 “ 副部級高校 ” ， 爲中...


### Step 3. Training

Prevent CUDA out of memory

In [8]:
torch.cuda.empty_cache()

Training (30 mins)

In [10]:
def list_to_string(lst):
    # 將 list 轉換為 string，並刪除所有空格
    return "[PAD]".join(str(item).replace(' ', '') for item in lst)

In [10]:
train_df = train_df[["label","claim","evidence_list"]]

In [11]:


# 將函數應用到整個列上
train_df['evidence_list'] = train_df['evidence_list'].apply(list_to_string)
train_df

Unnamed: 0,label,claim,evidence_list
0,refutes,天衛三軌道在天王星內部的磁層，以《仲夏夜之夢》作者緹坦妮雅命名。,1787年由威廉·赫歇爾發現，並以威廉·莎士比亞的《仲夏夜之夢》中的妖精王后緹坦妮雅命名。[...
1,refutes,信天翁科的活動範圍位於北冰洋以及南太平洋，牠的翼展可達到3.7米，是世界上現存的翼展最大的鳥類。,漂泊信天翁的翼展可達到3.7米，是世界上現存的翼展最大的飛行鳥類。[PAD]信天翁的活動範圍...
2,NOT ENOUGH INFO,F.I.R.的團員有主唱Faye飛（詹雯婷）、吉他手Real阿沁（黃漢青）、鍵盤手Ian（陳...,組合F.I.R.的鍵盤手。[PAD]也是F.I.R.飛兒樂團的成員，負責作曲、編曲、製作與吉...
3,NOT ENOUGH INFO,香港國際機場全年24小時運作，它從2001年起一直躋身世界最佳機場，並8度獲評級爲全宇宙最佳機場。,"香港國際機場全年24小時運作，2010年處理5,390萬人次旅客及410萬公噸貨物；2001..."
4,NOT ENOUGH INFO,北理工是歷史上最後一批副部級高校，黨委書記和校長列入中央管理的高校，簡稱中管高校，俗稱“副部...,黨委書記和校長列入中央管理的高校，簡稱中管高校，俗稱“副部級高校”，爲中華人民共和國中央部屬...
...,...,...,...
11341,NOT ENOUGH INFO,伯明翰大學一共出了11位諾貝爾獎得主，其研究成果包括過敏性鼻炎的應用。,在學術研究方面，伯明翰大學的研究成果包括成功研製代替心臟運行的塑料心臟、維生素C的合成、利用...
11342,NOT ENOUGH INFO,由其士集團負責承建並於2003年建築完成的翠豐臺位於香港新界，原是荃灣南豐紗廠的設廠地。,南豐紗廠(前稱：），位於香港荃灣柴灣角白田壩街45號，於1954年由人稱「棉紗大王」的南豐集...
11343,NOT ENOUGH INFO,行經行政區域有中正區、大安區、信義區的臺北市仁愛路其中有100公尺寬的林園大道。,仁愛路亦爲臺北市著名的林蔭大道之一，路中央佈設公車專用道。[PAD]仁愛路爲臺北市的重要幹道...
11344,NOT ENOUGH INFO,位處於臺南市由文化部所管轄的國定古蹟臺南北極殿，在臺南市區最高點，坐南朝北。,臺南北極殿，位於臺南市中西區，俗稱大上帝廟，爲中華民國國定古蹟[PAD]府城中和境鷲嶺北極殿...


In [12]:
train_df['label'].value_counts()

supports           4819
NOT ENOUGH INFO    3276
refutes            3251
Name: label, dtype: int64

In [13]:
selected_rows = train_df[(train_df['evidence_list'] == '') & (train_df['label'] == 'supports')]
selected_rows

Unnamed: 0,label,claim,evidence_list
6453,supports,鐒是在1961年被發現的。,
7557,supports,單一經濟共同體的貸款和買賣外匯是由央行負責。,


In [None]:
from autogluon.multimodal import MultiModalPredictor
import uuid
model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
predictor = MultiModalPredictor(label='label', eval_metric='acc', path=model_path)
predictor.fit(train_df, 
        hyperparameters={'model.hf_text.checkpoint_name': 'hfl/chinese-lert-base', 
                         'env.per_gpu_batch_size':4,
                        'env.eval_batch_size_ratio':2,
                         "optimization.max_epochs": 40, "optimization.top_k": 3}
)
predictor.save("step3_base", standalone=True)

Global seed set to 123
AutoMM starts to create your model. ✨

- Model will be saved to "/workspace/AI_Text/tmp/38aa6c0adf2742538b49d5bd353b0b58-automm_sst".

- Validation metric is "acc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /workspace/AI_Text/tmp/38aa6c0adf2742538b49d5bd353b0b58-automm_sst
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 102 M 
1 | validation_metric | Accuracy                 

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

### Step 4. Make your submission

In [7]:
from autogluon.multimodal import MultiModalPredictor
# predictor = MultiModalPredictor.load("step3_base")
TEST_DATA = load_json("data/test_doc5sent5.jsonl")
TEST_PKL_FILE = Path("data/test_doc5sent5.pkl")

test_df = join_with_topk_evidence(
        pd.DataFrame(TEST_DATA),
        mapping,
        mode="eval",
        topk=EVIDENCE_TOPK,
    )
test_df.to_pickle(TEST_PKL_FILE, protocol=4)


Extracting evidence_list for the eval mode ...


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=248), Label(value='0 / 248'))), HB…

0    [顯微鏡泛指將微小不可見或難見物品之影像放大 ， 而能被肉眼或其他成像儀器觀察之工具 。, ...
1    [蠶產絲 ， 蜜蜂產蜂蜜 ， 兩者都已被人類馴化 。, 家蠶 （ 學名 ： Bombyx m...
2    [綠山城縣  ， 是波蘭的縣份 ， 位於該國西部 ， 由盧布斯卡省負責管轄 ， 首府設於綠山...
3    [《 魂斷藍橋 》 （ Waterloo Bridge ） 是美國黑白電影 ， 由米高梅電影...
4    [2015年以 《 刺客聶隱娘 》 獲得第68屆坎城影展最佳導演獎及第52屆金馬獎最佳導演獎...
Name: evidence_list, dtype: object


In [8]:
test_df_org = test_df.copy()
test_df

Unnamed: 0,id,claim,predicted_pages,predicted_evidence,evidence_list
0,5208,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,"[光學顯微鏡, 電磁學, 顯微鏡, 肉眼, 物]","[[顯微鏡, 0], [光學顯微鏡, 0], [顯微鏡, 1], [顯微鏡, 6], [光學...","[顯微鏡泛指將微小不可見或難見物品之影像放大 ， 而能被肉眼或其他成像儀器觀察之工具 。, ..."
1,1019,產絲的蠶或產蜜的蜜蜂爲提供間接經濟利益的昆蟲。,"[昆蟲, 蠶]","[[昆蟲, 23], [蠶, 0], [蠶, 4], [昆蟲, 22], [蠶, 1]]","[蠶產絲 ， 蜜蜂產蜂蜜 ， 兩者都已被人類馴化 。, 家蠶 （ 學名 ： Bombyx m..."
2,8514,波蘭西部的綠山城縣平均每平方公里的土地有0人。,"[土地, 人, 西部, 綠山城縣, 波蘭, 0]","[[綠山城縣, 0], [波蘭, 1], [土地, 10], [土地, 8], [土地, 4]]",[綠山城縣 ， 是波蘭的縣份 ， 位於該國西部 ， 由盧布斯卡省負責管轄 ， 首府設於綠山...
3,1874,VivienLeigh主演魂斷藍橋中的女配角。,"[橋, 配角, 藍橋, 魂斷藍橋]","[[魂斷藍橋, 0], [藍橋, 2], [魂斷藍橋, 8], [魂斷藍橋, 1], [藍橋...",[《 魂斷藍橋 》 （ Waterloo Bridge ） 是美國黑白電影 ， 由米高梅電影...
4,8352,侯孝賢改編自唐代文言文學的電影獲得金馬獎最佳劇情片獎。,"[侯孝賢, 劇情片, 金馬獎, 文學, 文言文, 電影]","[[侯孝賢, 2], [侯孝賢, 1], [侯孝賢, 0], [侯孝賢, 3], [金馬獎,...",[2015年以 《 刺客聶隱娘 》 獲得第68屆坎城影展最佳導演獎及第52屆金馬獎最佳導演獎...
...,...,...,...,...,...
984,5668,中國的汾河是長江的支流之一。,"[中國, 汾河, 長江]","[[汾河, 1], [汾河, 2], [汾河, 0], [汾河, 9], [汾河, 3]]",[汾河源於中國山西省北部忻州市寧武縣管涔山南側 ， 經太原盆地南流到新絳縣折向西 ， 由運城...
985,4372,鐵達尼號首航的贊助公司在紐約。,"[鐵達尼號_(1943年電影), 贊助, 紐約]","[[贊助, 2], [贊助, 4], [鐵達尼號_(1943年電影), 0], [紐約, 9...",[其中由於受到贊助 ， 使某些有意義的事項及心願 ， 由原來的不可能實現 ， 成爲可以實現的...
986,8250,麗臺科技力求『創新和好品質的信念』，成爲了亞洲知名的電腦及智慧醫療研發製造商。,"[亞洲, 麗臺科技, 創新, 品質, 信念]","[[麗臺科技, 1], [麗臺科技, 0], [麗臺科技, 9], [麗臺科技, 4], [...",[麗臺科技是全球知名的電腦及智慧醫療研發製造商 、 NVIDIA長期合作伙伴 ， 以 「 研...
987,7215,秋紅谷廣場在天災時有控制洪水蔓延速度的功能。,"[功能, 秋紅谷廣場]","[[秋紅谷廣場, 0], [功能, 0]]",[秋紅谷生態公園 ， 又稱秋紅谷廣場 、 秋紅谷景觀生態公園 ， 通常直稱秋紅谷 ， 是位於...


In [11]:
test_df = test_df[["claim","evidence_list"]]
test_df['evidence_list'] = test_df['evidence_list'].apply(list_to_string)

In [12]:
test_df

Unnamed: 0,claim,evidence_list
0,光學顯微鏡是以電磁學原理來將不可見或難見的微小物放大至肉眼可見的儀器。,顯微鏡泛指將微小不可見或難見物品之影像放大，而能被肉眼或其他成像儀器觀察之工具。[PAD]光...
1,產絲的蠶或產蜜的蜜蜂爲提供間接經濟利益的昆蟲。,蠶產絲，蜜蜂產蜂蜜，兩者都已被人類馴化。[PAD]家蠶（學名：Bombyxmori）是鱗翅目...
2,波蘭西部的綠山城縣平均每平方公里的土地有0人。,"綠山城縣，是波蘭的縣份，位於該國西部，由盧布斯卡省負責管轄，首府設於綠山城，面積1,571平..."
3,VivienLeigh主演魂斷藍橋中的女配角。,《魂斷藍橋》（WaterlooBridge）是美國黑白電影，由米高梅電影公司於1940年出品...
4,侯孝賢改編自唐代文言文學的電影獲得金馬獎最佳劇情片獎。,2015年以《刺客聶隱娘》獲得第68屆坎城影展最佳導演獎及第52屆金馬獎最佳導演獎與金馬獎最...
...,...,...
984,中國的汾河是長江的支流之一。,汾河源於中國山西省北部忻州市寧武縣管涔山南側，經太原盆地南流到新絳縣折向西，由運城市的河津市...
985,鐵達尼號首航的贊助公司在紐約。,其中由於受到贊助，使某些有意義的事項及心願，由原來的不可能實現，成爲可以實現的奇蹟，當中贊助...
986,麗臺科技力求『創新和好品質的信念』，成爲了亞洲知名的電腦及智慧醫療研發製造商。,麗臺科技是全球知名的電腦及智慧醫療研發製造商、NVIDIA長期合作伙伴，以「研究創新、品質至...
987,秋紅谷廣場在天災時有控制洪水蔓延速度的功能。,秋紅谷生態公園，又稱秋紅谷廣場、秋紅谷景觀生態公園，通常直稱秋紅谷，是位於臺灣台中市西屯區七...


In [13]:
predictor = MultiModalPredictor.load("step3_base")
predicted_label = predictor.predict(test_df)

Load pretrained checkpoint: /workspace/AI_Text/step3_base/model.ckpt


Predicting: 0it [00:00, ?it/s]

Prediction

In [14]:
predict_dataset = test_df_org
predict_dataset["predicted_label"] = predicted_label
OUTPUT_FILENAME = "submission.jsonl"
predict_dataset[["id", "predicted_label", "predicted_evidence"]].to_json(
    OUTPUT_FILENAME,
    orient="records",
    lines=True,
    force_ascii=False,
)

將預測 SUPPORTS 或是 REFUTES 但 predicted_evidence 是空的更改成 NOT ENOUGH INFO

In [None]:

import pandas as pd 
from utils import load_json

data = load_json(OUTPUT_FILENAME)
# data predicted_evidence is 0 change this label
for item in data:
    if item['predicted_label'] == 'NOT ENOUGH INFO':
        continue
    length = len(item['predicted_evidence'])
    if length == 0:
        print(item)
        item['predicted_label'] = 'NOT ENOUGH INFO'
# save this json 

df = pd.DataFrame(data)
df.to_json(OUTPUT_FILENAME+"_Del", orient='records', lines=True, force_ascii=False)