# Loading data

This notebook serves to prepare the data from Fever to be used to train and test a textual entailment model.
The idea is to have, for each claim in fear, a decoded list of all evidence (decoded from wikidata dump ID and sentence ID into natural language sentences), and a label between Supported, Refuted, and Not Enough Info.

First, we will load the natural data, exactly as it comes from the FEVER challenge, into a dataset of dataframes

In [137]:
from datasets import load_dataset, Features, Value

features = Features(
{
    'id': Value('int64'),
    'verifiable': Value('string'),
    'label': Value('string'),
    'claim': Value('string'),
    'evidence': Value('string'),
})

FILENAMES = ['shared_task_dev','train','shared_task_test']

In [29]:
from pandas import read_json
from datasets import Dataset
from datasets import DatasetDict

dataset_dict = {}
for filename in FILENAMES:
    df = read_json(f'/home/k20036346/sharedscratch/fever/fever_jsonl/{filename}.jsonl', lines=True)
    if 'evidence' in df.columns:
        df['evidence'] = df['evidence'].astype(str)
              
    dataset_dict[filename] = Dataset.from_pandas(
        df,
        features=Features({column: features[column] for column in df.columns})
    )
    
data = DatasetDict(dataset_dict)
data

DatasetDict({
    shared_task_dev: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence'],
        num_rows: 19998
    })
    train: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence'],
        num_rows: 145449
    })
    shared_task_test: Dataset({
        features: ['id', 'claim'],
        num_rows: 19998
    })
})

In [36]:
# Here we'll create a numeric version of the labels

LABELS = ['SUPPORTS', 'REFUTES', 'NOT ENOUGH INFO']
def label_to_numeric(row):
    if 'label' in row:
        return {'label_numeric': LABELS.index(row['label'])}
    return {'label_numeric': -1}
    #    if row['label'] == 'SUPPORTS':
    #        return {'label_numeric': 1}
    #    elif row['label'] == 'REFUTES':
    #        return {'label_numeric': 0}
    #    else:
    #        return {'label_numeric': 2}
        
def filter_verifiable(row):
    if 'verifiable' in row and row['verifiable'] == 'VERIFIABLE':
        return True
    else:
        return False

In [37]:
data = data.map(label_to_numeric)#.filter(filter_verifiable)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

In [38]:
data

DatasetDict({
    shared_task_dev: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence', 'label_numeric'],
        num_rows: 19998
    })
    train: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence', 'label_numeric'],
        num_rows: 145449
    })
    shared_task_test: Dataset({
        features: ['id', 'claim', 'label_numeric'],
        num_rows: 19998
    })
})

# Getting supporting sentences

As can be seen in the cells below, the evidence in these datasets is not coded as written text sentences, but as a tuple (see https://fever.ai/2018/task.html).

An evidence set is a list of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID] tuples (or a [Annotation ID, Evidence ID, null, null] tuple if the label is NOT ENOUGH INFO).

The Annotation ID and Evidence ID fields are for FEVER's internal use only and are not used for scoring. They may help debug or correct annotation issues by FEVER's authors and **do not matter for us**.

Wikipedia URL and sentence ID are how we find the text. This is based on a June 2017 dump made available by FEVER at https://fever.ai/dataset/fever.html. The dumps consist of multiple JSON objects, each corresponding to a Wikipedia page from June 2007. Each has the following format:
- id: The page's title, like 1951_Baylor_Bears_football_team or Don_Kendell
- text: The whole page's text (or at least what was capturable, idk)
- lines: The same as in text, but split into lines. Each line entry consists of a number (line number), a tab (\t), the line, and an endline (\n).

Here, we will use the Wikipedia URL (same as wikipedia page id) and the sentence ID (same as line number) to transform the evidence set tuples into natural language sentences by querying the wikipedia page dumps from June 2017.

## Encoding issues
We noticed a decoding issue with the data from fever. From the jsonl files, the evidence is retrieved and stored into a dataframe as a string instead of as a list of lists. To revert this, we use ast.literal_eval. However, the data is then generated encoded using NFKD.

However, it seems like in the wikipages dumps, they are encoded using NFC.

In [49]:
# ast.literal_eval seems to encode using NFKD
import ast
ast.literal_eval(data['shared_task_dev'][26]['evidence'])[0][0][2].encode('utf-8')

b'Simo\xcc\x81n_Boli\xcc\x81var'

In [52]:
# These two seem the same but have different encoding, and doing a == b returns False
# The second one comes from ast.literal_eval, but the first comes straight from the wiki files, which uses NFC
'Simón_Bolívar'.encode('utf-8'),'Simón_Bolívar'.encode('utf-8')

(b'Sim\xc3\xb3n_Bol\xc3\xadvar', b'Simo\xcc\x81n_Boli\xcc\x81var')

In [69]:
# We fix this by encoding all with NFC
import unicodedata
unicodedata.normalize('NFC','Simón_Bolívar').encode('utf-8'), unicodedata.normalize('NFC','Simón_Bolívar').encode('utf-8')

(b'Sim\xc3\xb3n_Bol\xc3\xadvar', b'Sim\xc3\xb3n_Bol\xc3\xadvar')

## Using a file index

This sucks and is very slow

## Using an indexed database (much faster)
Run the 0.5_Wikipedia_fever_corpus_to_sql.ipynb before!

In [79]:
import sqlite3
class WikiPagesDB():
    def __init__(self, path='/home/k20036346/sharedscratch/fever/wikipedia_processed.db'):
        self.db = sqlite3.connect(path)
        self.cursor = self.db.cursor()
        self.cursor.execute('select count(*) from pages')
        self.size = self.cursor.fetchall()[0][0]
        
    def get_lines_by_id(self, identifier):
        self.cursor.execute(
            '''SELECT lines
            FROM pages 
            WHERE id = ?
            ''', [identifier]
        )
        return self.cursor.fetchall()[0][0]
    
    def get_lines_by_row_number(self, row_n):
        self.cursor.execute(
            ''' SELECT lines
            FROM pages
            WHERE rowid = ?
            ''', [row_n]
        )
        return self.cursor.fetchall()[0][0]
    
wikidb = WikiPagesDB()

In [115]:
import ast
import random
import unicodedata
def get_sentence_sqlite(row, ix):
    #print(row)
    if 'evidence' not in row:
        return None
    try:
        evidences = ast.literal_eval(row['evidence'])
        sentences = []
        for evidence_set in evidences:
            sentence_set = []
            for evidence in evidence_set:
                if evidence[2] is None:
                    continue
                lines = wikidb.get_lines_by_id(unicodedata.normalize('NFC',evidence[2]))
                assert type(lines) == str
                lines_parsed = [line.split('\t', 1)[1] for line in lines.split('\n')]
                assert lines_parsed[evidence[3]].strip != ''
                sentence_set.append(lines_parsed[evidence[3]])
            if len(sentence_set) > 0:
                sentences.append(sentence_set)
        if len(sentences) == 0:
            # If there are no sentences, the label is NOT ENOUGH INFO
            assert row['label'] == 'NOT ENOUGH INFO'
            while True:
                # If there are no lines we fill with a random set
                # (I dont know why, lets see in the future if this gets used?)
                random_row_id = random.randint(1,wikidb.size)
                lines = wikidb.get_lines_by_row_number(random_row_id)
                if lines.strip() != '': #Many entries in the DB have no text, perhaps for being mainly links/lists/tables
                    break
            lines_parsed = [line.split('\t', 1)[1] for line in lines.split('\n')]
            lines_parsed_non_empty = [l for l in lines_parsed if l.strip() != '']
            random_line_id = random.randint(0, len(lines_parsed_non_empty)-1)
            random_line = lines_parsed_non_empty[random_line_id]
            sentences.append([random_line]) #We just take a single line, really. I dont know why this is useful but
            # Gabriel from a year ago did. Maybe he figured something better out because I dont see this being used
            # going forward.
        return {'sentences':sentences,'first_sentence':sentences[0][0]} #First sentence from first evidence set

    except Exception as e:
        print(e)
        print(ix)
        print(random_row_id,lines)
        print(row)
        #print(sentences)
        raise

test_ids = [3,17,23,447]
for test_id in test_ids:
    print(data['shared_task_dev'][test_id])
    print('|||')
    print(get_sentence_sqlite(data['shared_task_dev'][test_id],test_id))
    print('---')

{'id': 166626, 'verifiable': 'NOT VERIFIABLE', 'label': 'NOT ENOUGH INFO', 'claim': 'Anne Rice was born in New Jersey.', 'evidence': '[[[191656, None, None, None], [191657, None, None, None]]]', 'label_numeric': 2}
|||
{'sentences': [['The ENI or electronic neutron initiator -LRB- generator -RRB- was Blue Stone .\tBlue Stone\tBlue Stone (neutron initiator)']], 'first_sentence': 'The ENI or electronic neutron initiator -LRB- generator -RRB- was Blue Stone .\tBlue Stone\tBlue Stone (neutron initiator)'}
---
{'id': 167997, 'verifiable': 'NOT VERIFIABLE', 'label': 'NOT ENOUGH INFO', 'claim': 'Don Bradman retired from soccer.', 'evidence': '[[[193413, None, None, None]]]', 'label_numeric': 2}
|||
{'sentences': [['Tekoulo is a town and sub-prefecture in the Guéckédou Prefecture in the Nzérékoré Region of south-western Guinea .\tRegion\tRegions of Guinea\tNzérékoré Region\tNzérékoré Region\tPrefecture\tPrefectures of Guinea\tGuéckédou Prefecture\tGuéckédou Prefecture\ttown\ttown\tsub-prefectu

In [122]:
data = data.map(get_sentence_sqlite, with_indices=True)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

In [123]:
data

DatasetDict({
    shared_task_dev: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence', 'label_numeric', 'sentences', 'first_sentence'],
        num_rows: 19998
    })
    train: Dataset({
        features: ['id', 'verifiable', 'label', 'claim', 'evidence', 'label_numeric', 'sentences', 'first_sentence'],
        num_rows: 145449
    })
    shared_task_test: Dataset({
        features: ['id', 'claim', 'label_numeric'],
        num_rows: 19998
    })
})

In [124]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [125]:
show_random_elements(data["train"])

Unnamed: 0,id,verifiable,label,claim,evidence,label_numeric,sentences,first_sentence
0,18120,VERIFIABLE,SUPPORTS,Red Headed Stranger had a strong cultural impact.,"[[[33643, 40802, 'Red_Headed_Stranger', 21]]]",0,"[[The album has had a strong cultural impact ; the song `` Time of the Preacher '' has been used often in the series Edge of Darkness , and its lyrics were used as well in the first issue of the comic Preacher .\tEdge of Darkness\tEdge of Darkness\tPreacher\tPreacher (comics)]]","The album has had a strong cultural impact ; the song `` Time of the Preacher '' has been used often in the series Edge of Darkness , and its lyrics were used as well in the first issue of the comic Preacher .\tEdge of Darkness\tEdge of Darkness\tPreacher\tPreacher (comics)"
1,31608,VERIFIABLE,SUPPORTS,Curly Top was directed by a person.,"[[[47740, 57010, 'Curly_Top_-LRB-film-RRB-', 0], [47740, 57010, 'Irving_Cummings', 0]]]",0,"[[Curly Top -LRB- 1935 -RRB- is an American musical film directed by Irving Cummings .\tIrving Cummings\tIrving Cummings\tmusical film\tmusical film, Irving Camisky -LRB- October 9 , 1888 -- April 18 , 1959 -RRB- was an American movie actor , director , producer and writer .]]",Curly Top -LRB- 1935 -RRB- is an American musical film directed by Irving Cummings .\tIrving Cummings\tIrving Cummings\tmusical film\tmusical film
2,27945,NOT VERIFIABLE,NOT ENOUGH INFO,Donald Glover hated the album Because the Internet.,"[[[44062, None, None, None]]]",2,"[[At the 2006 census , its population was 208 , in 53 families .]]","At the 2006 census , its population was 208 , in 53 families ."
3,169884,VERIFIABLE,SUPPORTS,Peyton Manning's name is Williams.,"[[[195967, 206104, 'Peyton_Manning', 0]]]",0,"[[Peyton Williams Manning -LRB- born March 24 , 1976 -RRB- is a former American football quarterback who played 18 seasons in the National Football League -LRB- NFL -RRB- , primarily with the Indianapolis Colts .\tIndianapolis Colts\tIndianapolis Colts\tAmerican football\tAmerican football\tquarterback\tquarterback\tNational Football League\tNational Football League]]","Peyton Williams Manning -LRB- born March 24 , 1976 -RRB- is a former American football quarterback who played 18 seasons in the National Football League -LRB- NFL -RRB- , primarily with the Indianapolis Colts .\tIndianapolis Colts\tIndianapolis Colts\tAmerican football\tAmerican football\tquarterback\tquarterback\tNational Football League\tNational Football League"
4,58954,VERIFIABLE,REFUTES,Cry Freedom shot all their scenes in South Africa.,"[[[75359, 86621, 'Cry_Freedom', 7]]]",1,[[The film was primarily shot on location in Zimbabwe and in Kenya due to political turmoil in South Africa at the time of production .\tSouth Africa\tSouth Africa\tZimbabwe\tZimbabwe\tKenya\tKenya]],The film was primarily shot on location in Zimbabwe and in Kenya due to political turmoil in South Africa at the time of production .\tSouth Africa\tSouth Africa\tZimbabwe\tZimbabwe\tKenya\tKenya
5,44013,VERIFIABLE,SUPPORTS,Toy Story is a film.,"[[[60346, 70726, 'Toy_Story', 0]], [[60346, 70727, 'Toy_Story', 1]], [[60346, 70728, 'Toy_Story', 6]], [[60346, 70729, 'Toy_Story', 12]], [[60346, 70730, 'Toy_Story', 15]], [[60346, 70731, 'Toy_Story', 16]], [[60346, 70732, 'Toy_Story', 17]], [[60346, 70733, 'Toy_Story', 19]]]",0,"[[Toy Story is a 1995 American computer-animated buddy comedy adventure film produced by Pixar Animation Studios and released by Walt Disney Pictures .\tWalt Disney Pictures\tWalt Disney Pictures\tPixar Animation Studios\tPixar\tcomputer-animated\tComputer-generated imagery\tbuddy\tbuddy film\tcomedy\tcomedy film\tadventure film\tadventure film], [The directorial debut of John Lasseter , Toy Story was the first feature-length computer-animated film and the first theatrical film produced by Pixar .\tJohn Lasseter\tJohn Lasseter\tcomputer-animated\tComputer-generated imagery\tdirectorial debut\tList of directorial debuts], [The film features music by Randy Newman , and was executive-produced by Steve Jobs and Edwin Catmull .\tRandy Newman\tRandy Newman\tSteve Jobs\tSteve Jobs\tEdwin Catmull\tEdwin Catmull], [The studio , then consisting of a relatively small number of employees , produced the film under minor financial constraints .], [Released in theaters on November 22 , 1995 , Toy Story was the highest-grossing film on its opening weekend and earned over $ 373 million worldwide .], [The film was widely acclaimed by critics , who praised the animation 's technical innovation , the wit and thematic sophistication of the screenplay , and the performances of Hanks and Allen .], [It is now considered by many critics to be one of the best animated films ever made .\tbest animated films ever made\tList of films considered the best#Animation], [It was inducted into the National Film Registry as being `` culturally , historically , or aesthetically significant '' in 2005 , its first year of eligibility .\tNational Film Registry\tNational Film Registry]]",Toy Story is a 1995 American computer-animated buddy comedy adventure film produced by Pixar Animation Studios and released by Walt Disney Pictures .\tWalt Disney Pictures\tWalt Disney Pictures\tPixar Animation Studios\tPixar\tcomputer-animated\tComputer-generated imagery\tbuddy\tbuddy film\tcomedy\tcomedy film\tadventure film\tadventure film
6,28986,VERIFIABLE,REFUTES,Yung Rich Nation featured only Young Thug.,"[[[46471, 55433, 'Yung_Rich_Nation', 2]]]",1,"[[The album features guest appearances from Chris Brown and Young Thug , while the production was handled by Zaytoven , Honorable C.N.O.T.E. and Murda Beatz , among others .\tMurda Beatz\tMurda Beatz\tZaytoven\tZaytoven\tChris Brown\tChris Brown\tYoung Thug\tYoung Thug\tproduction\tHip hop production]]","The album features guest appearances from Chris Brown and Young Thug , while the production was handled by Zaytoven , Honorable C.N.O.T.E. and Murda Beatz , among others .\tMurda Beatz\tMurda Beatz\tZaytoven\tZaytoven\tChris Brown\tChris Brown\tYoung Thug\tYoung Thug\tproduction\tHip hop production"
7,127393,VERIFIABLE,SUPPORTS,The Grammys nominated Nicki Minaj.,"[[[149269, 164255, 'Nicki_Minaj', 17]]]",0,"[[Minaj has received ten Grammy nominations throughout her career , and has won six American Music Awards , eleven BET Awards , three MTV Video Music Awards , four Billboard Music Awards , and was the recipient of Billboards Women in Music 2011 Rising Star award .\tGrammy\tGrammy Award\tAmerican Music Awards\tAmerican Music Awards\tBET Awards\tBET Awards]]","Minaj has received ten Grammy nominations throughout her career , and has won six American Music Awards , eleven BET Awards , three MTV Video Music Awards , four Billboard Music Awards , and was the recipient of Billboards Women in Music 2011 Rising Star award .\tGrammy\tGrammy Award\tAmerican Music Awards\tAmerican Music Awards\tBET Awards\tBET Awards"
8,211614,VERIFIABLE,REFUTES,Gabourey Sidibe made her acting debut in a 2006 film.,"[[[250774, 250754, 'Gabourey_Sidibe', 0]]]",1,"[[Gabourey Sidibe -LRB- -LSB- ˈɡæbəˌreɪ_ˈsɪdɪˌbeɪ -RSB- ; born May 6 , 1983 -RRB- is an American actress who made her acting debut in the 2009 film Precious , a role that brought her a nomination for the Academy Award for Best Actress .\tPrecious\tPrecious (film)\tAcademy Award for Best Actress\tAcademy Award for Best Actress]]","Gabourey Sidibe -LRB- -LSB- ˈɡæbəˌreɪ_ˈsɪdɪˌbeɪ -RSB- ; born May 6 , 1983 -RRB- is an American actress who made her acting debut in the 2009 film Precious , a role that brought her a nomination for the Academy Award for Best Actress .\tPrecious\tPrecious (film)\tAcademy Award for Best Actress\tAcademy Award for Best Actress"
9,134832,NOT VERIFIABLE,NOT ENOUGH INFO,Louis C.K. has won six Emmy awards as of 2013.,"[[[157397, None, None, None]]]",2,"[[Dr. Carlo Levi -LRB- -LSB- ˈkarlo ˈlɛːvi -RSB- -RRB- -LRB- November 29 , 1902 -- January 4 , 1975 -RRB- was an Italian-Jewish painter , writer , activist , anti-fascist , and doctor .\tanti-fascist\tanti-fascism]]","Dr. Carlo Levi -LRB- -LSB- ˈkarlo ˈlɛːvi -RSB- -RRB- -LRB- November 29 , 1902 -- January 4 , 1975 -RRB- was an Italian-Jewish painter , writer , activist , anti-fascist , and doctor .\tanti-fascist\tanti-fascism"


## Saving dataset for fine-tuning model

In [139]:
for filename in FILENAMES:
    data[filename].to_csv(f'./data/{filename}_wikidecoded.csv', index=None, encoding='utf-8')

Creating CSV from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/15 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

In [142]:
from pandas import DataFrame
df_train = DataFrame(data['train'])
df_dev = DataFrame(data['shared_task_dev'])

In [147]:
# Checking if they don't overlap
# There are 3 overlapping claims, but their evidence sets are distinct!
pd.set_option('display.max_colwidth', None)
df_train[df_train["claim"].isin(df_dev["claim"])]

Unnamed: 0,id,verifiable,label,claim,evidence,label_numeric,sentences,first_sentence
10795,216752,VERIFIABLE,SUPPORTS,Warren Beatty co-wrote Reds.,"[[[257528, 256524, 'Reds_-LRB-film-RRB-', 0]], [[257535, 256531, 'Reds_-LRB-film-RRB-', 0]], [[257552, 256547, 'Reds_-LRB-film-RRB-', 0]]]",0,"[[Reds is a 1981 American epic drama film co-written , produced and directed by Warren Beatty .\tWarren Beatty\tWarren Beatty\tepic\tEpic film\tdrama film\tDrama (film and television)], [Reds is a 1981 American epic drama film co-written , produced and directed by Warren Beatty .\tWarren Beatty\tWarren Beatty\tepic\tEpic film\tdrama film\tDrama (film and television)], [Reds is a 1981 American epic drama film co-written , produced and directed by Warren Beatty .\tWarren Beatty\tWarren Beatty\tepic\tEpic film\tdrama film\tDrama (film and television)]]","Reds is a 1981 American epic drama film co-written , produced and directed by Warren Beatty .\tWarren Beatty\tWarren Beatty\tepic\tEpic film\tdrama film\tDrama (film and television)"
45757,57807,VERIFIABLE,SUPPORTS,Floyd Mayweather Jr. is a professional boxer.,"[[[74147, 85375, 'Floyd_Mayweather_Jr.', 0]], [[74147, 85376, 'Floyd_Mayweather_Jr.', 1]]]",0,"[[Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [Widely considered to be one of the greatest boxers of all time , undefeated as a professional , and a five-division world champion , Mayweather won fifteen world titles and the lineal championship in four different weight classes -LRB- twice at welterweight -RRB- .\tfive-division world champion\tquintuple champion\tlineal championship\tlineal championship\twelterweight\twelterweight]]","Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)"
100294,142165,VERIFIABLE,SUPPORTS,Saturn is a planet in the Solar System.,"[[[165224, 179348, 'Solar_System', 7], [165224, 179348, 'Saturn', 0]], [[165225, 179349, 'Solar_System', 7]], [[165225, 179350, 'Solar_System', 15]]]",0,"[[Of the objects that orbit the Sun indirectly , the moons , two are larger than the smallest planet , Mercury.The two moons larger than Mercury are Ganymede , which orbits Jupiter , and Titan , which orbits Saturn .\tSun\tSun\tMercury\tMercury (planet)\tJupiter\tJupiter\tSaturn\tSaturn\tplanet\tplanet\tmoons\tNatural satellite\tGanymede\tGanymede (moon)\tTitan\tTitan (moon), Saturn is the sixth planet from the Sun and the second-largest in the Solar System , after Jupiter .\tSaturn\tSaturn (mythology)\tSun\tSun\tplanet\tplanet\tSolar System\tSolar System\tJupiter\tJupiter], [Of the objects that orbit the Sun indirectly , the moons , two are larger than the smallest planet , Mercury.The two moons larger than Mercury are Ganymede , which orbits Jupiter , and Titan , which orbits Saturn .\tSun\tSun\tMercury\tMercury (planet)\tJupiter\tJupiter\tSaturn\tSaturn\tplanet\tplanet\tmoons\tNatural satellite\tGanymede\tGanymede (moon)\tTitan\tTitan (moon)], [The two largest , Jupiter and Saturn , are gas giants , being composed mainly of hydrogen and helium ; the two outermost planets , Uranus and Neptune , are ice giants , being composed mostly of substances with relatively high melting points compared with hydrogen and helium , called volatiles , such as water , ammonia and methane .\tJupiter\tJupiter\tSaturn\tSaturn\tUranus\tUranus\tNeptune\tNeptune\tplanets\tList of gravitationally rounded objects of the Solar System#Planets\thydrogen\thydrogen\thelium\thelium\tvolatiles\tvolatiles\tammonia\tammonia\tmethane\tmethane]]","Of the objects that orbit the Sun indirectly , the moons , two are larger than the smallest planet , Mercury.The two moons larger than Mercury are Ganymede , which orbits Jupiter , and Titan , which orbits Saturn .\tSun\tSun\tMercury\tMercury (planet)\tJupiter\tJupiter\tSaturn\tSaturn\tplanet\tplanet\tmoons\tNatural satellite\tGanymede\tGanymede (moon)\tTitan\tTitan (moon)"


In [148]:
df_dev[df_dev["claim"].isin(df_train["claim"])]

Unnamed: 0,id,verifiable,label,claim,evidence,label_numeric,sentences,first_sentence
8058,198218,VERIFIABLE,SUPPORTS,Saturn is a planet in the Solar System.,"[[[233056, 236058, 'Saturn', 0]]]",0,"[[Saturn is the sixth planet from the Sun and the second-largest in the Solar System , after Jupiter .\tSaturn\tSaturn (mythology)\tSun\tSun\tplanet\tplanet\tSolar System\tSolar System\tJupiter\tJupiter]]","Saturn is the sixth planet from the Sun and the second-largest in the Solar System , after Jupiter .\tSaturn\tSaturn (mythology)\tSun\tSun\tplanet\tplanet\tSolar System\tSolar System\tJupiter\tJupiter"
10688,154383,VERIFIABLE,SUPPORTS,Warren Beatty co-wrote Reds.,"[[[293532, 286037, 'Warren_Beatty', 2]], [[297406, 289064, 'Warren_Beatty', 2]], [[341783, 326077, 'Warren_Beatty', 2]], [[341784, 326078, 'Warren_Beatty', 2]]]",0,"[[Beatty is the first and only person to have been twice nominated for acting in , directing , writing , and producing the same film -- first with Heaven Can Wait -LRB- 1978 -RRB- , which was co-written by Elaine May and co-directed by Buck Henry , and again with Reds , which he co-wrote with Trevor Griffiths .\tReds\tReds (film)\tHeaven Can Wait\tHeaven Can Wait (1978 film)\tElaine May\tElaine May\tBuck Henry\tBuck Henry\tTrevor Griffiths\tTrevor Griffiths], [Beatty is the first and only person to have been twice nominated for acting in , directing , writing , and producing the same film -- first with Heaven Can Wait -LRB- 1978 -RRB- , which was co-written by Elaine May and co-directed by Buck Henry , and again with Reds , which he co-wrote with Trevor Griffiths .\tReds\tReds (film)\tHeaven Can Wait\tHeaven Can Wait (1978 film)\tElaine May\tElaine May\tBuck Henry\tBuck Henry\tTrevor Griffiths\tTrevor Griffiths], [Beatty is the first and only person to have been twice nominated for acting in , directing , writing , and producing the same film -- first with Heaven Can Wait -LRB- 1978 -RRB- , which was co-written by Elaine May and co-directed by Buck Henry , and again with Reds , which he co-wrote with Trevor Griffiths .\tReds\tReds (film)\tHeaven Can Wait\tHeaven Can Wait (1978 film)\tElaine May\tElaine May\tBuck Henry\tBuck Henry\tTrevor Griffiths\tTrevor Griffiths], [Beatty is the first and only person to have been twice nominated for acting in , directing , writing , and producing the same film -- first with Heaven Can Wait -LRB- 1978 -RRB- , which was co-written by Elaine May and co-directed by Buck Henry , and again with Reds , which he co-wrote with Trevor Griffiths .\tReds\tReds (film)\tHeaven Can Wait\tHeaven Can Wait (1978 film)\tElaine May\tElaine May\tBuck Henry\tBuck Henry\tTrevor Griffiths\tTrevor Griffiths]]","Beatty is the first and only person to have been twice nominated for acting in , directing , writing , and producing the same film -- first with Heaven Can Wait -LRB- 1978 -RRB- , which was co-written by Elaine May and co-directed by Buck Henry , and again with Reds , which he co-wrote with Trevor Griffiths .\tReds\tReds (film)\tHeaven Can Wait\tHeaven Can Wait (1978 film)\tElaine May\tElaine May\tBuck Henry\tBuck Henry\tTrevor Griffiths\tTrevor Griffiths"
17816,36399,VERIFIABLE,SUPPORTS,Floyd Mayweather Jr. is a professional boxer.,"[[[105200, 118615, 'Floyd_Mayweather_Jr.', 0]], [[105200, 118616, 'Floyd_Mayweather_Jr.', 10]], [[108253, 121786, 'Floyd_Mayweather_Jr.', 0]], [[108253, 121787, 'Floyd_Mayweather_Jr.', 1]], [[108253, 121788, 'Floyd_Mayweather_Jr.', 3]], [[108253, 121789, 'Floyd_Mayweather_Jr.', 6]], [[108253, 121790, 'Floyd_Mayweather_Jr.', 9]], [[108253, 121791, 'Floyd_Mayweather_Jr.', 10]], [[110070, 123695, 'Floyd_Mayweather_Jr.', 0]], [[308093, 298563, 'Floyd_Mayweather_Jr.', 0]], [[308093, 298564, 'Floyd_Mayweather_Jr.', 1]], [[308093, 298565, 'Floyd_Mayweather_Jr.', 3]], [[308093, 298566, 'Floyd_Mayweather_Jr.', 6]], [[308093, 298567, 'Floyd_Mayweather_Jr.', 7]], [[308093, 298568, 'Floyd_Mayweather_Jr.', 8]], [[308093, 298569, 'Floyd_Mayweather_Jr.', 9]], [[308093, 298570, 'Floyd_Mayweather_Jr.', 10]], [[308093, 298571, 'Floyd_Mayweather_Jr.', 15]], [[309497, 299702, 'Floyd_Mayweather_Jr.', 0]], [[309497, 299703, 'Floyd_Mayweather_Jr.', 1]]]",0,"[[Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [He finished his career with a record of 26 wins without a loss or draw in world title fights -LRB- 10 by KO -RRB- ; 23 wins -LRB- 9 KOs -RRB- in lineal title fights ; 24 wins -LRB- 7 KOs -RRB- against former or current world titlists ; 12 wins -LRB- 3 KOs -RRB- against former or current lineal champions ; and 2 wins -LRB- 1 KO -RRB- against International Boxing Hall of Fame inductees .\tKO\tknockout\tInternational Boxing Hall of Fame\tInternational Boxing Hall of Fame], [Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [Widely considered to be one of the greatest boxers of all time , undefeated as a professional , and a five-division world champion , Mayweather won fifteen world titles and the lineal championship in four different weight classes -LRB- twice at welterweight -RRB- .\tfive-division world champion\tquintuple champion\tlineal championship\tlineal championship\twelterweight\twelterweight], [Mayweather is a two-time winner of The Ring magazine 's Fighter of the Year award -LRB- 1998 and 2007 -RRB- , a three-time winner of the Boxing Writers Association of America -LRB- BWAA -RRB- Fighter of the Year award -LRB- 2007 , 2013 , and 2015 -RRB- , and a six-time winner of the Best Fighter ESPY Award -LRB- 2007 -- 2010 , 2012 -- 2014 -RRB- .\tFighter of the Year\tSugar Ray Robinson Award\tBoxing Writers Association of America\tBoxing Writers Association of America\tBest Fighter ESPY Award\tBest Fighter ESPY Award], [In 2016 , Mayweather was ranked by ESPN as the greatest boxer , pound for pound , of the last 25 years .\tESPN\tESPN\tpound for pound\tpound for pound], [He is also regarded as the best defensive boxer in the sport , as well as being the most accurate puncher since the existence of CompuBox , having the highest plus -- minus ratio in recorded boxing history .\tCompuBox\tCompuBox], [He finished his career with a record of 26 wins without a loss or draw in world title fights -LRB- 10 by KO -RRB- ; 23 wins -LRB- 9 KOs -RRB- in lineal title fights ; 24 wins -LRB- 7 KOs -RRB- against former or current world titlists ; 12 wins -LRB- 3 KOs -RRB- against former or current lineal champions ; and 2 wins -LRB- 1 KO -RRB- against International Boxing Hall of Fame inductees .\tKO\tknockout\tInternational Boxing Hall of Fame\tInternational Boxing Hall of Fame], [Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [Widely considered to be one of the greatest boxers of all time , undefeated as a professional , and a five-division world champion , Mayweather won fifteen world titles and the lineal championship in four different weight classes -LRB- twice at welterweight -RRB- .\tfive-division world champion\tquintuple champion\tlineal championship\tlineal championship\twelterweight\twelterweight], [Mayweather is a two-time winner of The Ring magazine 's Fighter of the Year award -LRB- 1998 and 2007 -RRB- , a three-time winner of the Boxing Writers Association of America -LRB- BWAA -RRB- Fighter of the Year award -LRB- 2007 , 2013 , and 2015 -RRB- , and a six-time winner of the Best Fighter ESPY Award -LRB- 2007 -- 2010 , 2012 -- 2014 -RRB- .\tFighter of the Year\tSugar Ray Robinson Award\tBoxing Writers Association of America\tBoxing Writers Association of America\tBest Fighter ESPY Award\tBest Fighter ESPY Award], [In 2016 , Mayweather was ranked by ESPN as the greatest boxer , pound for pound , of the last 25 years .\tESPN\tESPN\tpound for pound\tpound for pound], [In the same year , he peaked as BoxRec 's number one fighter of all time , pound for pound , as well as the greatest welterweight of all time .\twelterweight\twelterweight\tpound for pound\tpound for pound\tBoxRec\tBoxRec], [Many sporting news and boxing websites ranked Mayweather as the best boxer in the world , pound for pound , twice in a span of ten years ; including The Ring , Sports Illustrated , ESPN , BoxRec , Fox Sports , and Yahoo! Sports .\tESPN\tESPN\tpound for pound\tpound for pound\tBoxRec\tBoxRec\tSports Illustrated\tSports Illustrated\tFox Sports\tFox Sports (United States)\tYahoo! Sports\tYahoo! Sports], [He is also regarded as the best defensive boxer in the sport , as well as being the most accurate puncher since the existence of CompuBox , having the highest plus -- minus ratio in recorded boxing history .\tCompuBox\tCompuBox], [He finished his career with a record of 26 wins without a loss or draw in world title fights -LRB- 10 by KO -RRB- ; 23 wins -LRB- 9 KOs -RRB- in lineal title fights ; 24 wins -LRB- 7 KOs -RRB- against former or current world titlists ; 12 wins -LRB- 3 KOs -RRB- against former or current lineal champions ; and 2 wins -LRB- 1 KO -RRB- against International Boxing Hall of Fame inductees .\tKO\tknockout\tInternational Boxing Hall of Fame\tInternational Boxing Hall of Fame], [In 2007 , he founded Mayweather Promotions , his own boxing promotional firm after defecting from Bob Arum 's Top Rank .\tMayweather Promotions\tMayweather Promotions\tBob Arum\tBob Arum\tTop Rank\tTop Rank], [Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)], [Widely considered to be one of the greatest boxers of all time , undefeated as a professional , and a five-division world champion , Mayweather won fifteen world titles and the lineal championship in four different weight classes -LRB- twice at welterweight -RRB- .\tfive-division world champion\tquintuple champion\tlineal championship\tlineal championship\twelterweight\twelterweight]]","Floyd Joy Mayweather Jr. -LRB- born Floyd Joy Sinclair ; February 24 , 1977 -RRB- is an American former professional boxer who competed from 1996 to 2015 , and currently works as a boxing promoter .\tprofessional boxer\tprofessional boxer\tpromoter\tpromoter (entertainment)"
