# Process Climate FEVER

The dataset, as it stands on huggingface, isn't in the correct format for textual entailment. This notebook file will fix that

## Environment

In [1]:
! pip install transformers
! pip install torch
! pip install datasets

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 13.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 2.5 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 26.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 12.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

In [2]:
import numpy as np
import pandas as pd

from html import unescape
from random import randint
import math
import gc

from transformers import pipeline                                                   
from transformers.pipelines.pt_utils import KeyDataset
#import datasets
from datasets import load_dataset, load_metric, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import notebook_login

import torch

Log into huggingface

In [3]:
# get access token on Huggingface website > settings > access token (make sure it's a write token)
!git config --global credential.helper store
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


## Original format

In [4]:
ds_path = 'climate_fever'
# use_auth_token must be true bc this is a private dataset
ds = load_dataset(ds_path)

Downloading builder script:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset climate_fever/default (download: 671.03 KiB, generated: 2.32 MiB, post-processed: Unknown size, total: 2.97 MiB) to /root/.cache/huggingface/datasets/climate_fever/default/1.0.1/60c6cf5ebdf73f1cad68b9a15e9da57d65e2d35416a13516080f6a0a34d8cbe6...


Downloading data: 0.00B [00:00, ?B/s]

Generating test split:   0%|          | 0/1535 [00:00<?, ? examples/s]

Dataset climate_fever downloaded and prepared to /root/.cache/huggingface/datasets/climate_fever/default/1.0.1/60c6cf5ebdf73f1cad68b9a15e9da57d65e2d35416a13516080f6a0a34d8cbe6. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
ds

DatasetDict({
    test: Dataset({
        features: ['claim_id', 'claim', 'claim_label', 'evidences'],
        num_rows: 1535
    })
})

It appears that each evidence row is a dictionary containing information for all 5 of the associated evidences for each claim. This keeps the dataset to one row per claim, but we will need to expand this if we want to use the data for text entailment.

In [6]:
print("claim:", ds['test']['claim'][1])
ds['test']['evidences'][1]

claim: The sun has gone into ‘lockdown’ which could cause freezing weather, earthquakes and famine, say scientists


[{'article': 'Famine',
  'entropy': 0.0,
  'evidence': "The current consensus of the scientific community is that the aerosols and dust released into the upper atmosphere causes cooler temperatures by preventing the sun's energy from reaching the ground.",
  'evidence_id': 'Famine:386',
  'evidence_label': 0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'article': 'Weather',
  'entropy': 0.0,
  'evidence': 'The Little Ice Age caused crop failures and famines in Europe.',
  'evidence_id': 'Weather:67',
  'evidence_label': 0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'article': 'Winter',
  'entropy': 0.0,
  'evidence': 'The persistently cold, wet weather caused great hardship, was primarily responsible for the Great Famine of 1315–1317, and strongly contributed to the weakened immunity and malnutrition leading up to the Black Death (1348–1350).',
  'evidence_id': 'Winter:114',
  'evidence_label': 0,
  'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
 {'a

## Fix format

The desired format is:

|claim|evidence|label|
|--|--|--|
|claim1|evidence1|entails|
|claim1|evidence2|contradicts|
|...|...|...|
|claim n|evidence4|neutral|

This will be easier to do with the original jsonl file. Read in the  file from [github](https://github.com/tdiggelm/climate-fever-dataset)

In [7]:
cf_url = 'https://raw.githubusercontent.com/tdiggelm/climate-fever-dataset/main/dataset/climate-fever.jsonl'
cf_orig = pd.read_json(cf_url, lines=True)
cf_orig.head()

Unnamed: 0,claim_id,claim,claim_label,evidences
0,0,Global warming is driving polar bears toward e...,SUPPORTS,[{'evidence_id': 'Extinction risk from global ...
1,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,"[{'evidence_id': 'Famine:386', 'evidence_label..."
2,6,The polar bear population has been growing.,REFUTES,"[{'evidence_id': 'Polar bear:1332', 'evidence_..."
3,9,Ironic' study finds more CO2 has slightly cool...,REFUTES,"[{'evidence_id': 'Atmosphere of Mars:131', 'ev..."
4,10,Human additions of CO2 are in the margin of er...,REFUTES,[{'evidence_id': 'Carbon dioxide in Earth's at...


We must expand the `evidences` column

In [8]:
df = (cf_orig.set_index(['claim_id'])['evidences']
       .apply(pd.Series).stack()
         .apply(pd.Series).reset_index().drop('level_1',1))

claim_info = cf_orig[['claim_id', 'claim', 'claim_label']]
df = claim_info.merge(df, how='left', on='claim_id')

df.head(5)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,claim_id,claim,claim_label,evidence_id,evidence_label,article,evidence,entropy,votes
0,0,Global warming is driving polar bears toward e...,SUPPORTS,Extinction risk from global warming:170,NOT_ENOUGH_INFO,Extinction risk from global warming,"""Recent Research Shows Human Activity Driving ...",0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]"
1,0,Global warming is driving polar bears toward e...,SUPPORTS,Global warming:14,SUPPORTS,Global warming,Environmental impacts include the extinction o...,0.0,"[SUPPORTS, SUPPORTS, None, None, None]"
2,0,Global warming is driving polar bears toward e...,SUPPORTS,Global warming:178,NOT_ENOUGH_INFO,Global warming,Rising temperatures push bees to their physiol...,0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]"
3,0,Global warming is driving polar bears toward e...,SUPPORTS,Habitat destruction:61,SUPPORTS,Habitat destruction,"Rising global temperatures, caused by the gree...",0.0,"[SUPPORTS, SUPPORTS, None, None, None]"
4,0,Global warming is driving polar bears toward e...,SUPPORTS,Polar bear:1328,NOT_ENOUGH_INFO,Polar bear,"""Bear hunting caught in global warming debate"".",0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]"


And, just in case, we transform our evidence labels to the format common in textual entailment (entailment, contradiction, neutral)

In [9]:
label_map = {'SUPPORTS':'entailment', 'REFUTES':'contradiction', 'NOT_ENOUGH_INFO':'neutral'}
df['label'] = df['evidence_label'].map(label_map)
df.head(10)

Unnamed: 0,claim_id,claim,claim_label,evidence_id,evidence_label,article,evidence,entropy,votes,label
0,0,Global warming is driving polar bears toward e...,SUPPORTS,Extinction risk from global warming:170,NOT_ENOUGH_INFO,Extinction risk from global warming,"""Recent Research Shows Human Activity Driving ...",0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]",neutral
1,0,Global warming is driving polar bears toward e...,SUPPORTS,Global warming:14,SUPPORTS,Global warming,Environmental impacts include the extinction o...,0.0,"[SUPPORTS, SUPPORTS, None, None, None]",entailment
2,0,Global warming is driving polar bears toward e...,SUPPORTS,Global warming:178,NOT_ENOUGH_INFO,Global warming,Rising temperatures push bees to their physiol...,0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]",neutral
3,0,Global warming is driving polar bears toward e...,SUPPORTS,Habitat destruction:61,SUPPORTS,Habitat destruction,"Rising global temperatures, caused by the gree...",0.0,"[SUPPORTS, SUPPORTS, None, None, None]",entailment
4,0,Global warming is driving polar bears toward e...,SUPPORTS,Polar bear:1328,NOT_ENOUGH_INFO,Polar bear,"""Bear hunting caught in global warming debate"".",0.693147,"[SUPPORTS, NOT_ENOUGH_INFO, None, None, None]",neutral
5,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Famine:386,SUPPORTS,Famine,The current consensus of the scientific commun...,0.0,"[SUPPORTS, SUPPORTS, None, None, None]",entailment
6,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Weather:67,SUPPORTS,Weather,The Little Ice Age caused crop failures and fa...,0.0,"[SUPPORTS, SUPPORTS, None, None, None]",entailment
7,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Winter:114,SUPPORTS,Winter,"The persistently cold, wet weather caused grea...",0.0,"[SUPPORTS, SUPPORTS, None, None, None]",entailment
8,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Winter:20,NOT_ENOUGH_INFO,Winter,The manifestation of the meteorological winter...,0.693147,"[REFUTES, NOT_ENOUGH_INFO, None, None, None]",neutral
9,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Winter:5,NOT_ENOUGH_INFO,Winter,"In many regions, winter is associated with sno...",0.693147,"[REFUTES, NOT_ENOUGH_INFO, None, None, None]",neutral


In alignment with above, we only need a few of these columns. Note I also kept "category" for a potential text classification task.

In [10]:
df_final = df[['claim_id', 'claim', 'evidence', 'evidence_label', 'label', 'article']]
df_final = df_final.rename({'article': 'category'}, axis=1)
df_final.head()

Unnamed: 0,claim_id,claim,evidence,evidence_label,label,category
0,0,Global warming is driving polar bears toward e...,"""Recent Research Shows Human Activity Driving ...",NOT_ENOUGH_INFO,neutral,Extinction risk from global warming
1,0,Global warming is driving polar bears toward e...,Environmental impacts include the extinction o...,SUPPORTS,entailment,Global warming
2,0,Global warming is driving polar bears toward e...,Rising temperatures push bees to their physiol...,NOT_ENOUGH_INFO,neutral,Global warming
3,0,Global warming is driving polar bears toward e...,"Rising global temperatures, caused by the gree...",SUPPORTS,entailment,Habitat destruction
4,0,Global warming is driving polar bears toward e...,"""Bear hunting caught in global warming debate"".",NOT_ENOUGH_INFO,neutral,Polar bear


## Convert to dataset and split into test/train/val

In [11]:
ds = Dataset.from_pandas(df_final, preserve_index=False)
ds.features

{'category': Value(dtype='string', id=None),
 'claim': Value(dtype='string', id=None),
 'claim_id': Value(dtype='int64', id=None),
 'evidence': Value(dtype='string', id=None),
 'evidence_label': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

Split into train/test/val

In [12]:
# start with train/test split
ds = ds.train_test_split(test_size=0.2, seed=727) #729

# split training into train and validation
train_val_ds = ds['train'].train_test_split(test_size=0.3, seed=451)

# update original ds with re-split training and validation
ds['train'] = train_val_ds['train']
ds['valid'] = train_val_ds['test']

Check sizes

In [13]:
ds

DatasetDict({
    train: Dataset({
        features: ['claim_id', 'claim', 'evidence', 'evidence_label', 'label', 'category'],
        num_rows: 4298
    })
    test: Dataset({
        features: ['claim_id', 'claim', 'evidence', 'evidence_label', 'label', 'category'],
        num_rows: 1535
    })
    valid: Dataset({
        features: ['claim_id', 'claim', 'evidence', 'evidence_label', 'label', 'category'],
        num_rows: 1842
    })
})

# Push to hub

Finally, let's put this on huggingface for further use in the project!

In [14]:
!git config --global credential.helper store

In [15]:
ds.push_to_hub('amandakonet/climate_fever_adopted', private = True)

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split valid to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]