## Reading Causal Relations Corpora
By: Pedram Hosseini (pdr.hosseini@gmail.com)

There have been efforts in creating various causal relation corpora with different level of granularity based on multiple annotation schemes. These efforts, even though admirable, are fairly sparse which makes it hard for people in the NLP community to use the generated knowledge by these resources. To take a step to alleviate this problem to some degree and to make it easier for people to benefit from these data sets with reach source of human annotation, we wrote methods in a **CausalDataReader** class to convert all of these resources into a simple and friendly format so that anyone can easily use these samples.
We try to keep most of the annotations from the original sources in the new schema so that no information will be lost during the data conversion. In the following, there is a list of current data sets which are covered in our collection:

- **SemEval 2007 task 4** - Public (source: **1**)
- **SemEval 2010 task 8** - Public (source: **2**)
- **EventCausality data set** - Public (source: **3**)
- **Causal-TimeBank** - Not public (source: **4**)
- **Crowdsourcing-StoryLines 1.2** - Public (source: **5**)
- **CaTeRS** - Public (source: **6**)
- **BECAUSE v2.1** Public (source: **7**)
- **Crowdsourcing-StoryLines 1.5** - Public (source: **8**)
- **Your data set?**

#### JOIN US
We invite everyone in the ML/NLP/NLU community and groups of researchers who work on causal/counterfactual relations extraction in language to contribute to this repository, **crest/crest_converter.py** in particular, so that we all take a step forward in improving the quality of availbale data resources and alleviate the sparsness issue.

In [1]:
import os
import sys

root_path = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.insert(0, root_path)

In [3]:
from crest import crest_converter
obj = crest_converter.CrestConverter()
total_samples = 0

def find_len_max(df_data):
    len_max = 0
    i_max = -1
    for index, row in df_data.iterrows():
        if len(row.context) > len_max:
            len_max = len(row.context)
            i_max = index
    print("max length = " + str(len_max))
    print(data.iloc[i_max].context)

## SemEval 2010 task 8

In [4]:
data = obj.convert_semeval_2010_8()
total_samples += len(data)
print("samples: {}".format(len(data)))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 10717
+ causal: 1331
- non-causal: 9386


In [5]:
data.head()

Unnamed: 0,id,span1,span2,context,idx,label,source,ann_file,split
0,1,configuration,elements,The system as described above has its greatest...,"{'span1': [[73, 86]], 'span2': [[98, 106]]}",0,2,,0
1,2,child,cradle,The child was carefully wrapped and bound into...,"{'span1': [[4, 9]], 'span2': [[51, 57]]}",0,2,,0
2,3,author,disassembler,The author of a keygen uses a disassembler to ...,"{'span1': [[4, 10]], 'span2': [[30, 42]]}",0,2,,0
3,4,ridge,surge,A misty ridge uprises from the surge.\n,"{'span1': [[8, 13]], 'span2': [[31, 36]]}",0,2,,0
4,5,student,association,The student association is the voice of the un...,"{'span1': [[4, 11]], 'span2': [[12, 23]]}",0,2,,0


In [6]:
find_len_max(data)

max length = 493
Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its growth from various roots, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers.



## SemEval 2007 task 4 

In [7]:
data = obj.convert_semeval_2007_4()
total_samples += len(data)
print("samples: " + str(len(data)))
print("+ causal: {}".format(len(data.loc[(data["label"] == 1) | (data["label"] == 2)])))
print("- non-causal: {}".format(len(data.loc[data["label"] == 0])))

samples: 220
+ causal: 114
- non-causal: 106


In [8]:
data.head()

Unnamed: 0,id,span1,span2,context,idx,label,source,ann_file,split
0,1,tumor shrinkage,radiation therapy,The period of tumor shrinkage after radiation ...,"{'span1': [[14, 29]], 'span2': [[36, 53]]}",2,1,,0
1,2,Habitat degradation,stream channels,Habitat degradation from within stream channel...,"{'span1': [[0, 19]], 'span2': [[32, 47]]}",0,1,,0
2,3,discomfort,traveling,Earplugs relieve the discomfort from traveling...,"{'span1': [[21, 31]], 'span2': [[37, 46]]}",2,1,,0
3,4,daily terror,antipersonnel land mines,We continue to see progress toward a world fre...,"{'span1': [[55, 67]], 'span2': [[71, 95]]}",2,1,,0
4,5,segment,anecdotes,The Global Warming segment starts off with two...,"{'span1': [[19, 26]], 'span2': [[53, 62]]}",0,1,,0


In [9]:
find_len_max(data)

max length = 493
Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its growth from various roots, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers.



## EventCausality data set

In [10]:
data = obj.convert_event_causality()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 485


In [11]:
for index, row in data.iterrows():
    span1 = row["context"][row["idx"]["span1"][0][0]:row["idx"]["span1"][0][1]]
    span2 = row["context"][row["idx"]["span2"][0][0]:row["idx"]["span2"][0][1]]
    print("{}:{}".format(span1, row["span1"]))
    print("{}:{}".format(span2, row["span2"]))
    print("------")

attacks:attacks
conclude:conclude
------
conclude:conclude
review:review
------
deliveries:deliveries
interpreted:interpreted
------
leaves:leaves
advancement:advancement
------
leaves:leaves
success:success
------
leaves:leaves
oppression:oppression
------
determine:determine
seek:seek
------
slowed:slowed
created:created
------
controls:controls
created:created
------
blocked:blocked
created:created
------
created:created
hoped:hoped
------
controls:controls
slowed:slowed
------
controls:controls
blocked:blocked
------
cooperating:cooperating
provide:provide
------
dominates:dominates
dwarfed:dwarfed
------
exits:exits
come:come
------
exits:exits
decision:decision
------
found:found
got:got
------
found:found
happened:happened
------
lured:lured
raped:raped
------
delivered:delivered
raped:raped
------
made:made
asked:asked
------
made:made
come:come
------
made:made
work:work
------
tell:tell
go:go
------
sold:sold
pay:pay
------
bought:bought
pay:pay
------
found:found
pay:pay
---

------
ensued:ensued
captured:captured
------
exchanges:exchanges
shootings:shootings
------
theft:theft
captured:captured
------
responded:responded
in:in
------
captured:captured
occurred:occurred
------
captured:captured
identified:identified
------
wounding:wounding
occurred:occurred
------
shootout:shootout
died:died
------
fight:fight
killings:killings
------
enlisted:enlisted
exchanges:exchanges
------
cracking:cracking
exchanges:exchanges
------
directs:directs
attacks:attacks
------
arrest:arrest
replaced:replaced
------
arrests:arrests
signify:signify
------


In [18]:
data.head()

Unnamed: 0,id,span1,span2,context,idx,label,source,ann_file,split
0,1,attacks,conclude,"The company says the attacks "" have led us to ...","{'span1': [[21, 28]], 'span2': [[46, 54]]}",1,3,,1
1,2,conclude,review,"The company says the attacks "" have led us to ...","{'span1': [[46, 54]], 'span2': [[70, 76]]}",1,3,,1
2,3,deliveries,interpreted,A large number of flower deliveries were made ...,"{'span1': [[25, 35]], 'span2': [[105, 116]]}",1,3,,1
3,4,leaves,advancement,""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[57, 68]]}",1,3,,1
4,5,leaves,success,""" If Google leaves China , it is likely to be ...","{'span1': [[12, 18]], 'span2': [[87, 94]]}",1,3,,1


In [19]:
find_len_max(data)

max length = 1076
French police responded to reports of car theft in a town near Paris late Tuesday and a shootout ensued with a group of alleged thieves . Most of them escaped but police captured one and he was later identified as a suspected ETA member , said the spokeswoman , who by custom is not identified . Spanish media reported that the shootout occurred in the town of Dammarie-les-Lys . The dead French policeman was wearing a bullet-proof vest but bullets struck fatally elsewhere on his body . He was reported to be in his 50s , and the father of four children . ETA has traditionally used France as its rearguard logistics and planning base to prepare attacks across the border in Spain , officials say . But in recent years as Spain has enlisted increased cooperation from France in cracking down on ETA hideouts , there have been various exchanges of gunfire between ETA suspects and French police , wounding some officers . Almost all of ETA 's fatal shootings and car bombings have 

## Causal-TimeBank

In [12]:
data = obj.read_causal_timebank()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 312


In [13]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,downturn,spending,"But in the past three months , stocks have plu...",0,1,4,ABC19980108.1830.0711.xml,
1,2,change,reposition,The Indonesian currency has lost twenty six pe...,0,1,4,ABC19980108.1830.0711.xml,
2,3,rains,landslides,Officials in California are warning residents ...,0,1,4,PRI19980213.2000.0313.xml,
3,4,get,rains,Forecasters say the picture will <@rg1>get</@r...,1,1,4,PRI19980213.2000.0313.xml,
4,5,dispute,invasion,"Bush , commenting on the two-week gulf crisis ...",1,1,4,AP900816-0139.xml,


In [14]:
find_len_max(data)

max length = 784
WASHINGTON _ Following are <@rg1>statements</@rg1> made Friday and Thursday by Lawrence Wechsler , a lawyer for the White House secretary , Betty Currie ; the White House ; White House spokesman Mike McCurry , and President Clinton in response to an article in The New York Times on Friday about her <@rg2>statements</@rg2> regarding a meeting with the president : Wechsler on Thursday " Without commenting on the allegations raised in this article , to the extent that there is any implication or suggestion that Mrs. Currie was aware of any legal or ethical impropriety by anyone , that implication or suggestion is entirely inaccurate . " I was pleased that Ms. Currie 's lawyers stated unambiguously this morning _ unambiguously _ that she 's not aware of any unethical conduct . 


## Crowdsourcing-StoryLines

In [15]:
data = obj.read_story_lines()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 1540


In [16]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,double murder,killing,Cumbria <@rg1>double murder</@rg1> : Son suspe...,1,1,5,,
1,2,sectioned,suicide attempt,"John Jenkin , 23 , had been <@rg1>sectioned</@...",1,1,5,,
2,3,charged,killing,Millom double murder : Man <@rg1>charged</@rg1...,1,1,5,,
3,4,double murder,charged,Millom <@rg1>double murder</@rg1> : Man <@rg2>...,0,1,5,,
4,5,charged,murder,A MAN has been <@rg1>charged</@rg1> with the <...,1,1,5,,


In [17]:
find_len_max(data)

max length = 416
Then came the anger : a Monday evening vigil <@rg1>marred</@rg1> by an unruly young mob <@rg2>thrashing</@rg2> its way through local businesses ; a second protest the next night ; and another on Wednesday night , after which , the police said , someone hit an officer in the face with a brick , another brick was thrown through the window of a police van , and there were 46 arrests — mostly for disorderly conduct .


## CaTeRS

In [18]:
data = obj.read_CaTeRS()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 203


In [19]:
i = 200
print(data.iloc[i]['text'])
print(data.iloc[i]['arg1'])
print(data.iloc[i]['arg2'])
print(data.iloc[i]['ann_file'])

Deb was visiting family for the holidays. They were at dinner when Deb's job came up. Everyone asked Deb about work, and Deb <@rg2>got</@rg2> nervous. The truth was that she had been <@rg1>unemployed</@rg1> for some time. Finally, she decided to just admit it to her family.
unemployed
got
batch_8.ann


In [20]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,lost,fit,Kay <@rg1>lost</@rg1> 50 pounds and her clothe...,0,1,6,part_12.ann,1
1,2,impressed,bought,Kay lost 50 pounds and her clothes no longer f...,0,1,6,part_12.ann,1
2,3,fit,decided,Kay lost 50 pounds and her clothes no longer <...,0,1,6,part_12.ann,1
3,4,skied,be,Charles was 10-years-old when his uncle took h...,0,1,6,part_12.ann,1
4,5,was,cleaning,Marion <@rg1>was</@rg1> about to move. She was...,0,1,6,part_12.ann,1


In [21]:
find_len_max(data)

max length = 348
There used to be a candy store down the block from where Joe lived. One day, Joe was walking down the street and saw that it was <@rg1>closed</@rg1>. He asked someone who was walking by why it was closed. Joe didn't go there very often, but it was still <@rg2>upsetting</@rg2>. Now that Joe has kids, he has to drive three miles just to get to one.


## BECAUSE v2.1

Since the raw text files for **PTB** and **NYT** need LDC subscription, these file have not been covered in our data reader yet. Once we have access to the raw files from these data resources, we will write the proper data readers for them.

In [22]:
data = obj.read_because()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 636


In [23]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,that has arisen,the past few years,We should redesign it to reflect current reali...,0,0,7,CHRG-111shrg61651.ann,
1,2,these banks are too big to fail,"they have lower funding costs, they are able t...",This is unfair competition. <S!G>Because</S!G>...,0,1,7,CHRG-111shrg61651.ann,
2,3,they make more money,the cycle,This is unfair competition. Because these bank...,0,0,7,CHRG-111shrg61651.ann,
3,4,too big,fail,This is unfair competition. Because these bank...,0,1,7,CHRG-111shrg61651.ann,
4,5,you look at the European situation today,it is much worse than what we have in this cou...,And I do not think that we have seen the end o...,0,0,7,CHRG-111shrg61651.ann,


In [24]:
find_len_max(data)

max length = 1262
Your acceptance of a television and stereo CD player upon payment from David Chang of an amount you understood to be the cost to Mr. Chang, rather than fair market retail value, evidenced poor judgment, displayed a lack of due regard for Senate rules and resulted in a violation of the Senate Gifts Rules (35) and, consequently, a violation of your public disclosure obligations under Senate rules.  2. <@rg1> Your acceptance on loan from Mr. Chang of bronze statues (eagle and bronco buster) for</@rg1> <@rg2> Your acceptance on loan from Mr. Chang of bronze statues (eagle and bronco buster) for</@rg2> <@rg1>display in your Senate office under your office's policy of accepting the loan of home state</@rg1> <@rg2>display in your Senate office under your office's policy of accepting the loan of home state</@rg2> <@rg1>artwork</@rg1> <@rg2> artwork</@rg2> was not consistent with Senate rules governing such loans, evidenced poor judgment, displayed a lack of due regard for the

## EventStroyLine v1.5

In [26]:
data = obj.read_story_lines_v_1_5()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 2055


In [27]:
pos = data.loc[data.label == 1]
neg = data.loc[data.label == 0]
print("# causal: " + str(len(pos)))
print("# non-causal: " +str(len(neg)))

# causal: 104
# non-causal: 1951


In [28]:
find_len_max(data)

max length = 593
One place Lindsay is finding this particularly difficult though is at the Betty Ford Centre , where the actress has just finished 30 days' worth of treatment and has decided that the renowned facilities there won't be enough to keep her around for her court - ordered 90 - day stay in rehab , so she's <@rg1>gone</@rg1> elsewhere .According to a report from TMZ published this week ( June 13 ) , the Mean Girls actress has <@rg2>had enough</@rg2> of the Palm Springs rehab centre and has decided to spend the rest of her court - ordered stay in a rehab centre in Malibu , named the Cliffside .


### Data set statistics

In [31]:
import os
import pandas as pd

data_path = root_path + '/data/causal/gold_causal.csv'
if os.path.exists(data_path):
    data = pd.read_csv(data_path)
else:
    data = obj.read_all()
    data.to_csv('data/causal/gold_causal.csv')
train = data.loc[data.split == 0]
dev = data.loc[data.split == 1]
test = data.loc[data.split == 2]
pos = data.loc[data.label == 1]
neg = data.loc[data.label == 0]
print("Total samples: " + str(len(data)))
print("----------------------------")
print("# Cuasal: " + str(len(pos)))
print("# Non-Cuasal: " + str(len(neg)))
print("----------------------------")
print("# train: " + str(len(train)))
print("# dev: " + str(len(dev)))
print("# test: " + str(len(test)))
print("----------------------------")
for i in range(7):
    print("# source [" + str(i+1) + "]: " + str(len(data.loc[data.source == (i+1)])))

Total samples: 4724
----------------------------
# Cuasal: 4461
# Non-Cuasal: 263
----------------------------
# train: 1268
# dev: 149
# test: 822
----------------------------
# source [1]: 220
# source [2]: 1331
# source [3]: 485
# source [4]: 312
# source [5]: 1540
# source [6]: 203
# source [7]: 633


In [32]:
# checking if all samples have labels
print("# unlabeled samples: " + str(len(data.loc[(data.label != 0) & (data.label != 1)])))

# unlabeled samples: 0
