## Reading Causal Relations Corpora
By: Pedram Hosseini (pdr.hosseini@gmail.com)

There have been efforts in creating various causal relation corpora with different level of granularity based on multiple annotation schemes. These efforts, even though admirable, are fairly sparse which makes it hard for people in the NLP community to use the generated knowledge by these resources. To take a step to alleviate this problem to some degree and to make it easier for people to benefit from these data sets with reach source of human annotation, we wrote methods in a **CausalDataReader** class to convert all of these resources into a simple and friendly format so that anyone can easily use these samples.
We try to keep most of the annotations from the original sources in the new schema so that no information will be lost during the data conversion. In the following, there is a list of current data sets which are covered in our collection:

- **SemEval 2007 task 4** - Public (source: **1**)
- **SemEval 2010 task 8** - Public (source: **2**)
- **EventCausality data set** - Public (source: **3**)
- **Causal-TimeBank** - Not public (source: **4**)
- **Crowdsourcing-StoryLines 1.2** - Public (source: **5**)
- **CaTeRS** - Public (source: **6**)
- **BECAUSE v2.1** Public (source: **7**)
- **Crowdsourcing-StoryLines 1.5** - Public (source: **8**)
- **Your data set?**

#### JOIN US
We invite everyone in the ML/NLP/NLU community and groups of researchers who work on causal relations extraction in language to contribute to this repository, **data_reader.py** in particular, so that we all take a step forward in improving the quality of availbale data resources and alleviate the sparsness issue.

In [1]:
import data_reader as dr
obj = dr.CausalDataReader()
total_samples = 0

def find_len_max(df_data):
    len_max = 0
    i_max = -1
    for index, row in df_data.iterrows():
        if len(row.text) > len_max:
            len_max = len(row.text)
            i_max = index
    print("max length = " + str(len_max))
    print(data.iloc[i_max].text)

## SemEval 2010 task 8

In [2]:
data = obj.read_semeval_2010_8()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 1331


In [3]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,7,inflammation,infection,The current view is that the chronic <@rg1>inf...,1,1,2,,0
1,14,burst,pressure,The <@rg1>burst</@rg1> has been caused by wate...,1,1,2,,0
2,23,singer,commotion,"The <@rg1>singer</@rg1>, who performed three o...",0,1,2,,0
3,27,Suicide,death,<@rg1>Suicide</@rg1> is one of the leading cau...,0,1,2,,0
4,32,headaches,mold,He had chest pains and <@rg1>headaches</@rg1> ...,1,1,2,,0


In [4]:
find_len_max(data)

max length = 369
The report links conditions in some of the worst affected localities and the likelihood that dire poverty - combined with despair and <@rg1>outrage</@rg1> over rampant <@rg2>corruption</@rg2>, repressive policies, and governments' failure to address local needs - could lead to outbreaks of localised unrest with the potential to spread into a wider regional conflict. 


## SemEval 2007 task 4 

In [5]:
data = obj.read_semeval_2007_4()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 220


In [6]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,tumor shrinkage,radiation therapy,The period of <@rg1>tumor shrinkage</@rg1> aft...,1,1,1,,0
1,2,Habitat degradation,stream channels,<@rg1>Habitat degradation</@rg1> from within <...,1,0,1,,0
2,3,discomfort,traveling,Earplugs relieve the <@rg1>discomfort</@rg1> f...,1,1,1,,0
3,4,daily terror,antipersonnel land mines,We continue to see progress toward a world fre...,1,1,1,,0
4,5,segment,anecdotes,The Global Warming <@rg1>segment</@rg1> starts...,1,0,1,,0


In [7]:
find_len_max(data)

max length = 519
Literary criticism is the study of literature by means of a microscopic knowledge of the language in which a book is written, of its <@rg1>growth</@rg1> from various <@rg2>roots</@rg2>, of its stages of development and the factors influencing them, of its condition in the period of this particular composition, of the writer's idiosyncrasies of thought and style in his ripening periods, of the general history and literature of his race, and of the special characteristics of his age and of his contemporary writers. 


## EventCausality data set

In [8]:
data = obj.read_event_causality()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 485


In [9]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,attacks,conclude,"The company says the <@rg1>attacks</@rg1> "" ha...",1,1,3,,1
1,2,conclude,review,"The company says the attacks "" have led us to ...",1,1,3,,1
2,3,deliveries,interpreted,A large number of flower <@rg1>deliveries</@rg...,1,1,3,,1
3,4,leaves,advancement,""" If Google <@rg1>leaves</@rg1> China , it is ...",1,1,3,,1
4,5,leaves,success,""" If Google <@rg1>leaves</@rg1> China , it is ...",1,1,3,,1


In [10]:
find_len_max(data)

max length = 1102
French police <@rg1>responded</@rg1> to reports of car theft in a town near Paris late Tuesday and a shootout ensued with a group of alleged thieves . Most of them escaped but police captured one and he was later identified as a suspected ETA member , said the spokeswoman , who by custom is not identified . Spanish media reported that the shootout occurred in the town of Dammarie-les-Lys . The dead French policeman was wearing a bullet-proof vest but bullets struck fatally elsewhere on his body . He was reported to be in his 50s , and the father of four children . ETA has traditionally used France as its rearguard logistics and planning base to prepare attacks across the border in Spain , officials say . But in recent years as Spain has enlisted increased cooperation from France in cracking down on ETA hideouts , there have been various exchanges of gunfire between ETA suspects and French police , wounding some officers . Almost all of ETA 's fatal shootings and car b

## Causal-TimeBank

In [11]:
data = obj.read_causal_timebank()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 312


In [12]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,downturn,spending,"But in the past three months , stocks have plu...",0,1,4,ABC19980108.1830.0711.xml,
1,2,change,reposition,The Indonesian currency has lost twenty six pe...,0,1,4,ABC19980108.1830.0711.xml,
2,3,rains,landslides,Officials in California are warning residents ...,0,1,4,PRI19980213.2000.0313.xml,
3,4,get,rains,Forecasters say the picture will <@rg1>get</@r...,1,1,4,PRI19980213.2000.0313.xml,
4,5,dispute,invasion,"Bush , commenting on the two-week gulf crisis ...",1,1,4,AP900816-0139.xml,


In [13]:
find_len_max(data)

max length = 784
WASHINGTON _ Following are <@rg1>statements</@rg1> made Friday and Thursday by Lawrence Wechsler , a lawyer for the White House secretary , Betty Currie ; the White House ; White House spokesman Mike McCurry , and President Clinton in response to an article in The New York Times on Friday about her <@rg2>statements</@rg2> regarding a meeting with the president : Wechsler on Thursday " Without commenting on the allegations raised in this article , to the extent that there is any implication or suggestion that Mrs. Currie was aware of any legal or ethical impropriety by anyone , that implication or suggestion is entirely inaccurate . " I was pleased that Ms. Currie 's lawyers stated unambiguously this morning _ unambiguously _ that she 's not aware of any unethical conduct . 


## Crowdsourcing-StoryLines

In [14]:
data = obj.read_story_lines()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 1540


In [15]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,double murder,killing,Cumbria <@rg1>double murder</@rg1> : Son suspe...,1,1,5,,
1,2,sectioned,suicide attempt,"John Jenkin , 23 , had been <@rg1>sectioned</@...",1,1,5,,
2,3,charged,killing,Millom double murder : Man <@rg1>charged</@rg1...,1,1,5,,
3,4,double murder,charged,Millom <@rg1>double murder</@rg1> : Man <@rg2>...,0,1,5,,
4,5,charged,murder,A MAN has been <@rg1>charged</@rg1> with the <...,1,1,5,,


In [16]:
find_len_max(data)

max length = 416
Then came the anger : a Monday evening vigil <@rg1>marred</@rg1> by an unruly young mob <@rg2>thrashing</@rg2> its way through local businesses ; a second protest the next night ; and another on Wednesday night , after which , the police said , someone hit an officer in the face with a brick , another brick was thrown through the window of a police van , and there were 46 arrests — mostly for disorderly conduct .


## CaTeRS

In [17]:
data = obj.read_CaTeRS()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 203


In [18]:
i = 200
print(data.iloc[i]['text'])
print(data.iloc[i]['arg1'])
print(data.iloc[i]['arg2'])
print(data.iloc[i]['ann_file'])

Deb was visiting family for the holidays. They were at dinner when Deb's job came up. Everyone asked Deb about work, and Deb <@rg2>got</@rg2> nervous. The truth was that she had been <@rg1>unemployed</@rg1> for some time. Finally, she decided to just admit it to her family.
unemployed
got
batch_8.ann


In [19]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,lost,fit,Kay <@rg1>lost</@rg1> 50 pounds and her clothe...,0,1,6,part_12.ann,1
1,2,impressed,bought,Kay lost 50 pounds and her clothes no longer f...,0,1,6,part_12.ann,1
2,3,fit,decided,Kay lost 50 pounds and her clothes no longer <...,0,1,6,part_12.ann,1
3,4,skied,be,Charles was 10-years-old when his uncle took h...,0,1,6,part_12.ann,1
4,5,was,cleaning,Marion <@rg1>was</@rg1> about to move. She was...,0,1,6,part_12.ann,1


In [20]:
find_len_max(data)

max length = 348
There used to be a candy store down the block from where Joe lived. One day, Joe was walking down the street and saw that it was <@rg1>closed</@rg1>. He asked someone who was walking by why it was closed. Joe didn't go there very often, but it was still <@rg2>upsetting</@rg2>. Now that Joe has kids, he has to drive three miles just to get to one.


## BECAUSE v2.1

Since the raw text files for **PTB** and **NYT** need LDC subscription, these file have not been covered in our data reader yet. Once we have access to the raw files from these data resources, we will write the proper data readers for them.

In [21]:
data = obj.read_because()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 636


In [22]:
data.head()

Unnamed: 0,id,arg1,arg2,text,direction,label,source,ann_file,split
0,1,that has arisen,the past few years,"And second, we should address the issue <@rg1>...",0,0,7,CHRG-111shrg61651.ann,
1,2,these banks are too big to fail,"they have lower funding costs, they are able t...",This is unfair competition. <S!G>Because</S!G>...,0,1,7,CHRG-111shrg61651.ann,
2,3,they make more money,the cycle,This is unfair competition. Because these bank...,0,0,7,CHRG-111shrg61651.ann,
3,4,too big,fail,This is unfair competition. Because these bank...,0,1,7,CHRG-111shrg61651.ann,
4,5,you look at the European situation today,it is much worse than what we have in this cou...,<S!G> If</S!G> <@rg1>you look at the European ...,0,0,7,CHRG-111shrg61651.ann,


In [23]:
find_len_max(data)

max length = 1616
3. <@rg1>Your failure to act to prevent the acceptance of or to pay for gifts of earrings from Mr.</@rg1> <@rg2>Your failure to act to prevent the acceptance of or to pay for gifts of earrings from Mr.</@rg2> <@rg1>Chang to individuals (your sister, an employee and a friend) in your home at Christmas on the</@rg1> <@rg2>Chang to individuals (your sister, an employee and a friend) in your home at Christmas on the</@rg2> <@rg1>mistaken belief that such items were of little value or were not gifts to you under the</@rg1> <@rg2>mistaken belief that such items were of little value or were not gifts to you under the</@rg2> <@rg1>circumstances</@rg1> <@rg2>circumstances</@rg2>, evidenced poor judgment, displayed a lack of due regard for Senate rules and <@rg1>resulted in a violation of the Senate Gifts Rule (35)</@rg1> <@rg2>resulted in</@rg2> a violation of the Senate Gifts Rule (35) and, <S!G>consequently</S!G>, <@rg2>a violation of your</@rg2> <@rg2>public disclosure obli

## EventStroyLine v1.5

In [24]:
data = obj.read_story_lines_v15()
total_samples += len(data)
print("samples: " + str(len(data)))

samples: 2055


In [29]:
pos = data.loc[data.label == 1]
neg = data.loc[data.label == 0]
print("# causal: " + str(len(pos)))
print("# non-causal: " +str(len(neg)))

# causal: 104
# non-causal: 1951


In [30]:
find_len_max(data)

max length = 593
One place Lindsay is finding this particularly difficult though is at the Betty Ford Centre , where the actress has just finished 30 days' worth of treatment and has decided that the renowned facilities there won't be enough to keep her around for her court - ordered 90 - day stay in rehab , so she's <@rg1>gone</@rg1> elsewhere .According to a report from TMZ published this week ( June 13 ) , the Mean Girls actress has <@rg2>had enough</@rg2> of the Palm Springs rehab centre and has decided to spend the rest of her court - ordered stay in a rehab centre in Malibu , named the Cliffside .


### Data set statistics

In [None]:
import os
import pandas as pd

data_path = 'data/causal/gold_causal.csv'
if os.path.exists(data_path):
    data = pd.read_csv(data_path)
else:
    data = obj.read_all()
data.to_csv('data/causal/gold_causal.csv')
train = data.loc[data.split == 0]
dev = data.loc[data.split == 1]
test = data.loc[data.split == 2]
pos = data.loc[data.label == 1]
neg = data.loc[data.label == 0]
print("Total samples: " + str(len(data)))
print("----------------------------")
print("# Cuasal: " + str(len(pos)))
print("# Non-Cuasal: " + str(len(neg)))
print("----------------------------")
print("# train: " + str(len(train)))
print("# dev: " + str(len(dev)))
print("# test: " + str(len(test)))
print("----------------------------")
for i in range(7):
    print("# source [" + str(i+1) + "]: " + str(len(data.loc[data.source == (i+1)])))

In [None]:
# checking if all samples have labels
print("# unlabeled samples: " + str(len(data.loc[(data.label != 0) & (data.label != 1)])))