# 01 Seed Pairs Preparation

This code aims to extract the cause-effect pairs from 'SemEval2010_task8', to prepare them as seed pairs in following steps.   

* **Input**: The tagged text from [this link of SemEval2010_task8](https://docs.google.com/leaf?id=0B_jQiLugGTAkMDQ5ZjZiMTUtMzQ1Yy00YWNmLWJlZDYtOWY1ZDMwY2U4YjFk&sort=name&layout=list&num=50)
* **Approaches**: Using the linguistic patterns to extract the tagged cause-effect pairs
* **Output**: The causal pairs stored in dataframe

In [2]:
### import necessary packages

import os
import pandas as pd
import numpy as np
import re
import random


Please download the source data for seed pairs, by executing the file **download_data.sh** in the path of this folder
> sh download_data.sh


In [4]:
### define the path of dataset

path_here = os.getcwd()
path_semEval2010 = path_here +'/data/SemEval2010_task8_all_data/'
# training pairs
path_semEval2010_train = path_semEval2010 +'SemEval2010_task8_training/TRAIN_FILE.TXT'
# test pairs
path_semEval2010_test = path_semEval2010 +'SemEval2010_task8_testing_keys/TEST_FILE_FULL.TXT'


## 01-A. Get the causal positive pairs

In [8]:

### purpose: extract causal pairs from the tagged text
### input: the path of tagged text
### output: the dataframe to store causal pairs DataFrame(columns=['SentID', 'Cause','Effect','Label'])

def extract_causalpairs(path_text):
    
    # read the train file
    with open(path_text, 'r') as f:
        content = f.readlines()
        content = [x.strip() for x in content] 


    # the dataframe to store the pairs
    df_causalpairs = pd.DataFrame(columns=['SentID', 'Cause','Effect','Label'])
    # the patterns to extract pairs from text
    pattern_causal = 'Cause-Effect\((e.),(e.)\)'
    pattern_e1 = '.*<e1>(.*)</e1>.*'        
    pattern_e2 = '.*<e2>(.*)</e2>.*'
    pattern_sentID = '(\d+)\\t.*'

    LABEL = 1

    for inx_l, lines in enumerate(content):

        res = re.match(pattern_causal, lines) 
        if res is not None:
            # sentence ID
            sent_id = re.match(pattern_sentID, content[inx_l-1])[1]

            # cause part + effect part (e1 or e2)
            if res[1] == 'e1':
                res_cause = re.match(pattern_e1, content[inx_l-1])[1]
                res_effect = re.match(pattern_e2, content[inx_l-1])[1]
            if res[1] == 'e2':
                res_cause = re.match(pattern_e2, content[inx_l-1])[1]
                res_effect = re.match(pattern_e1, content[inx_l-1])[1]

            # append to dataframe
            df_causalpairs = df_causalpairs.append({'SentID': sent_id, 'Cause': res_cause, 'Effect': res_effect, 'Label': LABEL}, ignore_index=True)
            
    
    # try to drop duplicate rows of a dataframe
    df_causalpairs = df_causalpairs.drop_duplicates(subset=['Cause', 'Effect'], keep='last', ignore_index=True)

    
    return df_causalpairs
    
       

In [10]:
# get the casual pairs from train files and test files of semEval2010
df_causalpairs_train = extract_causalpairs(path_semEval2010_train)
df_causalpairs_test = extract_causalpairs(path_semEval2010_test)
# save to 
df_causalpairs_train.to_csv(path_here + '/res/df_causalpairs_train.csv')
df_causalpairs_test.to_csv(path_here + '/res/df_causalpairs_test.csv')
