# 02 Data Sets Preparation

This code aims to preprocess plain text, extract a pairs of Noun Phrases(NPs) for each sentence and prepare these datasets for next steps. 

* **Input**: The plain text extracted from [Wikipedia Dump 20210601](https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors) 
* **Approaches**: Applying the [**spaCy-entity-linker**](https://github.com/egerber/spaCy-entity-linker) to recognize the entities of WikiData in the original sentencesimport sys
* **Output**: Each sentence is recognized with a pair of NPs and stored in a dataframe.
                                                                     

In [2]:
### import and install necessary packages

import os
import re
import random
import glob
import sys
import pickle

import pandas as pd
import numpy as np

import time

# to install if you don't install yet
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')

# to install if you don't install yet
# !{sys.executable} -m pip install spacy-entity-linker  
nlp.add_pipe("entityLinker", last=True)   # to make use of the entityLinker


<spacy_entity_linker.EntityLinker.EntityLinker at 0x122b0c438>

Please download the source dataset, by executing the file **download_data.sh** in the path of this folder 
> sh download_data.sh

Note:If you have executed already, just skip this step.

## 02-A. Preprocess big file
The source file recorded together all the WikiPages, this step is to separate them into separate text according to corresponding WikiPage IDs.

In [3]:

### purpose: process big file into separate text for each wikipage 
### input: each big file 
### output: the dataframe(columns= ['Directory','file_ID', 'file_title', 'file_text'])
def parse_segmt_file(content):

    # the dataframe to store the separate file for each segmentation
    df_file = pd.DataFrame(columns= ['Directory','file_ID', 'file_title', 'file_text'])

    # predifined variables
    file_ID = ''
    file_title = ''
    file_text = ''
    # the tagger to use in for loop 
    start = -1
    end = -1

    for inx_i, tx in enumerate(content):
        # to find the begining of a seprate file
        res1 = re.match('<doc id="(.*)" .* title="(.*)">', tx)
        if res1 is not None:
            start = inx_i
            file_ID = res1[1]
            file_title = res1[2]

        # to find the end of a seprate file
        if re.match('</doc>', tx) is not None:
            end = inx_i
        if start!= -1 and end!= -1:
            file_text =' '.join(content[start+1:end])
            # reset the tagger
            start = -1
            end = -1

            # put the values into dataframe
            if len(file_text)>500:
                df_file = df_file.append({'Directory': dir_str, 'file_ID': file_ID, 'file_title': file_title, 'file_text': file_text}, ignore_index=True)


    return df_file


## 02-B. Recognize NP pairs for each sentence 
Each sentence has many Noun Phrases (NPs), we would like to find the ones existing in **WikiData Knowledge Bases** and tag them in the sentence. 

In [4]:

### purpose: find the index of the tagged NPs in a sentence
### input: (a tagged NP, a sentence tokenized in a list) 
### output: a tuple indicating the start and end indexes of this NPs in the tokenized list
def find_word_inx(ele, sents_ls):
    # initialize for test
    word_inx = (0,0)
    
    # check the length of element
    lenth = len(ele.split(' '))

    # the seed is not NP
    if lenth == 1:
        if ele in sents_ls:
            a = sents_ls.index(ele)
            word_inx = (a, a)

    # the seed is NP (2,3,.... and more tokens)
    elif lenth > 1:
        a_ls = []
        for i in range(lenth):
            eleHere = ele.split(' ')[i]
            if eleHere in sents_ls:
                a_ls.append(sents_ls.index(eleHere))
                word_inx = (a_ls[0], a_ls[-1])     

    # test for return
    if word_inx != (0,0):
        return word_inx
    else:
        return None


    
### purpose: used inside the function <twoEntites_sentence_file2>
### input: the list of searching NPs; the list of a sentence in tokens
### output: the index of NPs in token lists
def find_sub_list(sl,l):
    results=[]
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
            results.append((ind,ind+sll-1))

    return results  


### purpose: recognized the NPs pairs in a sentence
### input: (a dataframe including each separate wikipage, a resulting dataframe , a list including the seed pairs) 
### output: a dataframe to store the sentences and other info. DataFrame(columns=['pairs', 'ele1_word_idx', 'ele2_word_idx', 'sentence', 'tokens', 'file_ID', 'file_title', 'SeedTF'])   
def twoEntites_sentence_file2(df_file, df_enwiki_causality, causality_pairs_list):
    
    for inx in range(len(df_file)):
        file = df_file.iloc[inx]['file_text']
        doc = nlp(file.lower())
        
        # returns all entities in the whole document
        all_linked_entities = doc._.linkedEntities

        # iterates over sentences and prints linked entities
        for sent in doc.sents:
            candid = []
            for entities in sent._.linkedEntities:
                #(1). to ensure that the text is the same as that in Wididata
                if entities.get_label() == entities.get_span().text:
                    candid.append(entities.get_label())
            
            tokens = [s.text for s in sent]
            ele1_2_word_idx = []
            SeedTF = False
            
            two_entites = []  
            # (2). if include the seed pairs, use the seed pairs
            for pairs in causality_pairs_list:
                if len(set(pairs).intersection(set(candid))) == 2:
                    two_entites = pairs
                    SeedTF = True
            
            #(3). only extract the elements that have the top two longest strings
            if not SeedTF:
                # remove the duplicate
                candid = list(set(candid))
                candid.sort(key = len)
                two_entites = candid[-2:]
                
            #(4). find the index of entities   
            if len(two_entites) == 2:

                for ele in two_entites:
                    res = find_sub_list(ele.split(' '), tokens)
                    if res != []:
                        ele1_2_word_idx.append(res[0])
                    else:
                        ele1_2_word_idx.append(None)
                df_enwiki_causality = df_enwiki_causality.append({'pairs': two_entites, 'ele1_word_idx': ele1_2_word_idx[0], 'ele2_word_idx': ele1_2_word_idx[1],
                                                                        'sentence': sent.text, 'tokens': [s.text for s in sent], 
                                                                        'file_ID': df_file.iloc[inx]['file_ID'], 'file_title': df_file.iloc[inx]['file_title'],
                                                                        'SeedTF': SeedTF},
                                                               ignore_index=True)
    return df_enwiki_causality


## 02-C. Execute these functions and get the tagged datasets


In [6]:
###!-------------------- main function --------------------!###


path_here = os.getcwd()
# wikipedia: Part of the whole Dataset (please see details in <download_data.sh>)
enwiki_data = path_here + 'data/enwiki_20210601/text/'
# get the seed pairs
with open(path_here+'/res/causality_pairs_list.pickle', 'rb') as f:
    causality_pairs_list = pickle.load(f)


# the dataframe to store the sentences info with CORRECT causal pairs            
df_enwiki_causality_AA = pd.DataFrame()

# To proceed with each segmentation file
dir_str = 'AA'
for filename in glob.iglob(enwiki_data+dir_str+'/*',recursive = True):
    round_start = time.time()
    with open(filename, 'r') as f:
        content = f.readlines()
        content = [x.strip() for x in content] 
        print('-----------SegmentationFile: '+filename+'------------')
        
        # process segmentation into separate files with info
        df_file = parse_segmt_file(content)
        print('segmentate into separate files')
        
        # extract sentences where causal pairs appear
        df_enwiki_causality_AA = twoEntites_sentence_file2(df_file, df_enwiki_causality_AA, causality_pairs_list)
        print('------------Finished this file----------------')
    
    round_end = time.time()
    print('This round use '+ str((round_end-round_start) / 60) +'mins')

    
# save to the disk 
df_enwiki_causality_AA.to_csv(path_here + '/res/df_enwiki_causality_AA.csv')
df_enwiki_causality_AA.to_pickle(path_here + '/res/df_enwiki_causality_AA.pkl')



-----------SegmentationFile: /Users/zoe/Desktop/datasets/enwiki-20210601/text/AA/wiki_73------------
segmentate into separate files
------------Finished this file----------------
This round use 1.6008496801058452mins
-----------SegmentationFile: /Users/zoe/Desktop/datasets/enwiki-20210601/text/AA/wiki_87------------
segmentate into separate files
------------Finished this file----------------
This round use 1.2242977142333984mins
-----------SegmentationFile: /Users/zoe/Desktop/datasets/enwiki-20210601/text/AA/wiki_80------------
segmentate into separate files
------------Finished this file----------------
This round use 1.274128246307373mins
-----------SegmentationFile: /Users/zoe/Desktop/datasets/enwiki-20210601/text/AA/wiki_74------------
segmentate into separate files
------------Finished this file----------------
This round use 1.2159496665000915mins
-----------SegmentationFile: /Users/zoe/Desktop/datasets/enwiki-20210601/text/AA/wiki_89------------
segmentate into separate files
-

If you would like to **skip this time-consuming step**, please download the processed data **df_enwiki_causality_AA** in this [link](https://drive.google.com/file/d/1Oaqg1mnnGTrk_OKnbULzd1c6BPDdDy3f/view?usp=sharing). Be aware, only the data in *pickle* format is available, but not the *csv* format.

Please download, unzip and put this file in the folder *res*


In [5]:
path_here = os.getcwd()
df_enwiki_causality_AA = pd.read_pickle(path_here + '/res/df_enwiki_causality_AA.pkl')

In [6]:
df_enwiki_causality_AA

Unnamed: 0,SeedTF,ele1_word_idx,ele2_word_idx,file_ID,file_title,pairs,sentence,tokens
0,0.0,"(0, 1)","(32, 33)",8816,Double bass,"[double bass, string instrument]","double bass the double bass, also known simpl...","[double, bass, , the, double, bass, ,, also, ..."
1,0.0,"(6, 6)","(1, 2)",8816,Double bass,"[structure, double bass]",the double bass has a similar structure to the...,"[the, double, bass, has, a, similar, structure..."
2,0.0,"(17, 18)","(29, 30)",8816,Double bass,"[concert band, chamber music]",the bass is a standard member of the orchestra...,"[the, bass, is, a, standard, member, of, the, ..."
3,0.0,"(36, 37)","(29, 30)",8816,Double bass,"[folk music, country music]","the bass is used in a range of other genres, s...","[the, bass, is, used, in, a, range, of, other,..."
4,0.0,"(11, 11)","(4, 5)",8816,Double bass,"[octave, transposing instrument]",the bass is a transposing instrument and is ty...,"[the, bass, is, a, transposing, instrument, an..."
...,...,...,...,...,...,...,...,...
392792,0.0,"(11, 11)","(4, 4)",4865,Roman Breviary,"[history, breviary]",lay use of the breviary has varied throughout ...,"[lay, use, of, the, breviary, has, varied, thr..."
392793,0.0,"(17, 17)","(8, 8)",4865,Roman Breviary,"[extent, breviary]",in some periods laymen did not use the breviar...,"[in, some, periods, laymen, did, not, use, the..."
392794,0.0,"(6, 6)","(56, 56)",4865,Roman Breviary,"[recitation, translation]",the late medieval period saw the recitation of...,"[the, late, medieval, period, saw, the, recita..."
392795,0.0,"(14, 14)","(4, 4)",4865,Roman Breviary,"[website, publication]","in 2013, the publication has resumed printing ...","[in, 2013, ,, the, publication, has, resumed, ..."
