# Identification of temporal knowledge

This notebook processes text from music personalities' biographies and extract historical meetups information 
 Pre-requirements:
    Text organised in sentences

The implementation of the algorithm is based in the work presented by Zhong et al.

    @inproceedings{Zhong_Sun_Cambria_2017, address={Vancouver, Canada}, 
    title={Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules}, 
    url={http://aclweb.org/anthology/P17-1039}, DOI={10.18653/v1/P17-1039}, booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, publisher={Association for Computational Linguistics}, 
    author={Zhong, Xiaoshi and Sun, Aixin and Cambria, Erik}, year={2017}, pages={420–429}, language={en} }

The authors use HEURISTIC rules to identify time tokens, and POST tags to filter out ambiguos time tokens
Implementation in JAVA https://github.com/zhongxiaoshi/syntime

Implementation of the approach, from Zhong et al. analysis:
    1) identify time tokents, 2) identify time segments, 3) identify time expressions
    
Our implementation:

    - Identify time tokens, we use three types of tokens: TIME, MODIFIER, NUMERAL.
        Each type have more specific types:
            MODIFIER = ["PREFIX","SUFFIX","LINKAGE","COMMA","PARENTHESIS","INARTICLE"]
            NUMERAL = ["BASIC","DIGIT","ORDINAL"]
            TIME = ["DECADE", "YEAR", "SEASON", "MONTH", "WEEK", "DATE", "TIME", "DAY_TIME", "TIMELINE", "HOLIDAY", "PERIOD", "DURATION", "TIME_UNIT","TIME_ZONE", "ERA","MID","TIME_ZONE","DAY","HALFDAY"]
        Added PARENTHESIS and improving regular expressions
    - Initialize regular expressions:
        Read regular expressions stored in:
        - timeRegex.txt
    - Build additional regular expressions using the base expressions in timeRegex.txt
    - Compile regex objects just once for better performance
    
For each sentence in the biography:

    a) Identify token types. Function "def get_time_tokens(text):"
        - Tokenize 
        - Obtain POS tags 
        - Use regular expressions to identify type of token: time, modifier, numeral
        - Filter out ambiguous words by matching POS tags and type of token 
        - Output: A list of all the tokenized words, POS tags, type of token and token

    b) Identify time segments. Function "def get_time_segments(time_token_list):"
        - A time segment has one time token and one or zero modifiers or numerals
        - Search for a time token, once found search the surroundings:
        - Search tokens on the left 
          If PREFIX or NUMERAL or IN_ARTICLE continue searching
          - Search tokens on the right 
          If SUFIX or NUMERAL continue searching
             For right and left search, if token is COMMA or LINKAGE then stop
          If time segments overlap, then apply heuristic rules and merge segments
        - Output: A list of time segments, each time segment has the word's index in the sentence

    Our implementation
    c) Identify time expressions. Function "classify_type_time_expression(time_expression_list):"
        - Three types of time expressions: time point, time reference and time range
        - Apply heuristic rules to classify the type of time expression
        - Output: A dataframe with the sentence, the type of time expression, the time expression and indexes
        - Store each biography as a CSV file in extractedTimeExpressions/

Directories information:

    - indexedSentences/ : collection of biographies in CSV format. Each row of the file represents a sentence. Each row has a section name and paragraph index, and sentence index
    - extractedTimeExpressions/ : collection of annotated time expressions grouped by biography

In [1]:
# For nltk time entities
# time entity
import nltk.tokenize as nt
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import Tree
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

import os
import pandas as pd
import re
# if not installed, run the following lines
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
# nltk.download('punkt')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# reading every CSV with indexed sentences
# return a list object of files in the given folder
files_list = [f for f in os.listdir('indexedSentences') if not f.startswith('.')]
# parse to dataframe
df_files = pd.DataFrame(files_list, columns=['file_name'])
# df_files = df_files.query("file_name=='10002116.csv'")
# df_files = df_files.query("file_name=='10085.csv'")
# df_files = df_files.query("file_name=='1000228.csv'")
# df_files.to_csv('totalBiographiesEntities.csv',index=False)

df_files.info()
df_files.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 76 to 76
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  1 non-null      object
dtypes: object(1)
memory usage: 16.0+ bytes


Unnamed: 0,file_name
76,10085.csv


In [3]:
# extract only the ones that do not exist in folder
files_list = [f for f in os.listdir('extractedEntities') if not f.startswith('.')]
# parse to dataframe
df_query = pd.DataFrame(files_list, columns=['file_name'])
df_result = df_files[~df_files['file_name'].isin(df_query['file_name'])]
df_files = df_result
df_files.info()
df_files.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  0 non-null      object
dtypes: object(1)
memory usage: 0.0+ bytes


Unnamed: 0,file_name


# HEURISTIC RULES

In [2]:
def read_text_file(filename,folder):
    fileObject = open(folder+filename, "r")
    data = fileObject.readlines()
    return data

def parse_text_to_dict(textFile):
    regexDictionary = {}
    for line in textFile:
        # print(line)
        if len(line.strip())>1:
            item = line.strip().split("\t")
            # print(item)
            regexDictionary[item[0][1:len(item[0])-1]] = item[1]
    return regexDictionary

def get_patterns(regex_dict):
    new_patterns = {}
    # POS: CD; Type: DATE
    new_patterns["YEAR_MID_REGEX"] = "(" + regex_dict["MID_REGEX"] + ")(" + regex_dict["YEAR_REGEX_1"] + "|" + regex_dict["YEAR_REGEX_2"] + ")"
    new_patterns["YEAR_YEAR_REGEX"] = regex_dict["YEAR_REGEX_1"] + "-(" + regex_dict["YEAR_REGEX_1"] + "|[0-9]{2})"
    # print(new_patterns)
    
    # POS: NNP; Type: DATE*/
    new_patterns["MONTH_MID_REGEX"] = "(" + regex_dict["MID_REGEX"] + ")(" + regex_dict["MONTH_REGEX"] + "|" + regex_dict["MONTH_ABBR_REGEX"] + ")"
    new_patterns["MONTH_MONTH_REGEX"] = "(" + regex_dict["MONTH_REGEX"] + "|" + regex_dict["MONTH_ABBR_REGEX"] + ")-(" + regex_dict["MONTH_REGEX"] + "|" + regex_dict["MONTH_ABBR_REGEX"] + ")"
    new_patterns["YEAR_MONTH_REGEX_1"] = regex_dict["YEAR_REGEX_1"] + "-(" + regex_dict["MONTH_REGEX"] + "|" + regex_dict["MONTH_ABBR_REGEX"] + ")"
    new_patterns["YEAR_MONTH_REGEX_2"] = "(" + regex_dict["MONTH_REGEX"] + "|" + regex_dict["MONTH_ABBR_REGEX"] + ")-" + regex_dict["YEAR_REGEX_1"]
    
    # POS: NNP; Type: DATE*/
    new_patterns["WEEK_WEEK_REGEX"] = "(" + regex_dict["WEEK_REGEX"] + "|" + regex_dict["WEEK_ABBR_REGEX"] + ")-(" + regex_dict["WEEK_REGEX"] + "|" + regex_dict["WEEK_ABBR_REGEX"] + ")"

    # POS: CD*/
    new_patterns["BASIC_NUMBER_NUMBER_REGEX"] = "(" + regex_dict["BASIC_NUMBER_REGEX_1"] + ")-(" + regex_dict["BASIC_NUMBER_REGEX_1"] + ")"

    # POS: CD*/
    new_patterns["DIGIT_REGEX_2"] = regex_dict["DIGIT_REGEX_1"] + "[/\\.]" + regex_dict["DIGIT_REGEX_1"]
    new_patterns["DIGIT_DIGIT_REGEX"] = regex_dict["DIGIT_REGEX_1"] + "-" + regex_dict["DIGIT_REGEX_1"]

    # POS: JJ, CD */
    new_patterns["ORDINAL_ORDINAL_REGEX"] = "(" + regex_dict["ORDINAL_REGEX_1"] + ")-(" + regex_dict["ORDINAL_REGEX_1"] + ")"

    # Type: TIME*/
    new_patterns["TIME_TIME_REGEX"] = "(" + regex_dict["TIME_REGEX_1"] + "|" + regex_dict["TIME_REGEX_2"] + ")-(" + regex_dict["TIME_REGEX_1"] + "|" + regex_dict["TIME_REGEX_2"] + ")"
    new_patterns["ERA_YEAR_REGEX"] = regex_dict["DIGIT_REGEX_1"] + "(" + regex_dict["ERA_REGEX"] + ")"

    # POS: NN, VBD; Type: TIME */
    new_patterns["HALFDAY_REGEX_2"] = "(" + regex_dict["DIGIT_REGEX_1"] + "|" + new_patterns["DIGIT_REGEX_2"] + "|" + regex_dict["TIME_REGEX_1"] + "|" + regex_dict["TIME_REGEX_2"] + ")(" + regex_dict["HALFDAY_REGEX_1"] + ")"
    new_patterns["HALFDAY_HALFDAY_REGEX"] = new_patterns["HALFDAY_REGEX_2"] + "-" + new_patterns["HALFDAY_REGEX_2"]

    # POS: NNS; Type: DATE*/
    new_patterns["DECADE_MID_REGEX"] = "(" + regex_dict["MID_REGEX"] + ")(" + regex_dict["DECADE_REGEX"] + ")"

    # POS: NN, JJ; Type: DURATION*/
    new_patterns["DURATION_REGEX"] = "(" + regex_dict["DIGIT_REGEX_1"] + "|" + new_patterns["DIGIT_REGEX_2"] + "|" + regex_dict["BASIC_NUMBER_REGEX_1"] + "|" + regex_dict["BASIC_NUMBER_REGEX_2"] + "|" + regex_dict["ORDINAL_REGEX_1"] + "|" + regex_dict["ORDINAL_REGEX_2"] + "|" + regex_dict["INARTICLE_REGEX"] + ")-?(" + regex_dict["TIME_UNIT_REGEX"] + ")"

    new_patterns["DURATION_DURATION_REGEX_1"] = "(" + new_patterns["DIGIT_DIGIT_REGEX"] + "|" + new_patterns["BASIC_NUMBER_NUMBER_REGEX"] + "|" + new_patterns["ORDINAL_ORDINAL_REGEX"] + ")-?(" + regex_dict["TIME_UNIT_REGEX"] +")"
    new_patterns["DURATION_DURATION_REGEX_2"] = new_patterns["DURATION_REGEX"] + "-" + new_patterns["DURATION_REGEX"]

    # POS: NN, NNP, NNS, RB, JJ; Type: TIME*/
    new_patterns["DAY_TIME_MID_REGEX"] = "(" + regex_dict["MID_REGEX"] + ")(" + regex_dict["DAY_TIME_REGEX"] + ")"

    # POS: NN, NNS; Type: DATE*/
    new_patterns["SEASON_MID_REGEX"] = "(" + regex_dict["MID_REGEX"] + ")(" + regex_dict["SEASON_REGEX"] + ")"
    
    return new_patterns

# returns compiled regex objects
def get_compiled_regex(patterns_dict):
    compiled_regex_dict = {}
    for regex in patterns_dict:
        if regex == "TIME_ZONE_REGEX":
            compiled_regex_dict[regex] = re.compile(patterns_dict[regex])
        else:
            compiled_regex_dict[regex] = re.compile(patterns_dict[regex],flags=re.IGNORECASE)
    return compiled_regex_dict

def get_token_type(regex_name):
    regex_name = regex_name[:regex_name.find("_")]
    types_dict = {}
    types_dict["MODIFIER"]=["PREFIX","SUFFIX","LINKAGE","COMMA","PARENTHESIS","INARTICLE"]
    types_dict["NUMERAL"]=["BASIC","DIGIT","ORDINAL"]
    types_dict["TIME"] = ["DECADE", "YEAR", "SEASON", "MONTH", "WEEK", "DATE", "TIME", "DAY_TIME", "TIMELINE",
                      "HOLIDAY", "PERIOD", "DURATION", "TIME_UNIT","TIME_ZONE", "ERA","MID","TIME_ZONE","DAY","HALFDAY"]
    for type_name in types_dict:
        if regex_name in types_dict[type_name]:
            return type_name


In [3]:
# TIME TOKENS RULES
# 15 types of tokens: 
#  DECADE, YEAR, SEASON, MONTH, WEEK, DATE, TIME, DAY_TIME, TIMELINE, 
#  HOLIDAY, PERIOD, DURATION, TIME_UNIT,TIME_ZONE, ERA
# MODIFIERS TOKEN
#  5 TYPES
#  PREFIX, SUFFIX, LINKAGE, COMMA, IN_ARTICLE
#  ** adding PARENTHESIS
# NUMERAL TOKENS

# 1.1 Read regex
textFile = read_text_file("","timeRegex.txt")
# print(textFile)
regex_dict = parse_text_to_dict(textFile)
# print(len(regex_dict))
new_patterns_dict = get_patterns(regex_dict)
# print(len(new_patterns_dict))
regex_dict.update(new_patterns_dict)
# print(len(new_patterns_dict))
compiled_regex_dict = get_compiled_regex(regex_dict)

  compiled_regex_dict[regex] = re.compile(patterns_dict[regex],flags=re.IGNORECASE)
  compiled_regex_dict[regex] = re.compile(patterns_dict[regex],flags=re.IGNORECASE)
  compiled_regex_dict[regex] = re.compile(patterns_dict[regex],flags=re.IGNORECASE)


In [4]:
def print_tokens_list(tokens_list):
    for item in tokens_list:
        print(item)
        
# tokenize POS tags and identify time, type tokens
def get_time_tokens(text):
    tokenized_sent=nt.word_tokenize(text)
    pos_sentences=nltk.pos_tag(tokenized_sent)
    # print(pos_sentences)

    tokens_list = []
    index = 0
    # for each word in the text
    for token in pos_sentences:
        # match regex, call objects already compiled
        for compile_pattern in compiled_regex_dict:

            # print(type(compiled_regex_dict))
            # order in list [word - 0, POS - 1, time_token - 2, type_token - 3, index - 4]
            result = compiled_regex_dict[compile_pattern].match(token[0])
            if result != None:
                ## use POS tags to identify time expressions and avoid things like MAY (a month) from may (I may leave)
                # use a flag to assign values
                valid = False
                # YEAR_REGEX_1, YEAR_REGEX_2, YEAR_MID, ERA_YEAR,YEAR_YEAR, YEAR_MONTH_1, YEAR_MONTH_2
                if compile_pattern.startswith("YEAR_REGEX") or compile_pattern.startswith("YEAR_MID") or compile_pattern.startswith("ERA_YEAR") or compile_pattern.startswith("YEAR_YEAR") or compile_pattern.startswith("YEAR_MONTH"):
                    valid = True
                # SEASON_REGEX, SEASON_MID
                if compile_pattern.startswith("SEASON_REGEX") and token[1].startswith("NN") or compile_pattern.startswith("SEASON_MID"):
                    valid = True
                # MONTH_REGEX, MONTH_MID
                if compile_pattern.startswith("MONTH_REGEX") and token[1].startswith("NN") or compile_pattern.startswith("MONTH_MID"):
                    valid = True
                # MONTH_ABBR
                if compile_pattern.startswith("MONTH_ABBR") and (token[1].startswith("NN") or token[1].startswith("JJ")):
                    valid = True
                # MONTH_MONTH
                if compile_pattern.startswith("MONTH_MONTH"):
                    valid = True
                # WEEK_REGEX, WEEK_ABBR
                if compile_pattern.startswith("WEEK_REGEX") or compile_pattern.startswith("WEEK_ABBR") and token[1].startswith("NN"):
                    valid = True
                # WEEK_WEEK
                if compile_pattern.startswith("WEEK_WEEK"):
                    valid = True
                # DATE_REGEX_1, 2, 3
                if compile_pattern.startswith("DATE_REGEX"):
                    valid = True
                # TIME_REGEX_1, 2
                if compile_pattern.startswith("TIME_REGEX"):
                    valid = True
                # TIME_TIME
                if compile_pattern.startswith("TIME_TIME"):
                    valid = True
                # HALFDAY_REGEX_1, 2
                if compile_pattern.startswith("HALFDAY_REGEX"):
                    valid = True
                # HALFDAY_HALFDAY
                if compile_pattern.startswith("HALFDAY_HALFDAY"):
                    valid = True
                # TIME_ZONE
                if compile_pattern.startswith("TIME_ZONE"):
                    valid = True
                # ERA_REGEX
                if compile_pattern.startswith("ERA_REGEX") and token[1].startswith("NN"):
                    valid = True
                # TIME_UNIT
                if compile_pattern.startswith("TIME_UNIT") and token[1].startswith("NN"):
                    valid = True
                # DURATION_REGEX
                if compile_pattern.startswith("DURATION_REGEX") and (token[1].startswith("NN") or token[1].startswith("JJ") or token[1].startswith("CD")):
                    valid = True
                # DURATION_DURATION_1, 2
                if compile_pattern.startswith("DURATION_DURATION"):
                    valid = True
                # DAY_TIME_REGEX, DAY_TIME_MID
                if compile_pattern.startswith("DAY_TIME_REGEX") and (token[1].startswith("NN") or token[1].startswith("RB") or token[1].startswith("JJ")) or compile_pattern.startswith("DAY_TIME_MID") :
                    valid = True
                # TIMELINE_REGEX
                if compile_pattern.startswith("TIMELINE_REGEX") and (token[1].startswith("NN") or token[1].startswith("RB")):
                    valid = True
                # HOLIDAY
                if compile_pattern.startswith("HOLIDAY"):
                    valid = True
                # PERIOD_REGEX
                if compile_pattern.startswith("PERIOD_REGEX") and (token[1].startswith("NN") or token[1].startswith("RB") or token[1].startswith("JJ")):
                    valid = True
                # DECADE_REGEX, DECADE_MID
                if compile_pattern.startswith("DECADE_REGEX") and token[1].startswith("NN") or compile_pattern.startswith("DECADE_MID"):
                    valid = True
                # DIGIT_1,2, BASIC_NUMBER_REGEX_1,2, ORDINAL_REGEX_1,2
                if compile_pattern.startswith("DIGIT_REGEX") or compile_pattern.startswith("BASIC_NUMBER_R") or compile_pattern.startswith("ORDINAL_REGEX")  and (token[1].startswith("JJ") or token[1].startswith("CD") or token[1].startswith("RB")):
                    valid = True
                # DIGIT_DIGIT, BASIC_NUMBER_NUMBER, ORDINAL_ORDINAL
                if compile_pattern.startswith("DIGIT_DIGIT") or compile_pattern.startswith("BASIC_NUMBER_NUMBER") or compile_pattern.startswith("ORDINAL_ORDINAL"):
                    valid = True
                # INARTICLE_REGEX
                if compile_pattern.startswith("INARTICLE_REGEX"):
                    valid = True
                # PREFIX_REGEX_1,2
                if (compile_pattern.startswith("PREFIX_REGEX_1") and not token[1].startswith("NN")) or (compile_pattern.startswith("PREFIX_REGEX_2") and token[1].startswith("NN")):
                    valid = True
                # SUFFIX
                if compile_pattern.startswith("SUFFIX"):
                    valid = True
                # LINKAGE
                if compile_pattern.startswith("LINKAGE"):
                    valid = True
                # LINKAGE
                if compile_pattern.startswith("COMMA"):
                    valid = True
                    
                if valid:
                    item = [token[0],token[1],compile_pattern,get_token_type(compile_pattern),index]
                else:
                    item = [token[0],token[1],"","",index]
                break
            else:
                item = [token[0],token[1],"","",index]
                
        tokens_list.append(item)
        index += 1

    # print_tokens_list(tokens_list)
    return tokens_list

# extract time expressions using time tokens
def extract_time_expression(index_list,time_token_list):
    expressions = []
    for row in index_list:
        # print(row)
        expression = ""
        for i in range(min(row), max(row)+1):
            # print(i)
            expression = expression + time_token_list[i][0] + " "
            
        expressions.append(expression.rstrip())
    return expressions
    # print(expressions)
    
# extract time expressions using time tokens
def extract_time_expression_extended(index_list,time_token_list):
    # order in list [word - 0, POS - 1, time_token - 2, type_token - 3, index - 4]
    expressions = []
    for row in index_list:
        # print(row)
        expression = ""
        POS_tag = ""
        time_token = ""
        type_token = ""
        for i in range(min(row), max(row)+1):
            # print(i)
            expression = expression + time_token_list[i][0] + " "
            POS_tag = POS_tag + time_token_list[i][1] + " "
            time_token = time_token + time_token_list[i][2] + " "
            type_token = type_token + time_token_list[i][3] + " "
            
        expressions.append([expression.rstrip(),POS_tag.rstrip(),time_token.rstrip(),type_token.rstrip()])
    # print(expressions)
    return expressions
    
# identify time segments
def get_time_segments(time_token_list):
    # order in list [word - 0, POS - 1, time_token - 2, type_token - 3, index - 4]
    # order in list
    time_segment = []
    index_token = 0
    # for token in time_token_list:
    while index_token < len(time_token_list):
        # print(index_token)
        current_token = time_token_list[index_token]
        type_token = current_token[3]
        # search time tokens
        if type_token == "TIME":
            type_expression = current_token[2]
            if type_expression.startswith("DURATION") or type_expression.startswith("PERIOD"):
                # print(current_token[4])
                segment = []
                segment.append(index_token)
            else:
                # start searching time segments
                segment = []
                # current token
                index = current_token[4]
                # time_token_one = True
                segment.append(index)
                # print(current_token)

                # search tokens on the left
                # reverse loop (current index -1 to 0)
                for i in reversed(range(index)):
                    # previous token
                    prev_token = time_token_list[i]
                    # print("prev token left",prev_token)
                    
                    if prev_token[2].startswith("PREFIX") or prev_token[3].startswith("NUMERAL") or prev_token[2].startswith("INARTICLE"):
                        segment.append(i)
                    elif prev_token[2].startswith("COMMA"):
                        break
                    elif prev_token[2].startswith("LINKAGE"): #and time_token_list[i-1][3].startswith("NUMERAL"):
                        if type_token.startswith("TIME") and time_token_list[i-1][3].startswith("NUMERAL"):
                            segment.append(i)
                        else:
                            break
                        # segment.append(i)
                    else:
                        break
                    # if prev_token[3] == "":
                    #     break
                    # elif prev_token[2].startswith("COMMA") or prev_token[2].startswith("LINKAGE") or prev_token[2].startswith("PARENTHESIS"):
                    #     break
                    # elif prev_token[2].startswith("PREFIX") or prev_token[3].startswith("NUMERAL") or prev_token[2].startswith("INARTICLE"):
                    #     segment.append(i)
                    # # merge two time segments
                    # elif prev_token[3].startswith("TIME") and len(segment)==1:
                    #     segment.append(i)
                    # elif prev_token[3].startswith("TIME") and len(segment)>1:
                    #     if not time_token_list[min(segment)][2].startswith("COMMA") or not time_token_list[min(segment)][2].startswith("LINKAGE"):
                    #         segment.append(i)
                    
          # search right
                for i in range(index+1, len(time_token_list)):
                    next_token = time_token_list[i]
                    # print("sub token right")
                    # print(next_token)
                    if next_token[2].startswith("SUFFIX") or next_token[3].startswith("NUMERAL"):
                        segment.append(i)
                    elif next_token[2].startswith("COMMA"):
                        if ((i+1)<len(time_token_list)) and time_token_list[i+1][3].startswith("TIME"):
                            segment.append(i)
                        else:
                            break
                    elif next_token[2].startswith("LINKAGE") and ((i+1)<len(time_token_list)) and time_token_list[i+1][3].startswith("TIME"):
                        segment.append(i)
                    elif next_token[2].startswith("LINKAGE"): 
                        if type_token.startswith("TIME") and ((i+1)<len(time_token_list)) and time_token_list[i+1][3].startswith("NUMERAL"):
                            segment.append(i)
                        else:
                            break
                    elif next_token[3].startswith("TIME") and len(segment)>=1:
                        segment.append(i)
                    elif next_token[3] == "":
                        break
                    # elif next_token[2].startswith("COMMA") or next_token[2].startswith("LINKAGE") or next_token[2].startswith("PARENTHESIS") or next_token[2].startswith("INARTICLE"):
                    #     break
                    # elif next_token[2].startswith("SUFFIX") or next_token[3].startswith("NUMERAL"):
                    #     segment.append(i)
                    # # merge two time segments
                    # elif next_token[3].startswith("TIME") and len(segment)==1:
                    #     segment.append(i)
                    # elif next_token[3].startswith("TIME") and len(segment)>1:
                    #     if not time_token_list[max(segment)][2].startswith("COMMA") or not time_token_list[max(segment)][2].startswith("LINKAGE"):
                    #         segment.append(i)
                    else:
                        break

                # calculate next item in list
                index_token = max(segment)
                # print(index_token)
                # add to the list of all time segments in sentence
                time_segment.append(segment)
                # print(segment)
                # break
        index_token +=1
    return time_segment

def classify_type_time_expression(time_expression_list):
    # print(len(time_expression_list))
    type_time_expression_name = []
    for time_expression_details in time_expression_list:
        # input: a lis of list with the following information ['time_expression', 'POST_tag',"time_token","type_token"]
        POS_tag_list = time_expression_details[1].split()
        # print(POS_tag_list)
        POS_tag_lenght = len(POS_tag_list)

        if POS_tag_list[0] == "CD":
            # eg. 1924	CD
            if POS_tag_lenght == 1:
                type_time_expression_name.append("time_reference")
            # 1899 and 1920	CD CC CD
            # 1846 to 1885	CD TO CD
            elif POS_tag_lenght == 3 and (POS_tag_list[1] == "CC" or POS_tag_list[1] == "TO"):
                type_time_expression_name.append("time_range")
            # 30-Jun-02	CD NNP CD
            elif POS_tag_lenght == 3 and POS_tag_list[1] == "NNP":
                type_time_expression_name.append("time_point")
            # eg. 2 June 1857 - 23 February 1934	CD NNP CD NNP CD NNP CD
            elif POS_tag_lenght>3:
                type_time_expression_name.append("time_range")
            # three years	CD NNS
            elif POS_tag_lenght == 2 and (POS_tag_list[1] == "NNS" or POS_tag_list[1] == "RB"):
                type_time_expression_name.append("time_reference")
            # Eight years older	CD NNS JJR
            elif POS_tag_lenght == 3 and POS_tag_list[1] == "NNS" and POS_tag_list[2] == "JJR":
                type_time_expression_name.append("time_reference")
            else:
                type_time_expression_name.append("")
        elif POS_tag_list[0] == "DT":
            # # the 1890s	DT CD
            # if POS_tag_lenght == 2 and POS_tag_list[1] == "CD":
            #     return "time_reference"
            # # the early 1920s	DT JJ CD
            # # the next seven years	DT JJ CD NNS
            # if (POS_tag_lenght == 3 or POS_tag_lenght == 4 or POS_tag_lenght == 5) and POS_tag_list[1] == "JJ":
            #     return "time_reference"
            # # some years	DT NNS
            # # a decade	DT NN
            # # the beginning of the twentieth century	DT NN IN DT JJ NN
            # if POS_tag_list[1].startswith("NN"):
            #     return "time_reference"
            type_time_expression_name.append("time_reference")
        # of 1910	IN CD
        # of the 1890s	IN DT CD
        # of the first march	IN DT JJ NN
        elif POS_tag_list[0] == "IN":
            type_time_expression_name.append("time_reference")
        # early 2007	JJ CD
        # many years	JJ NNS
        elif POS_tag_list[0] == "JJ":
            type_time_expression_name.append("time_reference")
        # of 1910	IN CD
        # of the 1890s	IN DT CD
        # of the first march	IN DT JJ NN
        elif POS_tag_list[0] == "JJR":
            type_time_expression_name.append("time_reference")
        # Sep-05	NNP CD
        # forties	NNS
        elif POS_tag_list[0].startswith("NN"):
            type_time_expression_name.append("time_reference")
        # now	RB
        # nearly ten years	RB JJ NNS
        # more than three years	RBR IN CD NNS
        elif POS_tag_list[0].startswith("RB"):
            type_time_expression_name.append("time_reference")
        # off the next year	RP DT JJ NN
        elif POS_tag_list[0].startswith("RP"):
            type_time_expression_name.append("time_reference")
        # lasting less than fifteen minutes	VBG JJR IN JJ NNS
        elif POS_tag_list[0].startswith("VBG"):
            type_time_expression_name.append("time_reference")
        else:
            type_time_expression_name.append("")

    # print(len(type_time_expression_name))
    return type_time_expression_name

In [5]:
## testing
# sentence = "He made his United States debut at the Steinway Hall in New York City on November 10, 1888, and his first tour of the United States in 1888–1889 with Moriz Rosenthal."
# sentence = "On April 26, 1941, he was involved in a serious traffic accident."
# sentence = "Brahms published a manifesto for the Serious Music side on 4 May 1861, signed by Clara Schumann, Joachim, Albert Dietrich, Woldemar Bargiel, and twenty others, which decried the purveyors of the Music of the Future as contrary to the innermost spirit of music, strongly to be deplored and condemned"
# sentence = "December 14–15, 1928;"
# sentence = "In The Musical Times Vol. 146 (Winter 2005), pp."
# time_token_list = get_time_tokens(sentence)
# print(time_token_list)
# # identify time_segments
# time_segments = get_time_segments(time_token_list)
# print(time_segments)
# if len(time_segments) > 0:
#     # print(time_segments)
#     time_expressions_list = extract_time_expression_extended(time_segments,time_token_list)
#     print(time_expressions_list)

In [6]:
# for file_name_item in df_files.itertuples():
for chunk in pd.read_csv('list_wikiIdSample.csv', chunksize=50):
# for chunk in pd.read_csv('totalBiographiesBenchmark.csv', chunksize=50):
    df_file_name = pd.DataFrame()
    df_file_name['file_name'] = chunk['file_name']
    for file_name_item in df_file_name.itertuples():
        file_exists = os.path.isfile('indexedSentences/'+file_name_item.file_name.replace(".txt",".csv"))
        if file_exists:
            print(file_name_item.file_name.replace(".txt",".csv"))
            # read the cached results from the query
            biography_df = pd.read_csv('indexedSentences/'+file_name_item.file_name.replace(".txt",".csv"))
            df_time_expression = pd.DataFrame()

            # for each sentence in the biography
            for sentence_row in biography_df.itertuples():
                # print(sentence_row)
                # identify time tokens
                time_token_list = get_time_tokens(sentence_row.sentences)
                # print(time_token_list)
                # identify time_segments
                time_segments = get_time_segments(time_token_list)

                if len(time_segments) > 0:
                    # print(time_segments)
                    time_expressions_list = extract_time_expression_extended(time_segments,time_token_list)

                    df_temp = pd.DataFrame(time_expressions_list, columns=['time_expression', 'POST_tag',"time_token","type_token"])
                    df_temp["time_expression_type"] = classify_type_time_expression(time_expressions_list)
                    df_temp["sentences"] = sentence_row.sentences
                    df_temp["sentenceIndex"] = sentence_row.sentenceIndex
                    df_temp["paragraphIndex"] = sentence_row.paragraphIndex
                    df_temp["section"] = sentence_row.section
                    df_temp["wikiId"] = sentence_row.wikiId

                    df_time_expression = df_time_expression.append(df_temp)

            df_time_expression.to_csv('extractedTimeExpressions/'+file_name_item.file_name.replace(".txt",".csv"),index=False)


608845.csv
2232977.csv
409969.csv
1551347.csv
579599.csv
1174545.csv
2898019.csv
3450382.csv
2553865.csv
2320846.csv
144624.csv
827409.csv
1396921.csv
2815597.csv
858538.csv
2269540.csv
701860.csv
2334176.csv
8716.csv
576282.csv
113049.csv
529161.csv
167975.csv
3126224.csv
78231.csv
1022191.csv
223497.csv
3081864.csv
63747.csv
739770.csv
1566844.csv
439467.csv
181985.csv
51560453.csv
3236079.csv
720273.csv
419012.csv
994118.csv
341837.csv
3818460.csv
2657736.csv
623861.csv
489381.csv
1113259.csv
1433271.csv
756836.csv
165726.csv
165113.csv
313835.csv
67892.csv
2704521.csv
2771033.csv
526281.csv
450629.csv
2416191.csv
173225.csv
632683.csv
991714.csv
1023303.csv
481738.csv
1566892.csv
2659909.csv
12945.csv
3141790.csv
838629.csv
492026.csv
512518.csv
1429918.csv
58067.csv
1213916.csv
155965.csv
1253793.csv
49906.csv
5033.csv
100273.csv
442318.csv
1098118.csv
1321577.csv
553740.csv
523339.csv
20405.csv
238175.csv
1599087.csv
1786533.csv
797572.csv
655613.csv
379324.csv
1070521.csv
262463

# NLTK process
#### old code

In [9]:
pattern = set_pattern_time()
count = 0
for file_name in df_files.itertuples():
    count +=1
    # start = time.time()
    print(file_name.file_name)
    # Read file with segmented sentences
    biography_df = pd.read_csv('indexedSentences/'+file_name.file_name)
    df_time_ent = pd.DataFrame()
    # df_entities = pd.DataFrame()
    
    # for each sentence in each biography
    for sentence_row in biography_df.itertuples():
        # print(sentence_row.sentences)
        # now use the same sentence to analyse if a time entity is present
        # for sentence_row in biography_df.itertuples():
        df_temp = pd.DataFrame()
        timeEntityList = []
        #added to include timeEntityType
        timeEntityTypeList = []

        tokenized_sent=nt.word_tokenize(sentence_row.sentences)
        pos_sentences=nltk.pos_tag(tokenized_sent)
        # print(pos_sentences)
        
        cp = nltk.RegexpParser(pattern)
        cs = cp.parse(pos_sentences)

        # loop to search for the POST TAGs related to TIME
        for ne in cs:
            res = ""
            if hasattr(ne, "label"):
                # print(type(ne[0:]))
                # print(ne.label(), ne[0:])

                for i in ne[0:]:
                    res += i[0] + " "
                res = res.strip()
                # print(res)
                    # print(t)
                
                # added to include timeEntityType
                time_type = ""
                if ne.label() == "RN":
                    # then type is RANGE
                    time_type = "TimeRange"
                else:
                    time_type = "TimePoint"
                if ('–' in res):
                    time_type = "TimeRange"
                timeEntityTypeList.append(time_type)
                
                #added to include timeEntityType
                timeEntityList.append(res)
                
        # if we have some time entities indentified
        if timeEntityList:
            df_temp['entity']=timeEntityList
            df_temp['sentence']= sentence_row.sentences
            df_temp['sentenceIndex']=sentence_row.sentenceIndex
            df_temp['paragraphIndex'] = sentence_row.paragraphIndex
            df_temp['section'] = sentence_row.section
            df_temp['entType'] = 'time'
            df_temp['wikiPageID'] = sentence_row.wikiId
            #added to include timeEntityType
            df_temp['timeEntityType']=timeEntityTypeList

            df_time_ent = df_time_ent.append(df_temp)

    # # append time
    df_entities = pd.read_csv('extractedEntitiesPersonPlaceOnly/'+file_name.file_name)
    df_entities.append(df_time_ent).to_csv('extractedEntities/'+file_name.file_name,index=False)

10085.csv
Sir Edward William Elgar, 1st Baronet,  ( (listen); 2 June 1857 – 23 February 1934) was an English composer, many of whose works have entered the British and international classical concert repertoire.
[('Sir', 'NNP'), ('Edward', 'NNP'), ('William', 'NNP'), ('Elgar', 'NNP'), (',', ','), ('1st', 'CD'), ('Baronet', 'NNP'), (',', ','), ('(', '('), ('(', '('), ('listen', 'VBN'), (')', ')'), (';', ':'), ('2', 'CD'), ('June', 'NNP'), ('1857', 'CD'), ('–', 'NNP'), ('23', 'CD'), ('February', 'NNP'), ('1934', 'CD'), (')', ')'), ('was', 'VBD'), ('an', 'DT'), ('English', 'NNP'), ('composer', 'NN'), (',', ','), ('many', 'JJ'), ('of', 'IN'), ('whose', 'WP$'), ('works', 'NNS'), ('have', 'VBP'), ('entered', 'VBN'), ('the', 'DT'), ('British', 'JJ'), ('and', 'CC'), ('international', 'JJ'), ('classical', 'JJ'), ('concert', 'NN'), ('repertoire', 'NN'), ('.', '.')]
Among his best-known compositions are orchestral works including the Enigma Variations, the Pomp and Circumstance Marches, concertos

In [108]:
pattern = set_pattern_time()
count = 0
for file_name in df_files.itertuples():
    count +=1
    # start = time.time()
    print(file_name.file_name)
    # Read file with segmented sentences
    biography_df = pd.read_csv('indexedSentences/'+file_name.file_name)
    df_time_ent = pd.DataFrame()
    # df_entities = pd.DataFrame()
    
    # for each sentence in each biography
    for sentence_row in biography_df.itertuples():
        print(sentence_row.sentences)
        # now use the same sentence to analyse if a time entity is present
        # for sentence_row in biography_df.itertuples():
        df_temp = pd.DataFrame()
        timeEntityList = []
        #added to include timeEntityType
        timeEntityTypeList = []

        tokenized_sent=nt.word_tokenize(sentence_row.sentences)
        pos_sentences=nltk.pos_tag(tokenized_sent)
        # print(pos_sentences)
        
        cp = nltk.RegexpParser(pattern)
        cs = cp.parse(pos_sentences)

        # loop to search for the POST TAGs related to TIME
        for ne in cs:
            res = ""
            if hasattr(ne, "label"):
                # print(type(ne[0:]))
                # print(ne.label(), ne[0:])

                for i in ne[0:]:
                    res += i[0] + " "
                res = res.strip()
                # print(res)
                    # print(t)
                
                # added to include timeEntityType
                time_type = ""
                if ne.label() == "RN":
                    # then type is RANGE
                    time_type = "TimeRange"
                else:
                    time_type = "TimePoint"
                if ('–' in res):
                    time_type = "TimeRange"
                timeEntityTypeList.append(time_type)
                
                #added to include timeEntityType
                timeEntityList.append(res)
                
        # if we have some time entities indentified
        if timeEntityList:
            df_temp['entity']=timeEntityList
            df_temp['sentence']= sentence_row.sentences
            df_temp['sentenceIndex']=sentence_row.sentenceIndex
            df_temp['paragraphIndex'] = sentence_row.paragraphIndex
            df_temp['section'] = sentence_row.section
            df_temp['entType'] = 'time'
            df_temp['wikiPageID'] = sentence_row.wikiId
            #added to include timeEntityType
            df_temp['timeEntityType']=timeEntityTypeList

            df_time_ent = df_time_ent.append(df_temp)

    # # append time
    df_entities = pd.read_csv('extractedEntitiesPersonPlaceOnly/'+file_name.file_name)
    df_entities.append(df_time_ent).to_csv('extractedEntities/'+file_name.file_name,index=False)

NameError: name 'set_pattern_time' is not defined

# Set pattern, nltk

In [7]:
"""
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions	  
. = Any character except a new line
CD	cardinal digit
DT	determiner
NN	noun, singular 'desk'
NNS	noun plural	'desks'
NNP	proper noun, singular	'Harrison'
NNPS	proper noun, plural	'Americans'
IN	preposition/subordinating conjunction
RB	adverb	very, silently,
RBR	adverb, comparative	better
RBS	adverb, superlative	best
"""
def set_pattern_time():
    pattern = r"""DT1: #dates, time point
    {<CD?><NNP|CD?><CD?>} #complete dates Eg. 23 January 1983
    {<NNP?><CD?><,?><CD?>} # December 13, 2000
    {<CD?></><CD?></><CD?>} #complete dates 23/02/2021
    {<CD?><-><CD?><-><CD?>} #complete dates 23-02-2021
    HO: # HOURS
    {<CD>+<NN>+} # hour only
    RN: # range 
    {<IN><CD>+<IN|TO|CC>+<CD>+} # between YYYY and <> YYYY, from 1938 to 1939
    DT2: #date from explicit, to implicit DT2 [('each', 'DT'), ('one', 'CD')]
    {<IN>+<DT>?<\d>} # "in XXXX" (year)
    <\W?>{<CD>}<\W?> # year in between special characters
    {<NNP><CD>} #incomplete date January 2003, Fall 1994
    {<IN>+<DT>+<CD>+} # years dt the 1990s, leukemia in 1996, age of 43,the 1990s
    {<NN>+<IN|DT>+<CD>} # years dt the 1990s, leukemia in 1996, age of 43,the 1990s
    {<CD><IN>} # 1984 novel,1954–58
    DT3:
    <DT>+{<CD>}
    REF: # references, implicit
    {<IN>+<NN>+<CD>+} # by age 43
    {<NN><IN><CD>} # Eg. fall/winter of 1345, age of 43
    <NN>{<DT>+<CD>} # Eg. fall/winter of 1345, age of 43
    """
    return pattern