### How do researchers deal with it:
- [Word embeddings Wiki](https://en.wikipedia.org/wiki/Word_embedding)
- [Gensim Python library](https://en.wikipedia.org/wiki/Gensim)
- [Inference Rules Wiki](https://en.wikipedia.org/wiki/Rule_of_inference)

### LSTM: 
- [Long-Short term memory (LSTM)](https://www.datacamp.com/community/tutorials/lstm-python-stock-market#lstm)
- [Learn via example](https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/)

### Literature:
- [Intent extraction from social media texts using sequential segmentation and deep learning models](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8119461) uses CRFs and Bi-LSTM for intent extraction from texts from social media in 2 categories - Cosmetics and Tourism. Look into these algos
    - Citation: 
`@INPROCEEDINGS{8119461, 
author={T. L. Luong and M. S. Cao and D. T. Le and X. H. Phan}, 
booktitle={2017 9th International Conference on Knowledge and Systems Engineering (KSE)}, 
title={Intent extraction from social media texts using sequential segmentation and deep learning models}, 
year={2017}, 
pages={215-220}, 
doi={10.1109/KSE.2017.8119461}, 
month={Oct},}`


- In [Semantic Indexing for Recorded Educational Lecture Videos](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1598977) they extracted scripts from videos with timestamps on each word and cluster them in order to allow for finding of the exact position of a particular thing in the video. They also use a retrieval method to find “example”, “explanation”, “overview”, “repetition”, “exercise” for a particular word or topic word. 
    - Citation: `@INPROCEEDINGS{1598977, 
author={S. Repp and M. Meinel}, 
booktitle={Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW'06)}, 
title={Semantic indexing for recorded educational lecture videos}, 
year={2006}, 
pages={5 pp.-245}, 
month={March},}`


- In [Olex: Effective Rule Learning for Text Categorization](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4641927) Sees the problem as a text classification task and applied Inference Rules onto it. Not particularly for intent mining, but for different categories, similar to what I have. The inference rules are of the form: \begin{equation}If \space T_1 \space or \space \dots \space T_n \space occurs \space in \space document \space d,\space and \space none \space of \space T_{n+1} \dots T_{n+m} \space occurs \space in \space d, \space then \space classify \space d \space under \space category \space C \end{equation}  This includes `one` positive literal and `0+` negative literals and temrs are `n-grams`

    - Citation `@ARTICLE{4641927, 
author={P. Rullo and V. L. Policicchio and C. Cumbo and S. Iiritano}, 
journal={IEEE Transactions on Knowledge and Data Engineering}, 
title={Olex: Effective Rule Learning for Text Categorization}, 
year={2009}, 
volume={21}, 
number={8}, 
pages={1118-1132}, 
doi={10.1109/TKDE.2008.206}, 
ISSN={1041-4347}, 
month={Aug},}`

# Inference Rules method

## Theory
### General Rules
1. If sentence has no label, proceed with label search.
2. If no label can be assigned, assign the last applied labeled from a previous sentence

-----------------------------------
### ALL RULES
- EX_1 <<< `example` || OR `for instance` || `assume` || `suppose` || `imagine` || `as` || `simulation` || `diagram` [✔] 
- EX_2 <<< `Let's` && try || think || see || pick || take a look || say .. [✔]
- CD_1 <<< `Let's` && look at || make || put || do || start || prove || evaluate || back || try || just && NOT `example` || `assume` || `suppose` || `imagine` || `diagram` [✔]
- CD_2 <<< `in other words` && present tense
- CD_3 <<< `so` && `it's` || `i'm`
- CD_4 <<< `so this is` || `actually` && NOT `example` || `summary` || `next` || `last`
- SM_1 <<< `Let's` && summarize
- SM_2 <<< `in other words` && past tense
- SM_3 <<< `later` || `next time` || `last time` || `summary` || `summarize`
- SM_4 <<< if (lineNr < 10 `OR` lineNr > fileLinesNr - 10) `&&` (past tense) =>> (within the first or last 10 lines + past tense)
- SM_5 <<< `going to` && `look` || `see` || `be` || `think`
- AP_1 <<< `in other words` && `should` || `could` || `would`
- AP_2 <<< `encourage` || `step` || `first` || `finally` || `second` || `should` || `could` || `would` || `best practice(s)` || `need to`
- AP_3 <<< `if` && `use` || `can` || `should` || `could` || `want`
- CM_1 <<< `called` && concept
- CM_2 <<< `what is` .. && concept
- CM_3 <<< `theorem` || `algorithm` || `method` || `let's use`
- CM_4 <<< `let's` && `use`
- CM_5 <<< first occurence of the terms in the title of the file

-----------------------------------

### Logical expressions (copy/paste in thesis later - LATEX style)
#### ♦ EXAMPLE
1. \begin{equation} d \leftarrow EX \space, if\space ("let's" \in d \space) \space \land ("try" \in d \space \lor "see" \in d \space \lor "think" \in d \space \lor "pick" \in d \space \lor "say" \in d) \end{equation} 

2. \begin{equation} d \leftarrow EX \space, if\space ("example" \in d \space) \lor ("for \space instance" \in d) \space \lor ("suppose" \in d) \space \lor ("assume" \in d) \space \lor ("includes" \in d) \space \lor ("imagine" \in d) \space \end{equation} 

Latex Formula Formatter: https://www.codecogs.com/eqnedit.php

## Pseudocode

`Disregard all sentences that have NL label, totally ignore, then: [✔]
    if sentence has no label:
        for line in text:
            if lineNR < 10 OR lineNR > nrOfLines-10:
                for word in line:
                    if (60%+ of the words on the line are in PAST TENSE):
                        go over the SM rules  (append res to curSentLabels)
                        go over the CM rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in PRESENT TENSE):
                        go over the CD rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in FUTURE TENSE):
                        go over the CM rules  (append res to curSentLabels)
            if lineNR > 10 AND lineNR < nrOfLines-10:
                go over the EX rules  (append res to curSentLabels)
                go over the CD rules  (append res to curSentLabels)
                go over the AP rules  (append res to curSentLabels)
                go over the CM rules  (append res to curSentLabels)
        Count the labels with a special method for this: [✔]
            if all rules fail to assign a label, i.e. if all labels return count 0:
                search for the last labeled sentence:
                    assign its label to the current sentence`                                                     

### Checking the tense of the verbs in the sentence

- [All POS Tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

NLTK PPOS TAGS for verbs:

- VBD		verb, past tense					`took`
- VBN		verb, past participle				`taken`
- VB		verb, base form						`take`
- VBG		verb, gerund/present participle		`taking`
- VBP		verb, sing. present, non-3d			`take`
- VBZ		verb, 3rd person sing. present		`takes`


**Simply counting verbs isn't enough, artificial boost is added if certain types of verbs are present so that they
form a specific English tense. All listed below:**

- **Past perfect**: `VBD`(had) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **Past continuous tense**: `VBD`(was/were) + `VBG`(-ing/gerund)
- **Past perfect continuous**: `VBD`(had) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT perfect**: `VBP`(have) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **PRESENT perfect continuous**: `VBP`(have) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT continuous**: `VBP`(is/are) + `VBG`(-ing/gerund)
- **Future continuous**: `MD`(WILL) + `VBG`(-ing/gerund)

In [229]:
import nltk

sentence = """I have lived here since 1987."""

puncDict = [",",".",";","-","_","`","'","?","!",":"]

def checkPastTense(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, pastTense, VBGgerund, VBNpp, MDmodalverb, VBDPast = 0, 0, 0, 0, 0, 0    
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
#            print(tag)
            verbs += 1
            if str(tag[1]) == "VBD": 
                pastTense += 1
                VBDPast += 1
            elif str(tag[1]) == "VBN": 
                pastTense += 1
                VBNpp += 1
            elif str(tag[1]) == "VBG":
                pastTense += 1
                VBGgerund += 1
        
        if str(tag[1]) == "MD":
            MDmodalverb += 1
    
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense
    if VBDPast > 0 and VBNpp > 0 and VBGgerund == 0: pastTense += 2                   #Past perfect 
    elif VBDPast > 0 and VBNpp == 0 and VBGgerund > 0: pastTense += 2                 #Past continuous tense 
    elif VBDPast > 0 and VBNpp > 0 and VBGgerund > 0: pastTense += 3                  #Past perfect continuous
    
    # big / small = 100 / x
    #calculate verbs over words
    ratio_verbs = len(text) / verbs
    perc_verbs= (100 / ratio_verbs) 
    
    #calculate past tense verbs over all verbs
    ratio_PastVerbs = verbs / pastTense
    perc_PastVerbsOverVerbs = (100 / ratio_PastVerbs)
    
    #calculate past tense verbs over all words
    ratio_PastVerbsOverWords = len(text) / pastTense
    perc_VerbsPastTenseOverWords = 100 / ratio_PastVerbsOverWords
    
    if perc_PastVerbsOverVerbs >= 50:
        return perc_PastVerbsOverVerbs
    else:
        return "NPAST"


# -------------- TO DELETE -----------------
#checkPastTense(sentence[0],",",sentence[1])
#print("Verbs: {0:.2f} %".format(checkPastTense(sentence)[0]))
#print("Past tense over verbs: {0:.2f} %".format(checkPastTense(sentence)[1]))
#print("Past tense over words: {0:.2f} %".format(checkPastTense(sentence)[2]))
print("\nPast tense chance: {0:.2f} %".format(checkPastTense(sentence)))

#print(checkPastTense(sentence))

def checkPresentTense(sentence):
    text = nltk.word_tokenize(sentence)
    verbs = 0
    presTense = 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBD" or str(tag[1]) == "VBN":
                presTense += 1
    
    # big / small = 100 / x
    #calculate verbs over words
    
    #calculate past tense verbs over all verbs
    ratio_PresVerbs = verbs / presTense
    perc_PresVerbsOverVerbs = (100 / ratio_PastVerbs)
    
    if perc_PresVerbsOverVerbs >= 50:
        return perc_PresVerbsOverVerbs
    else:
        return "NPRES"



Past tense chance: 50.00 %


## Implementation of Inference Rules

### Import modules

In [None]:
# Import all necessary modules for EVERYTHING here
import os
import sys
import os.path
import string
import time
import re

import math
from textblob import TextBlob as tb
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams

import re, string, unicodedata
import contractions
import inflect
from bs4 import BeautifulSoup
from tabulate import tabulate

### Rules implementaion

In [None]:
##### RULES #####

### ------------------EX--------------------
def EX_1(sent):
    nrOfWordsFound = 0
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "example" in monograms: nrOfWordsFound += 1
    elif "for instance" in bigrams: nrOfWordsFound += 1
    elif "assume" in monograms: nrOfWordsFound += 1
    elif "suppose" in monograms: nrOfWordsFound += 1
    elif "imagine" in monograms: nrOfWordsFound += 1
    elif "as" in monograms: nrOfWordsFound += 1
    elif "simulation" in monograms: nrOfWordsFound += 1
    elif "diagram" in monograms: nrOfWordsFound += 1
        
    if nrOfWordsFound > 0:
        return "EX"
    else: 
        return "NOLBL"
    
### ------------------EX--------------------
    
def EX_2(sent):
    mainWordNR = 0      # Let's
    secondaryWordsNR = 0     
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    if "let's" in bigrams or "let 's" in bigrams: mainWordNR += 1
    
    if "try" in monograms: secondaryWordsNR += 1
    elif "think" in monograms: secondaryWordsNR += 1
    elif "see" in monograms: secondaryWordsNR += 1
    elif "pick" in monograms: secondaryWordsNR += 1
    elif "take a look" in monograms: secondaryWordsNR += 1
    elif "say" in monograms: secondaryWordsNR += 1
        
    if secondaryWordsNR > 0 and mainWordNR > 0:
        return "EX"
    else:
        return "NOLBL"
    
### -------------------CD-------------------

def CD_1(sent):
    mainWordNR = 0      # Let's     # main word looking for in conjunction with one or more of the secondary words
    secondaryWordsNR = 0 
    negWords = 0      # words that must NOT occur for the label to apply, i.e. this should stay at ZERO
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "let's" in bigrams or "let 's" in bigrams: mainWordNR += 1
    
    if "look at" in bigrams: secondaryWordsNR += 1
    elif "make" in monograms: secondaryWordsNR += 1
    elif "put" in monograms: secondaryWordsNR += 1
    elif "do" in monograms: secondaryWordsNR += 1
    elif "start" in monograms: secondaryWordsNR += 1
    elif "prove" in monograms: secondaryWordsNR += 1
    elif "back" in monograms: secondaryWordsNR += 1
    elif "try" in monograms: secondaryWordsNR += 1
    elif "just" in monograms: secondaryWordsNR += 1
    elif "be" in monograms: secondaryWordsNR += 1
    elif "take" in monograms: secondaryWordsNR += 1
    elif "bring" in monograms: secondaryWordsNR += 1

    if "example" in monograms: negWords += 1
    if "diagram" in monograms: negWords += 1
    if "assume" in monograms: negWords += 1
    if "imagine" in monograms: negWords += 1
    if "suppose" in monograms: negWords += 1
        
    if secondaryWordsNR > 0 and mainWordNR > 0 and negWords == 0:
        return "CD"
    else:
        return "NOLBL"

### --------------------------------------

def CD_2(sent):
    

### --------------------------------------



### --------------------------------------



### --------------------------------------



### --------------------------------------



### --------------------------------------



### --------------------------------------

### Final label count method + title keywords exttract to search for CMs

In [124]:
testSent = "so let's just get rid of the ones we don't need, for example"
res = EX_dictSearch(testSent)
# ----------- disregard the above test ---------

# WORKS - takes a file name
# USE to find key words from tghe title and find the first occurence of the concepts in the text and label as CM
def title_getConcept(oFileNoExt):
    
    # remove all surrounding stuff like my naming convention etc from the title 
    mainTitlelist = oFileNoExt.split("_")
    maintitle = mainTitlelist[1]
    punct = {'_','-','.'}
    finaltitle = ""
    
    # extract the final title, i.e. the main part of the title
    for word in maintitle.split():
        for letter in word:
            if letter in punct:
                finaltitle += " "
                pass
            else:
                finaltitle += letter
    
    title_keywords = []
    for word in finaltitle.split(" "):
        if word == 'en':
             pass
        else:
            title_keywords.append(word)
    
    return title_keywords  # returns a list of keywords


### -----------------------------

# WORKS 
# USE at the end to export only one label per sentence - the one with the majority vote
# NOTE: equal case is NOT considered, so it may crash
def getFinalLabel(curSentLabels):     # gets the list of assigned labels after all rules have been checked
    EX,AP,CD,CM,SM, maxCount = 0, 0, 0, 0, 0, 0
    labels = []
    maxLabel = ""
    
    # counting the labels returned from the rules checks
    for item in curSentLabels:
        if item == "EX": EX += 1
        elif item == "AP": AP += 1
        elif item == "CD": CD += 1
        elif item == "CM": CM += 1
        elif item == "SM": SM += 1
        else: pass   #pass NOLBL items

    # adding the labels into a list of items to make it easy to get the max value
    labels.append("EX,"+str(EX))
    labels.append("AP,"+str(AP))
    labels.append("CD,"+str(CD))
    labels.append("CM,"+str(CM))
    labels.append("SM,"+str(SM))

    # getting the max value 
    for item in labels:
        parts = item.split(",")
        label = parts[0]
        count = int(parts[1])
        if count > 0:
            if count > maxCount:
                maxCount = count
                maxLabel = label

    return maxLabel

### Main program

In [116]:
### ---------- Global variables ----------------------------------------
correctLabels = 0
totalSentences = 0
accuracy = correctLabels / totalSentences

### ---------- Local variables - reset per file ------------------------
lineNR = 0
totalLines = 0
curSentLabels = []          # All the labels assigned to the current sentence (to get majority vote from it later)

originalLabel = ""          # label from the sentence
finalMajorityLabel = ""          # label assigned after the rules application
finalMajorityLabelUsers = ""     # label assigned by other users (majority vote)
##### ----------------------------------------------------------------------------------------------------- #####

ignore_dict = ['inaudible','OMITTED','NUMBER','sound','music','laughter','yeah','blank_audio']

curSentLabels.append(EX_dictSearch(testSent))
curSentLabels.append(EX_lets(testSent))
curSentLabels.append(CD_lets(testSent))


def get_ngrams(text, n):
    n_grams = ngrams(word_tokenize(text), n)
    return [' '.join(grams) for grams in n_grams]

# label_sentences(takesAFile)
# get_ngrams(takesASentence)
# ruleX(takesASentence)
# main(takesInputFile)

ZeroDivisionError: division by zero

In [None]:
def main(iFile, oPathNoExt):    # main application with all logic following the pseudocode=
        print("[LABELLING file: ] " + os.path.basename(iFile.name))
        baseName = oPathNoExt.split(".en", 1)[0]
        # print("BASE NAME: ", baseName)
        OFName = baseName + ".en_AutoRuleLabels.txt"
        
        sentences = iFile.read().split("\n")
        sent_processed = []
        
        with open(OFName, "w") as oFile:    # opening the output file to write in the same place where the original file is
            for sent, lineNR in sentences:
                originalLabel = sent.split("|")[0]
                sent = sent.split("|")[1]
                
                if originalLabel == "NL":
                    pass
                else: 
                    if finalMajorityLabel == "":
                        if lineNR < 10 or lineNR > len(sentences)-10:   # assigning threshold of first or last 10 sentences
                            if (checkPastTense(sent) > 60)
            # TODO: reset finalMajorityLabel back to ""   finalMajorityLabel = ""
            # TODO: oFile.write(finalMajorityLabel,"|","sent")   #write the final label and sentence in the output file
        
        oFile.close()

### RUN THIS PART - going over all files and calling the main program on each of them

In [114]:
# TODO