### How do researchers deal with it:
- [Word embeddings Wiki](https://en.wikipedia.org/wiki/Word_embedding)
- [Gensim Python library](https://en.wikipedia.org/wiki/Gensim)
- [Inference Rules Wiki](https://en.wikipedia.org/wiki/Rule_of_inference)

### LSTM: 
- [Long-Short term memory (LSTM)](https://www.datacamp.com/community/tutorials/lstm-python-stock-market#lstm)
- [Learn via example](https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/)

### Literature:
- [Intent extraction from social media texts using sequential segmentation and deep learning models](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8119461) uses CRFs and Bi-LSTM for intent extraction from texts from social media in 2 categories - Cosmetics and Tourism. Look into these algos
    - Citation: 
`@INPROCEEDINGS{8119461, 
author={T. L. Luong and M. S. Cao and D. T. Le and X. H. Phan}, 
booktitle={2017 9th International Conference on Knowledge and Systems Engineering (KSE)}, 
title={Intent extraction from social media texts using sequential segmentation and deep learning models}, 
year={2017}, 
pages={215-220}, 
doi={10.1109/KSE.2017.8119461}, 
month={Oct},}`


- In [Semantic Indexing for Recorded Educational Lecture Videos](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1598977) they extracted scripts from videos with timestamps on each word and cluster them in order to allow for finding of the exact position of a particular thing in the video. They also use a retrieval method to find “example”, “explanation”, “overview”, “repetition”, “exercise” for a particular word or topic word. 
    - Citation: `@INPROCEEDINGS{1598977, 
author={S. Repp and M. Meinel}, 
booktitle={Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW'06)}, 
title={Semantic indexing for recorded educational lecture videos}, 
year={2006}, 
pages={5 pp.-245}, 
month={March},}`


- In [Olex: Effective Rule Learning for Text Categorization](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4641927) Sees the problem as a text classification task and applied Inference Rules onto it. Not particularly for intent mining, but for different categories, similar to what I have. The inference rules are of the form: \begin{equation}If \space T_1 \space or \space \dots \space T_n \space occurs \space in \space document \space d,\space and \space none \space of \space T_{n+1} \dots T_{n+m} \space occurs \space in \space d, \space then \space classify \space d \space under \space category \space C \end{equation}  This includes `one` positive literal and `0+` negative literals and temrs are `n-grams`

    - Citation `@ARTICLE{4641927, 
author={P. Rullo and V. L. Policicchio and C. Cumbo and S. Iiritano}, 
journal={IEEE Transactions on Knowledge and Data Engineering}, 
title={Olex: Effective Rule Learning for Text Categorization}, 
year={2009}, 
volume={21}, 
number={8}, 
pages={1118-1132}, 
doi={10.1109/TKDE.2008.206}, 
ISSN={1041-4347}, 
month={Aug},}`

# Inference Rules method

## Theory
### General Rules
1. If sentence has no label, proceed with label search.
2. If no label can be assigned, assign the last applied labeled from a previous sentence

-----------------------------------
### ALL RULES
- [✔] EX_1 <<< `example` || OR `for instance` || `assume` || `suppose` || `imagine` || `as` || `simulation` || `diagram` 
- [✔] EX_2 <<< `Let's` __&&__ try || think || see || pick || take a look || say ..
- [✔] CD_1 <<< `Let's` __&&__ look at || make || put || do || start || prove || evaluate || back || try || just __AND NO__ `example` || `assume` || `suppose` || `imagine` || `diagram`
- [✔] CD_2 <<< `in other words` || `basically` __AND NO__ `should` || `have to` || `must` 
- [✔] CD_3 <<< `so` is the first word in the sentence __&&__ `it's` || `i'm`
- [✔] CD_4 <<< `so this is` || `actually` __AND NO__ `example` || `summary` || `next` || `last`
- [✔] CD_5 <<< `means` || `mean` || `given` || `define` || `explain` __AND NO__ Present continuous tense (going to)
- [✔] CD_6 <<< `what if` __AND NO__ `example` || `instance`
- [✔] SM_1 <<< `Let's` __&&__ summarize || `recap` 
- [✔] SM_2 <<< `in other words` __&&__ past tense
- [✔] SM_3 <<< `this week` || `this lesson` || `today` __&&__ present tense || future tense
- [✔] SM_4 <<< `later` || `next time` || `last time` || `summary` || `summarize` || `here is` || `here are` || `discuss` || `next` 
- [✔] SM_5 <<< if (lineNr < 10 `OR` lineNr > fileLinesNr - 10) __&&__ (past tense) =>> (within the first or last 10 lines + past tense)
- [✔] SM_6 <<< `going to` __&&__ `look` || `see` || `be` || `think` || `explain` || `explained` __AND NO__ present tense (&& future or past)
- [✔] AP_1 <<< `in other words` __&&__ `should` || `could` || `would` [✔]
- [✔] AP_2 <<< `encourage` || `step` || `first` || `finally` || `second` || `should` || `could` || `would` || `best practice(s)` || `need to`
- [✔] AP_3 <<< `if` __&&__ `use` || `can` || `should` || `could` || `want`
- [✔] CM_1 <<< `called` __&&__ concept
- [✔] CM_2 <<< `what is` .. __&&__ concept
- [✔] CM_3 <<< `theorem` || `algorithm` || `method` || `let's use` || `theory`
- [✔] CM_4 <<< first occurence of the terms in the title of the file
- [**X**] CM_5 <<< `let's` __&&__ `use` - **REPEATS CM_3** 

-----------------------------------

### Logical expressions (copy/paste in thesis later - LATEX style)
#### ♦ EXAMPLE
1. \begin{equation} d \leftarrow EX \space, if\space ("let's" \in d \space) \space \land ("try" \in d \space \lor "see" \in d \space \lor "think" \in d \space \lor "pick" \in d \space \lor "say" \in d) \end{equation} 

2. \begin{equation} d \leftarrow EX \space, if\space ("example" \in d \space) \lor ("for \space instance" \in d) \space \lor ("suppose" \in d) \space \lor ("assume" \in d) \space \lor ("includes" \in d) \space \lor ("imagine" \in d) \space \end{equation} 

Latex Formula Formatter: https://www.codecogs.com/eqnedit.php

## Pseudocode

`Disregard all sentences that have NL label, totally ignore, then: [✔]
    if sentence has no label:
        for line in text:
            if lineNR < 10 OR lineNR > nrOfLines-10:
                for word in line:
                    if (60%+ of the words on the line are in PAST TENSE):
                        go over the SM rules  (append res to curSentLabels)
                        go over the CM rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in PRESENT TENSE):
                        go over the CD rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in FUTURE TENSE):
                        go over the CM rules  (append res to curSentLabels)
            if lineNR > 10 AND lineNR < nrOfLines-10:
                go over the EX rules  (append res to curSentLabels)
                go over the CD rules  (append res to curSentLabels)
                go over the AP rules  (append res to curSentLabels)
                go over the CM rules  (append res to curSentLabels)
        Count the labels with a special method for this: [✔]
            if all rules fail to assign a label, i.e. if all labels return count 0:
                search for the last labeled sentence:
                    assign its label to the current sentence`                                                     

### [THEORY] Checking the tense of the verbs in the sentence

- [All POS Tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

NLTK PPOS TAGS for verbs:

- VBD		verb, past tense					`took`      +1 if checked for PAST, else 0
- VBN		verb, past participle				`taken`     +1 if checked for PAST, else 0
- VB		verb, base form						`take`      +0.5
- VBG		verb, gerund/present participle		`taking`    +0.5
- VBP		verb, sing. present, non-3d			`take`      +1 if checked for PRESENT, else 0
- VBZ		verb, 3rd person sing. present		`takes`     +1 if checked for PRESENT, else 0
- MD        modal verb (will, shall)            `will`      +1 if checked for FUTURE, else 0


**Simply counting verbs isn't enough, artificial boost is added if certain types of verbs are present so that they
form a specific English tense. All listed below:**

- **Past perfect**: `VBD`(had) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **Past continuous tense**: `VBD`(was/were) + `VBG`(-ing/gerund)
- **Past perfect continuous**: `VBD`(had) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT perfect**: `VBP`(have) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **PRESENT perfect continuous**: `VBP`(have) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT continuous**: `VBP`(is/are) + `VBG`(-ing/gerund)
- **Future continuous**: `MD`(WILL) + `VBG`(-ing/gerund)
- **Future perfect**: `MD`(will) + `VB`(have) + `VBN`(PP)
- **Future perfect continuous**: `MD`(will) + `VBN`(been) + `VB`(have) + `VBG`(-ing/gerund)

## Implementation of Inference Rules

### Import modules

In [1]:
# Import all necessary modules for EVERYTHING here
import os
import sys
import os.path
import string
import time
import re

import math
from textblob import TextBlob as tb
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams

import re, string, unicodedata
import contractions
import inflect
from bs4 import BeautifulSoup
from tabulate import tabulate

### Check for present / past / future tense

All return a number which is the percentage of likelihood that the sentence is of certain tense

**Version 1** checked also for the more complicated tenses such as perfect, continuous etc of each present, past, future by assigning addiotnal scores according to whether all criteria is covered, in order to make the output higher and ==> more certain.

**Version 2** only counts the types of verbs for each tense
- checkPRES(sentence)
- checkPAST(sentence)
- checkFUTURE(sentence)

#### Version 1

In [2]:
# ======================================== PAST TENSE CHECK ==============================================

puncDict = [",",".",";","-","_","`","'","?","!",":"]

def checkPAST2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, pastTense, grammarCase, VBGgerund, VBNpp, VBDHad = 0, 0, 0, 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
#            print(tag)
            verbs += 1
            if str(tag[1]) == "VBD": 
                pastTense += 1
                VBDHad += 1
            elif str(tag[1]) == "VBN": 
                pastTense += 1
                VBNpp += 1
            elif str(tag[1]) == "VBG": VBGgerund += 1
            else: pass
    
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense
    # TODO re-evaluate points
    #  1   1   0.5
    # VBD VBN  VBG
    #print("VBD ",VBDHad," VBN ",VBNpp," VBG ",VBGgerund)
    if VBDHad > 0 and VBNpp > 0 and VBGgerund == 0: grammarCase += 2                   #Past perfect 
    elif VBDHad > 0 and VBNpp == 0 and VBGgerund > 0: grammarCase += 1.5                #Past continuous tense 
    elif VBDHad > 0 and VBNpp > 0 and VBGgerund > 0: grammarCase += 2.5                  #Past perfect continuous
    else: grammarCase += 0
        
    if pastTense == 0:
        return -1
    else:
        #grammarCase = grammarCase * 0.1 
        # big / small = 100 / x
        #calculate verbs over words
        ratio_verbs = len(text) / verbs
        perc_verbs= (100 / ratio_verbs) 

        #calculate past tense verbs over all verbs
        ratio_PastVerbs = verbs / pastTense
        perc_PastVerbsOverVerbs = (100 / ratio_PastVerbs)

        #calculate past tense verbs over all words
        ratio_PastVerbsOverWords = len(text) / pastTense
        perc_VerbsPastTenseOverWords = 100 / ratio_PastVerbsOverWords

        """
        if perc_PastVerbsOverVerbs >= 30:
            return perc_PastVerbsOverVerbs
        else:
            print("{0:.2f}".format(perc_PastVerbsOverVerbs))
            return "NPAST"
        """

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_PastVerbsOverVerbs >= 30:
                return perc_PastVerbsOverVerbs
            else:
                #print("{0:.2f}".format(perc_PastVerbsOverVerbs))
                return -1
            
            
# ========================================= PRESENT TENSE CHECK =============================================
    
def checkPRESENT2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, presTenseVerbs, grammarCase, VBGgerund, VBPhave, VBNbeen = 0, 0, 0, 0, 0, 0 
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBP": 
                presTenseVerbs += 1
                VBPhave += 1
            elif str(tag[1]) == "VBN": 
                VBNbeen += 1
            elif str(tag[1]) == "VBG":
                presTenseVerbs += 1
                VBGgerund += 1
            elif str(tag[1]) == "VB":
                presTenseVerbs += 1
            elif str(tag[1]) == "VBZ":
                presTenseVerbs += 1
            else: pass
                
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense.
    #print("VBG ",VBGgerund," VBP ",VBPhave," VBN ",VBNbeen)
    # TODO re-evaluate points
    # 0.5   1    0
    # VBG  VBP  VBN
    
    if VBGgerund == 0 and VBPhave > 0 and VBNbeen > 0: grammarCase += 1                   # PRESENT perfect
    elif VBGgerund > 0 and VBPhave > 0 and VBNbeen == 0: grammarCase += 1.5               # PRESENT continuous 
    elif VBGgerund > 0 and VBPhave > 0 and VBNbeen > 0: grammarCase += 1.5                # PRESENT perfect continuous
    else: grammarCase += 0
    
    #grammarCase = grammarCase * 0.1
    #calculate past tense verbs over all verbs
    if presTenseVerbs == 0:
        return -1
    else:
        ratio_PresVerbs = verbs / presTenseVerbs
        perc_PresVerbsOverVerbs = (100 / ratio_PresVerbs)

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_PresVerbsOverVerbs >= 30:
                return perc_PresVerbsOverVerbs
            else:
                #print("{0:.2f}".format(perc_PresVerbsOverVerbs))
                return -1


# ======================================= FUTURE TENSE CHECK ===============================================

def checkFUTURE2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, futureTenseWords, grammarCase, MDmodal, VBGgerund, VBhave, VBNpp = 0, 0, 0, 0, 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBG":
                VBGgerund += 1
                futureTenseWords += 1
            elif str(tag[1]) == "VB":
                VBhave += 1
            elif str(tag[1]) == "VBN":
                VBNpp += 1
        if str(tag[1]) == "MD" and str(tag[0]) == "will" or str(tag[0]) == "shall":
            MDmodal += 1
            futureTenseWords += 1
                
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense.
    #print("VBG ",VBGgerund," VBP ",VBPhave," VBN ",VBNbeen)
    # TODO re-evaluate points
    # 0.5   0    0    1
    # VBG   VB  VBN   MD
    
    if MDmodal > 0 and VBhave > 0 and VBNpp > 0: grammarCase += 1                         # Fut. perf.    MD/VB/VBN
    elif MDmodal > 0 and VBGgerund > 0: grammarCase += 1.5                                # Fut. cont.  MD/VBG
    elif MDmodal > 0 and VBhave > 0 and VBNpp > 0 and VBGgerund > 0: grammarCase += 1.5   # Fut. perf. cont. MD/VB/VBN/VBG
    else: grammarCase += 0
    
    #grammarCase = grammarCase * 1.0
    #calculate past tense verbs over all verbs
    if futureTenseWords == 0:
        return -1
    else:
        ratio_FutureVerbs = verbs / futureTenseWords
        perc_FutureWordsOverVerbs = (100 / ratio_FutureVerbs)

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_FutureWordsOverVerbs >= 30:
                return perc_FutureWordsOverVerbs

#### Version 2 (USE THIS)

In [3]:
# ================================= SECOND VERSION OF TENSE CHECK ========================================

# ======================================== PAST TENSE CHECK ==============================================

puncDict = [",",".",";","-","_","`","'","?","!",":"]

def checkPAST(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, pastTense, VBGgerund = 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
#            print(tag)
            verbs += 1
            if str(tag[1]) == "VBD": pastTense += 1
            elif str(tag[1]) == "VBN": pastTense += 1
            elif str(tag[1]) == "VBG": pastTense += 1
            else: pass
        
    if pastTense == 0:
        return -1
    else:
        #calculate past tense verbs over all verbs
        ratio_PastVerbs = verbs / pastTense
        perc_PastVerbsOverVerbs = (100 / ratio_PastVerbs)
        
        return perc_PastVerbsOverVerbs
            
            
# ========================================= PRESENT TENSE CHECK =============================================
    
def checkPRESENT(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, presTenseVerbs = 0, 0 
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBP": presTenseVerbs += 1
            elif str(tag[1]) == "VBG":
                presTenseVerbs += 1
            elif str(tag[1]) == "VBZ": presTenseVerbs += 1
            else: pass
            
    if presTenseVerbs == 0:
        return -1
    else:
        ratio_PresVerbs = verbs / presTenseVerbs
        perc_PresVerbsOverVerbs = (100 / ratio_PresVerbs)
        
        return perc_PresVerbsOverVerbs


# ======================================= FUTURE TENSE CHECK ===============================================

def checkFUTURE(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, futureTenseWords, grammarCase = 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBG": futureTenseWords += 1
        if str(tag[1]) == "MD" and str(tag[0]) == "will" or str(tag[0]) == "shall":
            futureTenseWords += 1
            
    if futureTenseWords == 0:
        return -1
    else:
        ratio_FutureVerbs = verbs / futureTenseWords
        perc_FutureWordsOverVerbs = (100 / ratio_FutureVerbs)
        
        return perc_FutureWordsOverVerbs

In [4]:
sentence = """
that means we will create this next time
"""

# -------- check sent for PAST tense
print("Past tense chance: {} %".format(int(checkPAST(sentence))))

# -------- check sent for PRESENT tense    
print("Present tense chance: {} %".format(int(checkPRESENT(sentence))))

# -------- check sent for FUTURE tense
print("Future tense chance: {} %".format(int(checkFUTURE(sentence))))

print("i am going to eat a kiwi".find("to"))

Past tense chance: -1 %
Present tense chance: 50 %
Future tense chance: 50 %
11


### Functions:
- Final label count 
- title keywords exttract to search for CMs
- extract n-grams

In [5]:
def get_ngrams(text, n):
    n_grams = ngrams(word_tokenize(text), n)
    return [' '.join(grams) for grams in n_grams]

# WORKS - takes a file name
# USE to find key words from tghe title and find the first occurence of the concepts in the text and label as CM
def title_getConcept(oFileNoExt):
    
    # remove all surrounding stuff like my naming convention etc from the title 
    mainTitlelist = oFileNoExt.split("_")
    maintitle = mainTitlelist[1]
    punct = {'_','-','.'}
    finaltitle = ""
    
    # extract the final title, i.e. the main part of the title
    for word in maintitle.split():
        for letter in word:
            if letter in punct:
                finaltitle += " "
                pass
            else:
                finaltitle += letter
    
    title_keywords = []
    for word in finaltitle.split(" "):
        if word == 'en':
             pass
        else:
            title_keywords.append(word)
    
    return title_keywords  # returns a list of keywords


### -----------------------------

# WORKS 
# USE at the end to export only one label per sentence - the one with the majority vote
# NOTE: equal case is NOT considered, so it may crash
def getFinalLabel(curSentLabels):     # gets the list of assigned labels after all rules have been checked
    EX,AP,CD,CM,SM, maxCount = 0, 0, 0, 0, 0, 0
    labels = []
    maxLabel = ""
    
    # counting the labels returned from the rules checks
    for item in curSentLabels:
        if item == "EX": EX += 1
        elif item == "AP": AP += 1
        elif item == "CD": CD += 1
        elif item == "CM": CM += 1
        elif item == "SM": SM += 1
        else: pass   #pass NOLBL items

    # adding the labels into a list of items to make it easy to get the max value
    labels.append("EX,"+str(EX))
    labels.append("AP,"+str(AP))
    labels.append("CD,"+str(CD))
    labels.append("CM,"+str(CM))
    labels.append("SM,"+str(SM))

    # getting the max value 
    for item in labels:
        parts = item.split(",")
        label = parts[0]
        count = int(parts[1])
        if count > 0:
            if count > maxCount:
                maxCount = count
                maxLabel = label

    return maxLabel

### Rules implementaion

In [97]:
##### RULES #####

### ------------------EX--------------------
def EX_1(sent):
    nrOfWordsFound = 0
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    words = {"example", "for instance", "assume", "suppose", "imagine", "as", "simulation", "diagram"}
    
    for wd in words:
        if len(wd) == 2:
            if wd in bigrams:
                nrOfWordsFound += 1
        else:
            if wd in monograms:
                nrOfWordsFound += 1
        
    if nrOfWordsFound > 0:
        return "EX"
    else: 
        return "NOLBL"
    
### ------------------EX--------------------
    
def EX_2(sent):
    mainWordNR = 0      # Let's
    secondaryWordsNR = 0     
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    words = {"try", "think", "see", "pick", "take a look", "say"}
    
    if ("let's" or "let 's") in bigrams: mainWordNR += 1
    
    for wd in words:
        if len(wd) == 3:
            if wd in trigrams:
                secondaryWordsNR += 1
        if wd in monograms:
                secondaryWordsNR += 1
        
    #print(," ", " ",)
    #print(mainWordNR," ",secondaryWordsNR)
    if secondaryWordsNR > 0 and mainWordNR > 0:
        return "EX"
    else:
        return "NOLBL"
    
### -------------------CD-------------------

def CD_1(sent):
    mainWordNR = 0      # Let's     # main word looking for in conjunction with one or more of the secondary words
    secondaryWordsNR = 0 
    negWordsNR = 0      # words that must NOT occur for the label to apply, i.e. this should stay at ZERO
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    secondaryWords = {"look at", "make", "put", "do", "start", "prove", "back", "try", "just", "be", "take", "bring"}
    negWords = {"example", "diagram", "assume", "imagine", "suppose"}
    
    if ("let's" or "let 's") in bigrams: mainWordNR += 1
    
    for wd in secondaryWords:
        if len(wd) == 2:
            if wd in bigrams:
                secondaryWordsNR += 1
        else:
            if wd in monograms:
                secondaryWordsNR += 1
    
    for wd in negWords:
        if wd in monograms:
                negWordsNR += 1
    
    if secondaryWordsNR > 0 and mainWordNR > 0 and negWordsNR == 0:
        return "CD"
    else:
        return "NOLBL"

### --------------------------------------

def CD_2(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    if ("in other words" or "basically") in monograms and not ("should" or "must") in monograms or not "have to" in bigrams and checkPRESENT(sent) >= 50:
        return "CD"
    else: return "NOLBL"
    
### --------------------------------------

def CD_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if monograms[0] == "so" and ("it 's" or "it's") or ("i 'm" or "i'm") in bigrams: return "CD"
    else: return "NOLBL"
    
### --------------------------------------

def CD_4(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    if "so this is" in trigrams or "actually" in monograms and not ('example' or 'summary' or 'next' or 'last') in monograms: return "CD"
    else: return "NOLBL"
    
### --------------------------------------   

def CD_5(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("mean" or "given" or "define") in monograms and checkPRESENT(sent) >= 50 or "going to" in bigrams: return "CD"
    else: return "NOLBL"

### --------------------------------------
    
    
def CD_6(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "what if" in bigrams and not ("example" or "instance") in monograms: return "CD"
    else: return "NOLBL"
    
    
### --------------------------------------
def SM_1(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("let's" or "let 's") in bigrams and "summarize" or "recap" in monograms: return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_2(sent):
    trigrams = get_ngrams(sent, 3)
    
    if "in other words" in trigrams and (checkFUTURE(sent) or checkPAST(sent)) >= 40: return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if (("this week" or "this lesson") in bigrams or "today" in monograms) and (checkPRESENT(sent) or checkFUTURE(sent)) >= 50: 
        return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_4(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    words = {'later' , 'next time' , 'last time' , 'summary' , 'summarize' , 
             'here is' , 'here are' , 'discuss' , 'next', 'recap'}
    
    for wd in words:
        if len(wd) == 2:
            if wd in bigrams: return "SM"
        elif wd in monograms: return "SM"
        else: return "NOLBL"

### --------------------------------------

def SM_5(sent, lineNR, NrOfLines):
    if lineNR < 15 or lineNR > NrOfLines - 15 and checkPAST > 40: return "SM"
    else: return "NOLBL"

### --------------------------------------
    
def SM_6(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    words = {"look", "see", "be", "think", "explain"} 
    
    if "going to" in bigrams:
        for wd in words:
            if wd in monograms and checkPRESENT(sent) < 30 and (checkFUTURE(sent) or checkPAST(sent)) > 40 : return "SM"
    else: return "NOLBL"
    
### --------------------------------------

def AP_1(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    if "in other words" in trigrams and ("should" or "would" or "could") in monograms: return "AP"
    else: return "NOLBL"

### --------------------------------------

def AP_2(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    words = {'encourage' , 'step' , 'first' , 'finally' , 'second' , 'should' ,
             'could' , 'would' , 'best practice', 'need to', 'good idea'}
    
    for wd in words:
        if len(wd) == 2:
            if wd in bigrams: return "AP"
        elif wd in monograms: return "AP"
        else: return "NOLBL"  

### --------------------------------------

def AP_3(sent):
    monograms = get_ngrams(sent, 1)
    
    if "if" in monograms and ("use" or "can" or "should", "could" or "want") in monograms: return "AP"
    else: return "NOLBL"

### --------------------------------------

def CM_1(sent, oFileNoExt):
    mainConcepts = title_getConcept(oFileNoExt)
    monograms = get_ngrams(sent, 1)    
    
    if "called" in monograms:
        for wd in mainConcepts:
            if wd in monograms: return "CM"
    else: return "NOLBL"


### --------------------------------------

def CM_2(sent, oFileNoExt):
    mainConcepts = title_getConcept(oFileNoExt)
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "what is" in bigrams:
        for wd in mainConcepts:
            if wd in monograms: return "CM"
    else: return "NOLBL"
    
### --------------------------------------

def CM_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("theory" or "theorem" or "algorithm" or "method") in monograms or ("let's" or "let 's") in bigrams and "use" in monograms: return "CM"
    else: return "NOLBL"

### --------------------------------------

def CM_4(sent, oFileNoExt, seen):
    mainConcepts = title_getConcept(oFileNoExt)
    monograms = get_ngrams(sent, 1)
    
    for gr in monograms:
        for cpt in mainConcepts:
            if (gr in mainConcepts or cpt in monograms) and seen == 0: return "CM"
    else: return "NOLBL"

### Main program logic, called from the file traverse

In [7]:
#---------------------------------------------- STUPID TEST EXAMPLE ------------------------------------
testSent = "this week,in week nine,we're still doing applications of the derivative,but it's not so much word problems anymore"
#res = EX_2(testSent)
#print(res)
# ----------- disregard the above test ---------
curSentLabels = []

ignore_dict = ['inaudible','OMITTED','NUMBER','sound','music','laughter','yeah','blank_audio']

curSentLabels.append(EX_1(testSent))
curSentLabels.append(EX_2(testSent))
curSentLabels.append(CD_1(testSent))
curSentLabels.append(CD_2(testSent))
curSentLabels.append(CD_3(testSent))
curSentLabels.append(CD_4(testSent))
curSentLabels.append(CD_5(testSent))
curSentLabels.append(CD_6(testSent))
curSentLabels.append(SM_1(testSent))
curSentLabels.append(SM_2(testSent))
curSentLabels.append(SM_3(testSent))
curSentLabels.append(SM_4(testSent))
curSentLabels.append(SM_5(testSent, 5, 76))
curSentLabels.append(SM_6(testSent))
curSentLabels.append(AP_1(testSent))
curSentLabels.append(AP_2(testSent))
curSentLabels.append(AP_3(testSent))
curSentLabels.append(CM_1(testSent, "02_backup.en_labels.txt"))
curSentLabels.append(CM_2(testSent, "02_backup.en_labels.txt"))
curSentLabels.append(CM_3(testSent))
curSentLabels.append(CM_4(testSent, "03_data-management-across-the-research-lifecycle.en_labels.txt", 0))

#print(curSentLabels)

print(getFinalLabel(curSentLabels))

# label_sentences(takesAFile)
# get_ngrams(takesASentence)
# ruleX(takesASentence)
# main(takesInputFile)

curSentLabels = []

SM


In [90]:
# TODO
# figure out how to detect whether a term has been seen in a file or not. (meaning a term from the title)

def termSeen(sent, term):
    monograms = get_ngrams(sent, 1)
    
    if term in sent: return 1
    else: return 0
    
"""
    // TODO
    When this returns 0, we can update the value for the respective term in titleTermsSeen from 0 to 1 and 
    we can call CM_4 for each term and it will work only if the term's value is 0, 
    so for each sentence:
        check dictionary, if value is 0, call CM_4
        if CM_4 returns a value "CM", then update the dictionary for that value with 1
        else don't call CM_4 at all
"""

In [99]:
def main(iFile, oPathNoExt):    # main application with all logic following the pseudocode=
    correctLabels = 0
    #accuracy = correctLabels / totalSentences   * 100

    ### ---------- Local variables - reset per file ------------------------
    lineNR = 0
    totalLines = 0
    curSentLabels = []          # All the labels assigned to the current sentence (to get majority vote from it later)

    originalLabel = ""          # label from the sentence
    finalMajorityLabel = ""     # label assigned after the rules application SHOULD NOT BE MAJORITY, 
                                # but at the end only one real label should be inside the list of labels after the evaluation
    finalMajorityLabelUsers = ""     # label assigned by other users (majority vote)
    countCorLabels = 0
    latestLabeledSent = ""
    latestLabeledSentLBL = ""
    titleTerms = ""
    titleTermsSeen = {}

    #accuracy = (countCorLabels/totalSentences) * 100
    #print("Accuracy: {0:.2f} %".format(accuracy))

    print("[LABELLING file: ] " + os.path.basename(iFile.name))
    baseName = oPathNoExt.split(".en", 1)[0]
    OFName = baseName + ".en_AutoRuleLabels.txt"

    sentences = iFile.read().split("\n")
    titleTerms = title_getConcept(oPathNoExt)
    
    for term in titleTerms:
        titleTermsSeen[term] = 0
    
    for sent in sentences:
        totalLines += 1
        

    with open(OFName, "w") as oFile:    # opening the output file to write in the same place where the original file is
        for sent in sentences:
            lineNR += 1
            sentLBLandText = sent.split("|")
            sent = sentLBLandText[1]
            originalLabel = sentLBLandText[0]

            if originalLabel == "NL":
                pass
            else: 
                if finalMajorityLabel == "":
                    if lineNR < 15 or lineNR > len(sentences)-15:   # assigning threshold of first or last 10 sentences
                        if checkPAST(sent) > 50:
                            curSentLabels.append(SM_1(sent))
                            curSentLabels.append(SM_2(sent))
                            curSentLabels.append(SM_3(sent))
                            curSentLabels.append(SM_4(sent))
                            curSentLabels.append(SM_5(sent, lineNR, totalLines))
                            curSentLabels.append(SM_6(sent))
                        elif checkPRESENT(sent) > 50:
                            curSentLabels.append(CD_1(sent))
                            curSentLabels.append(CD_2(sent))
                            curSentLabels.append(CD_3(sent))
                            curSentLabels.append(CD_4(sent))
                            curSentLabels.append(CD_5(sent))
                            curSentLabels.append(CD_6(sent))
                        elif checkFUTURE(sent) > 50:
                            curSentLabels.append(CM_1(sent, oPathNoExt))
                            curSentLabels.append(CM_2(sent, oPathNoExt))
                            curSentLabels.append(CM_3(sent))
                            #curSentLabels.append(CM_4(sent))   # special case, TODO
                    else:
                        curSentLabels.append(EX_1(sent))
                        curSentLabels.append(EX_2(sent))
                        curSentLabels.append(CD_1(sent))
                        curSentLabels.append(CD_2(sent))
                        curSentLabels.append(CD_3(sent))
                        curSentLabels.append(CD_4(sent))
                        curSentLabels.append(CD_5(sent))
                        curSentLabels.append(CD_6(sent))
                        curSentLabels.append(AP_1(sent))
                        curSentLabels.append(AP_2(sent))
                        curSentLabels.append(AP_3(sent))
                        curSentLabels.append(CM_1(sent, oPathNoExt))
                        curSentLabels.append(CM_2(sent, oPathNoExt))
                        curSentLabels.append(CM_3(sent))
                        #curSentLabels.append(CM_4(sent, oPathNoExt, seen))   # special case, TODO
            
            """ ### UPDATE FINALMAJORITYLABEL HERE - IT DOES NOT OUTPUT IN THE FILE!!!! """
            if getFinalLabel(curSentLabels) != ("SM" or "CD" or "CP" or "AP" or "EX"):
                # if there exists a sentence that's already labeled, take its label and assign to the current sent
                # this only happens if no label is returned from the finalMajorityLabel function
                if latestLabeledSent != "":
                    finalMajorityLabel = latestLabeledSentLBL
            else: 
                latestLabeledSent = sent
                latestLabeledSentLBL = getFinalLabel(curSentLabels)
                
            oFile.write(finalMajorityLabel+"|"+sent+"\n") #write the final label and sentence in the output file
            print(finalMajorityLabel)
            if finalMajorityLabel == originalLabel:
                correctLabels += 1
            finalMajorityLabel = "" #reset finalMajorityLabel back to ""   finalMajorityLabel = ""
            titleTerms = ""
        

### RUN THIS PART - going over all files and calling the main program on each of them

In [100]:
# TODO

path = r"C:\Users\a.dimitrova\Desktop\Course data Thesis\INTENT MINING"
counter = 0

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            if filePath.endswith("_AutoRuleLabels.txt"): pass
            else:
                counter += 1
                curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]

                main(curFile, fileExtRemoved)

                curFile.close()
            
                
print("\nTotal number of {} {} files found.".format(counter, "TXT"))

[LABELLING file: ] 06_cruise-controllers.en_labels.txt












































































































Total number of 1 TXT files found.
