### How do researchers deal with it:
- [Word embeddings Wiki](https://en.wikipedia.org/wiki/Word_embedding)
- [Gensim Python library](https://en.wikipedia.org/wiki/Gensim)
- [Inference Rules Wiki](https://en.wikipedia.org/wiki/Rule_of_inference)

### LSTM: 
- [Long-Short term memory (LSTM)](https://www.datacamp.com/community/tutorials/lstm-python-stock-market#lstm)
- [Learn via example](https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/)

### Literature:
- [Intent extraction from social media texts using sequential segmentation and deep learning models](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8119461) uses CRFs and Bi-LSTM for intent extraction from texts from social media in 2 categories - Cosmetics and Tourism. Look into these algos
    - Citation: 
`@INPROCEEDINGS{8119461, 
author={T. L. Luong and M. S. Cao and D. T. Le and X. H. Phan}, 
booktitle={2017 9th International Conference on Knowledge and Systems Engineering (KSE)}, 
title={Intent extraction from social media texts using sequential segmentation and deep learning models}, 
year={2017}, 
pages={215-220}, 
doi={10.1109/KSE.2017.8119461}, 
month={Oct},}`


- In [Semantic Indexing for Recorded Educational Lecture Videos](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1598977) they extracted scripts from videos with timestamps on each word and cluster them in order to allow for finding of the exact position of a particular thing in the video. They also use a retrieval method to find “example”, “explanation”, “overview”, “repetition”, “exercise” for a particular word or topic word. 
    - Citation: `@INPROCEEDINGS{1598977, 
author={S. Repp and M. Meinel}, 
booktitle={Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW'06)}, 
title={Semantic indexing for recorded educational lecture videos}, 
year={2006}, 
pages={5 pp.-245}, 
month={March},}`


- In [Olex: Effective Rule Learning for Text Categorization](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4641927) Sees the problem as a text classification task and applied Inference Rules onto it. Not particularly for intent mining, but for different categories, similar to what I have. The inference rules are of the form: \begin{equation}If \space T_1 \space or \space \dots \space T_n \space occurs \space in \space document \space d,\space and \space none \space of \space T_{n+1} \dots T_{n+m} \space occurs \space in \space d, \space then \space classify \space d \space under \space category \space C \end{equation}  This includes `one` positive literal and `0+` negative literals and temrs are `n-grams`

    - Citation `@ARTICLE{4641927, 
author={P. Rullo and V. L. Policicchio and C. Cumbo and S. Iiritano}, 
journal={IEEE Transactions on Knowledge and Data Engineering}, 
title={Olex: Effective Rule Learning for Text Categorization}, 
year={2009}, 
volume={21}, 
number={8}, 
pages={1118-1132}, 
doi={10.1109/TKDE.2008.206}, 
ISSN={1041-4347}, 
month={Aug},}`

# Inference Rules method

## Theory
### General Rules
1. If sentence has no label, proceed with label search.
2. If no label can be assigned, assign the last applied labeled from a previous sentence
3. We need priority as we often get results from more than one label and therefore we need to put weight on the results from some of the rules over others. 

-----------------------------------
### ALL RULES
- [Priority] [Status] [Label] <<< [RULE]
- [1] [✔] EX_1 <<< `example` || OR `for instance` || `assume` || `suppose` || `imagine` || `as` || `simulation` || `diagram` 
- [2] [✔] EX_2 <<< `Let's` __&&__ try || think || see || pick || take a look || say ..
- [2] [✔] CD_1 <<< `Let's` __&&__ look at || make || put || do || start || prove || evaluate || back || try || just __AND NO__ `example` || `assume` || `suppose` || `imagine` || `diagram`
- [4] [✔] CD_2 <<< `in other words` || `basically` __AND NO__ `should` || `have to` || `must` 
- [4] [✔] CD_3 <<< `so` is the first word in the sentence __&&__ `it's` || `i'm`
- [3] [✔] CD_4 <<< `so this is` || `actually` __AND NO__ `example` || `summary` || `next` || `last`
- [1] [✔] CD_5 <<< `means` || `mean` || `given` || `define` || `explain` __AND NO__ Present continuous tense (going to)
- [3] [✔] CD_6 <<< `what if` __AND NO__ `example` || `instance`
- [1] [✔] SM_1 <<< `Let's` __&&__ summarize || `recap` 
- [4] [✔] SM_2 <<< `in other words` __&&__ past tense
- [3] [✔] SM_3 <<< `this week` || `this lesson` || `today` __&&__ present tense || future tense
- [1] [✔] SM_4 <<< `later` || `next time` || `last time` || `summary` || `summarize` || `here is` || `here are` || `discuss` || `next` 
- [4] [✔] SM_5 <<< if (lineNr < 10 `OR` lineNr > fileLinesNr - 10) __&&__ (past tense) =>> (within the first or last 10 lines + past tense)
- [2] [✔] SM_6 <<< `going to` __&&__ `look` || `see` || `be` || `think` || `explain` || `explained` __AND NO__ present tense (&& future or past)
- [1] [✔] AP_1 <<< `in other words` __&&__ `should` || `could` || `would` [✔]
- [1] [✔] AP_2 <<< `encourage` || `step` || `first` || `finally` || `second` || `should` || `could` || `would` || `best practice(s)` || `need to` || `homework` || `you can` || `make sure`
- [2] [✔] AP_3 <<< `if` __&&__ `use` || `can` || `should` || `could` || `want`
- [1] [✔] CM_1 <<< `called` __&&__ concept
- [2] [✔] CM_2 <<< `what is` .. __&&__ concept
- [3] [✔] CM_3 <<< `theorem` || `algorithm` || `method` || `let's use` || `theory`
- [1] [✔] CM_4 <<< first occurence of the terms in the title of the file
- [ ] [**X**] CM_5 <<< `let's` __&&__ `use` - **REPEATS CM_3** 

Based on manual analysis:
Label prioritization:
- CM == EX   =>    EX
- CM == SM   =>    SM
- AP == CD   =>    CD
- SM == AP   =>    AP

-----------------------------------

### Logical expressions (copy/paste in thesis later - LATEX style)
#### ♦ EXAMPLE
1. \begin{equation} d \leftarrow EX \space, if\space ("let's" \in d \space) \space \land ("try" \in d \space \lor "see" \in d \space \lor "think" \in d \space \lor "pick" \in d \space \lor "say" \in d) \end{equation} 

2. \begin{equation} d \leftarrow EX \space, if\space ("example" \in d \space) \lor ("for \space instance" \in d) \space \lor ("suppose" \in d) \space \lor ("assume" \in d) \space \lor ("includes" \in d) \space \lor ("imagine" \in d) \space \end{equation} 

Latex Formula Formatter: https://www.codecogs.com/eqnedit.php

## Pseudocode

`Disregard all sentences that have NL label, totally ignore, then: [✔]
    if sentence has no label:
        for line in text:
            if lineNR < 10 OR lineNR > nrOfLines-10:
                for word in line:
                    if (60%+ of the words on the line are in PAST TENSE):
                        go over the SM rules  (append res to curSentLabels)
                        go over the CM rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in PRESENT TENSE):
                        go over the CD rules  (append res to curSentLabels)
                    if (60%+ of the words on the line are in FUTURE TENSE):
                        go over the CM rules  (append res to curSentLabels)
            if lineNR > 10 AND lineNR < nrOfLines-10:
                go over the EX rules  (append res to curSentLabels)
                go over the CD rules  (append res to curSentLabels)
                go over the AP rules  (append res to curSentLabels)
                go over the CM rules  (append res to curSentLabels)
        Count the labels with a special method for this: [✔]
            if all rules fail to assign a label, i.e. if all labels return count 0:
                search for the last labeled sentence:
                    assign its label to the current sentence`                                                     

### [THEORY] Checking the tense of the verbs in the sentence

- [All POS Tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

NLTK PPOS TAGS for verbs:

- VBD		verb, past tense					`took`      +1 if checked for PAST, else 0
- VBN		verb, past participle				`taken`     +1 if checked for PAST, else 0
- VB		verb, base form						`take`      +0.5
- VBG		verb, gerund/present participle		`taking`    +0.5
- VBP		verb, sing. present, non-3d			`take`      +1 if checked for PRESENT, else 0
- VBZ		verb, 3rd person sing. present		`takes`     +1 if checked for PRESENT, else 0
- MD        modal verb (will, shall)            `will`      +1 if checked for FUTURE, else 0


**Simply counting verbs isn't enough, artificial boost is added if certain types of verbs are present so that they
form a specific English tense. All listed below:**

- **Past perfect**: `VBD`(had) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **Past continuous tense**: `VBD`(was/were) + `VBG`(-ing/gerund)
- **Past perfect continuous**: `VBD`(had) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT perfect**: `VBP`(have) + `VBN`(been) ------- BUT NO VBG(-ing/gerund)
- **PRESENT perfect continuous**: `VBP`(have) + `VBN`(been) + `VBG`(-ing/gerund)
- **PRESENT continuous**: `VBP`(is/are) + `VBG`(-ing/gerund)
- **Future continuous**: `MD`(WILL) + `VBG`(-ing/gerund)
- **Future perfect**: `MD`(will) + `VB`(have) + `VBN`(PP)
- **Future perfect continuous**: `MD`(will) + `VBN`(been) + `VB`(have) + `VBG`(-ing/gerund)

## Implementation of Inference Rules

### Import modules

In [1]:
# Import all necessary modules for EVERYTHING here
import os
import sys
import os.path
import string
import time
import re
import dis
import time

import math
from textblob import TextBlob as tb
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams

import re, string, unicodedata
import contractions
import inflect
from bs4 import BeautifulSoup
from tabulate import tabulate

### Check for present / past / future tense

All return a number which is the percentage of likelihood that the sentence is of certain tense

**Version 1** checked also for the more complicated tenses such as perfect, continuous etc of each present, past, future by assigning addiotnal scores according to whether all criteria is covered, in order to make the output higher and ==> more certain.

**Version 2** only counts the types of verbs for each tense
- checkPRES(sentence)
- checkPAST(sentence)
- checkFUTURE(sentence)

#### Version 1

In [None]:
# ======================================== PAST TENSE CHECK ==============================================

puncDict = [",",".",";","-","_","`","'","?","!",":"]

def checkPAST2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, pastTense, grammarCase, VBGgerund, VBNpp, VBDHad = 0, 0, 0, 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
#            print(tag)
            verbs += 1
            if str(tag[1]) == "VBD": 
                pastTense += 1
                VBDHad += 1
            elif str(tag[1]) == "VBN": 
                pastTense += 1
                VBNpp += 1
            elif str(tag[1]) == "VBG": VBGgerund += 1
            else: pass
    
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense
    # TODO re-evaluate points
    #  1   1   0.5
    # VBD VBN  VBG
    #print("VBD ",VBDHad," VBN ",VBNpp," VBG ",VBGgerund)
    if VBDHad > 0 and VBNpp > 0 and VBGgerund == 0: grammarCase += 2                   #Past perfect 
    elif VBDHad > 0 and VBNpp == 0 and VBGgerund > 0: grammarCase += 1.5                #Past continuous tense 
    elif VBDHad > 0 and VBNpp > 0 and VBGgerund > 0: grammarCase += 2.5                  #Past perfect continuous
    else: grammarCase += 0
        
    if pastTense == 0:
        return -1
    else:
        #grammarCase = grammarCase * 0.1 
        # big / small = 100 / x
        #calculate verbs over words
        ratio_verbs = len(text) / verbs
        perc_verbs= (100 / ratio_verbs) 

        #calculate past tense verbs over all verbs
        ratio_PastVerbs = verbs / pastTense
        perc_PastVerbsOverVerbs = (100 / ratio_PastVerbs)

        #calculate past tense verbs over all words
        ratio_PastVerbsOverWords = len(text) / pastTense
        perc_VerbsPastTenseOverWords = 100 / ratio_PastVerbsOverWords

        """
        if perc_PastVerbsOverVerbs >= 30:
            return perc_PastVerbsOverVerbs
        else:
            print("{0:.2f}".format(perc_PastVerbsOverVerbs))
            return "NPAST"
        """

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_PastVerbsOverVerbs >= 30:
                return perc_PastVerbsOverVerbs
            else:
                #print("{0:.2f}".format(perc_PastVerbsOverVerbs))
                return -1
            
            
# ========================================= PRESENT TENSE CHECK =============================================
    
def checkPRESENT2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, presTenseVerbs, grammarCase, VBGgerund, VBPhave, VBNbeen = 0, 0, 0, 0, 0, 0 
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBP": 
                presTenseVerbs += 1
                VBPhave += 1
            elif str(tag[1]) == "VBN": 
                VBNbeen += 1
            elif str(tag[1]) == "VBG":
                presTenseVerbs += 1
                VBGgerund += 1
            elif str(tag[1]) == "VB":
                presTenseVerbs += 1
            elif str(tag[1]) == "VBZ":
                presTenseVerbs += 1
            else: pass
                
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense.
    #print("VBG ",VBGgerund," VBP ",VBPhave," VBN ",VBNbeen)
    # TODO re-evaluate points
    # 0.5   1    0
    # VBG  VBP  VBN
    
    if VBGgerund == 0 and VBPhave > 0 and VBNbeen > 0: grammarCase += 1                   # PRESENT perfect
    elif VBGgerund > 0 and VBPhave > 0 and VBNbeen == 0: grammarCase += 1.5               # PRESENT continuous 
    elif VBGgerund > 0 and VBPhave > 0 and VBNbeen > 0: grammarCase += 1.5                # PRESENT perfect continuous
    else: grammarCase += 0
    
    #grammarCase = grammarCase * 0.1
    #calculate past tense verbs over all verbs
    if presTenseVerbs == 0:
        return -1
    else:
        ratio_PresVerbs = verbs / presTenseVerbs
        perc_PresVerbsOverVerbs = (100 / ratio_PresVerbs)

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_PresVerbsOverVerbs >= 30:
                return perc_PresVerbsOverVerbs
            else:
                #print("{0:.2f}".format(perc_PresVerbsOverVerbs))
                return -1


# ======================================= FUTURE TENSE CHECK ===============================================

def checkFUTURE2(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, futureTenseWords, grammarCase, MDmodal, VBGgerund, VBhave, VBNpp = 0, 0, 0, 0, 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBG":
                VBGgerund += 1
                futureTenseWords += 1
            elif str(tag[1]) == "VB":
                VBhave += 1
            elif str(tag[1]) == "VBN":
                VBNpp += 1
        if str(tag[1]) == "MD" and str(tag[0]) == "will" or str(tag[0]) == "shall":
            MDmodal += 1
            futureTenseWords += 1
                
    # artifial boost over the past tense verbs +1 if there is at least one from all gerund, past participle 
    # and modal verb that defines past continuous tense.
    #print("VBG ",VBGgerund," VBP ",VBPhave," VBN ",VBNbeen)
    # TODO re-evaluate points
    # 0.5   0    0    1
    # VBG   VB  VBN   MD
    
    if MDmodal > 0 and VBhave > 0 and VBNpp > 0: grammarCase += 1                         # Fut. perf.    MD/VB/VBN
    elif MDmodal > 0 and VBGgerund > 0: grammarCase += 1.5                                # Fut. cont.  MD/VBG
    elif MDmodal > 0 and VBhave > 0 and VBNpp > 0 and VBGgerund > 0: grammarCase += 1.5   # Fut. perf. cont. MD/VB/VBN/VBG
    else: grammarCase += 0
    
    #grammarCase = grammarCase * 1.0
    #calculate past tense verbs over all verbs
    if futureTenseWords == 0:
        return -1
    else:
        ratio_FutureVerbs = verbs / futureTenseWords
        perc_FutureWordsOverVerbs = (100 / ratio_FutureVerbs)

        if not grammarCase == 0:
            return grammarCase * 100
        else:
            if perc_FutureWordsOverVerbs >= 30:
                return perc_FutureWordsOverVerbs

#### Version 2 (USE THIS)

In [2]:
# ================================= SECOND VERSION OF TENSE CHECK ========================================

# ======================================== PAST TENSE CHECK ==============================================

puncDict = [",",".",";","-","_","`","'","?","!",":"]

def checkPAST(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, pastTense, VBGgerund = 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
#            print(tag)
            verbs += 1
            if str(tag[1]) == "VBD": pastTense += 1
            elif str(tag[1]) == "VBN": pastTense += 1
            elif str(tag[1]) == "VBG": pastTense += 1
            else: pass
        
    if pastTense == 0:
        return -1
    else:
        #calculate past tense verbs over all verbs
        ratio_PastVerbs = verbs / pastTense
        perc_PastVerbsOverVerbs = (100 / ratio_PastVerbs)
        
        return perc_PastVerbsOverVerbs
            
            
# ========================================= PRESENT TENSE CHECK =============================================
    
def checkPRESENT(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, presTenseVerbs = 0, 0 
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBP": presTenseVerbs += 1
            elif str(tag[1]) == "VBG":
                presTenseVerbs += 1
            elif str(tag[1]) == "VBZ": presTenseVerbs += 1
            else: pass
            
    if presTenseVerbs == 0:
        return -1
    else:
        ratio_PresVerbs = verbs / presTenseVerbs
        perc_PresVerbsOverVerbs = (100 / ratio_PresVerbs)
        
        return perc_PresVerbsOverVerbs


# ======================================= FUTURE TENSE CHECK ===============================================

def checkFUTURE(sentence):
    text = nltk.word_tokenize(sentence)
    verbs, futureTenseWords, grammarCase = 0, 0, 0
    
    textPOS = nltk.pos_tag(text)
    for tag in textPOS:
        if str(tag[1]).startswith("V"):
            verbs += 1
            if str(tag[1]) == "VBG": futureTenseWords += 1
        if str(tag[1]) == "MD" and str(tag[0]) == "will" or str(tag[0]) == "shall":
            futureTenseWords += 1
            
    if futureTenseWords == 0:
        return -1
    else:
        ratio_FutureVerbs = verbs / futureTenseWords
        perc_FutureWordsOverVerbs = (100 / ratio_FutureVerbs)
        
        return perc_FutureWordsOverVerbs

In [None]:
sentence = """
that means we will create this next time
"""

# -------- check sent for PAST tense
print("Past tense chance: {} %".format(int(checkPAST(sentence))))

# -------- check sent for PRESENT tense    
print("Present tense chance: {} %".format(int(checkPRESENT(sentence))))

# -------- check sent for FUTURE tense
print("Future tense chance: {} %".format(int(checkFUTURE(sentence))))

print("i am going to eat a kiwi".find("to"))

### Functions:
- **avg**(list)
- **get_ngrams**(text, n)
- **title_getConcept**(oFileNoExt)
- **getPerLabelMax**(tupleList)
- **getFinalLabelTupleList**(curSentLabels)
- **getFinalLabelDict**(resDict)    #NOT working
- **checkPriority**(funcName)
- **updatePriority**(ruleFunc, funcName)
- **termSeen**(listOfTermsSeen, term)

In [3]:
import operator

def avg(listOfItems):
    return sum(listOfItems, 0.0) / len(listOfItems)

def get_ngrams(text, n):
    n_grams = ngrams(word_tokenize(text), n)
    return [' '.join(grams) for grams in n_grams]

# WORKS - takes a file name
# USE to find key words from tghe title and find the first occurence of the concepts in the text and label as CM
def title_getConcept(oFileNoExt):
    
    # remove all surrounding stuff like my naming convention etc from the title 
    mainTitlelist = oFileNoExt.split("_")
    maintitle = mainTitlelist[1]
    punct = {'_','-','.'}
    finaltitle = ""
    
    # extract the final title, i.e. the main part of the title
    for word in maintitle.split():
        for letter in word:
            if letter in punct:
                finaltitle += " "
                pass
            else:
                finaltitle += letter
    
    title_keywords = []
    for word in finaltitle.split(" "):
        if word == 'en':
             pass
        else:
            title_keywords.append(word)
    
    return title_keywords  # returns a list of keywords


### -----------------------------

# since more then one rule may return the same label and summing up the results can give incorrect result, we will
# only get the highest value per label for the final comparison for label output
def getPerLabelMax(tupleList):
    maxCD, maxAP, maxEX, maxCM, maxSM = 0,0,0,0,0
    uniqueValueList = []
    
    for item in tupleList:
        label = item[0]
        score = item[1]
        
        if label == 'NOLBL': pass
        if label == 'CD': 
            if score > maxCD: maxCD = score
        if label == 'CM': 
            if score > maxCM: maxCM = score
        if label == 'AP': 
            if score > maxAP: maxAP = score
        if label == 'SM': 
            if score > maxSM: maxSM = score
        if label == 'EX': 
            if score > maxEX: maxEX = score
                
    uniqueValueList.append(tuple(("CD", maxCD)))
    uniqueValueList.append(tuple(("CM", maxCM)))
    uniqueValueList.append(tuple(("AP", maxAP)))
    uniqueValueList.append(tuple(("SM", maxSM)))
    uniqueValueList.append(tuple(("EX", maxEX)))
    
    return uniqueValueList

### -----------------------------

# WORKS 
# USE at the end to export only one label per sentence - the one with the majority vote
# NOTE: equal case is NOT considered, so it may crash
def getFinalLabelTupleList(curSentLabels):
    uniqueList = getPerLabelMax(curSentLabels)
    #print("final list before comparison: ", uniqueList)
    
    maxLabel = ""
    maxCount = 0
    labels = []
    
    # getting the max value
    for item in uniqueList:
        label = item[0]
        count = item[1]
        
        if count > maxCount:
            maxCount = count
            maxLabel = label
        elif count == maxCount:
            """ PRIORITIES of LABELS over each other
            CM == EX => EX [✔]
            CM == SM => SM [✔]
            AP == CD => CD [✔]
            SM == AP => AP [✔]
            """
            if label == 'CM' and maxLabel == 'EX': pass
            elif label == 'EX' and maxLabel == 'CM': maxLabel = label
            elif label == 'SM' and maxLabel == 'CM': maxLabel = label
            elif label == 'CM' and maxLabel == 'SM': pass
            elif label == 'CD' and maxLabel == 'AP': pass
            elif label == 'AP' and maxLabel == 'CD': maxLabel = label
            elif label == 'AP' and maxLabel == 'AP': maxLabel = label
            elif label == 'SM' and maxLabel == 'AP': pass
            #elif label == 'SM' and maxLabel == 'CD': maxLabel = label
            #elif label == 'CD' and maxLabel == 'SM': pass
            #elif label == 'EX' and maxLabel == 'CD': maxLabel = label
            #elif label == 'CD' and maxLabel == 'EX': pass
                
    return maxLabel

# ---------------------------------------------------------

def getFinalLabelDict(resDict):     # gets the list of assigned labels after all rules have been checked
    #print(resDict)
    EX,AP,CD,CM,SM = 0, 0, 0, 0, 0
    
    numericalDict = {}
    # counting the labels returned from the rules checks
    for value in resDict:
        if value == "EX": numericalDict["EX"] += 1
        elif value == "AP": numericalDict["AP"] += 1
        elif value == "CD": numericalDict["CD"] += 1
        elif value == "CM": numericalDict["CM"] += 1
        elif value == "SM": numericalDict["SM"] += 1
        else: pass   #pass NOLBL items
        
    maxValueLabel = getKeyWithMaxVal(numericalDict)

    return maxValueLabel

### -----------------------------

# checks the rule name and since the rules have priority returns int based on that: where higher priority = more points
def checkPriority(funcName):
    res = 0
    if funcName in ("EX_1", "CD_5", "SM_1", "SM_4", "AP_1", "AP_2", "CM_1", "CM_4"): res = 4    # priority 1 (highest)
    if funcName in ("EX_2", "CD_1", "SM_6", "AP_3", "CM_2"): res = 3    # priority 2
    if funcName in ("CD_4", "CD_6", "SM_3", "CM_3"): res = 2    # priority 3
    if funcName in ("SM_5", "SM_2", "CD_3", "CD_2"): res = 1     # priority 4 (lowest)
    
    return res

### -----------------------------

# gets function and sentence, checkswhat the function returns and if the result is not NOLBL,
# returns a TUPLE to be added to a dictionary as (KEY-VALUE): e.g. ("EX_1", 4)
def updatePriority(ruleFunc, funcName):
    keyValPair = ()
    points = 0
    
    if not ruleFunc == "NOLBL":
        points = checkPriority(funcName)
        #print(funcName," (",ruleFunc,points,")")
    
    keyValPair = (ruleFunc,points)
    return keyValPair    # returns a tuple



### -----------------------------
# figure out how to detect whether a term from the title has been seen in a file or not.

def termSeen(listOfTermsSeen, term):
    if term in listOfTermsSeen: return 1
    else: return 0
   
"""
    // TODO
    When this returns 0, we can update the value for the respective term in titleTermsSeen from 0 to 1 and 
    we can call CM_4 for each term and it will work only if the term's value is 0, 
    so for each sentence:
        check dictionary, if value is 0, call CM_4
        if CM_4 returns a value "CM", then update the dictionary for that value with 1
        else don't call CM_4 at all
"""

'\n    // TODO\n    When this returns 0, we can update the value for the respective term in titleTermsSeen from 0 to 1 and \n    we can call CM_4 for each term and it will work only if the term\'s value is 0, \n    so for each sentence:\n        check dictionary, if value is 0, call CM_4\n        if CM_4 returns a value "CM", then update the dictionary for that value with 1\n        else don\'t call CM_4 at all\n'

### Rules implementaion

In [4]:
##### RULES #####

### ------------------EX--------------------
def EX_1(sent):
    nrOfWordsFound = 0
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    wordsEX = {"example", "examples", "for instance", "assume", "sketch", "chart", "cartoon", "suppose", "imagine", "as", "simulation", "diagram", "might have", "may have", "want", "draw"}
    
    for wd in wordsEX:
        splitwords = wd.split()
        if len(splitwords) == 2:
            if wd in bigrams:
                nrOfWordsFound += 1
        else:
            if wd in monograms:
                nrOfWordsFound += 1
        
    if nrOfWordsFound > 0:
        return "EX"
    else: 
        return "NOLBL"
    
### ------------------EX--------------------
    
def EX_2(sent):
    mainWordNR = 0      # Let's
    secondaryWordsNR = 0     
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    wordsEX2 = {"try", "think", "see", "pick", "take a look", "say"}
    
    if ("let's" or "let 's") in bigrams: mainWordNR += 1
    
    for wd in wordsEX2:
        splitwords = wd.split()
        if len(splitwords) == 3:
            if wd in trigrams:
                secondaryWordsNR += 1
        if wd in monograms:
                secondaryWordsNR += 1
        
    #print(," ", " ",)
    #print(mainWordNR," ",secondaryWordsNR)
    if secondaryWordsNR > 0 and mainWordNR > 0:
        return "EX"
    else:
        return "NOLBL"
    
### -------------------CD-------------------

def CD_1(sent):
    mainWordNR = 0      # Let's     # main word looking for in conjunction with one or more of the secondary words
    secondaryWordsNR = 0 
    negWordsNR = 0      # words that must NOT occur for the label to apply, i.e. this should stay at ZERO
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    secondaryWords = {"look at", "make", "put", "do", "start", "prove", "back", "try", "just", "be", "take", "bring"}
    negWords = {"example", "examples", "diagram", "assume", "imagine", "suppose"}
    
    if ("let's" or "let 's") in bigrams: mainWordNR += 1
    
    for wd in secondaryWords:
        splitwords = wd.split()
        if len(splitwords) == 2:
            if wd in bigrams:
                secondaryWordsNR += 1
        else:
            if wd in monograms:
                secondaryWordsNR += 1
    
    for wd in negWords:
        if wd in monograms:
                negWordsNR += 1
    
    if secondaryWordsNR > 0 and mainWordNR > 0 and negWordsNR == 0:
        return "CD"
    else:
        return "NOLBL"

### --------------------------------------

def CD_2(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    if ("in other words" or "basically") in monograms and not ("should" or "must") in monograms or not "have to" in bigrams and checkPRESENT(sent) >= 50:
        return "CD"
    else: return "NOLBL"
    
### --------------------------------------

def CD_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if monograms[0] == "so" and ("it 's" or "it's") or ("i 'm" or "i'm") in bigrams: return "CD"
    else: return "NOLBL"
    
### --------------------------------------

def CD_4(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    if "so this is" in trigrams or "actually" in monograms and not ('example' or 'summary' or 'next' or 'last') in monograms: return "CD"
    else: return "NOLBL"
    
### --------------------------------------   

def CD_5(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("mean" or "given" or "define") in monograms and checkPRESENT(sent) >= 50 or "going to" in bigrams: return "CD"
    else: return "NOLBL"

### --------------------------------------
    
    
def CD_6(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "what if" in bigrams and not ("example" or "instance") in monograms: return "CD"
    else: return "NOLBL"
    
    
### --------------------------------------
def SM_1(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("let's" or "let 's") in bigrams and "summarize" or "recap" in monograms: return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_2(sent):
    trigrams = get_ngrams(sent, 3)
    
    if "in other words" in trigrams and (checkFUTURE(sent) or checkPAST(sent)) >= 40: return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if (("this week" or "this lesson") in bigrams or "today" in monograms) and (checkPRESENT(sent) or checkFUTURE(sent)) >= 50: 
        return "SM"
    else: return "NOLBL"

### --------------------------------------

def SM_4(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    wordsSM = {'later' , 'next time' , 'last time' , 'summary' , 'summarize' , 
             'here is' , 'here are' , 'discuss' , 'next', 'recap'}
    
    for wd in wordsSM:
        splitwords = wd.split()
        if len(splitwords) == 2:
            if wd in bigrams: return "SM"
        elif wd in monograms: return "SM"
        else: return "NOLBL"

### --------------------------------------

def SM_5(sent, lineNR, NrOfLines):
    if lineNR < 15 or lineNR > NrOfLines - 15 and checkPAST(sent) > 40: return "SM"
    else: return "NOLBL"

### --------------------------------------
    
def SM_6(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    wordsSM6 = {"look", "see", "be", "think", "explain"} 
    
    if "going to" in bigrams:
        for wd in wordsSM6:
            if wd in monograms and checkPRESENT(sent) < 30 and (checkFUTURE(sent) or checkPAST(sent)) > 40 : return "SM"
            else: return "NOLBL"
    else: return "NOLBL"
    
### --------------------------------------

def AP_1(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    trigrams = get_ngrams(sent, 3)
    
    if "in other words" in trigrams and ("should" or "would" or "could") in monograms: return "AP"
    else: return "NOLBL"

### --------------------------------------

def AP_2(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    wordsAP2 = {'encourage' , 'step' , 'first' , 'finally' , 'second' , 'should' ,
             'could' , 'would' , 'best practice', 'need to', 'good idea', 'homework', 'you can', 'make sure'}
    
    res = ""
    for wd in wordsAP2:
        splitwords = wd.split()
        if len(splitwords) == 2 and wd in bigrams: res = "AP"
        elif len(splitwords) == 1 and wd in monograms: res = "AP"
        else: 
            if not res == "AP": res = "NOLBL"
    
    if res == "": return "NOLBL"  
    else: return res
   
    ### --------------------------------------

def AP_3(sent):
    monograms = get_ngrams(sent, 1)
    
    if "if" in monograms and ("use" or "can" or "should", "could" or "want") in monograms: return "AP"
    else: return "NOLBL"

### --------------------------------------

def CM_1(sent, origLbl):
    #mainConcepts = title_getConcept(oFileNoExt)
    monograms = get_ngrams(sent, 1)    
    
    if "called" in monograms and origLbl == "CM": return "CM"
    else: return "NOLBL"


### --------------------------------------

def CM_2(sent, oFileNoExt):
    mainConcepts = title_getConcept(oFileNoExt)
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if "what is" in bigrams:
        for wd in mainConcepts:
            if wd in monograms: return "CM"
    else: return "NOLBL"
    
### --------------------------------------

def CM_3(sent):
    monograms = get_ngrams(sent, 1)
    bigrams = get_ngrams(sent, 2)
    
    if ("theory" or "theorem" or "algorithm" or "method") in monograms or ("let's" or "let 's") in bigrams and "use" in monograms: return "CM"
    else: return "NOLBL"

### --------------------------------------

def CM_4(sent, termsSeenList, term):
    seen = -1
    monograms = get_ngrams(sent, 1)
    
    if term in sent and not originalLabel == "EX": 
        if term not in termsSeenList:
            termsSeenList.append(term)
            seen = 1
    else: seen = 0

    #print(term,"\t",seen)
    if seen == 0: return "CM"
    else: return "NOLBL"
        

### Main program logic, called from the file traverse

#### TEST

In [None]:
#---------------------------------------------- STUPID TEST EXAMPLE ------------------------------------
testSent = "this week,in week nine,we're still doing applications of the derivative,but it's not so much word problems anymore"
#res = EX_2(testSent)
#print(res)
# ----------- disregard the above test ---------

ignore_dict = ['inaudible','OMITTED','NUMBER','sound','music','laughter','yeah','blank_audio']

text = """EX|so here's a example we want to run through 
EX|here,have some principle inertias that are NUMBER,NUMBER,and NUMBER 
CD|now,just by the numbers,a little bit unfortunate,the intermediate one if NUMBER beneath the max and is NUMBER above the min,so it's plus and minus NUMBER 
CD|this is why when we look at these wheels,the critical wheel speeds,where things go stable unstable,it happens to be symmetric 
CD|but,generally these inertias aren't split evenly like that 
CD|so generally those critical wheel speeds are not necessarily symmetric,all right 
EX|this example will have them symmetric 
EX|the wheel inertia is ten kilograms,just to make a lot of math easy 
EX|spacecraft is to spin about NUMBER rpm,about b1,so that's our omega e1 that you have 
EX|and so now,we're trying to figure out how fast do we have to spin this wheel at least to guarantee linear stability of the system 
CD|so you want both bracketed terms to be either positive or negative 
CD|now,those two bracketed terms could be positive,that's the set 
CD|that's where we have i1 must be greater than these other two terms here 
CD|in that case,the left bracket and the right bracket are both positive,and then we have a stable system 
CD|set,actually just simply reverses this,greater than becomes a less than 
CD|this is the condition that makes the first and second bracket go negative and that is also a stable configuration,all right 
CD|so now,we want to look at what is the range of spin rates 
CD|so the way like to think of this is just look at the spin axis 
CD|we have zero 
CD|and then,can go positive spin rate and negative spin rate 
CD|and we're going to put in inequality condition to start to go it has to be at least to the right of this mark 
CD|and it has to be at least to the left of this mark,so what's left 
CD|the union or intersections of these different areas,that's how solve this stuff at least 
CD|so if we're looking at this,in this problem if the wheel speed is zero,is this system going to be stable 
NL|brian 
CD|they'd take a look at the inequalities again 
NL|no,no 
SM|if the wheel speed is zero,there's no longer a dual spin,there's simply a single rigid body 
CD|and we're spinning about b1 dual spin
CD|about which inertia axis are we spinning then 
CD|it's not going to be stable 
CD|all right,because now we have NUMBER 
CD|so b1 is an axis of intermediate inertia in this particular example,right 
CD|so we know up front,zero can not be included in my solution space 
CD|have to,if this is going to work,have to have something positive or negative 
CD|and we'll see how that drops out 
CD|so if we do this,here's the two inequality conditions 
EX|the second one,for example here,this is one set 
CD|you can say okay,i2 minus,you can bring this term over,that gives you i2 minus i1,divide by this 
CD|iw,i2 minus i1,gives you minus NUMBER divided by NUMBER 
CD|no,it gets you minus NUMBER divided by ten,gives you minus five 
CD|so by bringing this over,omega hat has to be bigger than minus five 
CD|if'm just going to draw that here,so let's say,here's the minus five part,that means,what did say,less than 
CD|all right,thinking too many steps ahead,okay 
CD|no,it's greater than 
CD|okay,so that means we would have to be somewhere greater than minus five 
CD|but that's only one condition 
CD|now the other condition,here'm just going through the math a little bit more steps,right 
CD|i'm bringing this condition and bringing this part over to the left hand side 
CD|i1 back over to the right hand side 
CD|divide by iw2,here it's i3 minus i1 
CD|you plug in the NUMBERs,which you do something similar in the homework,so'm just not going to 
CD|you know how to do this algebra 
EX|then omega hat as to be greater than plus five 
EX|so if we look at this now and say,okay,there's also a plus five,and here's the other point 
CD|this is the domain that makes the second bracket go to positive 
CM|what is the actual solution space then for stability 
NL|inaudible great 
CD|so that union between both,right 
CD|because we need both of them to be positive 
CD|that's going to be here,all right 
CD|outs greater than 
CD|is that the only domain,though 
NL|carlo,what do you think 
CD|will be less than minus five 
CD|less than minus five,now talk me through why that it's also domain correct 
CD|because the second set of equations doesn't give us that answer 
CD|yep,the first set always has greater than
CD|you've identified two points where each bracketed term flips signs 
CD|once you've found those,you've really found,instead of greater than,it's all less than 
CD|so instead of always being to the right of,it's always going to be all to the left of 
CD|and,again,the same unions 
CD|so if you did that to make both brackets negative,if pick a different color,let me make blue for negative right 
CD|then you would have to be here or here and the union of that is just going to be here 
CD|so you can see as expected the origin is not included in the solution space 
CD|because as brian was pointing out,this is a single rigid body spinning about axis remediated inertia 
CD|we know it's not stable,but if the wheel spins up enough then at some point,you'll notice with this wheel spinning,you're spinning oars is an oblate body,you're spinning by axis of max inertia 
CD|that wheel by itself,the way it's defined 
CD|it's always going to be stable,right 
CD|so if that wheel spins fast enough,one way to think of this is the stability of that wheel is going to overcome the instability of the spacecraft spin and that's how we stabilize it 
CD|but you have to get to some minimum amount where it's just equal and then you have to be greater than that,hopefully,quite a bit greater that it will increase your stiffness and your response time as well,right 
CD|so either you can go positive or you can go negative and that will get you there 
CD|now,so good 
EX|so in this case,have,this is one of the wheel's speed,the critical speeds,you could use or it has to be bigger than that really,that's what you have to write 
CD|not equal,but probably bigger than that 
CD|or it could also be less than minus NUMBER in this case 
CD|there's two sets,both brackets positive,both brackets negative 
CD|so if'm giving you some problem like this though,here am saying,we are spinning about major axis 
CD|so in this case,b1 is axis of maximum inertia and without a rotor spin,so the origin,we would have to be stable,which is what am showing 
CD|if we are spinning up more about the spacecraft by definition,already has a positive spin,about b1 
CD|that is how we derived to all this stuff 
CD|that's how picked b1 
CD|and now,the wheel is also spinning about that 
CD|if the spacecraft is max inertia stable,the wheel is max inertia stable and they're both spinning about positive b1,it's always just going to be stable 
CD|more and more stable 
CD|if you start to spin it the opposite direction though,even though the space craft by itself without that wheel will stable,it has the speed momentum that helps stabilize it 
CD|what happens with the wheel,if it's in the opposite direction,it's going to start to pull out the total momentum gets reduced,actually 
CD|if this is spinning at NUMBER rpm positive and the wheel is NUMBER rpm negative,you're pulling out momentum 
CD|and in fact,the momentum perspective,at some point it acts like the separatrix,it acts like the unstable intermediate axis motion and you will drive a system unstable 
CD|but the good news is no matter what axis,principle axis,you want to spin nominally,we can always find a wheel speed that will stabilize it 
CD|laugh the bad news is no matter how stable the spacecraft was before you touched it,once you touch it,you have the capability to drive it unstable 
AP|so make sure you have the right real speeds otherwise your sponsor will not be happy 
AP|at some point again,if you go fast enough,then the wheel momentum will dominate and it's very,very stable 
CD|whatever the spacecraft's doing is almost noise 
CM|so with all duo spinners to extremes infinity real speeds always stable 
CD|but in between,there is a region,a finite zone,that is unstable 
CD|so for an intermediate axis speed,which we looked at 
CD|and in our problem,we had plus and minus five,that was just because of the inertias,it doesn't have to be symmetric 
CD|if you see something like this that excludes the origin,right away could say,this must be spinning about an axis of intermediate inertia with a dual-spinner 
CD|just from the solution space 
CD|and,if you have an axis of a least inertia,you're spinning about this region that you shouldn't be spinning about is actually in the positive spin direction and just all comes out of the mathematics then 
NL|jordan 
CD|sorry,missed how can it be stable if you're spinning it less than negative five 
CD|both bracketed terms 
CD|because the two brackets have to either be positive,and that was the argument for the black hash lines 
CD|but we found the two critical points 
CD|if we also less than those critical points,then both brackets both become negative 
NL|okay 
CD|and that's why this is also a possible answer,out of the mathematics 
NL|thanks 
NL|yeah,good 
NL|so anyway,but this gives you now a quick solution space 
AP|you can do these easy homework with this yourself,to come up with it 
SM|but just remember the extremums are always included 
CD|the origin is only included for a max and min inertia case and then you kind of look at the pattern 
CD|if see the pattern,know right away what type of spin we're doing """


text2 = """SM|welcome to my last additional video,which is about doing things many,many times 
SM|this is something called a loop 
SM|before we get to loops,let's talk about why we want to do things many,many times 
EX|simplest reason is that we might want to draw lots of things that are very similar 
EX|a bunch of concentric circles,a set of lines forming a grid,these are the examples i'm going to use 
EX|but also,things get much,much more complex than that 
EX|i mean,you might have a game with lots of enemies you want to draw 
EX|or in the case of this week's example,we're creating complex shapes made out of many similar simple shapes 
EX|and,in a sense,a grid or concentric circles is a good starting point 
EX|here,this is a sketch that gives us,draws a bunch of lines across the screen  """

curSentLabels = []
sentences = text2.split("\n")
originalLabel = ""
res = {}
lineNR = 0
totalLines = len(text2)
prevLabeledSent = ""
prevLabeledSentLABEL = ""
finalLabelSent = ""

titleTermList = title_getConcept("04_9-example-dual-spinner-stability.en_labels.txt")
#print(titleTermList)
termsSeenList = []   # will hold values for main terms

for sent in sentences:
    lineNR += 1
    splitSent = sent.split("|")
    originalLabel = splitSent[0]
    sent = splitSent[1]
    termInCurSent = ""
    
    if originalLabel == "NL":
        pass
    else:
        curSentLabels.append(tuple(updatePriority(SM_1(sent),"SM_1")))
        curSentLabels.append(tuple(updatePriority(SM_2(sent),"SM_2")))
        curSentLabels.append(tuple(updatePriority(SM_3(sent),"SM_3")))
        curSentLabels.append(tuple(updatePriority(SM_4(sent),"SM_4")))
        curSentLabels.append(tuple(updatePriority(SM_5(sent, lineNR, totalLines),"SM_5")))
        curSentLabels.append(tuple(updatePriority(SM_6(sent),"SM_6")))
        curSentLabels.append(tuple(updatePriority(CD_1(sent),"CD_1")))
        curSentLabels.append(tuple(updatePriority(CD_2(sent),"CD_2")))
        curSentLabels.append(tuple(updatePriority(CD_3(sent),"CD_3")))
        curSentLabels.append(tuple(updatePriority(CD_4(sent),"CD_4")))
        curSentLabels.append(tuple(updatePriority(CD_5(sent),"CD_5")))
        curSentLabels.append(tuple(updatePriority(CD_6(sent),"CD_1")))
        curSentLabels.append(tuple(updatePriority(EX_1(sent),"EX_1")))
        curSentLabels.append(tuple(updatePriority(EX_2(sent),"EX_2")))
        curSentLabels.append(tuple(updatePriority(AP_1(sent),"AP_1")))
        curSentLabels.append(tuple(updatePriority(AP_2(sent),"AP_2")))
        curSentLabels.append(tuple(updatePriority(AP_3(sent),"AP_3")))
        for term in titleTermList:
            curSentLabels.append(tuple(updatePriority(CM_4(sent, termsSeenList, term),"CM_4")))
        curSentLabels.append(tuple(updatePriority(CM_1(sent,originalLabel),"CM_1")))
        curSentLabels.append(tuple(updatePriority(CM_2(sent, "04_9-example-dual-spinner-stability.en_labels.txt"),"CM_2")))
        curSentLabels.append(tuple(updatePriority(CM_3(sent),"CM_3")))
        #print(curSentLabels)
        
        if getFinalLabelTupleList(curSentLabels) in ("SM", "AP", "EX", "CD", "CM"):
            prevLabeledSentLABEL = finalLabelSent     # before assigning the new value, we still have the old one
            finalLabelSent = getFinalLabelTupleList(curSentLabels)
            prevLabeledSent = sent
        else:
            finalLabelSent = prevLabeledSentLABEL
            prevLabeledSent = sent
                       
        print("Original: ", originalLabel, " | Assigned: ", finalLabelSent)
        print(finalLabelSent, "\t", sent, "\n")
        
        termInCurSent = ""
        curSentLabels.clear()
    #print(termsSeenList)

# label_sentences(takesAFile)
# get_ngrams(takesASentence)
# ruleX(takesASentence)
# main(takesInputFile)

In [None]:
# TODO
# figure out how to detect whether a term has been seen in a file or not. (meaning a term from the title)

def termSeen(listOfTermsSeen, term):
    
    if term in listOfTermsSeen: return 1
    else: return 0
    
"""
    // TODO
    When this returns 0, we can update the value for the respective term in titleTermsSeen from 0 to 1 and 
    we can call CM_4 for each term and it will work only if the term's value is 0, 
    so for each sentence:
        check dictionary, if value is 0, call CM_4
        if CM_4 returns a value "CM", then update the dictionary for that value with 1
        else don't call CM_4 at all
"""

#### MAIN METHOD (LOGIC)

In [20]:
def main(iFile, oPathNoExt):    # main application with all logic following the pseudocode
    correctLabels = 0
    correctCM = 0
    correctCD = 0
    correctAP = 0
    correctSM = 0
    correctEX = 0
    #print(oPathNoExt)

    print("[LABELLING file: ] " + os.path.basename(iFile.name))
    #print("============================================================================\n")
    baseName = oPathNoExt.split(".en", 1)[0]
    OFName = baseName + ".en_AutoRuleLabels.txt"

    sentences = iFile.read().lower().split("\n")
    
    
    ### ---------- Local variables - reset per file ------------------------
    
    originalLabel = ""
    res = {}
    lineNR = 0
    totalLines = len(sentences)
    prevLabeledSent = ""
    prevLabeledSentLABEL = ""
    curSentLabels = []          # All the labels assigned to the current sentence (to get majority vote from it later)
    finalLabelSent = ""
    
    titleTermList = title_getConcept(oPathNoExt)
    termsSeenList = []   # will hold values for main terms
    #accuracy = (countCorLabels/totalSentences) * 100
    #accuracy = (correctLabels/totalLines) * 100
    #print("Accuracy: {0:.2f} %".format(accuracy))
    
    #for term in titleTermList:
        #titleTermsSeen[term] = 0
        
    with open(OFName, "w") as oFile:    # opening the output file to write in the same place where the original file is
        oFile.write("Original|Assigned|Sentence\n".upper())
        for sent in sentences:
            if len(sent) == 0:   #empty line
                totalLines -= 1   # don't count these sentences as part of the labelling and directly assign "NL" to them
                continue
            else:
                lineNR += 1
                splitSent = sent.split("|")
                originalLabel = splitSent[0].upper()
                sent = splitSent[1]

                if originalLabel == "NL":
                    oFile.write(originalLabel+"|"+"NL"+"|"+sent+"\n")
                    totalLines -= 1   # don't count these sentences as part of the labelling and directly assign "NL" to them
                    continue
                else:
                    curSentLabels.append(tuple(updatePriority(SM_1(sent),"SM_1")))
                    curSentLabels.append(tuple(updatePriority(SM_2(sent),"SM_2")))
                    curSentLabels.append(tuple(updatePriority(SM_3(sent),"SM_3")))
                    curSentLabels.append(tuple(updatePriority(SM_4(sent),"SM_4")))
                    curSentLabels.append(tuple(updatePriority(SM_5(sent, lineNR, totalLines),"SM_5")))
                    curSentLabels.append(tuple(updatePriority(SM_6(sent),"SM_6")))
                    curSentLabels.append(tuple(updatePriority(CD_1(sent),"CD_1")))
                    curSentLabels.append(tuple(updatePriority(CD_2(sent),"CD_2")))
                    curSentLabels.append(tuple(updatePriority(CD_3(sent),"CD_3")))
                    curSentLabels.append(tuple(updatePriority(CD_4(sent),"CD_4")))
                    curSentLabels.append(tuple(updatePriority(CD_5(sent),"CD_5")))
                    curSentLabels.append(tuple(updatePriority(CD_6(sent),"CD_1")))
                    curSentLabels.append(tuple(updatePriority(EX_1(sent),"EX_1")))
                    curSentLabels.append(tuple(updatePriority(EX_2(sent),"EX_2")))
                    curSentLabels.append(tuple(updatePriority(AP_1(sent),"AP_1")))
                    curSentLabels.append(tuple(updatePriority(AP_2(sent),"AP_2")))
                    curSentLabels.append(tuple(updatePriority(AP_3(sent),"AP_3")))
                    curSentLabels.append(tuple(updatePriority(CM_1(sent, "04_9-example-dual-spinner-stability.en_labels.txt"),"CM_1")))
                    curSentLabels.append(tuple(updatePriority(CM_2(sent, "04_9-example-dual-spinner-stability.en_labels.txt"),"CM_2")))
                    curSentLabels.append(tuple(updatePriority(CM_3(sent),"CM_3")))
                    #for term in titleTermList:
                     #   curSentLabels.append(tuple(updatePriority(CM_4(sent, termsSeenList, term),"CM_4")))
                    #curSentLabels.append(tuple(updatePriority(CM_4(sent, "04_9-example-dual-spinner-stability.en_labels.txt", 0),"CM_4")))
                    #print(curSentLabels)

                    if getFinalLabelTupleList(curSentLabels) in ("SM", "AP", "EX", "CD", "CM"):
                        prevLabeledSentLABEL = finalLabelSent     # before assigning the new value, we still have the old one
                        finalLabelSent = getFinalLabelTupleList(curSentLabels)
                        prevLabeledSent = sent
                    else:
                        finalLabelSent = prevLabeledSentLABEL
                        prevLabeledSent = sent

                    if originalLabel == finalLabelSent:
                        correctLabels += 1
                        if originalLabel == "CM" and finalLabelSent == "CM": correctCM += 1
                        if originalLabel == "CD" and finalLabelSent == "CD": correctCD += 1
                        if originalLabel == "AP" and finalLabelSent == "AP": correctAP += 1
                        if originalLabel == "SM" and finalLabelSent == "SM": correctSM += 1
                        if originalLabel == "EX" and finalLabelSent == "EX": correctEX += 1

            oFile.write(originalLabel+"|"+finalLabelSent+"|"+sent+"\n")
            #print("Original: ", originalLabel, " | Assigned: ", finalLabelSent, " | Label of prev. sent: ", prevLabeledSentLABEL)
            #print(finalLabelSent, "\t", sent, "\n")

            curSentLabels.clear()
                
    lineNR = lineNR - 2 #removing a line for the header and because at the end of every file there's one empty line
    
    #accuracyCM,accuracyCD,accuracyEX,accuracySM,accuracyAP = 0,0,0,0,0
    accuracy = (correctLabels/totalLines) * 100
    
    try:
        accuracyCM = (correctCM / correctLabels) * 100
        accuracyCD = (correctCD / correctLabels) * 100
        accuracyEX = (correctEX / correctLabels) * 100
        accuracySM = (correctSM / correctLabels) * 100
        accuracyAP = (correctAP / correctLabels) * 100
        print("Accuracy: {0:.2f} %".format(accuracy))
        #print("Accuracy: {0:.2f} % | CM = {1:.2f} % | CD = {2:.2f} % | AP = {3:.2f} % | EX = {4:.2f} % | SM = {5:.2f} % |".format(accuracy, accuracyCM, accuracyCD, accuracyAP, accuracyEX, accuracySM))
    except ZeroDivisionError:
        pass
    #print("Accuracy: {0:.2f} %".format(accuracy))
    
    finalTuple = (accuracy,accuracyCM,accuracyCD,accuracyAP,accuracyEX,accuracySM)
    #print(finalTuple)
    return finalTuple

### ---RUN THIS PART--- (going over all files and calling the main program on each of them)

#### With multiple files

In [5]:
# TODO

#path = r"C:\Users\a.dimitrova\Desktop\Course data Thesis\INTENT MINING"     # Toshiba path
#path = r"C:\Users\ani\Desktop\Course data Thesis\INTENT MINING"    # HP path SINGLE FILE
path = r"C:\Users\ani\Desktop\Course data Thesis\Intent Mining ALL files"  #HP path ALL files
# return [0]accuracy [1]accuracyCM [2]accuracyCD [3]accuracyEX [4]accuracySM [5]accuracyAP

counter = 0
accuracyAllFiles,accuracyAllCM,accuracyAllCD,accuracyAllEX,accuracyAllSM,accuracyAllAP = [],[],[],[],[],[]

start = time.time()

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            if filePath.endswith(".txt"):
                if filePath.endswith("_AutoRuleLabels.txt"): 
                    pass
                elif filePath.endswith(".txt"):
                    curFileRes = ()
                    counter += 1
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]

                    # Running the main method and assigning the tuple it returns to the local file tuple
                    curFileRes = main(curFile, fileExtRemoved) 
                    # ----------------------------------------------------------------------------------
                    
                    curFAccuracy,curFCM,curFCD,curFEX,curFSM,curFAP = curFileRes  # assigning parts of tuple to variables
                    accuracyAllFiles.append(curFAccuracy)
                    accuracyAllCM.append(curFCM)
                    accuracyAllCD.append(curFCD)
                    accuracyAllEX.append(curFEX)
                    accuracyAllSM.append(curFSM)
                    accuracyAllAP.append(curFAP)                    
                    
                    curFile.close()
            else:
                continue

print("Average accuracy: {0:.2f} %".format(avg(accuracyAllFiles))) 
print("Average accuracy per label: CM {0:.2f} % | CD {1:.2f} % | EX {2:.2f} % | SM {3:.2f} % | AP {4:.2f} % ".format(avg(accuracyAllCM),avg(accuracyAllCD),avg(accuracyAllEX),avg(accuracyAllSM),avg(accuracyAllAP)))

print("\nTotal number of {} {} files found.".format(counter, "TXT"))

end = time.time()
print("Execution time: {0:.2f} min".format((end - start)/60))

"""
# 62.05 %
# updating priority 62.08 %

OUTPUT:
Average accuracy: 62.08 %
Execution time: 1.35 min
"""

NameError: name 'main' is not defined

#### With a single file

In [28]:
start = time.time()
filePath = r"C:\Users\ani\Desktop\Course data Thesis\INTENT MINING\allMerged.txt"

curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]

# Running the main method and assigning the tuple it returns to the local file tuple
curFileRes = main(curFile, fileExtRemoved) 
# ----------------------------------------------------------------------------------

curFAccuracy,curFCM,curFCD,curFEX,curFSM,curFAP = curFileRes  # assigning parts of tuple to variables

curFile.close()

end = time.time()
print("\nExecution time: {0:.2f} min".format((end - start)/60))

[LABELLING file: ] allMerged.txt


IndexError: list index out of range