Jupyter Notebook Shortcuts:
- **A**: Insert cell --**ABOVE**--
- **B**: Insert cell --**BELOW**--
- **M**: Change cell to --**MARKDOWN**--
- **Y**: Change cell to --**CODE**--
    
- **Shift + Tab** will show you the Docstring (**documentation**) for the the object you have just typed in a code cell  you can keep pressing this short cut to cycle through a few modes of documentation.
- **Ctrl + Shift + -** will split the current cell into two from where your cursor is.
- **Esc + O** Toggle cell output.
- **Esc + F** Find and replace on your code but not the outputs.

[MORE SHORTCUTS](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)

### ----------------------------------------------------------------------------------------------------------------------------------------------------

## Data analysis

#### Steps:
1. Apply TF-IDF
2. Try Wikipedia linking
3. Try linking with WordNet
4. Try Bag of Words
5. Try other algorithms? 
6. Define a clear dictionary with words for each category
7. Other Classification algorithms?
8. Try finding n-grams

---

### TODO next:
- get the _tokPOStag.txt files
- read and save every line as key-value pair or list of 2 elements
- compare the second element of each line, i.e the POS, match with the wordnet pos tags
- process stemming correctly
- [lemmatize](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word)
- [find n-grams](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)
- Perform [Bag of Words](https://pythonprogramminglanguage.com/bag-of-words/)

------

#### Offtopic
- [Puthon theory](http://xahlee.info/python/python_basics.html)
- [Text classification](https://gallery.azure.ai/Experiment/Text-Classification-Step-2-of-5-text-preprocessing-2)
- [Preprocessing steps](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)

In [5]:
import os
import sys
import os.path
import string
import time
import pathlib
from unidecode import unidecode
import pprint
from tabulate import tabulate

import scipy
import numpy
import sklearn
import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from beautifultable import BeautifulTable
#nltk.download('punkt')
#nltk.download('wordnet')

## TF-IDF implementation
#### ---------------

IF-IDF is implemented in order to check whether the terms extracted from LOs will have anything in common with the terms that would be extracted with manual MOOC analysis and to compare with of the two methods will bring better results in the classification part

Below is the main TF-IDF implementation without any text provided to it yet.

##### Term frequency
\\( tf(t,d) = 0.5 + 0.5 * (\frac{f_{t,d}}{f_{t',d}:t' \in d}) \\) 

##### Inversed document frequency
\\( idf(t,D) = log * (\frac{N}{d \in D  :  t \in d}) \\)

##### Computing tf-idf
\\( tfidf(t,d,D) = tf(t,d) * idf(t,D) \\)

In [9]:
# blob is the the text where to look for the word
def tf(term, doc):
    #return ratio between nr of certain word count and total document word count
    return doc.words.count(term) / len(doc.words)

def docsWithTermIn(term, doclist):
    return sum(1 for doc in doclist if term in doc.words)

def idf(term, doclist):
    return math.log(len(doclist) / (1 + docsWithTermIn(term, doclist)))

def tfidf(term,doc,doclist):
    return tf(term, doc) * idf(term,doclist)

### Running TF-IDF with data

#### TODO: Fix the input, it takes strings, and not files right now

In [13]:
# traverse each folder and sub-folder
# create an array of files to add each file in it
# if the file is TXT, add to the array
# create a String array of documents with the file of the array with files 
# so we can store the contents of each inside
# read each line of each file and save to the strings
# perform algorithms on the documents

# -------------------------------------------------------------------------------------------------------

# WINDOWS
#path = r"C:\Users\ani\Desktop\Course data Thesis\one file"
path = r"C:\Users\ani\Desktop\Course data Thesis\Course test files"

# LINUX
#path = "/media/sf_Shared_Folder/TEST/one file" # FEW TEST FILES
#path = "/media/sf_Shared_Folder/TEST/RAW"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

docnames = []
counter = 0

# DOCUMENT LIST CONSISTS OF TEXTBLOB files. All input files need to be converted to TEXTBLOB 
# and then saved in this list in order for TF-IDF to work
doclist = []

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            # check for file extension and if not TXT, continue and disregard the current file
            if not filePath.endswith(".txt"):
                pass
            elif filePath.endswith("_lemmatized.txt"): 
                try: 
                    counter += 1
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]
                    docnames.append(curFile)
                    
                    fcontentTBlob = tb(curFile.read())
                    #print(fcontentTBlob)
                    doclist.append(fcontentTBlob)
                    
                    # bag of words processing:
                    
                finally: 
                    curFile.close()
            else:
                pass

print("Total number of files in docnames[]:", len(docnames))
print("Total number of files in doclist[]:", len(docnames))

Total number of files in docnames[]: 3
Total number of files in doclist[]: 3


In [14]:
# ------------------------------------ TF-IDF --------------------------------------------------------

# arrays to hold the terms found in text and also a custom list to test domain-specific terms
exportedList = []
ownList = {"data management","database","example","iot","lifecycle","bloom","filter","integrity",
           "java","pattern","design pattern","svm","Support vector machine","knn","k-nearest neighbors","machine learning"}

table = BeautifulTable()
table.column_headers = ["TERM", "TF-IDF"]

topNwords = 10;

for i, doc in enumerate(doclist):
    print("\nTop {} terms in document {} | {}".format(topNwords, i + 1, docnames[i]))
    scores = {term: tfidf(term, doc, doclist) for term in doc.words}
    sortedTerms = sorted(scores.items(),key=lambda x: x[1], reverse=True)
    
    for term, score in sortedTerms[:topNwords]:
        #print(table.append_row([term, round(score, 5)]))
        print("\tTERM: {} \t|\t TF-IDF: {}".format(term, round(score, 5)))
        exportedList.append(term)
        #print tabulate([term, round(score, 5)], headers=['tTERM', 'TF-IDF'])
        


Top 10 terms in document 1 | <_io.TextIOWrapper name='C:\\Users\\ani\\Desktop\\Course data Thesis\\Course test files\\01_driving-robots-around.en_lemmatized.txt' mode='r' encoding='ISO-8859-1'>
	TERM: plan 	|	 TF-IDF: 0.00544
	TERM: goal 	|	 TF-IDF: 0.00454
	TERM: mod 	|	 TF-IDF: 0.00363
	TERM: sun 	|	 TF-IDF: 0.00272
	TERM: understand 	|	 TF-IDF: 0.00272
	TERM: allow 	|	 TF-IDF: 0.00272
	TERM: adv 	|	 TF-IDF: 0.00272
	TERM: build 	|	 TF-IDF: 0.00272
	TERM: switch 	|	 TF-IDF: 0.00272
	TERM: introduc 	|	 TF-IDF: 0.00181

Top 10 terms in document 2 | <_io.TextIOWrapper name='C:\\Users\\ani\\Desktop\\Course data Thesis\\Course test files\\02_differential-drive-robots.en_lemmatized.txt' mode='r' encoding='ISO-8859-1'>
	TERM: sub 	|	 TF-IDF: 0.01138
	TERM: dot 	|	 TF-IDF: 0.00854
	TERM: paramet 	|	 TF-IDF: 0.0064
	TERM: input 	|	 TF-IDF: 0.0064
	TERM: car 	|	 TF-IDF: 0.00285
	TERM: vr 	|	 TF-IDF: 0.00285
	TERM: vl 	|	 TF-IDF: 0.00285
	TERM: unicyc 	|	 TF-IDF: 0.00285
	TERM: spee 	|	 TF-IDF

In [15]:
# ----------------------------------------- NLTK, WORDNET -------------------------------------------
print("\n\n------- EXPORTED TERMS in WORDNET ----------") 
for word in exportedList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())

print("\n\n------- CUSTOM TERMS in WORDNET (also domain specific) ----------")    
for word in ownList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())
    



------- EXPORTED TERMS in WORDNET ----------


  for i, line in enumerate(self.open('lexnames')):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):



 plan
-  plan.n.01  |  a series of steps to be carried out or goals to be accomplished
-  design.n.02  |  an arrangement scheme
-  plan.n.03  |  scale drawing of a structure
-  plan.v.01  |  have the will and intention to carry out some action
-  plan.v.02  |  make plans for something
-  plan.v.03  |  make or work out a plan for; devise
-  design.v.04  |  make a design of; plan out in systematic, often graphic form

 goal
-  goal.n.01  |  the state of affairs that a plan is intended to achieve and that (when achieved) terminates behavior intended to achieve it
-  finish.n.04  |  the place designated as the end (as of a race or journey)
-  goal.n.03  |  game equipment consisting of the place toward which players of a game try to advance a ball or puck in order to score points
-  goal.n.04  |  a successful attempt at scoring

 mod
-  mod.n.01  |  a British teenager or young adult in the 1960s; noted for their clothes consciousness and opposition to the rockers
-  mod.s.01  |  relating t

### Bag of Words, and all the rest

In [121]:
# Algorithms

# BoF
def bagOfWords(iFile,POSfileIn):
    
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    data_corpus = []  
    
    if iFile.name.endswith("_lemmatized.txt"):
        print("\n[Bag of words: ]\t" + os.path.basename(iFile.name))

        baseName = iFile.name.split(".en", 1)[0]
        OFName = baseName + ".en_tokPOStag.txt" 
        
        LemmafileCont = iFile.read().split()
        #POSfileCont = POSfileIn.read().split()
       
        listLemmas = []
        with open(OFName, "w") as oFile:
            for line in LemmafileCont:
                LemmafileCont.append(line)
                
            listLemmas = findRealTerm(listLemmas,POSfileIn)
            
            text = ""
            for line in LemmafileCont:
                #fullTerm = findRealTerm(iFile,line,POSfileList)
                #print(fullTerm)
                #text = text," ",fullTerm," "
                #fullT = findRealTerm(line,POSfileIn)
                #findRealTerm(line,POSfileIn)   # Look for the full term from the lemma
                text += line+" "
                #pass
                continue
            data_corpus.append(text)
        
        #print(data_corpus)
        vector = vectorizer.fit_transform(data_corpus) 
        #print(vector.toarray())
        #print(vectorizer.get_feature_names())  
        
        array = vector.toarray()
        featureNames = vectorizer.get_feature_names()
        
    else:
        pass

# ----------------------------------------- FINISH ---------------------------------------------

# Match the lemma to first found word from file _lemmatized.txt for that file
def findRealTerm(listLemmas, POSfileIn):   
    # take the path of the input file and look for the file ending on "_stemmedbyPOS.txt"
    # split each line into 3, look in line[3] for the first match of the current lemma
    # when found, take line[0] which is the full word and return that word
    # exit the function
    #print("Looking for term for lemma: " + lemmaIn)

    #print("Searching for term")
    #res = ""
    resList = []
    posfileCont = POSfileIn.read().split()

    for curLemma in listLemmas:
        for line in posfileCont:
            line = line.split(",")
            #print(len(line))

            word = line[0]
            lemma = line[2]

            if lemma == curLemma:
                print("WORD: ", word, " for LEMMA: ", lemma)
                res = res.append(word)

    return resList

In [122]:
#path = r"C:\Users\a.dimitrova\Desktop\TEST data\PROCESSED\Course-data\mobile-robot\02_mobile-robots\01_week-2"
path = "/media/sf_Shared_Folder/TEST/one file"

# LINUX
#path = "/media/sf_Shared_Folder/TEST/one file" # FEW TEST FILES
#path = "/media/sf_Shared_Folder/TEST/RAW"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

counter = 0
POSfiles = []

# --- Collecting the POS files
for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            # create a list of files for POS so that it can be sent along with BoF to look for the right file and terms
            if filePath.endswith("_stemmedbyPOS.txt"):
                curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                baseName = os.path.basename(curFile.name.split(".en", 1)[0])
                curFilePOS = baseName+".en_stemmedbyPOS.txt"
                POSfiles.append(curFilePOS)
            else:
                pass

# --------------------------------------------------------------------------------
            
# --- processing Lemmatized files with Algos
for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:                
            # check for file extension and if not TXT, continue and disregard the current file
            if filePath.endswith("_lemmatized.txt"): 
                try: 
                    counter += 1
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    
                    baseName = curFile.name.split(".en", 1)[0]
                    POSfileName = baseName+".en_stemmedbyPOS.txt"
                    if os.path.basename(POSfileName) in POSfiles:
                        
                        try:
                            lookFile = open(POSfileName, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                            bagOfWords(curFile,lookFile)
                            #findRealTerm(curFile,"wiggl",lookFile)
                        # bag of words processing:
                        #findRealTerm(curFile,item,POSfile)
                            
                    #unifyLemmas(curFile,POSfiles)
                        finally: 
                            lookFile.close()
                finally: 
                    curFile.close()            
            else:
                pass

#print("Total number of POS Files[]:", len(POSfiles))

  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app



[Bag of words: ]	01_what-is-the-definition-of-derivative.en_lemmatized.txt


KeyboardInterrupt: 

### --------------------------------------------------------------------------------------------------------------------------------------
### NOTES
### --------------------------------------------------------------------------------------------------------------------------------------

**TF-IDF** doesn't output the necessary result, I need n-grams selected as a combined keyword and these are often very general words like `for example` or `key concept` etc. in order to classify the text into the GOAL element. 

**TextBlob** provides options for n-grams and also connection to WordNet ontology which could be useful, so will look more into it.

**WordNet** finds multiple definitions and synsets (synonyms) for most of the general words, however if provided specific e.g. computer science algorithm names, or specific terms, it doesn find any synonyms, nor descriptions of any of them.

**Wikipedia** recognized some of the terms, but not all. For instance if we give it KNN it doesn't find anything, but if we give it K-nearest neighbour, if finds it. This is how the name is in Wikipedia, so that may be the reason. But on Google first returned result for KNN is this article. Same for SVM and Support vector machine. I've modified the script to return "NO DESCRIPTION or DISAMBIGUATION" everytime if finds nopthing ot if there's a disambiguation error, otherwise it wouldn continue checking the rest of the terms. So now it skips the error. 
 
**Full list** of identified key words so far [HERE](https://docs.google.com/spreadsheets/d/1Dj4UAh6U5jAelcsz-gDCdDE9JRVhwaNei0Ctn8m0Ui4/edit?usp=sharing)