Jupyter Notebook Shortcuts:
- **A**: Insert cell --**ABOVE**--
- **B**: Insert cell --**BELOW**--
- **M**: Change cell to --**MARKDOWN**--
- **Y**: Change cell to --**CODE**--
    
- **Shift + Tab** will show you the Docstring (**documentation**) for the the object you have just typed in a code cell  you can keep pressing this short cut to cycle through a few modes of documentation.
- **Ctrl + Shift + -** will split the current cell into two from where your cursor is.
- **Esc + O** Toggle cell output.
- **Esc + F** Find and replace on your code but not the outputs.

[MORE SHORTCUTS](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)

### ----------------------------------------------------------------------------------------------------------------------------------------------------

## Data analysis

#### Steps:
1. Apply TF-IDF
2. Try Wikipedia linking
3. Try linking with WordNet
4. Try Bag of Words
5. Try other algorithms? 
6. Define a clear dictionary with words for each category
7. Other Classification algorithms?
8. Try finding n-grams

---

### TODO next:
- get the _tokPOStag.txt files
- read and save every line as key-value pair or list of 2 elements
- compare the second element of each line, i.e the POS, match with the wordnet pos tags
- process stemming correctly
- [lemmatize](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word)
- [find n-grams](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)
- Perform [Bag of Words](https://pythonprogramminglanguage.com/bag-of-words/)

------

#### Offtopic
- [Puthon theory](http://xahlee.info/python/python_basics.html)
- [Text classification](https://gallery.azure.ai/Experiment/Text-Classification-Step-2-of-5-text-preprocessing-2)
- [Preprocessing steps](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)

In [83]:
import os
import sys
import os.path
import string
import time
import pathlib
from unidecode import unidecode
import pprint
from tabulate import tabulate

import scipy
import numpy
import sklearn
import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from beautifultable import BeautifulTable
#nltk.download('punkt')
#nltk.download('wordnet')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

## TF-IDF implementation
#### ---------------

IF-IDF is implemented in order to check whether the terms extracted from LOs will have anything in common with the terms that would be extracted with manual MOOC analysis and to compare with of the two methods will bring better results in the classification part

Below is the main TF-IDF implementation without any text provided to it yet.

##### Term frequency
\\( tf(t,d) = 0.5 + 0.5 * (\frac{f_{t,d}}{f_{t',d}:t' \in d}) \\) 

##### Inversed document frequency
\\( idf(t,D) = log * (\frac{N}{d \in D  :  t \in d}) \\)

##### Computing tf-idf
\\( tfidf(t,d,D) = tf(t,d) * idf(t,D) \\)

In [None]:
# blob is the the text where to look for the word
def tf(term, doc):
    #return ratio between nr of certain word count and total document word count
    return doc.words.count(term) / len(doc.words)

def docsWithTermIn(term, doclist):
    return sum(1 for doc in doclist if term in doc.words)

def idf(term, doclist):
    return math.log(len(doclist) / (1 + docsWithTermIn(term, doclist)))

def tfidf(term,doc,doclist):
    return tf(term, doc) * idf(term,doclist)

### Running TF-IDF with data

#### TODO: Fix the input, it takes strings, and not files right now

In [None]:
# traverse each folder and sub-folder
# create an array of files to add each file in it
# if the file is TXT, add to the array
# create a String array of documents with the file of the array with files 
# so we can store the contents of each inside
# read each line of each file and save to the strings
# perform algorithms on the documents

# -------------------------------------------------------------------------------------------------------

# WINDOWS
#path = r"C:\Users\ani\Desktop\Course data Thesis\one file"
path = r"C:\Users\ani\Desktop\Course data Thesis\Course test files"

# LINUX
#path = "/media/sf_Shared_Folder/TEST/one file" # FEW TEST FILES
#path = "/media/sf_Shared_Folder/TEST/RAW"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

docnames = []
counter = 0

# DOCUMENT LIST CONSISTS OF TEXTBLOB files. All input files need to be converted to TEXTBLOB 
# and then saved in this list in order for TF-IDF to work
doclist = []

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            # check for file extension and if not TXT, continue and disregard the current file
            if not filePath.endswith(".txt"):
                pass
            elif filePath.endswith("_lemmatized.txt"): 
                try: 
                    counter += 1
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]
                    docnames.append(curFile)
                    
                    fcontentTBlob = tb(curFile.read())
                    #print(fcontentTBlob)
                    doclist.append(fcontentTBlob)
                    
                    # bag of words processing:
                    
                finally: 
                    curFile.close()
            else:
                pass

print("Total number of files in docnames[]:", len(docnames))
print("Total number of files in doclist[]:", len(docnames))

In [None]:
# ------------------------------------ TF-IDF --------------------------------------------------------

# arrays to hold the terms found in text and also a custom list to test domain-specific terms
exportedList = []
ownList = {"data management","database","example","iot","lifecycle","bloom","filter","integrity",
           "java","pattern","design pattern","svm","Support vector machine","knn","k-nearest neighbors","machine learning"}

table = BeautifulTable()
table.column_headers = ["TERM", "TF-IDF"]

topNwords = 10;

for i, doc in enumerate(doclist):
    print("\nTop {} terms in document {} | {}".format(topNwords, i + 1, docnames[i]))
    scores = {term: tfidf(term, doc, doclist) for term in doc.words}
    sortedTerms = sorted(scores.items(),key=lambda x: x[1], reverse=True)
    
    for term, score in sortedTerms[:topNwords]:
        #print(table.append_row([term, round(score, 5)]))
        print("\tTERM: {} \t|\t TF-IDF: {}".format(term, round(score, 5)))
        exportedList.append(term)
        #print tabulate([term, round(score, 5)], headers=['tTERM', 'TF-IDF'])
        

In [None]:
# ----------------------------------------- NLTK, WORDNET -------------------------------------------
print("\n\n------- EXPORTED TERMS in WORDNET ----------") 
for word in exportedList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())

print("\n\n------- CUSTOM TERMS in WORDNET (also domain specific) ----------")    
for word in ownList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())
    

### Bag of Words, and all the rest

In [84]:
# Algorithms

# BoF
def bagOfWords(iFilePath,iPOSfPath,choice):
    
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    data_corpus = []  
    
    if iFilePath.endswith("_lemmatized.txt"):
        print("[Bag of words: ]\t" + iFilePath+"\n")

        baseName = iFilePath.split(".en", 1)[0]
        OFName = baseName + ".en_FullLemTerm.txt" 
        
        try:
            iLemmaFile = open(iFilePath, 'r', encoding = "ISO-8859-1")  # open lemma file in read mode
            LemmafileCont = iLemmaFile.read().split()   # read file content and save it into the string variable

            text = ""
            for line in LemmafileCont:
                fullTerm = findRealTerm(line,iPOSfPath)
                text += fullTerm+" "
                
                if(choice == 0):
                    pass
                elif(choice == 1):
                    with open(OFName, "w") as oFile:
                        oFile.write(fullTerm+"\n")
                else:
                    print("Invalid output file option.. 0 - NO file, 1 - SAVE file")
                    break
                
                continue
            data_corpus.append(text)
            
            #print(data_corpus)
            #vector = vectorizer.fit_transform(data_corpus).todense() 
            #print(vector.toarray())
            print(vectorizer.get_feature_names())  

            #array = vector.toarray()
            #featureNames = vectorizer.get_feature_names()
            
            print(vectorizer.vocabulary_)
            
            features = vectorizer.fit_transform(data_corpus).todense() 
            print(vectorizer.vocabulary_)

            for f in features:
                print(euclidean_distances(features[0], f))
            
        finally:
            iLemmaFile.close()              
        
    else:
        pass

#### ----------------------------------------------------------------------------------------------------------------------------------------

In [85]:
# Match the lemma to first found word from file _lemmatized.txt for that file
def findRealTerm(lemmaIn, POSfilePath):   
    # take the path of the input file and look for the file ending on "_stemmedbyPOS.txt"
    # split each line into 3, look in line[3] for the first match of the current lemma
    # when found, take line[0] which is the full word and return that word
    # exit the function

    #print("Searching for term")
    res = ""
    try:
        iPOSFile = open(POSfilePath, 'r', encoding = "ISO-8859-1")  # open POS file in read mode
        posfileCont = iPOSFile.read().split()   # read file content and save it into the string variable
    
        for line in posfileCont:
            line = line.split(",")

            word = line[0]
            lemma = line[2]

            if lemma == lemmaIn:
                #print("WORD: ", word, " || LEMMA: ", lemma)
                res = word
                break
    
    finally:
        iPOSFile.close()
        
    if res == "":
        res = "NOT FOUND: "+lemmaIn
        
    return res

#### ----------------------------------------------------------------------------------------------------------------------------------------

In [82]:
path = r"C:\Users\ani\Desktop\Course data Thesis\Course test files"
#path = "/media/sf_Shared_Folder/TEST/one file"

# LINUX
#path = "/media/sf_Shared_Folder/TEST/one file" # FEW TEST FILES
#path = "/media/sf_Shared_Folder/TEST/RAW"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

counter = 0
POSfiles = []

# --- Collecting the POS files
for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        curFilePath = os.path.join(root, curFile)

        if os.path.isdir(curFilePath):
            pass

        else:
            # create a list of files for POS so that it can be sent along with BoF to look for the right file and terms
            if curFilePath.endswith("_stemmedbyPOS.txt"):
                curFile = open(curFilePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                baseName = os.path.basename(curFile.name.split(".en", 1)[0])
                curFilePOS = baseName+".en_stemmedbyPOS.txt"
                POSfiles.append(curFilePOS)
            else:
                pass

# --------------------------------------------------------------------------------
            
# --- processing Lemmatized files with Algos
for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        curFilePath = os.path.join(root, curFile)

        if os.path.isdir(curFilePath):
            pass

        else:                
            # check for file extension and if not TXT, continue and disregard the current file
            if curFilePath.endswith("_lemmatized.txt"): 
                counter += 1
                try:
                    # need this only to extract the file path and send it to the algorithm later. Send path, !not file!
                    tempFile = open(curFilePath, 'r', encoding = "ISO-8859-1")                    

                    baseName = tempFile.name.split(".en", 1)[0]
                    POSfilePath = baseName+".en_stemmedbyPOS.txt"

                    if os.path.basename(POSfilePath) in POSfiles:
                        print("\n\nprocessing.. " + POSfilePath)
                        
                        # ---------- bag of words processing: ------------
                        # last index is whether an output file to be saved or not. 0 - NO, 1 - YES
                        bagOfWords(curFilePath,POSfilePath,0)  
                        
                finally:
                    tempFile.close()
            else:
                pass
#print("Total number of POS Files[]:", len(POSfiles))



processing.. C:\Users\ani\Desktop\Course data Thesis\Course test files\01_driving-robots-around.en_stemmedbyPOS.txt
[Bag of words: ]	C:\Users\ani\Desktop\Course data Thesis\Course test files\01_driving-robots-around.en_lemmatized.txt



  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app


NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

### --------------------------------------------------------------------------------------------------------------------------------------
### NOTES
### --------------------------------------------------------------------------------------------------------------------------------------

**TF-IDF** doesn't output the necessary result, I need n-grams selected as a combined keyword and these are often very general words like `for example` or `key concept` etc. in order to classify the text into the GOAL element. 

**TextBlob** provides options for n-grams and also connection to WordNet ontology which could be useful, so will look more into it.

**WordNet** finds multiple definitions and synsets (synonyms) for most of the general words, however if provided specific e.g. computer science algorithm names, or specific terms, it doesn find any synonyms, nor descriptions of any of them.

**Wikipedia** recognized some of the terms, but not all. For instance if we give it KNN it doesn't find anything, but if we give it K-nearest neighbour, if finds it. This is how the name is in Wikipedia, so that may be the reason. But on Google first returned result for KNN is this article. Same for SVM and Support vector machine. I've modified the script to return "NO DESCRIPTION or DISAMBIGUATION" everytime if finds nopthing ot if there's a disambiguation error, otherwise it wouldn continue checking the rest of the terms. So now it skips the error. 
 
**Full list** of identified key words so far [HERE](https://docs.google.com/spreadsheets/d/1Dj4UAh6U5jAelcsz-gDCdDE9JRVhwaNei0Ctn8m0Ui4/edit?usp=sharing)