Jupyter Notebook Shortcuts:
- **A**: Insert cell --**ABOVE**--
- **B**: Insert cell --**BELOW**--
- **M**: Change cell to --**MARKDOWN**--
- **Y**: Change cell to --**CODE**--
    
- **Shift + Tab** will show you the Docstring (**documentation**) for the the object you have just typed in a code cell  you can keep pressing this short cut to cycle through a few modes of documentation.
- **Ctrl + Shift + -** will split the current cell into two from where your cursor is.
- **Esc + O** Toggle cell output.
- **Esc + F** Find and replace on your code but not the outputs.

[MORE SHORTCUTS](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)

### ----------------------------------------------------------------------------------------------------------------------------------------------------

## Data analysis

#### Steps:
1. Apply TF-IDF
2. Try Wikipedia linking
3. Try linking with WordNet
4. Try Bag of Words
5. Try other algorithms? 
6. Define a clear dictionary with words for each category
7. Other Classification algorithms?
8. Try finding n-grams

---

### TODO next:
- get the _tokPOStag.txt files
- read and save every line as key-value pair or list of 2 elements
- compare the second element of each line, i.e the POS, match with the wordnet pos tags
- process stemming correctly
- [lemmatize](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word)
- [find n-grams](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams)
- Perform [Bag of Words](https://pythonprogramminglanguage.com/bag-of-words/)

------

#### Offtopic
- [Puthon theory](http://xahlee.info/python/python_basics.html)
- [Text classification](https://gallery.azure.ai/Experiment/Text-Classification-Step-2-of-5-text-preprocessing-2)
- [Preprocessing steps](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)

In [43]:
import os
import sys
import os.path
import string
import time
from unidecode import unidecode

import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from beautifultable import BeautifulTable
#nltk.download('punkt')
#nltk.download('wordnet')

## TF-IDF implementation
#### ---------------

IF-IDF is implemented in order to check whether the terms extracted from LOs will have anything in common with the terms that would be extracted with manual MOOC analysis and to compare with of the two methods will bring better results in the classification part

Below is the main TF-IDF implementation without any text provided to it yet.

##### Term frequency
\\( tf(t,d) = 0.5 + 0.5 * (\frac{f_{t,d}}{f_{t',d}:t' \in d}) \\) 

##### Inversed document frequency
\\( idf(t,D) = log * (\frac{N}{d \in D  :  t \in d}) \\)

##### Computing tf-idf
\\( tfidf(t,d,D) = tf(t,d) * idf(t,D) \\)

In [44]:
# blob is the the text where to look for the word
def tf(term, doc):
    #return ratio between nr of certain word count and total document word count
    return doc.words.count(term) / len(doc.words)

def docsWithTermIn(term, doclist):
    return sum(1 for doc in doclist if term in doc.words)

def idf(term, doclist):
    return math.log(len(doclist) / (1 + docsWithTermIn(term, doclist)))

def tfidf(term,doc,doclist):
    return tf(term, doc) * idf(term,doclist)

### Running TF-IDF with data

#### TODO: Fix the input, it takes strings, and not files right now

In [56]:
# traverse each folder and sub-folder
# create an array of files to add each file in it
# if the file is TXT, add to the array
# create a String array of documents with the file of the array with files 
# so we can store the contents of each inside
# read each line of each file and save to the strings
# perform algorithms on the documents

# -------------------------------------------------------------------------------------------------------

path = "/media/sf_Shared_Folder/TEST/RAW/03_the-end-of-limits/03_infinity-how-can-i-work-with-that" # FEW TEST FILES
#path = "/media/sf_Shared_Folder/TEST/RAW"   # TEST DATA PATH
#path = "/media/sf_Shared_Folder/Coursera Downloads PreProcessed"   # REAL DATA PATH

docnames = []
counter = 0

# DOCUMENT LIST CONSISTS OF TEXTBLOB files. All input files need to be converted to TEXTBLOB 
# and then saved in this list in order for TF-IDF to work
doclist = []

for root, subdirs, files in os.walk(path):

    for curFile in os.listdir(root):

        filePath = os.path.join(root, curFile)

        if os.path.isdir(filePath):
            pass

        else:
            # check for file extension and if not TXT, continue and disregard the current file
            if not filePath.endswith(".txt"):
                pass
            elif filePath.endswith("_tokens.txt"): 
                try: 
                    counter += 1
                    curFile = open(filePath, 'r', encoding = "ISO-8859-1") #IMPORTANT ENCODING! UTF8 DOESN'T WORK
                    fileExtRemoved = os.path.splitext(os.path.abspath(filePath))[0]
                    docnames.append(curFile)
                    
                    fcontentTBlob = tb(curFile.read())
                    #print(fcontentTBlob)
                    doclist.append(fcontentTBlob)
                    
                finally: 
                    curFile.close()
            else:
                pass

print("Total number of files in docnames[]:", len(docnames))
print("Total number of files in doclist[]:", len(docnames))

Total number of files in docnames[]: 4
Total number of files in doclist[]: 4


In [55]:
# ------------------------------------ TF-IDF --------------------------------------------------------

# arrays to hold the terms found in text and also a custom list to test domain-specific terms
exportedList = []
ownList = {"data management","database","example","iot","lifecycle","bloom","filter","integrity",
           "java","pattern","design pattern","svm","Support vector machine","knn","k-nearest neighbors","machine learning"}

table = BeautifulTable()
table.column_headers = ["TERM", "TF-IDF"]

topNwords = 5;

for i, doc in enumerate(doclist):
    print("\nTop {} terms in document {} | {}".format(topNwords, i + 1, docnames[i]))
    scores = {term: tfidf(term, doc, doclist) for term in doc.words}
    sortedTerms = sorted(scores.items(),key=lambda x: x[1], reverse=True)
    
    for term, score in sortedTerms[:topNwords]:
         table.append_row([term, round(score, 5)]) 
         exportedList.append(term)
    
    print(table)
#    print(exportedWords, "\n")


Top 5 terms in document 1 | <_io.TextIOWrapper name='/media/sf_Shared_Folder/TEST/RAW/03_the-end-of-limits/03_infinity-how-can-i-work-with-that/01_why-is-there-an-x-so-that-f-x-x.en_tokens.txt' mode='r' encoding='ISO-8859-1'>
+---------+--------+
|  TERM   | TF-IDF |
+---------+--------+
| cosine  | 0.015  |
+---------+--------+
|  input  | 0.011  |
+---------+--------+
|  fixed  | 0.007  |
+---------+--------+
|  point  | 0.006  |
+---------+--------+
| between | 0.005  |
+---------+--------+

Top 5 terms in document 2 | <_io.TextIOWrapper name='/media/sf_Shared_Folder/TEST/RAW/03_the-end-of-limits/03_infinity-how-can-i-work-with-that/02_what-does-lim-f-x-infinity-mean.en_tokens.txt' mode='r' encoding='ISO-8859-1'>
+----------+--------+
|   TERM   | TF-IDF |
+----------+--------+
|  cosine  | 0.015  |
+----------+--------+
|  input   | 0.011  |
+----------+--------+
|  fixed   | 0.007  |
+----------+--------+
|  point   | 0.006  |
+----------+--------+
| between  | 0.005  |
+--------

In [57]:
# ----------------------------------------- NLTK, WORDNET -------------------------------------------
print("\n\n------- EXPORTED TERMS in WORDNET ----------") 
for word in exportedList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())

print("\n\n------- CUSTOM TERMS in WORDNET (also domain specific) ----------")    
for word in ownList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())
    



------- EXPORTED TERMS in WORDNET ----------


  for i, line in enumerate(self.open('lexnames')):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for i, line in enumerate(self.open('index.%s' % suffix)):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):
  for line in self.open('%s.exc' % suffix):



 cosine
-  cosine.n.01  |  ratio of the adjacent side to the hypotenuse of a right-angled triangle

 input
-  input_signal.n.01  |  signal going into an electronic system
-  remark.n.01  |  a statement that expresses a personal opinion or belief or adds information
-  stimulation.n.02  |  any stimulating information or event; acts to arouse action
-  input.n.04  |  a component of production; something that goes into the production of output
-  input.v.01  |  enter (data or a program) into a computer

 fixed
-  repair.v.01  |  restore by replacing a part or putting together what is torn or broken
-  fasten.v.01  |  cause to be firmly attached
-  specify.v.02  |  decide upon or fix definitely
-  cook.v.02  |  prepare for eating by applying heat
-  pay_back.v.02  |  take vengeance on or get even
-  fix.v.06  |  set or place definitely
-  fix.v.07  |  kill, preserve, and harden (tissue) in order to prepare for microscopic study
-  fixate.v.03  |  make fixed, stable or stationary
-  steril


 divided
-  divide.v.01  |  separate into parts or portions
-  divide.v.02  |  perform a division
-  separate.v.01  |  act as a barrier between; stand between
-  separate.v.12  |  come apart
-  separate.v.07  |  make a division or separation
-  separate.v.02  |  force, take, or pull apart
-  divided.a.01  |  separated into parts or pieces
-  divided.s.02  |  having a median strip or island between lanes of traffic moving in opposite directions
-  divided.s.03  |  distributed in portions (often equal) on the basis of a plan or purpose

 100
-  hundred.n.01  |  ten 10s
-  hundred.s.01  |  being ten more than ninety

 difference
-  difference.n.01  |  the quality of being unlike or dissimilar
-  deviation.n.01  |  a variation that deviates from the standard or norm
-  dispute.n.01  |  a disagreement or argument about something important
-  difference.n.04  |  a significant change
-  remainder.n.03  |  the number that remains after subtraction; the number that when added to the subtrahend

### --------------------------------------------------------------------------------------------------------------------------------------
### NOTES
### --------------------------------------------------------------------------------------------------------------------------------------

**TF-IDF** doesn't output the necessary result, I need n-grams selected as a combined keyword and these are often very general words like `for example` or `key concept` etc. in order to classify the text into the GOAL element. 

**TextBlob** provides options for n-grams and also connection to WordNet ontology which could be useful, so will look more into it.

**WordNet** finds multiple definitions and synsets (synonyms) for most of the general words, however if provided specific e.g. computer science algorithm names, or specific terms, it doesn find any synonyms, nor descriptions of any of them.

**Wikipedia** recognized some of the terms, but not all. For instance if we give it KNN it doesn't find anything, but if we give it K-nearest neighbour, if finds it. This is how the name is in Wikipedia, so that may be the reason. But on Google first returned result for KNN is this article. Same for SVM and Support vector machine. I've modified the script to return "NO DESCRIPTION or DISAMBIGUATION" everytime if finds nopthing ot if there's a disambiguation error, otherwise it wouldn continue checking the rest of the terms. So now it skips the error. 
 
**Full list** of identified key words so far [HERE](https://docs.google.com/spreadsheets/d/1Dj4UAh6U5jAelcsz-gDCdDE9JRVhwaNei0Ctn8m0Ui4/edit?usp=sharing)