## TF-IDF implementation
### --------------------------------------------------------------------------------------------------------------------------------------

IF-IDF is implemented in order to check whether the terms extracted from LOs will have anything in common with the terms that would be extracted with manual MOOC analysis and to compare with of the two methods will bring better results in the classification part

Below is the main TF-IDF implementation without any text provided to it yet.

##### Term frequency
\\( tf(t,d) = 0.5 + 0.5 * (\frac{f_{t,d}}{f_{t',d}:t' \in d}) \\) 

##### Inversed document frequency
\\( idf(t,D) = log * (\frac{N}{d \in D  :  t \in d}) \\)

##### Computing tf-idf
\\( tfidf(t,d,D) = tf(t,d) * idf(t,D) \\)

In [24]:
import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from beautifultable import BeautifulTable
#nltk.download('punkt')
#nltk.download('wordnet')

# blob is the the text where to look for the word
def tf(term, doc):
    #return ratio between nr of certain word count and total document word count
    return doc.words.count(term) / len(doc.words)

def docsWithTermIn(term, doclist):
    return sum(1 for doc in doclist if term in doc.words)

def idf(term, doclist):
    return math.log(len(doclist) / (1 + docsWithTermIn(term, doclist)))

def tfidf(term,doc,doclist):
    return tf(term, doc) * idf(term,doclist)

### --------------------------------------------------------------------------------------------------------------------------------------
### Now it's time to supply several documents and see what happens:
### --------------------------------------------------------------------------------------------------------------------------------------

In [25]:
# traverse each folder and sub-folder
# create an array of files to add each file in it
# if the file is TXT, add to the array
# create a String array of documents with the file of the array with files 
# so we can store the contents of each inside
# read each line of each file and save to the strings
# process each string by tokenization, lemmatization etc. 
# perform tf-idf on the documents


# 01-understanding-research-data/01_research-data-defined.en.txt
document1 = tb("""[sound] so, who knows what a function is, right
but i know what it does
it takes an input value, and produces an output value
and we've got a whole bunch of functions, right
and we can take these functions and start asking questions about them
what happens when you plug in a really big number, or a really small number
or, or what happens when you plug in two numbers that are nearby each other
how are the outputs related, right
those are the kinds of questions that are going to occupy us for the rest of the term
but even before we start thinking about questions like that, right
there are some things that we can still ask about functions
like, how do you know when two functions are the same function
for instance, here's two functions
f(x) = (1+x)^2, g(x) = x^2 + 2x + 1
are these the same function
now, let's try
look at the value like f(2)
f(2) = (1+2)^2, start replacing the x by two, 1 + 2 = 3
3^2 = 9
well, what's what's g(2)
well, g(2) would be 2^2 + 2 * 2 + 1
2^2 = 4, 2 * 2 = 4 + 1, 4 + 4 = 8 + 1 = 9
look, f and g, when i plug in x = 2 give me the same output value of nine
and that should be a little bit surprising, right
because the way that f and g are telling me to compute their output is totally different
f takes the input two, adds one to it and squares it to get nine
g takes two, squares it, doubles it, adds those two numbers to one, to get nine
so, the method by which f and g are doing the calculations is totally different, right
this sequence of operations is not the same as this sequence of operations
the, the rules are different
and yet, look at this
f(x), for any value of x, right
is 1 + x * 1 + x, right
that's 1 + x^2
well, i could expand this out, right
1 * 1 + x, and then x * 1 + x
i could combine some of these terms, right
1 + x + x = 2x
x * x = x^2
look, 1 + 2x + x^2, that's g(x)
this is really quite surprising
f and g don't compute their output in the same way, right
this one is doing something different than this function, and yet, for any input value, f's output value is this, which is the same by expanding out as g(x)
now, how we're going to deal with this
we're going to say that f and g are at the same function, right
not because they have the same rule, right
but because for every input value, they have the same output value
here's a much more subtle example
again, i got two functions
f is defined like this
f(x) = x^2 / x, and g is defined like this, g(x) is just x, the identity function
same question, is f the same as g
are these the same function
now, they're not the same rule, right
this is not the same as this
so, you know, it's a little more subtle, you know
but that's okay, right
two functions are the same if they have the same output for each input
so, let's see if that happens here
let's just pick some value to get a first test
let's take a look at f(5), right
f(5) would be 5^2 / 5, that's 25
5^2 / 5, that's 5
well, that's the same as g(5), right
if i plug anything into g, i just get the same thing out
so, plug in five, you get five
so, at least at the value five, f and g agree
you might think this always works, right
because of something like this
you might want to say, well, f(x) that's x^2 / x, no matter what x is
you might rewrite this x^22 as x * x / x
and then, you'd be tempted to say, cancel one of these xes with the x in the denominator
and then, you'd write equals x
and x, well that's, that's g(x)
so, this looks like a pretty convincing argument, right
over here, i've got f of x, i've got a bunch of equal signs
and over here, i've got g(x)
so maybe that means f and g are the same function
ha, but not so fast
what happens if you plug in zero
what's f(0)
well, i know what g(0) is
g(0) is zero, right
zero is in the domain of g because zero makes sense for this rule
but, what's f(0)
well, that would be zero squared over zero, whoa
okay
you see this is terrible, right
i cannot divide by zero
this rule, x^2 / x doesn't make sense when x is equal to zero
so, zero is not in the domain of f, but it is in the domain of g
so, i'm going to say that these are not the same function
they don't have the same domain, right
f isn't defined at zero, and g is defined at zero
in that sense, these are really different functions
this example suggests that there's a real richness to this theory of functions, right
and we're going to be studying it a lot more this term.""")

# 01-understanding-research-data/02_types-of-data-and-metadata.en.txt
document2 = tb("""it's also a good idea toconsider data transformations
there are a number of reasons,why you may wish to transform your data, either during your project, or afterwards
unlike the earlier discussion aboutmigrating files from one format to the other, data transformationsinvolve changing the actual data
for example via anonymization
for example in survey data collectedfrom questionnaires, multiple choice and other kinds of responses are usually codedas numbers instead of character strings
this simple type of transformation hasthe advantage of easing data entry if you're typing in paper responses, and it also avoids inconsistenciessuch as typos in data values
qualitative data such as interviewtranscripts, can be transformed into quantitative data by applying textualcoding and categorization techniques
another reason for data transformation maybe to visualize the data more effectively
a simple example is converting data wherethere's a numerator and a denominator from ratios to percentages, so you candisplay it on a bar chart or pie graph
a number of methods may be usedto transform confidential or sensitive data, sothey can be shared with other researchers
these include aggregation,and anonymization
you've now completed this module
take a look at the further reading, whereyou'll find additional resources to learn more about file formats, compression,normalization, and data transformations
you may wish to move on to the next moduleand return to these references later
we recommend that you move on tothe module about documentation and data citation.""")

# 01-understanding-research-data/03_research-data-lifecycle.en.txt
document3 = tb("""so welcome to module 5 of the course, control of mobile robots
so far the first 2 modules were kind of introductions to the topic and the previous 2 modules they went rather deeply into the issue of, control the sign for linear time in variant systems
and we ended the last module on a high note
which showed that we could do rather remarkable things, like make a humanoid robot wave in, exciting ways
using, control of linear time in variant systems
now, unfortunately, the world isn't that easy
and especially, the robotic world is not that easy
so this entire module is devoted to how do we take module 4
and make it cope with the realities of the world in which robots are embedded
then in module 6, we're actually going to go full force robotics, 100%
and this module can be thought of as the glue module between linear systems and actual robotics
and in fact, the title of this lecture is, switches everywhere, because so far the modules, or models have been the same all the time, and in fact, because the models are the same, we design a size fits all type of controller
the way we've kind of messed with the controllers, is by changing the, the, reference values to some desired angles for the...
this humanoid robot, or desired velocities and angular velocities for the segue robot
but, it's certainly not always true that the models never change in the real world
and, this is never true in robotics, unfortunately
in fact, if you recall, what we talked about in module 2
we talked about something called behavior based control
where the robot switches between different modes of operation, or behaviors, in response to what the world throws at you
you see an obstacle, you switch to avoid that obstacle
if you, need to go recharge you switch to a, let's look for outlets behavior
so what we need to do is somehow come to grips, or terms with this reality
and that is how we're going to start this module
the first thing to note is that in the world there are switches everywhere
so here, over on our left, is a bipedal walking robot
well, bipedal walking robots have a leg that's swinging, has one dynamics, and then bam, the ro, the leg is on the ground, now the dynamics changed
so because of the fact that the dynamics changes
depending on whether or not you're in the swing face, or the stance face
the dynamics is different
here, i drew a cartoon of a bouncing ball
well, it has 2 modes
it's in the air, [sound]
and then it bounces
something new happens
the b-, the ball gets squished, and then it releases again
so, there
we have switches
i actually included this wonderful picture of locusts because i have actually worked on locusts
these locusts don't bother anyone during what's called a solitary mode
and then something happens and they switch to a gregarious mode where they lump together and devastate harvests everywhere
and this is a transition that occurs because of a lack of food, for instance
but in the, all of these cases, these are naturally occurring switches that our models have to take into account
now, from a robotics point of view, it may be slightly more relevant to look at switches by design instead of doing it by necessity
so here is a gear box to a car
we're switching between different gears because we want the car to run more smoothly at different rpm's
so we're switching the dynamics not because we have to but because it's better
here, this is a, a cockpit
it's supposed to be representing an autopilot on an aircraft where you're switching between cruise climb turn land takeoff modes
so, instead of designing 1 controller that takes me from atlanta, georgia to stockholm, sweden
you have a bunch of different controllers that you're switching through in order to do this
and at the bottom here, we have a sensor network
where, in order to preserve power, you're turning sensors on and off on purpose
so, you're switching by design, rather by necessity
and, in robotics
everywhere it switches by design
this is our south driving georgia tech car that switches between different behaviors depending on what's happening in the world
here's a mobile robot that switches between behaviors depending or not, whether or not there are obstacles, i wish we worked on snake robotics
we're switching between different modes, depending on what is going on in the environment
this is a friend in now, this is the sensor network, and this is the aerial robot that we've seen, and we're going to see more of
in all of these cases, we're going to have to switch in order to respond to what the world throw's our way
now, there are some issues that we need to deal with
issue number one is really how do we model these switches
how do we model systems that aren't staying the same all the time
well, the other question then, of course, is if these models change, what about stability
what about the performance
can we go with our old methods to try to understand this
well, this all boils down to the fact, that t does not go to infinity, within a single mode
meaning you are not staying in 1 mode forever, and stability is defined what happens when t goes to infinity, but t does not go to infinity in the in, individual mode
so, we somewhat need to understand that
we also have issues of compositionality
this is fancy speak for saying, if i have multiple modes and multiple controllers, how do i put them together
what is the way in which they fit together like lego pieces in a big lego drawing
and, most importantly, are there traps
are there issues that arise because of these switches that we don't fully know how to deal with
now, this module will deal with all of this, and more
and in the next lecture, we're going to start with the first question which is, how do we really model hybrid or switched systems in a systematic and coherent way?""")




### --------------------------------------------------------------------------------------------------------------------------------------
### Now, finally the **MAIN()** method: traversing through the documents and output the terms and their frequencies
### --------------------------------------------------------------------------------------------------------------------------------------

In [26]:
# arrays to hold the terms found in text and also a custom list to test domain-specific terms
exportedList = []
ownList = {"data management","database","example","iot","lifecycle","bloom","filter","integrity",
           "java","pattern","design pattern","svm","Support vector machine","knn","k-nearest neighbors","machine learning"}

table = BeautifulTable()
table.column_headers = ["TERM", "TF-IDF"]

doclist = [document1, document2, document3]
#doclist = [document4, document5, document6]
docnames = ["01_research-data-defined.en.txt","02_types-of-data-and-metadata.en.txt","03_research-data-lifecycle.en.txt"]
topNwords = 15;

for i, doc in enumerate(doclist):
    print("\nTop {} terms in document {} | {}".format(topNwords, i + 1, docnames[i]))
    scores = {term: tfidf(term, doc, doclist) for term in doc.words}
    sortedTerms = sorted(scores.items(),key=lambda x: x[1], reverse=True)
    
    for term, score in sortedTerms[:topNwords]:
         table.append_row([term, round(score, 5)]) 
         exportedList.append(term)
    
    print(table)
#    print(exportedWords, "\n")


# ----------------------------------------- NLTK, WORDNET -------------------------------------------
print("\n\n------- EXPORTED TERMS in WORDNET ----------") 
for word in exportedList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())

print("\n\n------- CUSTOM TERMS in WORDNET (also domain specific) ----------")    
for word in ownList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())
    


Top 15 terms in document 1 | 01_research-data-defined.en.txt
+-----------+--------+
|   TERM    | TF-IDF |
+-----------+--------+
|     x     | 0.015  |
+-----------+--------+
|     g     | 0.011  |
+-----------+--------+
|     f     |  0.01  |
+-----------+--------+
|   right   |  0.01  |
+-----------+--------+
|   value   | 0.005  |
+-----------+--------+
|   zero    | 0.005  |
+-----------+--------+
| function  | 0.004  |
+-----------+--------+
| functions | 0.004  |
+-----------+--------+
|    two    | 0.004  |
+-----------+--------+
|    x^2    | 0.003  |
+-----------+--------+
|  output   | 0.003  |
+-----------+--------+
|   plug    | 0.003  |
+-----------+--------+
|   input   | 0.002  |
+-----------+--------+
|    got    | 0.002  |
+-----------+--------+
|    get    | 0.002  |
+-----------+--------+

Top 15 terms in document 2 | 02_types-of-data-and-metadata.en.txt
+-----------------+--------+
|      TERM       | TF-IDF |
+-----------------+--------+
|        x        | 0.015


 function
-  function.n.01  |  (mathematics) a mathematical relation such that each element of a given set (the domain of the function) is associated with an element of another set (the range of the function)
-  function.n.02  |  what something is used for
-  function.n.03  |  the actions and activities assigned to or required or expected of a person or group
-  function.n.04  |  a relation such that one thing is dependent on another
-  function.n.05  |  a formal or official social gathering or ceremony
-  affair.n.03  |  a vaguely specified social event
-  routine.n.03  |  a set sequence of steps, part of larger computer program
-  function.v.01  |  perform as expected when applied
-  serve.v.01  |  serve a purpose, role, or function
-  officiate.v.02  |  perform duties attached to a particular office or place or function

 functions
-  function.n.01  |  (mathematics) a mathematical relation such that each element of a given set (the domain of the function) is associated with an elem


 responses
-  response.n.01  |  a result
-  reaction.n.03  |  a bodily process occurring due to the effect of some antecedent stimulus or agent
-  answer.n.01  |  a statement (either spoken or written) that is made to reply to a question or request or criticism or accusation
-  reception.n.01  |  the manner in which something is greeted
-  response.n.05  |  a phrase recited or sung by the congregation following a versicle by the priest or minister
-  reply.n.02  |  the speech act of continuing a conversational exchange
-  response.n.07  |  the manner in which an electrical or mechanical device responds to an input signal or a range of input signals

 transformation
-  transformation.n.01  |  a qualitative change
-  transformation.n.02  |  (mathematics) a function that changes the position or direction of the axes of a coordinate system
-  transformation.n.03  |  a rule describing the conversion of one syntactic structure into another related syntactic structure
-  transformation.n.04 

### --------------------------------------------------------------------------------------------------------------------------------------
### Let's try with **Wikipedia** data as ontology and see if it will return better result than WordNet
### --------------------------------------------------------------------------------------------------------------------------------------

In [27]:
import wikipedia as wiki
from contextlib import suppress
import sys 

wiki.set_lang("en")

#wiki.search("Java")

#wiki.summary("Java виртуална машина (JVM)")
#wiki.summary("Java (programmeertaal)")


# ------------------------------------------- WIKIPEDIA -------------------------------------------
print("\n\n-------Mapping with Wikipedia ----------") 
for word in exportedList:
    try: 
        print("\n- ",word," |\t ",wiki.summary(word),"\n")  
    except:
        print(word," |\t NO DESCRIPTION or DISAMBIGUATION")
        pass
    
print ("------------------- Custom domain-specific words for testing ----------------------)")
for word in ownList:
    try: 
        print("\n- ",word," |\t ",wiki.summary(word),"\n")  
    except:
        print(word," |\t NO DESCRIPTION or DISAMBIGUATION")
        pass



-------Mapping with Wikipedia ----------

-  x  |	  X (named ex , plural exes) is the 24th and antepenultimate letter in the modern English alphabet and the ISO basic Latin alphabet. 


-  g  |	  G (named gee ) is the 7th letter in the ISO basic Latin alphabet. 


-  f  |	  F (named ef ) is the sixth letter in the modern English alphabet and the ISO basic Latin alphabet. 


-  right  |	  Rights are legal, social, or ethical principles of freedom or entitlement; that is, rights are the fundamental normative rules about what is allowed of people or owed to people, according to some legal system, social convention, or ethical theory. Rights are of essential importance in such disciplines as law and ethics, especially theories of justice and deontology.
Rights are often considered fundamental to civilization, for they are regarded as established pillars of society and culture, and the history of social conflicts can be found in the history of each right and its development. According to 



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
  self.parser.feed(markup)


value  |	 NO DESCRIPTION or DISAMBIGUATION

-  zero  |	  0 (zero; ) is both a number and the numerical digit used to represent that number in numerals. The number 0 fulfills a central role in mathematics as the additive identity of the integers, real numbers, and many other algebraic structures. As a digit, 0 is used as a placeholder in place value systems. Names for the number 0 in English include zero, nought (UK), naught (US) (), nil, or—in contexts where at least one adjacent digit distinguishes it from the letter "O"—oh or o (). Informal or slang terms for zero include zilch and zip. Ought and aught (), as well as cipher, have also been used historically. 

function  |	 NO DESCRIPTION or DISAMBIGUATION
functions  |	 NO DESCRIPTION or DISAMBIGUATION

-  two  |	  2 (two;  ( listen)) is a number, numeral, and glyph. It is the natural number following 1 and preceding 3. 

x^2  |	 NO DESCRIPTION or DISAMBIGUATION
output  |	 NO DESCRIPTION or DISAMBIGUATION
plug  |	 NO DESCRIPTION or DI


-  chart  |	  A chart is a graphical representation of data, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can represent tabular numeric data, functions or some kinds of qualitative structure and provides different info.
The term "chart" as a graphical representation of data has multiple meanings:
A data chart is a type of diagram or graph, that organizes and represents a set of numerical or qualitative data.
Maps that are adorned with extra information (map surround) for a specific purpose are often known as charts, such as a nautical chart or aeronautical chart, typically spread over several map sheets.
Other domain specific constructs are sometimes called charts, such as the chord chart in music notation or a record chart for album popularity.
Charts are often used to ease understanding of large quantities of data and the relationships between parts of the data. Charts can usually be read more qu

switching  |	 NO DESCRIPTION or DISAMBIGUATION
modes  |	 NO DESCRIPTION or DISAMBIGUATION

-  systems  |	  A system is a regularly interacting or interdependent group of items forming an integrated whole. Every system is delineated by its spatial and temporal boundaries, surrounded and influenced by its environment, described by its structure and purpose and expressed in its functioning. 

go  |	 NO DESCRIPTION or DISAMBIGUATION
models  |	 NO DESCRIPTION or DISAMBIGUATION
dynamics  |	 NO DESCRIPTION or DISAMBIGUATION

-  fact  |	  A fact is a statement that is consistent with reality or can be proven with evidence. The usual test for a statement of fact is verifiability — that is, whether it can be demonstrated to correspond to experience. Standard reference works are often used to check facts. Scientific facts are verified by repeatable careful observation or measurement (by experiments or other means). 

mode  |	 NO DESCRIPTION or DISAMBIGUATION
our  |	 NO DESCRIPTION or DISAMBIGUATI

filter  |	 NO DESCRIPTION or DISAMBIGUATION
svm  |	 NO DESCRIPTION or DISAMBIGUATION

-  java  |	  Java (Indonesian: Jawa; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia. At about 139,000 square kilometres (54,000 sq mi), the island is comparable in size to England, the U.S. State of North Carolina, or Omsk Oblast. With a population of over 141 million (the island itself) or 145 million (the administrative region), Java is home to 56.7 percent of the Indonesian population and is the world's most populous island. The Indonesian capital city, Jakarta, is located on western Java. Much of Indonesian history took place on Java. It was the center of powerful Hindu-Buddhist empires, the Islamic sultanates, and the core of the colonial Dutch East Indies. Java was also the center of the Indonesian struggle for independence during the 1930s and 1940s. Java dominates Indonesia politically, economically and culturally. Four of Indonesia's eight UNESCO world heritage sites are located in Ja

### --------------------------------------------------------------------------------------------------------------------------------------
### NOTES
### --------------------------------------------------------------------------------------------------------------------------------------

**TF-IDF** doesn't output the necessary result, I need n-grams selected as a combined keyword and these are often very general words like `for example` or `key concept` etc. in order to classify the text into the GOAL element. 

TextBlob provides options for n-grams and also connection to WordNet ontology which could be useful, so will look more into it.

**WordNet** finds multiple definitions and synsets (synonyms) for most of the general words, however if provided specific e.g. computer science algorithm names, or specific terms, it doesn find any synonyms, nor descriptions of any of them.

**Wikipedia** recognized some of the terms, but not all. For instance if we give it KNN it doesn't find anything, but if we give it K-nearest neighbour, if finds it. This is how the name is in Wikipedia, so that may be the reason. But on Google first returned result for KNN is this article. Same for SVM and Support vector machine. I've modified the script to return "NO DESCRIPTION or DISAMBIGUATION" everytime if finds nopthing ot if there's a disambiguation error, otherwise it wouldn continue checking the rest of the terms. So now it skips the error. 
 
**Full list** of identified key words so far [HERE](https://docs.google.com/spreadsheets/d/1Dj4UAh6U5jAelcsz-gDCdDE9JRVhwaNei0Ctn8m0Ui4/edit?usp=sharing)