## TF-IDF implementation

IF-IDF is implemented in order to check whether the terms extracted from LOs will have anything in common with the terms that would be extracted with manual MOOC analysis and to compare with of the two methods will bring better results in the classification part

Below is the main TF-IDF implementation without any text provided to it yet.

In [5]:
import math
from textblob import TextBlob as tb
import nltk
from nltk.corpus import wordnet as wn
from beautifultable import BeautifulTable
#nltk.download('punkt')
#nltk.download('wordnet')

# blob is the the text where to look for the word
def tf(term, doc):
    #return ratio between nr of certain word count and total document word count
    return doc.words.count(term) / len(doc.words)

def docsWithTermIn(term, doclist):
    return sum(1 for doc in doclist if term in doc.words)

def idf(term, doclist):
    return math.log(len(doclist) / (1 + docsWithTermIn(term, doclist)))

def tfidf(term,doc,doclist):
    return tf(term, doc) * idf(term,doclist)

Now it's time to supply several documents and see what happens:

In [6]:
# 01-understanding-research-data/01_research-data-defined.en.txt
document1 = tb("""To begin this course,
we need to first discuss some basics. The most fundamental question for
us is what are data? While this may seem like
a straightforward question, numerous organizations have tackled it
resulting in a range of definitions. It's important to note that data
are different for various disciplines and different context. By the end of this lesson, you will be introduced to multiple
types of data in an array of contexts. Data come in many forms from numeric and textual data to biological samples and
physical collections. You will also be able to make
the distinction between research data and other associated researcher materials. Some of which may be
required alongside the data to understand the data themselves. And finally, and this is probably one of the most
important concepts in the entire course. You will understand data in the context
of the research data life cycle. To understand data management, is to
understand data in all of its constructs. From project planning, through the
collection of data to archiving that data as a research output after
the project period has ended. First, let's take a look at
how others have defined data. The National Institutes of Health, or NIH,
define data as “recorded factual material commonly accepted in
the scientific community as necessary to validate
research findings.” Key concepts here include recorded and accessible data that others may look
at to validate a studies findings. The National Science Foundation or
NSF considers data to be something, determined by the community of interest
through the process of peer review and project management. NSF’s definition points out the data are
essential to the research community and not just to individual researchers or
research teams. NSF also offers some examples of data that make the rather abstract
definition easier to understand. As we know, examples are always useful. The NSF list includes data, publications, samples, physical collections,
software, and models. It's interesting that NSF includes
data as an example of data, but the point here is there are many types
of data beyond quantitative datasets. The National Endowment for
the Humanities, or NEH, also offers examples of data that
includes citations, software code, algorithms, digital tools,
documentation, databases, geospatial coordinates,
reports and articles. It is interesting to note that NEH
includes a wider array of data types than we see from NSF or NIH. Any of these data types NEH
defines as humanities data, or “materials generated or collected during
the course of conducting research.” So what are some key concepts
in these definitions? For NIH and NSF, the research
community establishes definitions of data that involve validity and
presume data sharing among the community. When you look at the examples provided
by any of these organizations, you can see that data come in
quite a significant variety of forms, almost to the point of being nebulous. What is important to understand here is that data are products of research
that are heterogeneous across and contextualized within
the academic disciplines. So to reiterate,
data should be valid, shared, and are heterogeneous, and contextualized
within research communities.""")

# 01-understanding-research-data/02_types-of-data-and-metadata.en.txt
document2 = tb("""Let's look at different types of data. Probably what most of us think about
when we use the term data is numeric or tabular data, what researchers might
refer to as quantitative data. And then there are other types. Samples such as DNA or
blood samples, physical collections, including plant specimens,
software programs and code, databases, algorithms, model, and geodatabases. Obviously, these are not all
the types of data out there. Can you think of other examples? What types have you encountered? Please share them in the forum. There are other research products that
need to be considered alongside data in order for the data to be meaningful. These include questionnaires, code books,
and descriptions of methodologies. Jillian Wallis, Elizabeth Rolando, and Christine Borgman describe
this as background data. Background data provides contextual
information important in the analysis of the primary foreground
data collected for analysis. For example, what the temperature was
at the time a sensor reading is taken may be critical to understanding
variances in data a sensor collected. And then there are research products
built on the data that are essential for secondary analysis or meta-analysis and
for content dissemination. These include reports, conference posters,
articles, white papers, and books. And we should include websites and blogs. Another essential concept in
data management is metadata. This is a term that you will hear
me refer to throughout this course. Metadata is often defined
as data about data. This, of course, is a basic and circular
definition that may not help us too much. More specifically, metadata is
structured information that describes, explains, locates, or
otherwise represents something else. For our interest in this course,
that something else is, of course, our target research data. Metadata makes it easier to retrieve,
use or manage an information source. Data are nothing without metadata,
especially digital data. One cannot search for, identify, or
interpret data without robust metadata. For each dataset, we need to know
at minimum, who created the data, when the data were created or
published, and a title or descriptive name
used to refer to the dataset. In this digital landscape,
we also need a unique and persistent identifier for
the data so that we can locate it, even if the data are moved to
a different location on the web. Beyond these metadata elements,
we can increase our ability to find and identify data if we have
information about these, along with the other 11 metadata elements that make
up the Dublin Core Metadata Element Set. Dublin Core allows us to find and
identify data. But to be able to interpret and
use data as they were intended, we also need to know quite a bit of
other information about the data. The Data Documentation Initiative, or DDI, developed a metadata scheme
specifically for this purpose. Along with basic Dublin Core Elements, DDI prescribes additional
metadata elements that provide specific information
about data collection processes, variable-level descriptions,
and methodologies.""")

# 01-understanding-research-data/03_research-data-lifecycle.en.txt
document3 = tb("""When thinking about and doing data
management, it is critical to understand data in terms of their lifecycle and
the research project’s lifecycle. For project planning to archiving, proper data management happens
throughout the research lifecycle. Each stage of the lifecycle
produces specific data products and requires a variety of considerations,
responsibilities, and activities. There are numerous lifecycle
models from simple to complex. Let's look at a few. Here is the research lifecycle
model from the UK Data Archive. Once data are created, which is
represented in the circle at 12 o'clock, the data undergo subsequent stages,
processing, analyzing, preserving, providing access to,
and re-using the data. The University of Virginia library created
this research data life cycle model, which I think directly aligns
to common research practice. Here is the more complex and comprehensive data lifecycle model
developed by the Digital Curation Center. With data at the center of the graphic,
you can see the various data curation activities as you
move to the outer rings. This model brings together data,
researchers and curators. Now that you have been
introduced to data and understand that data management refers to
activities throughout the data lifecycle, please take a moment to introduce yourself
on the forums and tell us about your data experiences, or maybe even questions
you have about data or the course. We have also provided some
additional readings for you to explore if you're
interested in learning more. And make sure you check out
the resources for this module.""")

Now, finally the **MAIN()** method: traversing through the documents and output the terms and their frequencies

In [31]:
table = BeautifulTable()
table.column_headers = ["TERM", "TF-IDF"]
exportedWords = []
ownList = {"data management","database","example","iot","lifecycle","bloom","filter","integrity",
           "java","pattern","design pattern","svm","knn","machine learning"}

doclist = [document1, document2, document3]
docnames = ["01_research-data-defined.en.txt","02_types-of-data-and-metadata.en.txt","03_research-data-lifecycle.en.txt"]
topNwords = 15;

for i, doc in enumerate(doclist):
    print("\nTop {} terms in document {} | {}".format(topNwords, i + 1, docnames[i]))
    scores = {term: tfidf(term, doc, doclist) for term in doc.words}
    sortedTerms = sorted(scores.items(),key=lambda x: x[1], reverse=True)
    
    for term, score in sortedTerms[:topNwords]:
         table.append_row([term, round(score, 5)]) 
         exportedWords.append(term)
    
    print(table)
#    print(exportedWords, "\n")

print("\n\n------- EXPORTED TERMS in WORDNET ----------") 
for word in exportedWords:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())

print("\n\n------- CUSTOM TERMS in WORDNET (also domain specific) ----------")    
for word in ownList:
    if not wn.synsets(word):
        print("\n", word, ": NO SYNSETS\n")
    else:
        print("\n", word)
        for ss in wn.synsets(word):
            print("- ",ss.name()," | ",ss.definition())
# print (wn.synsets(word))
# for ss in wn.synsets(word):
#    print("- ",ss.name()," | ",ss.definition())
    


Top 15 terms in document 1 | 01_research-data-defined.en.txt
+-------------+--------+
|    TERM     | TF-IDF |
+-------------+--------+
|     To      | 0.015  |
+-------------+--------+
|     NSF     | 0.005  |
+-------------+--------+
|     As      | 0.005  |
+-------------+--------+
|     You     | 0.004  |
+-------------+--------+
|  community  | 0.004  |
+-------------+--------+
|     It      | 0.003  |
+-------------+--------+
|  includes   | 0.003  |
+-------------+--------+
|    Some     | 0.003  |
+-------------+--------+
|     NIH     | 0.002  |
+-------------+--------+
|  National   | 0.002  |
+-------------+--------+
|  concepts   | 0.002  |
+-------------+--------+
| definitions | 0.002  |
+-------------+--------+
|     By      | 0.002  |
+-------------+--------+
|    From     | 0.002  |
+-------------+--------+
|    here     | 0.002  |
+-------------+--------+

Top 15 terms in document 2 | 02_types-of-data-and-metadata.en.txt
+-------------+--------+
|    TERM     | TF-ID

### Conclusion

TF-IDF doesn't output the necessary result, I need n-grams selected as a combined keyword and these are often very general words like `for example` or `key concept` etc. in order to classify the text into the GOAL element. 

TextBlob provides options for n-grams and also connection to WordNet ontology which could be useful, so will look more into it.

Full list of identified key words so far [HERE](https://docs.google.com/spreadsheets/d/1Dj4UAh6U5jAelcsz-gDCdDE9JRVhwaNei0Ctn8m0Ui4/edit?usp=sharing)