# Term-weighting techniques

In this practical you are going to:

- Implement TF-IDF weighting
- Apply these techniques to the collection of documents provided
- Return the TF-IDF scores for the provided set of words

Prior to starting, engage in the exercise below:

A variety of stemming algorithms are accessible through NLP toolkits, such as those included in NLTK. Typically, there's no need to create your own stemmer or lemmatizer from the ground up. Nonetheless, it's beneficial to familiarize yourself with the strengths and weaknesses of the toolkits that are already developed.

For hands-on experience with various algorithms, consult (for further details, visit https://www.nltk.org/api/nltk.stem.html and https://www.nltk.org/howto/stem.html) and examine the differences in their outcomes.


## Load the data

There are three components to the provided `CISI` dataset (http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/):
- documents with their ids and content – there are $1460$ of those to be precise;
- questions / queries with their ids and content – there are $112$ of those;
- mapping between the queries and relevant documents.

First, let's read in documents from the `CISI.ALL` file and store the result in `documents` data structure – set of tuples of document ids matched with contents:

In [2]:
#if using Google Colab
from google.colab import drive
drive.mount('/content/drive')

def read_documents():
    f = open("/content/drive/My Drive/datasets_ecoleit/4NLP/data/CISI.ALL")
    merged = ""

    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()

    documents = {}

    content = ""
    doc_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            doc_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".X"):
            documents[doc_id] = content
            content = ""
            doc_id = ""
        else:
            content += a_line.strip()[3:] + " "
    f.close()
    return documents

documents = read_documents()
print(f"{len(documents)} documents in total")
print("Document with id 1:")
print(documents.get("1"))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
1460 documents in total
Document with id 1:
 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


Second, let's read in queries from the `CISI.QRY` file and store the result in `queries` data structure – set of tuples of query ids matched with contents:

In [3]:
def read_queries():
    f = open("/content/drive/My Drive/datasets_ecoleit/4NLP/data/CISI.QRY")
    merged = ""

    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()

    queries = {}

    content = ""
    qry_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            if not content=="":
                queries[qry_id] = content
                content = ""
                qry_id = ""
            qry_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".W") or a_line.startswith(".T"):
            content += a_line.strip()[3:] + " "
    queries[qry_id] = content
    f.close()
    return queries

queries = read_queries()
print(f"{len(queries)} queries in total")
print("Query with id 1:")
print(queries.get("1"))

112 queries in total
Query with id 1:
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? 


Finally, let's read in the mapping between the queries and the documents – we'll keep these in the `mappings` data structure – with tuples where each query index (key) corresponds to the list of one or more document indices (value):

In [4]:
def read_mappings():
    f = open("/content/drive/My Drive/datasets_ecoleit/4NLP/data/CISI.REL")

    mappings = {}

    for a_line in f.readlines():
        voc = a_line.strip().split()
        key = voc[0].strip()
        current_value = voc[1].strip()
        value = []
        if key in mappings.keys():
            value = mappings.get(key)
        value.append(current_value)
        mappings[key] = value

    f.close()
    return mappings

mappings = read_mappings()
print(f"{len(mappings)} mappings in total")
print(mappings.keys())
print("Mapping for query with id 1:")
print(mappings.get("1"))

76 mappings in total
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '37', '39', '41', '42', '43', '44', '45', '46', '49', '50', '52', '54', '55', '56', '57', '58', '61', '62', '65', '66', '67', '69', '71', '76', '79', '81', '82', '84', '90', '92', '95', '96', '97', '98', '99', '100', '101', '102', '104', '109', '111'])
Mapping for query with id 1:
['28', '35', '38', '42', '43', '52', '65', '76', '86', '150', '189', '192', '193', '195', '215', '269', '291', '320', '429', '465', '466', '482', '483', '510', '524', '541', '576', '582', '589', '603', '650', '680', '711', '722', '726', '783', '813', '820', '868', '869', '894', '1162', '1164', '1195', '1196', '1281']


## Preprocess the data

Pratise application of the following steps:
- tokenize the texts
- put all to lowercase
- remove stopwords
- apply stemming

Implement and apply these steps to a sample text:

In [5]:
import nltk
import string

nltk.download('stopwords')
nltk.download('punkt')

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer

def process(text):
    stoplist = set(stopwords.words('english'))
    st = LancasterStemmer()
    word_list = [st.stem(word) for word in # FIXME
                # A tokenized list of words, all converted to lowercase,
                # If the word is not in the stoplist and not a punctuation mark (from string.punctuation)
                ]
    return word_list

word_list = process(documents.get("27"))
print(word_list)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['cost', 'analys', 'sim', 'proc', 'evalu', 'larg', 'inform', 'system', 'bourn', 'c.p', 'ford', 'd.f', 'comput', 'program', 'writ', 'us', 'sim', 'several-year', 'op', 'inform', 'system', 'comput', 'estim', 'expect', 'op', 'cost', 'wel', 'amount', 'equip', 'personnel', 'requir', 'tim', 'period', 'program', 'us', 'analys', 'sev', 'larg', 'system', 'prov', 'us', 'research', 'tool', 'study', 'system', 'many', 'compon', 'interrel', 'op', 'equ', 'man', 'analys', 'would', 'extrem', 'cumbersom', 'tim', 'consum', 'perhap', 'ev', 'impract', 'pap', 'describ', 'program', 'show', 'exampl', 'result', 'sim', 'two', 'sev', 'suggest', 'design', 'spec', 'inform', 'system']


## Step 3: Term weighing

First calculate the term frequency in each document:

In [7]:
def get_terms(text):
    terms = {}
    st = LancasterStemmer()
    stoplist = # FIXME: As above
    word_list = # FIXME: As above
    for word in word_list:
        terms[word] = terms.get(word, 0) + 1
    return terms

doc_terms = {}
qry_terms = {}
for doc_id in documents.keys():
    doc_terms[doc_id] = get_terms(# FIXME)
for qry_id in queries.keys():
    qry_terms[qry_id] = get_terms(# FIXME)


print(f"{len(doc_terms)} documents in total") # Sanity check – this should be the same number as before
d1_terms = doc_terms.get("1")
print("Terms and frequencies for document with id 1:")
print(d1_terms)
print(f"{len(d1_terms)} terms in this document")
print()
print(f"{len(qry_terms)} queries in total") # Sanity check – this should be the same number as before
q1_terms = qry_terms.get("1")
print("Terms and frequencies for query with id 1:")
print(q1_terms)
print(f"{len(q1_terms)} terms in this query")

1460 documents in total
Terms and frequencies for document with id 1:
{'18': 1, 'edit': 4, 'dewey': 3, 'decim': 2, 'class': 2, 'comarom': 1, 'j.p.': 1, 'pres': 1, 'study': 1, 'hist': 2, 'first': 2, 'ddc': 2, 'publ': 1, '1876': 1, 'eighteen': 1, '1971': 1, 'fut': 1, 'continu': 1, 'appear': 1, 'nee': 1, 'spit': 1, "'s": 1, 'long': 1, 'healthy': 1, 'lif': 1, 'howev': 1, 'ful': 1, 'story': 1, 'nev': 1, 'told': 1, 'biograph': 1, 'brief': 1, 'describ': 1, 'system': 1, 'attempt': 1, 'provid': 1, 'detail': 1, 'work': 1, 'spur': 1, 'grow': 1, 'libr': 1, 'country': 1, 'abroad': 1}
43 terms in this document

112 queries in total
Terms and frequencies for query with id 1:
{'problem': 1, 'concern': 1, 'mak': 1, 'describ': 1, 'titl': 3, 'difficul': 1, 'involv': 1, 'autom': 1, 'retriev': 1, 'artic': 2, 'approxim': 1, 'us': 1, 'relev': 1, 'cont': 1}
14 terms in this query


Second, collect shared vocabulary from all documents and queries:

In [8]:
def collect_vocabulary():
    all_terms = []
    for doc_id in doc_terms.keys():
        for term in doc_terms.get(doc_id).keys():
            all_terms.append(term)
    for qry_id in qry_terms.keys():
        # FIXME
        # Apply the same procedure to the query terms
    return sorted(set(all_terms))

all_terms = collect_vocabulary()
print(f"{len(all_terms)} terms in the shared vocabulary") # This should be the same number as before
print("First 10:")
print(all_terms[:10])

7775 terms in the shared vocabulary
First 10:
["''", "'60", "'70", "'anyhow", "'apparent", "'basic", "'better", "'bibliograph", "'bibliometrics", "'building"]


Represent each document and query as vectors containing word counts in the shared space:

In [9]:
def vectorize(input_terms, shared_vocabulary):
    output = {}
    for item_id in input_terms.keys(): # e.g., a document in doc_terms
        terms = input_terms.get(item_id)
        output_vector = []
        for word in shared_vocabulary:
            if word in terms.keys():
                # add the raw count of the word from the shared vocabulary in doc to the doc vector
                output_vector.append(int(terms.get(word)))
            else:
                # if the word from the shared vocabulary is not in doc, add 0 to the doc vector in this position
                output_vector.append(0)
        output[item_id] = output_vector
    return output

doc_vectors = vectorize(
    # Apply vectorize to the doc_terms and the shared vocabulary all_terms
)
qry_vectors = vectorize(
    # Apply vectorize to the qry_terms and the shared vocabulary all_terms
)

print(f"{len(doc_vectors)} document vectors") # This should be the same number as before
d1460_vector = doc_vectors.get("1460")
print(f"{len(d1460_vector)} terms in this document") # This should be the same number as before
print(f"{len(qry_vectors)} query vectors") # This should be the same number as before
q112_vector = qry_vectors.get("112")
print(f"{len(q112_vector)} terms in this query") # This should be the same number as before

1460 document vectors
7775 terms in this document
112 query vectors
7775 terms in this query


In [10]:
import math

def calculate_idfs(shared_vocabulary, d_terms):
    doc_idfs = {}
    for term in shared_vocabulary:
        doc_count = 0 # the number of documents containing this term
        for doc_id in d_terms.keys():
            terms = d_terms.get(doc_id)
            if term in terms.keys():
                doc_count += 1
        doc_idfs[term] = math.log(float(len(d_terms.keys()))/float(1 + doc_count), 10)
    return doc_idfs

doc_idfs = calculate_idfs(
    # Apply calculate_idfs to the shared vocabulary all_terms and to doc_terms
)
print(f"{len(doc_idfs)} terms with idf scores") # This should be the same number as before
print("Idf score for the word system:")
print(doc_idfs.get("system"))

7775 terms with idf scores
Idf score for the word system:
0.4287539560862571


In [12]:
def vectorize_idf(input_terms, input_idfs, shared_vocabulary):
    output = {}
    for item_id in input_terms.keys():
        terms = # Collect terms from the document
        output_vector = []
        for term in shared_vocabulary:
            if term in terms.keys():
                output_vector.append(input_idfs.get(term)*float(terms.get(term)))
            else:
                output_vector.append(float(0))
        output[item_id] = output_vector
    return output

doc_vectors = vectorize_idf(
    # Apply to the relevant data structures
)

print(f"{len(doc_vectors)} document vectors") # This should be the same number as before
print("Number of idf-scored words in a particular document:")
print(len(doc_vectors.get("1460"))) # This should be the same number as before

1460 document vectors
Number of idf-scored words in a particular document:
7775
