## Report

### Question *a*
The component $log(1 + f_{ij})$ has the role of reducing the impact of terms with a high frequency. By using the logarithm instead of the local weight directly, frequent, but less important terms will have less 'weight' in the computation.

### Question *b*
The second component is used as a constant for all of the calculations for all documents in which the term *i* appears and it describes the entropy of the state: the more evenly-distributed a term is across the documents in which it appears, the lower the value of this second component is (less 'weight'). This component's purpose is to ensure that more important terms in the document body will score higher, which is especially important for queries looking for documents with multiple terms, since it will let the algorithm know which term to prioritize.

## Question *c*
*{ see code }*

## Question *d*

Compared to TF, the log entropy weighting works better: in TF, we're accounting only for the frequency of the words used, not taking into account their relevance according to the context, while log entropy weighting also takes their spread into account. 

Log entropy weighting's performance is comparable to TFIDF, however, it does have a drawback: if a term is encountered too often in a set of documents, the log weighting can disregard its usages as noise by giving it a very low weight, even though it is actually relevant, compared to TFIDF.

In [1]:
# first install the required packages
# !pip3 install nltk
# !pip3 install scipy
# !pip3 install numpy
import nltk
nltk.download('stopwords')


from inverted_index import InvertedIndex
from utils import read_data
inv_ind = InvertedIndex()
documents = read_data("./shakespeare")
for d in documents:
    inv_ind.add_document(d)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  t = [re.sub('[^\w]', "", w) for w in t]


In [2]:
inv_ind.calcLogEntropy()
inv_ind.generate_term_by_doc_matrix(log_entropy = True)
results = inv_ind.search("scotland kings and thanes", log_entropy=True)

for r in range(10):
    print (results[r])

('Macbeth', 0.16118727379382877)
('King Henry VI', 0.030048860936556884)
('King Henry IV', 0.02948619616502898)
('King Henry IV, II', 0.023764658494466812)
('King Richard III', 0.01760253727154099)
('King Henry V', 0.014994910138431003)
('King John', 0.012133633289107127)
('King Richard II', 0.012106985654825615)
("All's Well that Ends Well", 0.01119139253053665)
('King Lear', 0.010978210515097522)


In [3]:
inv_ind.calcTFIDF()
inv_ind.generate_term_by_doc_matrix(tfidf=True)
results = inv_ind.search("scotland kings and thanes", tfidf=True)

for r in range(10):
    print (results[r])

('Macbeth', 0.08559316237351267)
('King Henry IV', 0.005789261723483593)
('King Henry VI', 0.003660436049077642)
('King Henry IV, II', 0.003121709934588564)
('King Henry V', 0.0019193400093131976)
('King Richard III', 0.0013431147327243318)
('King John', 0.0007488196759316429)
('King Richard II', 0.0006742482860831404)
('King Henry VIII', 0.0005161221482695165)
('The Comedy of Errors', 0.00046244490997942336)


In [4]:
inv_ind.generate_term_by_doc_matrix()
results = inv_ind.search("scotland kings and thanes", tfidf=True)

for r in range(10):
    print (results[r])

('Macbeth', 0.06342271855923712)
('King Henry VI', 0.01072781887182939)
('King Henry V', 0.009657775892130944)
('King John', 0.008469965122871384)
('King Richard II', 0.0077213170531481206)
('King Lear', 0.006993816919843515)
('King Henry IV', 0.006848287703722218)
('King Henry VIII', 0.0068475238333851225)
('King Richard III', 0.006723494450381792)
('King Henry IV, II', 0.005135632065177295)
