## Work through all of the cells in this notebook and answer the questions

### Part I : text preprocessing

The following cells show a very basic process for 'normalizing' text.

In [None]:
import nltk

In [None]:
s = '''Alzheimer's disease is a progressive disorder that causes brain cells to waste away (degenerate) and die. Alzheimer's disease is the most common cause of dementia — a continuous decline in thinking, behavioral and social skills that disrupts a person's ability to function independently.

The early signs of the disease may be forgetting recent events or conversations. As the disease progresses, a person with Alzheimer's disease will develop severe memory impairment and lose the ability to carry out everyday tasks.

Current Alzheimer's disease medications may temporarily improve symptoms or slow the rate of decline. These treatments can sometimes help people with Alzheimer's disease maximize function and maintain independence for a time. Different programs and services can help support people with Alzheimer's disease and their caregivers.

There is no treatment that cures Alzheimer's disease or alters the disease process in the brain. In advanced stages of the disease, complications from severe loss of brain function — such as dehydration, malnutrition or infection — result in death.'''

adTokens = nltk.word_tokenize(s)
print(adTokens)

In [None]:
adTokLow = [x.lower() for x in adTokens]
print(adTokLow)

In [None]:
from nltk.corpus import stopwords
set(stopwords.words('english'))

In [None]:
stop_words = set(stopwords.words('english'))
adTokLowStop = [w for w in adTokLow if not w in stop_words]
print(adTokLowStop)

In [None]:
porter = nltk.PorterStemmer()
adTokLowStopPorter = [porter.stem(w) for w in adTokLowStop]
print(adTokLowStopPorter)

#### Question 1: True/False - stemming always increases the precision of an IR system? 

#### Answer: 

### Part II -  indexing

The following cells show the creation of an inverted index for Herman Melville's Moby Dick. Two indexes are created - one using a stemmed term dictionary.

In [None]:
#nltk.download()
from nltk.book import *

In [None]:
text1.tokens[0:30]

In [None]:
len(text1.vocab())

In [None]:
idxTxt = nltk.Index((word, i) for (i, word) in enumerate(text1))

In [None]:
text1.vocab().most_common()[0:200]

In [None]:
text1.vocab().most_common()[-200:]

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in text1.tokens][0:10]

In [None]:
idxTxtPrtr = nltk.Index((porter.stem(word), i) for (i, word) in enumerate(text1))

In [None]:
idxTxtPrtr['herman']

In [None]:
idxTxtPrtr['mobi']

#### Question 2: True/False - Indexing reduces data storage requirements but increases the computational demand placed on IR systems? 

#### Answer: 

### Part III - Example of term weighting (TF/IDF) using code from https://nlpforhackers.io/tf-idf/

The following cells create a TF-IDF weighted document-term matrix for a corpus called 'reuters'.

In [None]:
from nltk.corpus import reuters
 
## OPTIONAL Print the categories associated with a file
#print (reuters.categories('training/999'))        # [u'interest', u'money-fx']
 
# OPTIONAL Print the contents of the file
#print(reuters.raw('test/14829'))

In [None]:
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
 
stop_words = stopwords.words('english') + list(punctuation)
 
def tokenize(text):
    words = word_tokenize(text)
    words = [w.lower() for w in words]
    return [w for w in words if w not in stop_words and not w.isdigit()]
 

In [None]:
# build the vocabulary in one pass
vocabulary = set()
for file_id in reuters.fileids():
    words = tokenize(reuters.raw(file_id))
    vocabulary.update(words)
 
vocabulary = list(vocabulary)
word_index = {w: idx for idx, w in enumerate(vocabulary)}
 
VOCABULARY_SIZE = len(vocabulary)
DOCUMENTS_COUNT = len(reuters.fileids())
 
print([VOCABULARY_SIZE, DOCUMENTS_COUNT])      # 51553, 10788

In [None]:
import collections
import math

word_freq = collections.defaultdict(lambda: 0)
word_idf = collections.defaultdict(lambda: 0)

for file_id in reuters.fileids():
    words = set(tokenize(reuters.raw(file_id)))
    for word in words:
        word_freq[word] += 1
 
for word in vocabulary:
    word_idf[word] = math.log(DOCUMENTS_COUNT / float(1 + word_freq[word]))

In [None]:
print(word_freq['deliberations'])
print(word_idf['deliberations'])     # 7.49443021503

print(word_freq['committee'])
print(word_idf['committee'])     # 3.61286641709

In [None]:
from six import string_types
basestring = str

def word_tf(word, document):
    if isinstance(document, string_types):
        document = tokenize(document)
    return float(document.count(word)) / len(document)

 
def tf_idf(word, document):
    # If not tokenized
    if isinstance(document, basestring):
        document = tokenize(document)

    if word not in word_index:
        return .0

    return word_tf(word, document) * word_idf[word_index[word]]

In [None]:
print(word_index['year'])
print(word_tf('year',reuters.raw('test/14829')))
print(word_idf[word_index['year']])
print(tf_idf('year', reuters.raw('test/14829')))

In [None]:
print(tf_idf('year', reuters.raw('test/14829')))                 # 0.0209031169481
print(tf_idf('following', reuters.raw('test/14829')))            # 0.0306117802726
print(tf_idf('provided', reuters.raw('test/14829')))             # 0.0388082713404
print(tf_idf('structural', reuters.raw('test/14829')))           # 0.0534999300236
print(tf_idf('japanese', reuters.raw('test/14829')))             # 0.0613707825494
print(tf_idf('downtrend', reuters.raw('test/14829')))            # 0.068131183773

In [None]:
print('document\t','\t'.join(['year','following','provided','structural','japanese','downtrend']))
for d in ['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', 'test/14844', 'test/14849', 'test/14852', 'test/14854', 'test/14858', 'test/14859', 'test/14860', 'test/14861', 'test/14862', 'test/14863', 'test/14865', 'test/14867', 'test/14872', 'test/14873', 'test/14875', 'test/14876', 'test/14877', 'test/14881', 'test/14882', 'test/14885', 'test/14886', 'test/14888', 'test/14890', 'test/14891', 'test/14892', 'test/14899', 'test/14900', 'test/14903', 'test/14904', 'test/14907', 'test/14909', 'test/14911', 'test/14912', 'test/14913', 'test/14918', 'test/14919', 'test/14921', 'test/14922', 'test/14923', 'test/14926', 'test/14928', 'test/14930', 'test/14931', 'test/14932', 'test/14933', 'test/14934', 'test/14941', 'test/14943', 'test/14949', 'test/14951', 'test/14954', 'test/14957', 'test/14958', 'test/14959', 'test/14960', 'test/14962', 'test/14963', 'test/14964', 'test/14965', 'test/14967', 'test/14968', 'test/14969', 'test/14970', 'test/14971', 'test/14974', 'test/14975', 'test/14978', 'test/14981', 'test/14982', 'test/14983', 'test/14984', 'test/14985', 'test/14986', 'test/14987', 'test/14988', 'test/14993', 'test/14995', 'test/14998', 'test/15000', 'test/15001', 'test/15002', 'test/15004', 'test/15005', 'test/15006', 'test/15011', 'test/15012', 'test/15013', 'test/15016', 'test/15017', 'test/15020', 'test/15023', 'test/15024', 'test/15026', 'test/15027', 'test/15028', 'test/15029', 'test/15031', 'test/15032', 'test/15033', 'test/15037', 'test/15038', 'test/15043', 'test/15045', 'test/15046', 'test/15048', 'test/15049', 'test/15052', 'test/15053', 'test/15055', 'test/15056', 'test/15060', 'test/15061', 'test/15062', 'test/15063', 'test/15065', 'test/15067', 'test/15069', 'test/15070', 'test/15074', 'test/15077', 'test/15078', 'test/15079', 'test/15082', 'test/15090', 'test/15091', 'test/15092', 'test/15093', 'test/15094', 'test/15095']:
    tfIdfL = []
    for w in ['year','following','provided','structural','japanese','downtrend']:
        tfIdfL.append(str(round(tf_idf(w,reuters.raw(d)),2)))
    print(d,':\t','\t'.join(tfIdfL))
    

#### Question 3: What steps would be required to implement a query over this weighted document-term matrix? Be specific about how you would process the query, calculate query/document similarity scores, and rank the results.

#### Answer: 

### Part 4: Evaluation measures using pytrec_eval https://arxiv.org/pdf/1805.01597.pdf

The following cells load the TREC evaluation library and apply it to a synthetic IR system evaluation.

In [None]:
import pytrec_eval
import json

In [None]:
pytrec_eval.supported_measures

#### The following cell has 2 dictionaries (qrel and run). The 'qrel' dictionary creates 20 synthetic mappings from query IDs to relevant documents from a corpus of 12 documents. This serves as a synthetic reference set for an imaginary IR challenge. The 'run' dictionary shows synthetic results from an imaginary system for 2 queries.

In [None]:
qrel = {
    'q1': {
        'd1': 0,
        'd2': 1,
        'd3': 0,
        'd4': 0,
        'd5': 1,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 0
    },
    'q2': {
        'd1': 1,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q3': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 0
    },
    'q4': {
        'd1': 1,
        'd2': 1,
        'd3': 1,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q5': {
        'd1': 0,
        'd2': 1,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    },
    'q6': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },    
    'q7': {
        'd1': 1,
        'd2': 1,
        'd3': 1,
        'd4': 1,
        'd5': 1,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q8': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 1,
        'd10': 1,
        'd11': 1,
        'd12': 0
    },
    'q9': {
        'd1': 0,
        'd2': 0,
        'd3': 1,
        'd4': 1,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    },
    'q10': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    },
    'q11': {
        'd1': 0,
        'd2': 1,
        'd3': 0,
        'd4': 0,
        'd5': 1,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 0
    },
    'q12': {
        'd1': 1,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q13': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 0
    },
    'q14': {
        'd1': 1,
        'd2': 1,
        'd3': 1,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q15': {
        'd1': 0,
        'd2': 1,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    },
    'q16': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },    
    'q17': {
        'd1': 1,
        'd2': 1,
        'd3': 1,
        'd4': 1,
        'd5': 1,
        'd6': 0,
        'd7': 0,
        'd8': 0,
        'd9': 0,
        'd10': 0,
        'd11': 0,
        'd12': 0
    },
    'q18': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 1,
        'd10': 1,
        'd11': 1,
        'd12': 0
    },
    'q19': {
        'd1': 0,
        'd2': 0,
        'd3': 1,
        'd4': 1,
        'd5': 0,
        'd6': 0,
        'd7': 1,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    },
    'q20': {
        'd1': 0,
        'd2': 0,
        'd3': 0,
        'd4': 0,
        'd5': 0,
        'd6': 0,
        'd7': 0,
        'd8': 1,
        'd9': 0,
        'd10': 0,
        'd11': 1,
        'd12': 1
    }
}

# q1 run:d9>d1>d4>d5>d11  Relevent top 5: 
run = {
    'q1': {
        'd1': 1.3,
        'd2': 0.4,
        'd3': 0.3,
        'd4': 1.2,
        'd5': 1.1,
        'd6': 0.5,
        'd7': 0.1,
        'd8': 0.2,
        'd9': 1.5,
        'd10': 0.9,
        'd11': 1.0,
        'd12': 0.8
     },
    'q2': {
        'd1': 1.0,
        'd2': 1.4,
        'd3': 1.3,
        'd4': 0.2,
        'd5': 0.1,
        'd6': 1.5,
        'd7': 0.4,
        'd8': 0.5,
        'd9': 1.1,
        'd10': 0.9,
        'd11': 0.7,
        'd12': 0.8
    }
}

 The following cells show how to evaluate the imaginary IR system using the two queries and the TREC measures. 

In [None]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'recall'})
print(json.dumps(evaluator.evaluate(run), indent=1))

In [None]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'P'})
print(json.dumps(evaluator.evaluate(run), indent=1))

In [None]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'11pt_avg'})
print(json.dumps(evaluator.evaluate(run), indent=1))

In [None]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'map'})
print(json.dumps(evaluator.evaluate(run), indent=1))

#### Question 4: True/False : The set up for a “TREC-like” information retrieval evaluation includes a set of queries and document relevance judgements.

#### Answer: 

#### Question 5: True/False: Term weights derived using TF-IDF term weighting follow the three axioms of probability.

#### Answer: 

#### Question 6: True/False: R-precision is a metric that is useful for assessing IR system performance when there are an unknown number of relevant documents in a very large corpus 

#### Answer: 

#### Question 7: True/False: The balanced harmonic mean of precision and recall (i.e., F-measure) places more emphasis on recall than precision.

#### Answer: 

#### Question 8: What are the possible values for interpolated precision at recall level 0?

#### Answer: 

#### Question 9: An IR system returns 8 relevant documents, and 10 non-relevant documents. There are a total of 20 relevant documents in the collection. What is the precision, recall, and F-measure of the system on this search?

#### Answer: 

#### Question 10: What relationship between precision and recall is used to construct an 11-point interpolated precision-recall curve?

#### Answer: 

#### Question 11: Why would the Area Under the Precision Recall Curve make more sense than AUROC for a comparative evaluation of IR systems?

#### Answer: 