# Data Engineering: Lab 07 - Solution
---------------

### Task 01: Indexing

##### a) Read the file 'sprichwoerter.txt' into a variable called corpus.

In [None]:
def readCorpus(filename):
    content = ''
    with open(filename) as f:
        content = f.readlines()
    return content

corpus = readCorpus('sprichwoerter.txt')

##### b) Read the stopwords from the file 'stopwoerter.txt' into a variable called stopwords.

In [None]:
def readStopwords(filename):
    stopwords = set()
    with open(filename) as f:
        content = f.readlines()
    for c in content:
        tc = c.strip()
        stopwords.add(tc)
    return stopwords

stopwords = readStopwords('stopwoerter.txt')

##### c) Create an index of the corpus called 'index1'. Remove all non-characters, split by whitespace, and convert all terms to lower-case.

In [None]:
import re

def index1Corpus(corpus):
    index = {}
    pos = 0
    for c in corpus:
        cindex = set()
        terms = c.split(' ')
        for t in terms:
            tt = re.sub(r'\W+', '', t).lower()
            if tt not in cindex:
                cindex.add(tt)
        index[pos] = cindex
        pos += 1
    return index

index1 = index1Corpus(corpus)

##### d) Create an index of the corpus called 'index2'. Remove all non-characters, split by whitespace, convert all terms to lower-case, and also remove all stopwords.

In [None]:
def index2Corpus(corpus):
    index = {}
    pos = 0
    for c in corpus:
        cindex = set()
        terms = c.split(' ')
        for t in terms:
            tt = re.sub(r'\W+', '', t).lower()
            if tt not in cindex and tt not in stopwords:
                cindex.add(tt)
        index[pos] = cindex
        pos += 1
    return index

index2 = index2Corpus(corpus)

##### e) Create an index of the corpus called 'index3'. Remove all non-characters, split by whitespace, convert all terms to lower-case, remove all stopwords, and also use stemming (with a library of your choice, e.g., Snowball Stemmer).

In [None]:
from nltk.stem import SnowballStemmer

def index3Corpus(corpus):
    stemmer = SnowballStemmer('german')
    index = {}
    pos = 0
    for c in corpus:
        cindex = set()
        terms = c.split(' ')
        for t in terms:
            tt = re.sub(r'\W+', '', t).lower()
            if tt not in stopwords:
                tts = stemmer.stem(tt)
                if tts not in cindex:
                    cindex.add(tts)
        index[pos] = cindex
        pos += 1
    return index

index3 = index3Corpus(corpus)

### Task 02: Matching

##### a) Create a method called evaluate which uses the boolean retrieval function to evaluate a search query against the documents in the choosen index.

In [None]:
def evaluateQuery(index, corpus, query):
    result = ''
    for ind in index:
        if query in index[ind]:
            result = result + corpus[ind]
    return result

##### b) Create a method called stemQueryTerm which stems and returns a query term.

In [None]:
def stemQueryterm(term):
    stem = ''
    stemmer = SnowballStemmer('german')
    stem = stemmer.stem(term)
    return stem

##### c) Evaluate the following search queries with the corresponding indexes.

In [None]:
print('Answer 1: The results for the search term augen and the index1 are:\n {}'.format(evaluateQuery(index1, corpus, 'augen')))
print('Answer 2: The results for the search term ohne and the index1 are:\n {}'.format(evaluateQuery(index1, corpus, 'ohne')))
print('Answer 3: The results for the search term ohne and the index2 are:\n {}'.format(evaluateQuery(index2, corpus, 'ohne')))
print('Answer 4: The results for the search term augen and the index3 are:\n {}'.format(evaluateQuery(index3, corpus, 'augen')))
print('Answer 5: The results for the stemmed search term augen and the index3 are:\n {}'.format(evaluateQuery(index3, corpus, stemQueryterm('augen'))))

### Task 03: Ranking

##### a) Create a method called calculateScore1, which creates a ranking of the results based on the count of the occurrences of the search term in the document. More occurrences of a search term in the document leads to a higher score in the ranking.

In [None]:
def calculateScore1(index, query):
    scores = []
    qterms = query.split(' ')
    for ind in index:
        score = 0
        for t in qterms:
            for dt in index[ind]:
                if t == dt:
                    score = score + 1
        scores.append(score)
    return scores

##### b) Create a method called calculateScore2, which creates a ranking of the results based on the position of the occurrences of the search terms in the document. Earlier occurrences of a search term in the document leads to a higher score in the ranking. Maximum score is set to 10.0.

In [None]:
def calculateScore2(index, query):
    scores = []
    qterms = query.split(' ')
    for ind in index:
        score = 10.0
        for t in qterms:
            for pos, dt in enumerate(ind):
                if t == dt:
                    if pos < score:
                        score = pos + 1
                    break#
        fsc = 10.0 / (score)
        scores.append(max(1.0, fsc))
    return scores

##### c) Use the calculateScore1 function to rank the results for the search term 'alt' with the index3 from Task 01. Probably you need to change the index3 a little bit. So reimplement the index3 function if needed.

In [None]:
from nltk.stem import SnowballStemmer

def index3Corpus(corpus):
    stemmer = SnowballStemmer('german')
    index = {}
    pos = 0
    for c in corpus:
        cindex = list()
        terms = c.split(' ')
        for t in terms:
            tt = re.sub(r'\W+', '', t).lower()
            if tt not in stopwords:
                tts = stemmer.stem(tt)
                cindex.append(tts)
        index[pos] = cindex
        pos += 1
    return index

index3 = index3Corpus(corpus)

calculateScore1(index3, 'alt')

##### d) Use the calculateScore2 function to rank the results for the search term 'alt' with the index3 from Task 01. Probably you need to change the index3 a little bit. So reimplement the index3 function if needed.