### Pre-Processing
We will be studying the effect of preprocessing on the relevancy of the results. In other terms, the system will be tested with three preprocessing models, mainly:
* Preprocessing model 1: cleaning + tokenization + removal of stop words 
* Preprocessing model 2: Model 1 + lemmatization
* Preprocessing model 3: Model 1 + lemmatization +synonym enrichment



In [3]:
# Importing libraries
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import sent_tokenize, TreebankWordTokenizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import re
from bs4 import BeautifulSoup #to remove HTML tags

lemmatizer = WordNetLemmatizer()
stop_list = stopwords.words('english')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Documents in "cran.all.1400" look like this: 

**.I 1** (index) \
**.T** (Title) \
experimental investigation of the aerodynamics of a
wing in a slipstream .\
**.A** (Author) \
brenckman,m.\
**.B** (?)\
j. ae. scs. 25, 1958, 324. \
**.W** (content)\
experimental investigation of the aerodynamics of a
wing in a slipstream .
  an experimental study of a wing in a propeller slipstream was
made in order to determine the spanwise distribution of the lift
increase due to slipstream at different angles of attack of the wing
and at different free stream to slipstream velocity ratios .  the
results were intended in part as an evaluation basis for different
theoretical treatments of this problem .
  the comparative span loading curves, together with
supporting evidence, showed that a substantial part of the lift increment
produced by the slipstream was due to a /destalling/ or
boundary-layer-control effect .  the integrated remaining lift
increment, after subtracting this destalling lift, was found to agree
well with a potential flow theory .
  an empirical evaluation of the destalling effects was made for
the specific configuration of the experiment .\
**.I 2**\
...

#### Queries in "car.qry" look like this:

**.I** 001 \
**.W**\
what similarity laws must be obeyed when constructing aeroelastic models
of heated high speed aircraft .\
**.I** 002\
...

In [4]:
def parseDocs(filename): 
    with open(filename,"r") as f:
        docs = []
        doc = ""
        cont = False
    
        for line in f: #skip text between .I and .W
            if ".I" in line:
                cont = False
                if len(doc)>0:
                    docs.append(doc)
                    doc = ""
            elif ".W" in line:
                cont = True
            elif cont == True:
                doc = doc + line
            
        if len(doc)>0: #needed for the last document
            docs.append(doc)

        f.close()
    
    return docs

In [6]:
docs = parseDocs("cran/cran.all.1400")
print(docs[5])
print("number of documents: ",len(docs))
queries = parseDocs("cran/cran.qry")
print(queries[5])
print("number of queries: ",len(queries))

one-dimensional transient heat flow in a multilayer
slab .
  in a recent contribution to the readers'
forum wassermann gave analytic
solutions for the temperature in a double
layer slab, with a triangular heat
rate input at one face, insulated at the other,
and with no thermal resistance
at the interface .  his solutions were for the
three particular cases..
i propose here to give the general solution
to this problem, to indicate
briefly how it is obtained using the method of
reference 2, and to point out
that the solutions given by wassermann are
incomplete for times longer
than the duration of the heat input .

number of documents:  1398
what theoretical and experimental guides do we have as to turbulent
couette flow behaviour .

number of queries:  225


In [8]:
def tokenize_and_clean(docs):
    """This function tokenizes texts into lowercased tokens with TreebankWordTokenizer
    
    Preprocesses the list of strings given as input.
    Tokenize each string into sentences using sent_tokenize(),
    tokenize each sentence into tokens using TreebankWordTokenizer().tokenize(),
    Lowercasing the characters, removing non-ASCII values, special characters, HTML tags and stopwords.
    

    Parameters
    ----------
    docs : list of strings
        list of document contents
    
    Returns
    -------
    tokens : list of list of strings
        each text as a list of lowercased tokens
    """
    tokens = []
    
    for doc in docs:
        # converting to lower case
        txt = doc.lower()
        
        # remove HTML tags
        txt = BeautifulSoup(txt, 'html.parser').get_text()
        
        # tokenize
        sentence = sent_tokenize(txt)
        tok = [TreebankWordTokenizer().tokenize(sent) for sent in sentence]
        tok = [item for sublist in tok for item in sublist] #convert to one list
        
        # removing stop words and special characters from the tokens
        clean_tokens = [word for word in tok if (word not in stop_list and not re.match('[^A-Za-z0-9]', word))]
        
        tokens.append(clean_tokens)


    return tokens

In [9]:
doc_tokens = tokenize_and_clean(docs)
print(doc_tokens[5])

['one-dimensional', 'transient', 'heat', 'flow', 'multilayer', 'slab', 'recent', 'contribution', "readers'", 'forum', 'wassermann', 'gave', 'analytic', 'solutions', 'temperature', 'double', 'layer', 'slab', 'triangular', 'heat', 'rate', 'input', 'one', 'face', 'insulated', 'thermal', 'resistance', 'interface', 'solutions', 'three', 'particular', 'cases..', 'propose', 'give', 'general', 'solution', 'problem', 'indicate', 'briefly', 'obtained', 'using', 'method', 'reference', '2', 'point', 'solutions', 'given', 'wassermann', 'incomplete', 'times', 'longer', 'duration', 'heat', 'input']


In [10]:
query_tokens = tokenize_and_clean(queries)
print(query_tokens[5])

['theoretical', 'experimental', 'guides', 'turbulent', 'couette', 'flow', 'behaviour']


In [11]:
def lemmatize(doc_tokens):
    """This function lemmatizes texts with NLTK WordNetLemmatizer

    Parameters
    ----------
    doc_tokens : list of list of tokens
    
    Returns
    -------
    doc_lemmas : list of list of lemmatized tokens
    """
    doc_lemmas = []
    
    for doc in doc_tokens:
        lemmas = [lemmatizer.lemmatize(token) for token in doc]
        doc_lemmas.append(lemmas)
        
    return doc_lemmas

In [12]:
doc_lemmas = lemmatize(doc_tokens)
print(doc_lemmas[5])

['one-dimensional', 'transient', 'heat', 'flow', 'multilayer', 'slab', 'recent', 'contribution', "readers'", 'forum', 'wassermann', 'gave', 'analytic', 'solution', 'temperature', 'double', 'layer', 'slab', 'triangular', 'heat', 'rate', 'input', 'one', 'face', 'insulated', 'thermal', 'resistance', 'interface', 'solution', 'three', 'particular', 'cases..', 'propose', 'give', 'general', 'solution', 'problem', 'indicate', 'briefly', 'obtained', 'using', 'method', 'reference', '2', 'point', 'solution', 'given', 'wassermann', 'incomplete', 'time', 'longer', 'duration', 'heat', 'input']


In [13]:
query_lemmas = lemmatize(query_tokens)
print(query_lemmas[5])

['theoretical', 'experimental', 'guide', 'turbulent', 'couette', 'flow', 'behaviour']


In [14]:
def synonym_enrichment(doc_lemmas):
    """This function enriches the documents using a semantic knowledge base such as wordNet

    Parameters
    ----------
    doc_lemmas : list of list of strings (lemmas)
    
    Returns
    -------
    doc_enrich : list of list of strings
    """
    doc_enrich = []
    
    for doc in doc_lemmas:
        for lemma in doc:
            for syn in wordnet.synsets(lemma):
                #????
                raise NotImplementedError()
                    
    return doc_enrich

In [15]:
wordnet.synsets(doc_lemmas[5][1])

[Synset('transient.n.01'),
 Synset('transient.n.02'),
 Synset('transeunt.a.01'),
 Synset('ephemeral.s.01')]

In [16]:
def remove_duplicates(t): 
    return list(set(t))

In [81]:
print(remove_duplicates(doc_lemmas[5]))

["readers'", 'resistance', 'multilayer', 'general', 'layer', 'insulated', 'duration', 'forum', 'indicate', 'heat', 'point', 'slab', 'recent', 'solution', 'double', 'face', 'give', 'gave', 'obtained', 'analytic', '2', 'propose', 'wassermann', 'one', 'incomplete', 'triangular', 'transient', 'interface', 'time', 'flow', 'three', 'cases..', 'briefly', 'using', 'longer', 'particular', 'problem', 'thermal', 'given', 'rate', 'one-dimensional', 'contribution', 'input', 'method', 'reference', 'temperature']


### Get 3 pre-processing models:
* Preprocessing model 1: cleaning + tokenization + removal of stop words 
* Preprocessing model 2: Model 1 + lemmatization
* Preprocessing model 3: Model 1 + lemmatization +synonym enrichment

In [93]:
docs = parseDocs("cran/cran.all.1400")
model1 = tokenize_and_clean(docs)
model2 = lemmatize(model1)
model3 = synonym_enrichment(model2)

NotImplementedError: 