## Code for extracting Wordnet terms

This Python code deals with the extraction of all the terms present in the Wordnet Knowledge Base, as well as cleaning the terms to remove the unnecessary articles such that it helps us to do classification in a better way.<br>
To run the python script file on stout, command lines are **python WordNet_term_Extract.py**


In [None]:
#Importing suitable libraries
from nltk.corpus import wordnet as wn
from itertools import islice
import re
import spacy
import string
import nltk
import en_core_web_sm
nlp = en_core_web_sm.load()

### Cleaning the Terms from WordNet
 In the following code snipper we remove the determiners that appear at the beginning of the terms, such as 'a,an,the' etc. <br>Moreover, we also remove any presence of digits in the string.

In [None]:
def removearticles(text):
    #text is the raw term extracted from Wordnet
    doc = nlp(text)
    
    temp = []
    i = 0
    #Seperating determiners that appear at the beginning of the terms
    while i<len(doc) and doc[i].pos_ == 'DET':
        
        i+=1
    #Appending the rest of the tokens of the term
    while i!= len(doc):
        temp.append(str(doc[i]))
        i+=1

    
    fin = ' '.join(temp)
    fin = re.sub(' ([@.#$\/:-]) ?',r'\1', fin)
    return fin

    #Checking if any number appears in our string
def hasNumbers(inputString):
     return any(char.isdigit() for char in inputString)

In [None]:
#all_words consists of all terms found on WordNet
all_words = []
for word in wn.words():
    all_words.append(word)


### Creating list and dictionary of cleaned terms
The following code cleans the terms from WordNet but also tries to not lose the data as present in WordNet. A dictionary stores the cleaned term as the key, and it's value as the original way the term was found on WordNet
    

In [None]:
list_of_terms = []
dictOfLemma =  {}
for w in all_words:
    lemmaword = w
    if hasNumbers(w) == True:
        continue
    w=w.replace("_"," ")
    w = removearticles(w)
    dictOfLemma[w] = lemmaword

In [None]:
#The cleaned part from the WordNet stored in a list
list_of_terms=list(dictOfLemma.keys())
from nltk.corpus import stopwords
nltk.download('stopwords')
#Removing stopwords and punctuation marks in our extracted cleaned terms 
list_of_terms = [word for word in list_of_terms if word not in stopwords.words('english')]
list_of_terms = [''.join(c for c in s if c not in string.punctuation) for s in list_of_terms]
list_of_terms = [s for s in list_of_terms if s]


### Pickling
Since this list is huge and occupies a lot of memory, and also since we will need it later on, we store it in a compressed file called as pickle. The process of generating this list everytime we need to pre-process is expensive and hence pickling is essential

In [None]:

import pickle
#Pickling our List of cleaned terms
with open('list_wordnet', 'wb') as fp:
    pickle.dump(list_of_terms, fp)
#Pickling the dictionary
with open('dict_wordnet', 'wb') as fpp:
    pickle.dump(dictOfLemma, fpp)