<h1><center style="color:#2B3698">Merging the different methods for generating matching keywords</center></h1>

## Table Of Contents :
1. [Synset selection for each category](#syn)
2. [Wordnet Scores](#score-wordnet)
3. [Vocabulary Generator](#vocab)
4. [Putting All Together](#all)
5. [Next Steps](#next)

#### Importing the different libraries

In [1]:
# Import libraries 
from nltk.corpus import wordnet as wn
from stemming.porter2 import stem
from nltk.corpus import wordnet_ic
from tqdm import tqdm
from time import time
import gensim
import os



# *Vocabulary creation* 

The process that we will use in order to generate the different words will be the following :
<ol> 
<li> Define different models that can generate n words with a score (wordnet, word2vec).
   <br>The output of this generators will be of the form {word : {"model" : score}}.
<li> Given the outputs of different models, merge them into one dictionnary and define 
   the weights for each model.
<li> Loop through all the categories and generate all the words.

Defining global variables :

In [2]:
# Path of the pretrained word2vec 
PATH = r"C:\Users\Nasser Benab\Documents\git\data"

# Categories used in weanswer
CATEGORIES = [
    'advice',
    'hygiene',
    'equipment',
    'activities',
    'technology',
    'info',
    'administrative',
    'job',
    'education',
    'home',
    'health',
    'food'
]

## 1. Synset selection for each category <a id = "syn">

The goal here is to choose the right synset for each category.

In [3]:
# For each category, for each synset print its lemma names and the definition
for category in CATEGORIES:
    print(">>> {}".format(category))
    i = 0
    for synset in wn.synsets(category):
        print(i, "-", synset.lemma_names(), ":", synset.definition())
        i += 1
    print("\n")

>>> advice
0 - ['advice'] : a proposal for an appropriate course of action


>>> hygiene
0 - ['hygiene'] : a condition promoting sanitary practices
1 - ['hygiene', 'hygienics'] : the science concerned with the prevention of illness and maintenance of health


>>> equipment
0 - ['equipment'] : an instrumentality needed for an undertaking or to perform a service


>>> activities
0 - ['activity'] : any specific behavior
1 - ['action', 'activity', 'activeness'] : the state of being active
2 - ['bodily_process', 'body_process', 'bodily_function', 'activity'] : an organic process that takes place in the body
3 - ['activity'] : (chemistry) the capacity of a substance to take part in a chemical reaction
4 - ['natural_process', 'natural_action', 'action', 'activity'] : a process existing in or produced by nature (rather than by the intent of human beings)
5 - ['activeness', 'activity'] : the trait of being active; moving or acting rapidly and energetically


>>> technology
0 - ['technology', 'e

In [15]:
# For each category, we will select only the relevant synsets (the numbers represent the indices of the right synsets)
d_category_synsets = {"advice": [0], "hygiene": [0], "equipment": [0], "activities": [0], "technology": [0], "info": [0], 
                      "administrative": [0], "job": [0, 1], "education": [0], "home": [0], "health": [0], "food": [0,1]}

## 2. Wordnet scores   <a id="score-wordnet"> 

Here we compute 4 different scores for wordnet (http://www.nltk.org/howto/wordnet.html):
<li> path_similarity score :  Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
<li> lch_similarity : Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth. (Here is a website that simulates the path to better understand the taxonomy : http://ws4jdemo.appspot.com/?mode=w&s1=&w1=cat&s2=&w2=dog.
<li> wup_similarity : The Wu & Palmer measure (wup) calculates similarity by considering the depths of the two concepts in wordnet, along with the depth of the LCS (least common ancestor in the taxonomy) The formula is score = 2*depth(lcs) / (depth(s1) + depth(s2)). This means that 0 $<$ score $<=$ 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of a taxonomy is one). The score is one if the two input concepts are the same.
<li> jcn_similarity : Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)). (Information Content (IC) is a measure of specificity for a concept. Higher values are associated with more specific concepts (e.g., pitch fork), while those with lower values are more general (e.g., idea). In- formation Content is computed based on frequency counts of concepts as found in a corpus of text).

Also, this article http://ftp.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf presents the different score computation and compares them to human judgment of semantic similarity. The similarity that matched the most the human judgment was jcn_similarity.

In [5]:
def score_wordnet(word, matching_keyword) : 
    """
    Function that generates a score for the wordnet outputs
    
    Parameters
    ----------
    word              : string we want to compute the similarity to.
    matching_keywords : string for computring the similarity.
    
    Returns
    -------
    score : The similarity score (float).
    """
    
    # We will nomalise the scores for each similarity
    min_max_values = {'path' : [0, 1], 'lch' : [0, 3.6375861597263857 ], 'wup' : [0,1], 'jcn' : [0, 10000]}
    # Reference for range : (lch) https://stackoverflow.com/questions/20112828/maximum-score-in-wordnet-based-similarity
    # (jcn) https://stackoverflow.com/questions/35751207/how-to-normalize-similarity-measurements-lch-wup-path-res-lin-jcn-between
    
    # For jcn we will take 10 000 as the maximum (quite arbitrary but seemed relevant)
    
    word = wn.synsets(word)[0]
    matching_keyword = wn.synsets(matching_keyword)[0]
    score = 0
    try :
        score += word.path_similarity(matching_keyword)
    except :
        pass
    try : 
        score += (word.lch_similarity(matching_keyword) /  min_max_values['lch'][1])
    except :
        pass
    try :
        score += word.wup_similarity(matching_keyword)
    except :
        pass
    try : 
        brown_ic = wordnet_ic.ic('ic-brown.dat')
        score += (word.jcn_similarity(matching_keyword) / min_max_values['jcn'][1])
    except :
        pass
    
    return score

## 3. Vocabulary generator  <a id="vocab">

Here we create a class that is able to generate different words according to different models.

We will keep only stems in our inputs and then generate the lems afterward.

In [6]:
class VocabularyGenerator():
    
    def __init__(self):
        pass

    def generate_words_wordnet(self, word, n, depth = 4, synsets_indices = d_category_synsets) :
        """
        Function that generates matching keywords given a word.

        Parameters
        ----------
        word    : word to compute similarities to (can be a category word).
        depth   : number of layers we use when generating maching keywords.
        n       : number of words we take.
        synsets : dictionary mapping a word to the a list of indices
                 specifying its relevant synsets in wordnet
                 Ex: {"food": [0]}

        Returns
        -------
        d : {similar_word: {wordnet: wordnet_score}
        """

        name_model = "wordnet"
        d = {}

        # First iteration (select only relevant synsets as specified in d_category_synsets)
        # All the synsets of word
        synsets = wn.synsets(word)    
        # Only relevant synsets of word
        if word in CATEGORIES:
            synsets = [synsets[i] for i in range(len(synsets)) if i in synsets_indices[word]]
        for synset in tqdm(synsets):
            for lemma in synset.lemma_names():
                d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}

        # Other iterations

        for i in tqdm(range(depth)) :
            dic = d.copy()
            for origin in dic.keys():
                for synset in wn.synsets(origin):
                    for lemma in synset.lemma_names():
                        d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}

        if n > len(d) :
            return d

        d_max_values = {k: d[k] for k in sorted(d, key = lambda k: d[k]["wordnet"], reverse = True)[:n]}

        return d_max_values

    def generate_words_word2vec(self, word, n, path = PATH, model_name = "text8-vector.bin"):
        """ 
        Most similar words to word and their scores.

        Parameters
        ----------
        word       : word to compute similarities to (can be a category word)
        n          : number of similar words to get
        path       : path of the pretrained word2vec model
        model_name : name of the pretrained word2vec model

        Returns
        -------
        d : {similar_word: {word2vec: word2vec_score}
        """
        
        # Load Google's pre-trained Word2Vec model
        model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(path, "text8-vector.bin"), binary=True) 
        d = {key: {"word2vec": value} for (key, value) in model.most_similar(word, topn = n)}
        return d
    
    def generate_word_spacy(self):
        pass


In [8]:
# Examples
generator = VocabularyGenerator()
d_wordnet = generator.generate_words_wordnet("food", 40)
d_word2vec = generator.generate_words_word2vec("food", 40)

100%|██████████| 2/2 [00:13<00:00,  8.98s/it]
100%|██████████| 4/4 [05:32<00:00, 94.96s/it]


In [9]:
d_wordnet

{'aliment': {'wordnet': 2.218539496664136},
 'back': {'wordnet': 0.9169821263560435},
 'backup': {'wordnet': 0.9169821263560435},
 'baffl': {'wordnet': 0.7746942172313404},
 'beat': {'wordnet': 0.8404107581326701},
 'begin': {'wordnet': 0.7746942172313404},
 'brook': {'wordnet': 0.9169821263560435},
 'cargo_deck': {'wordnet': 0.7174237954982008},
 'cargo_hold': {'wordnet': 0.7174237954982008},
 'come': {'wordnet': 1.107911697275812},
 'comest': {'wordnet': 2.218539496664136},
 'conserv': {'wordnet': 1.3407650777506728},
 'curb': {'wordnet': 0.7174237954982008},
 'donjon': {'wordnet': 0.7174237954982008},
 'draw': {'wordnet': 0.7746942172313404},
 'dungeon': {'wordnet': 0.7174237954982008},
 'eatabl': {'wordnet': 2.218539496664136},
 'edibl': {'wordnet': 2.218539496664136},
 'food': {'wordnet': 3.0},
 'gravel': {'wordnet': 1.107911697275812},
 'guard': {'wordnet': 0.8404107581326701},
 'handgrip': {'wordnet': 0.9169821263560435},
 'handl': {'wordnet': 0.9169821263560435},
 'harbor': {'w

In [10]:
d_word2vec

{'bananas': {'word2vec': 0.6068102717399597},
 'beef': {'word2vec': 0.6227272748947144},
 'beverages': {'word2vec': 0.6333991289138794},
 'canned': {'word2vec': 0.6140711903572083},
 'cashew': {'word2vec': 0.6168662905693054},
 'chemicals': {'word2vec': 0.5928491950035095},
 'citrus': {'word2vec': 0.5871732831001282},
 'cocoa': {'word2vec': 0.6041184663772583},
 'cooking': {'word2vec': 0.6009650826454163},
 'dairy': {'word2vec': 0.6460014581680298},
 'fermented': {'word2vec': 0.5760336518287659},
 'fertilizers': {'word2vec': 0.5925315618515015},
 'fish': {'word2vec': 0.5927667617797852},
 'foods': {'word2vec': 0.7001853585243225},
 'foodstuffs': {'word2vec': 0.6066914200782776},
 'fruit': {'word2vec': 0.5887770652770996},
 'fruits': {'word2vec': 0.6141229271888733},
 'grain': {'word2vec': 0.5781475305557251},
 'ingredients': {'word2vec': 0.5744912624359131},
 'liquor': {'word2vec': 0.5889161825180054},
 'livestock': {'word2vec': 0.6447887420654297},
 'maize': {'word2vec': 0.60753554105

## 4. Putting it all together <a id="all">

First we create a function that normalizes the scores of a model.

In [7]:
def normalize_scores(d_scores):
    """"
    The function normalize the scores (between 0 and 1).
    
    Parameters
    ----------
    d_scores : dictionnary of the form {matching_keywords : {model : score}}.
    
    Returns
    -------
    d_normalized : same dictionnary with normalized scores.
    """
    
    # Get the model name
    modelname = list(list(d_scores.values())[0].keys())[0]
        
    scores = [d_scores[k][modelname] for k in d_scores.keys()]
    M = max(scores)
    m = min(scores)
    
    # Normalize the scores
    # If there is only one word, thus one score in the generated vocabulary
    if m == M:
        d_normalized = {k: {modelname: d_scores[k][modelname] / M} for k in d_scores.keys()}
    else:
        d_normalized = {k: {modelname: (d_scores[k][modelname] - m) / (M - m)} for k in d_scores.keys()}
    
    return d_normalized

Merging all the result obtained :

In [8]:
def mixing_model(models, word, n, generator = VocabularyGenerator()):
    """
    Function that takes as input a list of models and a word and outputs
    a normalized dictionary that gathers all the scores of the different
    models.
    
    Parameters
    ----------
    models    : list of models (names).
    word      : the word in question.
    n         : input to the generators (number of words).
    generator : default value, our created classe.
    
    Returns
    ------- 
    d_score, d_count : output dictionnary with the generate words 
                       (scores and counts).
    
    """
    
    d = {}
    for modelname in models :
        d_scores = getattr(generator, "generate_words_{}".format(modelname))(word, n)
        d.update(normalize_scores(d_scores))
        
    d_count = {}
    d_score = {}
    # counting occurrence (could also use counter)
    for key in d.keys() :
        d_count[key] = len(d[key])
        d_score[key] = sum(d[key].values())
        
    return d_score, d_count    

In [13]:
mixing_model(["word2vec", "wordnet"], "food", n = 40)

100%|██████████| 2/2 [00:01<00:00,  1.27it/s]
100%|██████████| 4/4 [05:27<00:00, 93.28s/it]


({'aliment': 0.6576410015163425,
  'back': 0.08742679892319248,
  'backup': 0.08742679892319248,
  'baffl': 0.025090256185177198,
  'bananas': 0.2628202745200088,
  'beat': 0.05388076962859285,
  'beef': 0.3884821873510358,
  'begin': 0.025090256185177198,
  'beverages': 0.4727345793662483,
  'brook': 0.08742679892319248,
  'canned': 0.3201439373655058,
  'cargo_deck': 0.0,
  'cargo_hold': 0.0,
  'cashew': 0.34221075709228455,
  'chemicals': 0.15260005336242072,
  'citrus': 0.10778973700420358,
  'cocoa': 0.2415689492826425,
  'come': 0.17107332539762463,
  'comest': 0.6576410015163425,
  'conserv': 0.2730867346391723,
  'cooking': 0.21667354480349313,
  'curb': 0.0,
  'dairy': 0.5722277304830711,
  'donjon': 0.0,
  'draw': 0.025090256185177198,
  'dungeon': 0.0,
  'eatabl': 0.6576410015163425,
  'edibl': 0.6576410015163425,
  'fermented': 0.019844326666481577,
  'fertilizers': 0.15009239604327532,
  'fish': 0.15194925770237605,
  'food': 1.0,
  'foods': 1.0,
  'foodstuffs': 0.26188196

Function that generates the final output :

In [9]:
def categories_generate(models, n, categories = CATEGORIES, d = ({}, {})) :
    """
    Function that generate the final output.
    
    Parameters
    ----------
    models     : list of models (names).
    n          : input to the generators (number of words).
    categories : list of categories (default value, the global variable
                 CATEGORIES).
    dics        : already generated dictionaries to complete 
                 (d_output_score, d_output_count)
                 
    Returns
    -------
    d_output_score, d_output_count : output dictionnary with the generate words 
                       (scores and counts) for each category.
     
    """
    
    d_output_score = d[0]
    d_output_count = d[1]
    for word in tqdm(categories):
        if word in d_output_score.keys:
            continue
        print(">>> Generating vocabulary for {}".format(word))
        d_output_score[word], d_output_count[word] = mixing_model(models, word, n)
        # Select only the top n similar words
        d_output_score[word] = {k: d_output_score[word][k] for k in sorted(d_output_score[word], 
                                key = lambda k: d_output_score[word][k], reverse = True)[:n]}
        d_output_count[word] = {k: d_output_count[word][k] for k in d_output_score[word].keys()}

        
    return d_output_score, d_output_count

In [10]:
d = categories_generate(["word2vec", "wordnet"], 40)

  0%|          | 0/12 [00:00<?, ?it/s]

>>> Generating vocabulary for advice



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:11<00:00, 11.83s/it]

  0%|          | 0/4 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:12<02:21, 12.89s/it]]

>>> Generating vocabulary for hygiene



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.74it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
 17%|█▋        | 2/12 [00:14<01:35,  9.59s/it]]

>>> Generating vocabulary for equipment



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.89it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
 25%|██▌       | 1/4 [00:02<00:06,  2.17s/it]
 50%|█████     | 2/4 [00:35<00:22, 11.42s/it]
 75%|███████▌  | 3/4 [07:01<02:04, 124.04s/it]
100%|██████████| 4/4 [56:27<00:00, 976.49s/it]
 25%|██▌       | 3/12 [56:44<2:33:31, 1023.48s/it]

>>> Generating vocabulary for activities



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  2.42it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
 33%|███▎      | 4/12 [56:45<1:35:34, 716.86s/it] 

>>> Generating vocabulary for technology



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.09it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
 42%|████▏     | 5/12 [56:47<58:36, 502.39s/it]  

>>> Generating vocabulary for info



  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00,  1.17it/s]

  0%|          | 0/4 [00:00<?, ?it/s]
 25%|██▌       | 1/4 [00:02<00:06,  2.10s/it]
 50%|█████     | 2/4 [00:04<00:04,  2.07s/it]
 75%|███████▌  | 3/4 [00:06<00:02,  2.06s/it]
100%|██████████| 4/4 [00:08<00:00,  2.05s/it]
 50%|█████     | 6/12 [56:58<35:29, 354.97s/it]

>>> Generating vocabulary for administrative


KeyError: 'administrative'

## 5. Next Steps <a id="next">

**The next steps for the project would be : **
<ul>
<li> Generate the lemmas from the stems for the output. 
<li> Creating other models *(spacy, word2vec with other training set, mixing word2vec with wordnet with scores as the number of occurence of each word, learn a model from the users (for instance if given a same keyword from different questions, the user answer with food related ressources 2 times in a row, then we may attribute this keyword to food and with a score of 2)* ....)
<li> Defining a relevant threshold in order to extract the most relevant words.
<li> Find a way to combine smartly the output of the different models for instance if model 1 generates fruit and model 2 generates fruit, it should be counted as 1 word with 2 occurences.
<li> Finally, may be we will need manual data cleaning.


##### Documentation (error)

import error : https://stackoverflow.com/questions/15526996/ipython-notebook-locale-error