<h1><center style="color:#2B3698">Merging the different methods for generating matching keywords</center></h1>

Importing the different libraries

Code that was previously written by the appmachine

In [5]:
from nltk.corpus import wordnet as wn
from stemming.porter2 import stem
from nltk.corpus import wordnet_ic
from tqdm import tqdm

## Merging the different methods

In [6]:
categories  = [
    'advice',
    'hygiene',
    'equipment',
    'activities',
    'technology',
    'info',
    'administrative',
    'job',
    'education',
    'home',
    'health',
    'food'
]

# Selecting the right synset for each category 

In [7]:
# For each category, for each synset print its lemma names and the definition
for category in categories:
    print(">>> {}".format(category))
    i = 0
    for synset in wn.synsets(category):
        print(i, "-", synset.lemma_names(), ":", synset.definition())
        i += 1
    print("\n")

>>> advice
0 - ['advice'] : a proposal for an appropriate course of action


>>> hygiene
0 - ['hygiene'] : a condition promoting sanitary practices
1 - ['hygiene', 'hygienics'] : the science concerned with the prevention of illness and maintenance of health


>>> equipment
0 - ['equipment'] : an instrumentality needed for an undertaking or to perform a service


>>> activities
0 - ['activity'] : any specific behavior
1 - ['action', 'activity', 'activeness'] : the state of being active
2 - ['bodily_process', 'body_process', 'bodily_function', 'activity'] : an organic process that takes place in the body
3 - ['activity'] : (chemistry) the capacity of a substance to take part in a chemical reaction
4 - ['natural_process', 'natural_action', 'action', 'activity'] : a process existing in or produced by nature (rather than by the intent of human beings)
5 - ['activeness', 'activity'] : the trait of being active; moving or acting rapidly and energetically


>>> technology
0 - ['technology', 'e

In [8]:
# For each category, we will select only the relevant synsets
d_category_synsets = {"advice": [0], "hygiene": [0], "equipment": [0], "activities": [0], "technology": [0], "info": [0], 
                      "job": [0, 1], "education": [0], "home": [0], "health": [0], "food": [0]}

Hypothesis : our input will be supposed to be a dictionnary of the form {"name_category" : {"matching_keywords" : {"Model" : score}}} and the return would be {"name_category" : {"matching_keywords" : score}}

Question : Deal with stems or all words ?
Hypothesis : we will keep only stems in our inputs and then generate the lems afterward.

### TO DO : create a function generate_input(models, name_models, category) and return an input

##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using wordnet

Remark : May be code models as classes

Here we compute 4 different scores for wordnet (http://www.nltk.org/howto/wordnet.html):
<li> path_similarity score :  Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
<li> lch_similarity : Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth. (Here is a website that simulates the path to better understand the taxonomy : http://ws4jdemo.appspot.com/?mode=w&s1=&w1=cat&s2=&w2=dog.
<li> wup_similarity : The Wu & Palmer measure (wup) calculates similarity by considering the depths of the two concepts in wordnet, along with the depth of the LCS (least common ancestor in the taxonomy) The formula is score = 2*depth(lcs) / (depth(s1) + depth(s2)). This means that 0 $<$ score $<=$ 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of a taxonomy is one). The score is one if the two input concepts are the same.
<li> jcn_similarity : Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)). (Information Content (IC) is a measure of specificity for a concept. Higher values are associated with more specific concepts (e.g., pitch fork), while those with lower values are more general (e.g., idea). In- formation Content is computed based on frequency counts of concepts as found in a corpus of text).

In [9]:
def score_wordnet(word, matching_keyword) : 
    """
    Function that generates a score for the wordnet outputs
    
    Parameters
    ----------
    word              : string we want to compute the similarity to.
    matching_keywords : string for computring the similarity.
    
    Returns
    -------
    score : The similarity score (float).
    """
    
    # We will nomalise the scores for each similarity
    min_max_values = {'path' : [0, 1], 'lch' : [0, 3.6375861597263857 ], 'wup' : [0,1], 'jcn' : [0, 10000]}
    # Reference for range : (lch) https://stackoverflow.com/questions/20112828/maximum-score-in-wordnet-based-similarity
    # (jcn) https://stackoverflow.com/questions/35751207/how-to-normalize-similarity-measurements-lch-wup-path-res-lin-jcn-between
    
    # For jcn we will take 10 000 as the maximum (quite arbitrary but seemed relevant)
    
    word = wn.synsets(word)[0]
    matching_keyword = wn.synsets(matching_keyword)[0]
    score = 0
    try :
        score += word.path_similarity(matching_keyword)
    except :
        pass
    try : 
        score += (word.lch_similarity(matching_keyword) /  min_max_values['lch'][1])
    except :
        pass
    try :
        score += word.wup_similarity(matching_keyword)
    except :
        pass
    try : 
        brown_ic = wordnet_ic.ic('ic-brown.dat')
        score += (word.jcn_similarity(matching_keyword) / min_max_values['jcn'][1])
    except :
        pass
    
    return score

This model does not render good results, may be we sould add a control variable that does control the score (for instance a synonym of the word or the real corresponding synset).

In [10]:
def generate_words_wordnet(word, n, depth = 3, synsets_indices = d_category_synsets) :
    """
    Function that generates matching keywords given a word.
    
    Parameters
    ----------
    word  : word to compute similarities to (can be a category word).
    depth : number of layers we use when generating maching keywords.
    n     : number of words we take.
    synsets: dictionary mapping a word to the a list of indices
             specifying its relevant synsets in wordnet
             Ex: {"food": [0]}
    
    Returns
    -------
    d : {word: {similar_word: {wordnet: wordnet_score}}
    """
    
    name_model = "wordnet"
    d = {}
    
    # First iteration (select only relevant synsets as specified in d_category_synsets)
    # All the synsets of word
    synsets = wn.synsets(word)    
    # Only relevant synsets of word
    if word in categories:
        synsets = [synsets[i] for i in range(len(synsets)) if i in synsets_indices[word]]
    for synset in tqdm(synsets):
        for lemma in synset.lemma_names():
            d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}
    
    # Other iterations
    
    for i in tqdm(range(depth)) :
        dic = d.copy()
        for origin in dic.keys():
            for synset in wn.synsets(origin):
                for lemma in synset.lemma_names():
                    d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}
    
    if n > len(d) :
        return d
    
    d_max_values = {k: d[k] for k in sorted(d, key = lambda k: d[k]["wordnet"], reverse = True)[:n]}
    
    return d_max_values

In [11]:
d = generate_words_wordnet("education", 5)

100%|██████████| 1/1 [00:16<00:00, 16.14s/it]
100%|██████████| 3/3 [05:28<00:00, 96.80s/it]


In [12]:
d

{'conduct': {'wordnet': 1.8884593473203033},
 'didact': {'wordnet': 3.0},
 'educ': {'wordnet': 3.0},
 'educational_act': {'wordnet': 3.0},
 'tri': {'wordnet': 1.8884593473203033}}

##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using word2vec

In [6]:
import gensim
import os

In [7]:
path = r"C:\Users\Nasser Benab\Documents\git\data"

In [8]:
def generate_words_word2vec(word, n, path = path, model_name = "text8-vector.bin"):
    """ 
    Most similar words to word and their scores.
    
    Parameters
    ----------
    word: word to compute similarities to (can be a category word)
    n: number of similar words to get
    path: path of the pretrained word2vec model
    model_name: name of the pretrained word2vec model

    Returns
    -------
    d: {word: {similar_word: {word2vec: word2vec_score}}
    """
    
    d = {}
    # Load Google's pre-trained Word2Vec model
    model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(path, "text8-vector.bin"), binary=True) 
    d[word] = {key: {"word2vec": value} for (key, value) in model.most_similar(word, topn = n)}
    return d

In [9]:
generate_words_word2vec("food", 40)

NotImplementedError: unknown URI scheme 'c' in 'C:\\Users\\Nasser Benab\\Documents\\git\\data/text8-vector.bin'

##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using spacy

### TO DO : generate a function that outputs scores and list of matching keywords

First we create a function that normalizes the scores of a model.

We can think of a more simple way to do it (may be when generating the scores).

In [None]:
def normalize_scores(d_scores):
    """"
    The function normalize the scores (between 0 and 1).
    
    Parameters
    ----------
    d_scores : dictionnary of the form {matching_keywords : {model : score}}.
    
    Returns
    -------
    d_normalized : same dictionnary with normalized scores.
    """
    
    # Getting the model used (hypothesis : same models used for all categories)
    models = d_scores.values()[0].keys()
    
    # We keep the maximum and minimum for each model
    max_dic = {}
    min_dic = {}
    
    for model in models :
        temp = [dic[model] for dic in d_scores.values()]
        max_dic[model] = max(temp)
        min_dic[model] = min(temp)
        
    d_normalized = {}
    
    # may be an easier way to do it with comprehensive dictionaries
    for key in d_scores.keys() :
        print(d_scores[key])
        d_normalized[key] = {key_bis : (float(value - min_dic[key_bis]) / (max_dic[key_bis] - min_dic[key_bis])) \
                             for (key_bis, value) in d_scores[key].iteritems()}
    
    return d_normalized

### TO DO : finish the function.

We need to find a way to match the different outputs from the models

In [None]:
def mixing_model(models, word, n):
    # TO DO : Comment the function
    # TO DO : Check the words output for each model and how to do 
    d = {}
    for model in models :
        d.update(normalize_score(model(word, n)))
    
    d_count = {}
    d_score = {}
    # counting occurrence (could also use counter)
    for key in d.keys() :
        d_count[key] = len(d[key])
        d_score[key] = sum(d[key].values())
        
    return d_score, d_count    

TO DO : generate lemas from obtained dictionnary.

In [12]:
def categories(list_categories, models, n) :
    # Final function : TO DO : complete
    d_output_score = {}
    d_output_count = {}
    for word in list_categories :
        d_output_score[word], d_output_count[word] = mixing_model(models, word, n)
        
    return d_output_score, d_output_count
        
    
        
        

# Sources

http://ftp.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf

# Documentation (error)

import error : https://stackoverflow.com/questions/15526996/ipython-notebook-locale-error