<h1><center style="color:#2B3698">Merging the different methods for generating matching keywords</center></h1>

Importing the different libraries

Code that was previously written by the appmachine

## Merging the different methods

In [1]:
categories  = [
    'advice',
    'hygiene',
    'equipment',
    'activities',
    'technology',
    'info',
    'administrative',
    'job',
    'education',
    'home',
    'health',
    'food'
]

Hypothesis : our input will be supposed to be a dictionnary of the form {"name_category" : {"matching_keywords" : {"Model" : score}}} and the return would be {"name_category" : {"matching_keywords" : score}}

Question : Deal with stems or all words ?
Hypothesis : we will keep only stems in our inputs and then generate the lems afterward.

### TO DO : create a function generate_input(models, name_models, category) and return an input

##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using wordnet

In [77]:
from nltk.corpus import wordnet as wn
from stemming.porter2 import stem
from nltk.corpus import wordnet_ic
from tqdm import tqdm

In [78]:
def score_wordnet(word, matching_keyword) : 
    """
    Function that generates a score for the wordnet outputs
    
    Parameters
    ----------
    word              : string we want to compute the similarity to.
    matching_keywords : string for computring the similarity.
    
    Returns
    -------
    score : The similarity score (float).
    """
    
    word = wn.synsets(word)[0]
    matching_keyword = wn.synsets(matching_keyword)[0]
    score = 0
    try :
        score += word.path_similarity(matching_keyword)
        score += word.lch_similarity(matching_keyword)
        score += word.wup_similarity(matching_keyword)
    except :
        pass
    # TO DO add an other argument to the function (needs 3 arguments)
    # score += word.jcn_similarity(matching_keyword) 
    
    return score

In [79]:
def generate_words_wordnet(word, n = 3) :
    """
    Function that generates matching keywords given a word.
    
    Parameters
    ----------
    word : word to compute similarities to (can be a category word).
    n    : number of layers we use when generating maching keywords.
    
    Returns
    -------
    d : {word: {similar_word: {word2vec: word2vec_score}}
    """
    
    name_model = "wordnet"
    d = {}
    
    # First iteration
    
    for synset in wn.synsets(word):
        for lemma in synset.lemma_names():
            d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}
    
    # Other iterations
    
    for i in range(n) :
        for origin in d.keys():
            for synset in wn.synsets(origin):
                for lemma in synset.lemma_names():
                    d[stem(lemma)] = {name_model : score_wordnet(word, lemma)}
          
    return d

Rules of thumbs to know the correspondance between the number of words and n (number of layers).

In [83]:
corresp = {}

for n in tqdm(range(7)) :
    corresp[n + 1] = len(generate_words_wordnet("food", n = n))
    
print(corresp)

100%|██████████| 7/7 [00:10<00:00,  2.96s/it]

{1: 5, 2: 10, 3: 16, 4: 34, 5: 253, 6: 1393, 7: 5447}





##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using word2vec

In [6]:
import gensim
import os



In [7]:
path = r"C:\Users\Nasser Benab\Documents\git\data"

In [13]:
def generate_words_word2vec(word, n, path, model_name = "text8-vector.bin"):
    """ 
    Most similar words to word and their scores.
    
    Parameters
    ----------
    word: word to compute similarities to (can be a category word)
    n: number of similar words to get
    path: path of the pretrained word2vec model
    model_name: name of the pretrained word2vec model

    Returns
    -------
    d: {word: {similar_word: {word2vec: word2vec_score}}
    """
    
    d = {}
    # Load Google's pre-trained Word2Vec model
    model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(path, "text8-vector.bin"), binary=True) 
    d[word] = {key: {"word2vec": value} for (key, value) in model.most_similar(word, topn = n)}
    return d

In [18]:
generate_words_word2vec("food", 40, path)

{'food': {'bananas': {'word2vec': 0.6068102717399597},
  'beef': {'word2vec': 0.6227272748947144},
  'beverages': {'word2vec': 0.6333991289138794},
  'canned': {'word2vec': 0.6140711903572083},
  'cashew': {'word2vec': 0.6168662905693054},
  'chemicals': {'word2vec': 0.5928491950035095},
  'citrus': {'word2vec': 0.5871732831001282},
  'cocoa': {'word2vec': 0.6041184663772583},
  'cooking': {'word2vec': 0.6009650826454163},
  'dairy': {'word2vec': 0.6460014581680298},
  'fermented': {'word2vec': 0.5760336518287659},
  'fertilizers': {'word2vec': 0.5925315618515015},
  'fish': {'word2vec': 0.5927667617797852},
  'foods': {'word2vec': 0.7001853585243225},
  'foodstuffs': {'word2vec': 0.6066914200782776},
  'fruit': {'word2vec': 0.5887770652770996},
  'fruits': {'word2vec': 0.6141229271888733},
  'grain': {'word2vec': 0.5781475305557251},
  'ingredients': {'word2vec': 0.5744912624359131},
  'liquor': {'word2vec': 0.5889161825180054},
  'livestock': {'word2vec': 0.6447887420654297},
  'maiz

##### TO DO : generate a function that takes as input a word and returns matching keywords and scores using spacy

### TO DO : generate a function that outputs scores and list of matching keywords

First we create a function that normalizes the scores of a model.

We can think of a more simple way to do it (may be when generating the scores).

In [55]:
def normalize_scores(d_scores):
    """"
    The function normalize the scores (between 0 and 1).
    
    Parameters
    ----------
    d_scores : dictionnary of the form {matching_keywords : {model : score}}.
    
    Returns
    -------
    d_normalized : same dictionnary with normalized scores.
    """
    
    # Getting the model used (hypothesis : same models used for all categories)
    models = d_scores.values()[0].keys()
    
    # We keep the maximum and minimum for each model
    max_dic = {}
    min_dic = {}
    
    for model in models :
        temp = [dic[model] for dic in d_scores.values()]
        max_dic[model] = max(temp)
        min_dic[model] = min(temp)
        
    d_normalized = {}
    
    # may be an easier way to do it with comprehensive dictionaries
    for key in d_scores.keys() :
        print(d_scores[key])
        d_normalized[key] = {key_bis : (float(value - min_dic[key_bis]) / (max_dic[key_bis] - min_dic[key_bis])) \
                             for (key_bis, value) in d_scores[key].iteritems()}
    
    return d_normalized

### TO DO : finish the function.

In [None]:
def mixing_model():
    for 

# Sources

http://ftp.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf

# Documentation (error)

import error : https://stackoverflow.com/questions/15526996/ipython-notebook-locale-error