# Calculate recipe-ingredient similarity
We will compare each recipe's ingredients similarity using three techniques. 
* Similarity based on co-occurance count
* Similarity based on Jaccard
* Similarity based on [tfidf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

#### Before you read further, please check the following terminologies
* token: each ingredients. As this is a tutorial for 
First we will use simple count of tokens. Second, we will calculate the similarity using  technique. For simplicity, we will treat each recipe-ingredients as one document. 

In [152]:
# Load libraries
import pandas as pd
import numpy as np
import pickle
import copy
import datetime
import math
import scipy.sparse as sp
from collections import Counter

### 0. Load dataset

In [96]:
data_path = 'data/'
with open( data_path + 'recipe_2_ingredients_dict.pkl', 'rb') as fp:
    recipe_2_ingredient_dict = pickle.load(fp)
    
with open( data_path + 'recipe_2_recipename_dict.pkl', 'rb') as fp:
    recipe_2_recipename_dict = pickle.load(fp)

## 1. Calculate similarity based on counts
This is perhaps the most straight-forward way to calcualte simialrity between documents. We will record all the terms that occur in each document and see the co-occurance of certain terms in any two documents

In [12]:
recipe_key_list = list(recipe_2_ingredient_dict.keys())

In [15]:
# Get a list of all unique words from all the documents, recipe ingredients
unqiue_ingredient_list = []
for key in recipe_key_list:
    unqiue_ingredient_list = list(set(unqiue_ingredient_list) | set(recipe_2_ingredient_dict[key]))

# remove null ingredient
unqiue_ingredient_list.remove('')

In [40]:
# Create matrix to update the counts
# But before we update the matrix, we will create matrix_idx - recipe mapping table
recipe_idx_dict     = dict(zip(recipe_key_list, np.arange(0, len(recipe_key_list))))
ingredient_idx_dict = dict(zip(unqiue_ingredient_list, np.arange(0, len(unqiue_ingredient_list))))

# Make reverse dictionary to use later
idx_recipe_dict     = {idx: recipe for recipe, idx in recipe_idx_dict.items()}
idx_ingredient_dict = {idx: ingredient for ingredient, idx in ingredient_idx_dict.items()}

# Also, define variables
row_num = len(recipe_idx_dict)
col_num = len(ingredient_idx_dict)

In [30]:
# Populate matrix
start = datetime.datetime.now()
mat_a = sp.dok_matrix(( row_num, col_num ), dtype = np.int64)
for recipe in recipe_key_list:
    mx              = recipe_idx_dict[recipe]
    ingredient_list = [ing for ing in recipe_2_ingredient_dict[recipe] if ing != '']
    for ingredient in ingredient_list:
        mx_ingre = ingredient_idx_dict[ingredient]
        mat_a[mx, mx_ingre] = 1
        
# Transpose mat_a
mat_b = mat_a.transpose().tocsr()
       
# Multiply mat_a and mat_b to get co-occurance
coo_mx = np.dot(mat_a,mat_b)
        
end = datetime.datetime.now()
print( 'Populating co-occurance matrix completed. {} seconds'.format(end - start) )

Populating co-occurance matrix completed. 0:00:00.226271 seconds


Now we want to calculate the similarity using counts. If it has the  most counts, we think it is more similar

In [35]:
# Convert to dense matrix
coo_mx_dense = coo_mx.todense()

# Squeeze the matrix into array, so I can sort it in a descending order
# Each array value represents how many co-occuring artist for each bx (represented as mx, row index num)

coo_ingredient_array = np.squeeze(np.asarray(coo_mx_dense))

In [44]:
# We want to ONLY sort non-zero values. Because only non-zero indices make a proper recommendations. 
coo_ingredient_array_index = np.argsort(~coo_ingredient_array)
coo_model = coo_ingredient_array_index

# non_zero for production & convert back to recipe & make into dictionary
def non_zero_argsort(arr):
    indices = np.nonzero(arr)[0]
    return indices[np.argsort(~arr[indices])]

recipe_2_simRecipe_dict = dict()
for arr in range(coo_ingredient_array.shape[0]):
    non_zero_rec_candidate = non_zero_argsort(coo_ingredient_array[arr])
    # Make into dictionary without self
    recipe_2_simRecipe_dict[idx_recipe_dict[arr]] = [idx_recipe_dict[mx] for mx in non_zero_rec_candidate if idx_recipe_dict[mx] != idx_recipe_dict[arr]]

In [115]:
def prt_model_result(model, idx_recipe_dict, recipe_idx_dict, recipe_2_ingredient_dict, list_of_recipe_to_test, rec_size):
    for test_recipe in list_of_recipe_to_test:
        print(test_recipe)
        test_recipe_ingredient_list  = recipe_2_ingredient_dict[test_recipe]
        close_recipe                 = recipe_2_simRecipe_dict[test_recipe][:rec_size]   # Bring top recommendations from the model
        rec_candidates               = [recipe for recipe in close_recipe]     # Convert matrix row id to bx
        print( 'Test recipe:{} <{}>'.format(test_recipe, recipe_2_recipename_dict[test_recipe]) )
        print( 'Test recipe ingredients: {}'.format(recipe_2_ingredient_dict[test_recipe]) )
        print( ' ')
        print( 'Similar recipes ')
        for rec in rec_candidates:
            co_occur_num = len(set(test_recipe_ingredient_list)&set(recipe_2_ingredient_dict[rec]))
            print( ' {}, <{}>.............co-occurance: {}'.format(rec, recipe_2_recipename_dict[rec], co_occur_num))
            print( ' Co-occuring ingredients: {}'.format(set(test_recipe_ingredient_list)&set(recipe_2_ingredient_dict[rec])))
            print( ' ' )
            print( ' ' )
        print('--------------------------------------------------------------------')    

In [116]:
list_of_recipe_to_test = [6128]

In [117]:
prt_model_result(coo_model, idx_recipe_dict, recipe_idx_dict, recipe_2_ingredient_dict, list_of_recipe_to_test, 20)

6128
Test recipe:6128 <마이 ♥ 케이크>
Test recipe ingredients: ['인스턴트 카레', '돈코츠 라면 수프', '트러플 오일', '달걀', '크림소스', '핫케이크 가루', '브로콜리', '파프리카', '팽이버섯', '칵테일 새우', '날치알', '물', '후추', '대파', '소금']
 
Similar recipes 
 6129, <크림 속에 비친 파프리카>.............co-occurance: 6
 Co-occuring ingredients: {'팽이버섯', '파프리카', '날치알', '소금', '달걀', '후추'}
 
 
 6046, <허니버터징>.............co-occurance: 5
 Co-occuring ingredients: {'대파', '소금', '달걀', '브로콜리', '후추'}
 
 
 6228, <무적 카레레인저>.............co-occurance: 5
 Co-occuring ingredients: {'대파', '소금', '브로콜리', '후추', '인스턴트 카레'}
 
 
 6156, <미트볼그레>.............co-occurance: 5
 Co-occuring ingredients: {'대파', '소금', '달걀', '파프리카', '후추'}
 
 
 6127, <로맨티스타 케이크>.............co-occurance: 5
 Co-occuring ingredients: {'팽이버섯', '소금', '달걀', '후추', '크림소스'}
 
 
 6459, <떡이라자냐>.............co-occurance: 5
 Co-occuring ingredients: {'소금', '달걀', '파프리카', '후추', '크림소스'}
 
 
 6208, <떡.고.치>.............co-occurance: 5
 Co-occuring ingredients: {'대파', '소금', '달걀', '파프리카', '후추'}
 
 
 6034, <누르삼>............

Now we see that after the first 3 recipes, the rest is not as related as I would hope. This is a reason why we might want to consider tfidf which, theoretically discrimiates the ingredients that occure more frequenlty, therefore, has less meaning in defining a relationship

### Calculate similarity based on Jaccard

In [130]:
def get_jaccard(doc_1, doc_2):
    a = set(doc_1)
    b = set(doc_2)
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [143]:
def get_similar_recipe(test_recipe, n ):
    """
    Input: Recipe to test, recipe_2_ingredient_dict, recipe_key_list
    Output: Top N similar recipe based on Jaccard
    """
    recipe_score_tuple_list = []
    test_ingredient = recipe_2_ingredient_dict[test_recipe]
    for key in recipe_key_list:
        compare_ingredient = recipe_2_ingredient_dict[key]
        score = get_jaccard(test_ingredient, compare_ingredient)
        recipe_score_tuple_list.append((score, key))

    # Sort based on the value 
    sorted_recipe_score_tuple_list = sorted(recipe_score_tuple_list, key = lambda element: element[0],reverse=True)
    sorted_recipe_score_list = [candidate for candidate in sorted_recipe_score_tuple_list if candidate[1] != test_recipe]
    return sorted_recipe_score_list[:n]

In [144]:
get_similar_recipe(6128, 10)

[(0.2222222222222222, 6129),
 (0.21739130434782608, 6127),
 (0.19047619047619047, 5998),
 (0.19047619047619047, 6114),
 (0.18181818181818182, 6238),
 (0.18181818181818182, 6012),
 (0.17857142857142858, 6147),
 (0.17857142857142858, 6228),
 (0.17857142857142858, 6560),
 (0.17391304347826086, 6212)]

Looking at the result, we can see it is different from the similarity based on count

In [150]:
# print the result
test_recipe = 6128
similar_recipe = get_similar_recipe(test_recipe, 10)
test_recipe_ingredient_list = recipe_2_ingredient_dict[test_recipe]
print( 'Test recipe:{} <{}>'.format(test_recipe, recipe_2_recipename_dict[test_recipe]) )
print( 'Test recipe ingredients: {}'.format(test_recipe_ingredient_list) )
print( ' ')
print( 'Similar recipes ')
for rec in similar_recipe:
    co_occur_num = len(set(test_recipe_ingredient_list)&set(recipe_2_ingredient_dict[rec[1]]))
    print( ' {}, <{}>.............co-occurance: {}'.format(rec[1], recipe_2_recipename_dict[rec[1]], co_occur_num))
    print( ' Co-occuring ingredients: {}'.format(set(test_recipe_ingredient_list)&set(recipe_2_ingredient_dict[rec[1]])))
    print( ' ' )
    print( ' ' )
    print('--------------------------------------------------------------------')    

Test recipe:6128 <마이 ♥ 케이크>
Test recipe ingredients: ['인스턴트 카레', '돈코츠 라면 수프', '트러플 오일', '달걀', '크림소스', '핫케이크 가루', '브로콜리', '파프리카', '팽이버섯', '칵테일 새우', '날치알', '물', '후추', '대파', '소금']
 
Similar recipes 
 6129, <크림 속에 비친 파프리카>.............co-occurance: 6
 Co-occuring ingredients: {'팽이버섯', '파프리카', '날치알', '소금', '달걀', '후추'}
 
 
--------------------------------------------------------------------
 6127, <로맨티스타 케이크>.............co-occurance: 5
 Co-occuring ingredients: {'팽이버섯', '소금', '달걀', '후추', '크림소스'}
 
 
--------------------------------------------------------------------
 5998, <소테미너>.............co-occurance: 4
 Co-occuring ingredients: {'소금', '브로콜리', '파프리카', '후추'}
 
 
--------------------------------------------------------------------
 6114, <아이스테키>.............co-occurance: 4
 Co-occuring ingredients: {'대파', '소금', '달걀', '후추'}
 
 
--------------------------------------------------------------------
 6238, <렌틸콩그레츄레이션>.............co-occurance: 4
 Co-occuring ingredients: {'대파', '소금', '파프리카', 

### Cosine similarity based on TFIDF
Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. With cosine similarity, we need to convert sentences into vectors. One way to do that is to use bag of words with either TF (term frequency) or TF-IDF (term frequency- inverse document frequency). The choice of TF or TF-IDF depends on application and is immaterial to how cosine similarity is actually performed — which just needs vectors. TF is good for text similarity in general, but TF-IDF is good for search query relevance.

There are two main difference between tf/ tf-idf with bag of words and word embeddings: 
1. tf / tf-idf creates one number per word, word embeddings typically creates one vector per word. 
2. tf / tf-idf is good for classification documents as a whole, but word embeddings is good for identifying contextual content.



In [154]:
# Resource: https://stevenloria.com/tf-idf/
# Resource: https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50

In [184]:
# Define the following functions
def tf(word, blob):
    """
    Tf(word, blob) computes "term frequency" which is the number of times 
    a word appears in a document blob, normalized by dividing by the 
    total number of words in blob.
    """
    all_the_words_counter = Counter(recipe_2_ingredient_dict[blob])
    num_word_appears = all_the_words_counter[word]
    total_num_of_words = len(set(recipe_2_ingredient_dict[blob]))
    return num_word_appears / total_num_of_words

def n_containing(word, bloblist):
    """
    Returns the number of documents containing word.
    A generator expression is passed to the sum() function.
    """
    return sum(1 for blob in bloblist if word in recipe_2_ingredient_dict[blob])

def idf(word, bloblist):
    """
    Computes "inverse document frequency" which measures 
    how common a word is among all documents in bloblist. 
    The more common a word is, the lower its idf. 
    We take the ratio of the total number of documents 
    to the number of documents containing word, 
    then take the log of that. Add 1 to the divisor to 
    prevent division by zero.
    """
    return float(math.log(len(bloblist)) / float((1 + n_containing(word, bloblist))))

def tfidf(word, blob, bloblist):
    """
    Computes the TF-IDF score. It's the product of tf and idf.
    """
    return tf(word, blob) * idf(word, bloblist)

In [185]:
bloblist = recipe_key_list
test_recipe = [6128,6129]
tfidf_score_dict = dict()
for i, blob in enumerate(bloblist):
    #print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in recipe_2_ingredient_dict[blob]}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    tfidf_score_dict[blob] = scores
#     if blob in test_recipe:
#         for word, score in sorted_words[:10]:
#             print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
#         print("-------")

In [197]:
def get_cosine_similarity(doc_1, doc_2):
    co_occur_words = list(set(recipe_2_ingredient_dict[doc_1]) & set(recipe_2_ingredient_dict[doc_2]))
    cosine_similarity = 0
    for word in co_occur_words:
        vec_1 = tfidf_score_dict[doc_1][word]
        vec_2 = tfidf_score_dict[doc_2][word]
        cosine_similarity += np.dot(vec_1, vec_2)
    return cosine_similarity

In [198]:
get_cosine_similarity(6128,6129)

0.004596486194561721

In [201]:
def rank_based_on_cosine_similarity(doc_1, all_the_doc):
    compare_dict = dict()
    for compare_doc in all_the_doc:
        score = get_cosine_similarity(doc_1, compare_doc)
        compare_dict[compare_doc] = score
    # sort
    sorted_doc = sorted(compare_dict.items(), key = lambda x: x[1], reverse = True)
    return sorted_doc[:20]

In [216]:
recipe_key_list[-10:]

[6134, 6135, 6136, 6137, 6138, 6139, 6140, 6141, 6142, 6143]

In [225]:
rank_based_on_cosine_similarity(6142, recipe_key_list)

[(6142, 0.055135872735961766),
 (6504, 0.022171458156960384),
 (6003, 0.008496461529052778),
 (6138, 0.006953647368766969),
 (6507, 0.006132103494184278),
 (6514, 0.005880420000713393),
 (6564, 0.005479137948729378),
 (6006, 0.005463580075459761),
 (6128, 0.0051031529877613785),
 (6130, 0.004499418885672744),
 (6562, 0.004204035747129451),
 (6423, 0.0005813939380748428),
 (6139, 0.0004908841208334472),
 (6225, 0.00045107130802424835),
 (6396, 0.0004160094726963732),
 (6416, 0.0004002141612553351),
 (6314, 0.0003894637038674107),
 (6336, 0.0003676214024024194),
 (6008, 0.000365350865301032),
 (6353, 0.0003540566773712902)]

Differences between Jaccard Similarity and Cosine Similarity:
1. Jaccard similarity takes only unique set of words for each sentence / document while cosine similarity takes total length of the vectors. (these vectors could be made from bag of words term frequency or tf-idf)
2. This means that if you repeat the word “friend” in Sentence 1 several times, cosine similarity changes but Jaccard similarity does not. For ex, if the word “friend” is repeated in the first sentence 50 times, cosine similarity drops to 0.4 but Jaccard similarity remains at 0.5.
3. Jaccard similarity is good for cases where duplication does not matter, cosine similarity is good for cases where duplication matters while analyzing text similarity. For two product descriptions, it will be better to use Jaccard similarity as repetition of a word does not reduce their similarity.