## Recommendation Engines via Nearest Neighbors

#### Objectives

In this section, I will use a TF-IDF format of the recipe data that I have been working with [here](https://github.com/emenriquez/Springboard-Coursework/blob/master/Capstone%20Project%202/EDA%20-%20Cuisines.ipynb) to create recommendation engines for the following applications that could be used by users of the Yummly service:

1. **Simple recommendation engine** - recommend the most similar recipes to a user's current recipe
2. **Recipe discovery engine** - recommend the most similar recipes to a user's current recipe that are from different cuisines unique from the user's original recipe cuisine.  
    Example: (User's recipe is Italian) "Here are 3 similar recipes from French, Irish, and Spanish cuisines."  
3. **Ingredient list recommendation engine** - take the ingredients that a user has, and recommend recipes that can be made, along with suggestions for which additional ingredients they may need.
4. **Ingredient pairing recommendation engine** - take ingredients for a user's recipe and recommend ingredients that are likely to pair well based on how often they are used together in the dataset.

In [1]:
# import packages to read and work with data
import pandas as pd
import numpy as np
from collections import defaultdict
from random import shuffle, sample
import itertools
import time

# Visualization packages
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
sns.set(font_scale=1.5)

# Packages for working with text data
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Tools for Recommendations
from sklearn.neighbors import NearestNeighbors

First I will redefine the TF-IDF sparse matrix for the data.

In [2]:
# Load the cleaned dataset from the data folder
data = pd.read_pickle('data/data_clean.pkl')

# drop the words column of data
data.drop('words', axis=1, inplace=True)

# Convert the recipe ingredient lists into strings
ingredient_strings = [', '.join(recipe) for recipe in data.ingredients]
data.ingredients = ingredient_strings

In [3]:
# Custom tokenizer to separate list into tokens by commas
tokenized = tokenize.regexp.RegexpTokenizer(pattern=", ", gaps=True)

# Create TF-IDF weighting dictionary for each cuisine, exclude terms that appear in every cuisine
tfidf = TfidfVectorizer(tokenizer=tokenized.tokenize, max_df=0.12, binary=True, use_idf=False, norm=None)

# Fit and transform cuisine ingredient lists to generate sparse matrix
ingredients_weighted = tfidf.fit_transform(data.ingredients)

In [4]:
ingredients_weighted

<39757x720 sparse matrix of type '<class 'numpy.float64'>'
	with 267634 stored elements in Compressed Sparse Row format>

### 1. Simple Recommendation Engine

Now that the sparse matrix is created, I will create the first recommendation engine, which will output the nearest neighbor recipes that can be suggested to the user.

In [5]:
def similar_cuisine_recommendations(user_recipe_id, n_recommendations=3):
    """
    This function takes in the id of a recipe for a user, and generates similar recipe recommendations.
    """
    
    # Define a map between ingredients_weighted and data ID's
    assert ingredients_weighted.shape[0] == data.shape[0]
    
    ingredients_weighted_indices_dict = defaultdict(int)
    for i in range(data.shape[0]):
        ingredients_weighted_indices_dict[data.index[i]] = i
    
    # Find nearest neigbors among all recipes
    nbrs = NearestNeighbors(n_recommendations+1).fit(ingredients_weighted)
    indices = nbrs.kneighbors(ingredients_weighted[ingredients_weighted_indices_dict[user_recipe_id]], return_distance=False)
    print('Your recipe:')
    print(data.loc[[user_recipe_id]])

    print('\n\nSimilar recipes you might be interested in:')
    print(data.iloc[indices[0][1:]])

In [6]:
similar_cuisine_recommendations(4758)

Your recipe:
      cuisine                                        ingredients
id                                                              
4758  italian  garlic clove, chicken broth, basil, flour, sal...


Similar recipes you might be interested in:
       cuisine                                        ingredients
id                                                               
40073  italian  garlic clove, chicken broth, basil, tomato, on...
44504  italian  chicken broth, basil, salt, butter, cheese, pe...
26040  italian  basil, cheese, olive oil, tomato, sugar, garli...


It looks like it works well, but as expected, the recipes may be overly similar. For example, if a user's current recipe is for brownies in a Southern U.S. cuisine style, the recommended recipes may also be Southern U.S. cuisine-style brownie recipes with very small differences. 

One way to address this is to create a recommendation engine that will let users discover recipes they may not have tried otherwise by suggesting similar recipes from different cuisine categories.

### 2. Recipe Discovery Engine

In [7]:
# Find nearest neighbors from cuisines distinct from input
def unique_cuisine_recommendations(user_recipe_id, n_recommendations=3):
    """
    This function takes in the id of a recipe for a user, and generates similar recipe recommendations from other cuisines.
    """
    
    # Define a map between ingredients_weighted and data ID's
    assert ingredients_weighted.shape[0] == data.shape[0]
    
    ingredients_weighted_indices_dict = defaultdict(int)
    for i in range(data.shape[0]):
        ingredients_weighted_indices_dict[data.index[i]] = i
    
    # Display the record for the user's recipe
    print('Your recipe:')
    print(data.loc[[user_recipe_id]])
    
    # Create a list of all cuisines
    cuisine_list = data.cuisine.value_counts().index
    cuisines_observed = [data.cuisine[user_recipe_id]]
    recommendations = []
    
    # Let the user know that the algorithm is searching
    print('\nSearching...\n')
    
    for recommendation in range(n_recommendations):
        data_subset = [data.index[i] for i in range(data.shape[0]) if data.loc[data.index[i]].cuisine not in cuisines_observed]
        ingredients_indices = [ingredients_weighted_indices_dict[value] for value in data_subset]
        ingredients_subset = ingredients_weighted[ingredients_indices]
        
        # Find nearest neigbors among cuisine subsets
        nbrs = NearestNeighbors(n_neighbors=2).fit(ingredients_subset)
        index = nbrs.kneighbors(ingredients_weighted[ingredients_weighted_indices_dict[user_recipe_id]],
                                  return_distance=False)
        
        # Append the result to the list for output and add the cuisine to cuisines_observed
        recommendations.append(index[0][0])
        cuisines_observed.append(data.iloc[index[0][0]].cuisine)

    # Display the results!
    print('\n\nTry some of these new recipes you might enjoy!')
    print(data.iloc[recommendations])

In [8]:
unique_cuisine_recommendations(23260)

Your recipe:
       cuisine                                        ingredients
id                                                               
23260  italian  milk, chicken broth, basil, cheese, flour, sal...

Searching...



Try some of these new recipes you might enjoy!
           cuisine                                        ingredients
id                                                                   
37648  southern_us  chicken broth, margarine, biscuit, onion, cele...
11935       korean  sesame oil, onion, sugar, flank steak, garlic,...
4407          thai  medium shrimp, lemongrass, lime juice, salt, g...


Great! This engine is also useful and may increase user engagement when they want to be a little adventurous with their cooking and are looking for a good place to start!

### 3. Ingredient list recommendation engine

For this engine, I will take a list of ingredients (perhaps whatever the user has in their fridge and pantry at the moment) and suggest the nearest recipes that can make use of a similar set of ingredients.

In [9]:
def ingredient_list_recommendations(user_ingredient_list, n_recommendations=3):
    """
    This function takes in the list of ingredients from a user, and generates similar recipe recommendations.
    """
    
    # Transform the recipe ingredients vector to TF-IDF format
    user_ingredients_string = ', '.join(user_ingredient_list)
    user_tfidf = tfidf.transform([user_ingredients_string])
    
    # Find nearest neigbors among all recipes
    nbrs = NearestNeighbors(n_recommendations).fit(ingredients_weighted)
    indices = nbrs.kneighbors(user_tfidf, return_distance=False)

    print('\n\nYou can try these recipes!')
    
    # loop through nearest neighbors = n_recommendations
    for neighbor in range(n_recommendations):
        
        # List ingredients (if needed) to make neighbor recipe
        neighbor_ingredients = (data.iloc[indices[0][neighbor]].ingredients).split(', ')
        missing_ingredients = [ingredient for ingredient in neighbor_ingredients if ingredient not in user_ingredient_list]
        
        # Print a border between recipes
        print('------------------------------')
        
        # Display recipes that user can make
        if len(missing_ingredients) == 0:
            print('\nYou have all of the ingredients needed to make this recipe!')
            print(data.iloc[indices[0][neighbor]])
            
        else:
            # Display recipes that can be made
            print('\nIf you buy: {0}'.format(missing_ingredients))
            print('\nYou can make:')
            print(data.iloc[indices[0][neighbor]])
    
    print('------------------------------')

In [10]:
sample_list = ['chicken broth', 'margarine', 'biscuit', 'onion', 'celery', 'pepper', 'sage', 'egg']

ingredient_list_recommendations(sample_list)



You can try these recipes!
------------------------------

You have all of the ingredients needed to make this recipe!
cuisine                                              southern_us
ingredients    chicken broth, margarine, biscuit, onion, cele...
Name: 37648, dtype: object
------------------------------

If you buy: ['bacon', 'bell pepper']

You can make:
cuisine                                              southern_us
ingredients    chicken broth, bacon, onion, celery, bell pepp...
Name: 31111, dtype: object
------------------------------

If you buy: ['flour', 'tomato', 'sugar', 'butter']

You can make:
cuisine                                              southern_us
ingredients    chicken broth, flour, tomato, sugar, butter, p...
Name: 2273, dtype: object
------------------------------


### Ingredient pairing recommendation engine

Lastly, I will use the a measure of the co-occurence of ingredients in a recipe to recommend an ingredient that could be added to either a single ingredient(best ingredient pair) or to a list of ingredients (best ingredient list compliment). This can be a useful suggestion for users looking to spice up their favorite recipes with a new ingredient!

In [11]:
# Convert to dense format for simpler handling
ingredients_dense = ingredients_weighted.todense()

# create a matrix of zeros to fill co-occurence values into
ingredient_pairs = np.zeros((ingredients_dense.shape[1], ingredients_dense.shape[1]))

# loop through ingredients and calculate fraction of recipes in which they occur together
for i in range(ingredients_dense.shape[1]):
    ingredient_indices = np.where(ingredients_dense[:,i] == 1)[0]
    ingredient_pairs[i,:] = ingredients_dense[ingredient_indices, :].sum(axis=0)/len(ingredient_indices)
    ingredient_pairs[i,i] = 0

In [12]:
def complimentary_ingredients(user_ingredient_list, n_recommendations=1):
    """
    This function takes in a list of ingredients from a user and suggests ingredients that are 
    most likely to compliment the set.
    """
    
    # Create a sublists of users ingredients that are in the ingredient pairings matrix
    ingredient_sublist = [ingredient for ingredient in user_ingredient_list if ingredient in tfidf.vocabulary_.keys()]
    print('Searching for compliments to the following ingredients:')
    print(ingredient_sublist)
    
    if len(ingredient_sublist) == 0:
        print('\nNo ingredients recognized for matching!')
    else:
        # Extract indices of the rows in the matrix for the given ingredients
        ingredient_indices = [sorted(tfidf.vocabulary_).index(ingredient) for ingredient in ingredient_sublist]
            
        # Sum up the co-occurences of all ingredients and return the ingredients with the highest combined co-occurence scores
        top_pairings = np.argsort(ingredient_pairs[ingredient_indices,:].sum(axis=0))[::-1]
        top_complimentary_ingredients = [ingredient for ingredient in top_pairings if ingredient not in ingredient_sublist]
        
        # Convert the indices back into ingredient names for display
        recommended_ingredients = [sorted(tfidf.vocabulary_)[i] for i in top_complimentary_ingredients[:n_recommendations]]
        
        # Display the recommendations to the user
        print('\nThis might go well with your ingredients!')
        print(recommended_ingredients)

In [13]:
complimentary_ingredients(sample_list)

Searching for compliments to the following ingredients:
['chicken broth', 'margarine', 'biscuit', 'celery', 'sage']

This might go well with your ingredients!
['milk']


In [14]:
complimentary_ingredients(['chicken'])

Searching for compliments to the following ingredients:
['chicken']

This might go well with your ingredients!
['ginger']


In [15]:
complimentary_ingredients(['strawberry'], n_recommendations=3)

Searching for compliments to the following ingredients:
['strawberry']

This might go well with your ingredients!
['cream', 'vanilla extract', 'milk']


In [16]:
# Create Second vocabulary TF-IDF weighting dictionary with all ingredients
tfidf_all = TfidfVectorizer(tokenizer=tokenized.tokenize, binary=True, use_idf=False, norm=None)

# Fit and transform cuisine ingredient lists to get vocabulary list
_ = tfidf_all.fit_transform(data.ingredients)

# Compare vocabulary lists to see which ingredients are not included in the recommendation engines
print('The following ingredients are not considered for recommendations because of their universal usage:')
print(set(tfidf_all.vocabulary_).difference(set(tfidf.vocabulary_)))

The following ingredients are not considered for recommendations because of their universal usage:
{'sugar', 'garlic clove', 'salt', 'olive oil', 'tomato', 'flour', 'water', 'pepper', 'garlic', 'egg', 'butter', 'onion'}


With these recommendation engines in place, we are able to utilize the data and analysis that has been done on the cuisines, recipes and ingredient levels to create prototype services that could enhance the experiences for users of this service. In addition, these engines can be used to learn more about cuisines and experiment with new fusion recipes that have a higher probability of a good outcome than random guessing by adding in complimentary ingredients to existing recipes.