<b>Recommendation Engine</b>

A recommendation engine/system helps the service providers to enhance the customer experience by recommending them the personalised products/services that they might like.

Recommendataion Systems may be of 2 type:
  1. Content-based: Given a few products, recommending the similar products. (Users who bought this,also bought..)
  2. User-based: Finding similar users by on the basis of their shopping patterns and recommending products that    other similar users liked!
  
<b>Problem Statement and Approach:</b>
     <br>
Given a dataset of thousands of restaurants in Bangalore and a user query with 4 parameters(text, location, cuisine, budget), we have to recommend restaurants to the user(Top 3 recommendations).
    
This is clearly a Content-based recommendation problem as we do not have any user past restaurant vising patterns.
    
Hence I used the following Approach step-by-step:
   1. Filling the missing data which is required during recommendation(dishes_liked, approx cost).
   2. Writing helper functions for text preprocesing - This is required to get recommendations based on text provided by user. Used dishes_liked, cuisined, reviews_list to find similarity (more about this in the later blocks)
   3. Writing functions to vectorize (BOW, TF-IDF), TF-IDF worked better and faster.
   4. Wrting recommendation function which returns the top 3 recommendations.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict, Counter
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from bs4 import BeautifulSoup
import unicodedata
import re
import spacy

In [2]:
data = pd.read_csv('RestoInfo.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,46019,Unique Brew Cafe Resto,No,No,,0,Indiranagar,Quick Bites,,Fast Food,200.0,[],[],Dine-out,Old Airport Road
1,28849,Jayanthi Sagar,No,No,3.1 /5,21,Koramangala 5th Block,Quick Bites,,"South Indian, North Indian, Chinese",200.0,"[('Rated 2.0', ""RATEDn Works only because it'...",[],Dine-out,Koramangala 4th Block
2,19855,Rock Stone Ice Cream Factory,Yes,No,4.0/5,131,BTM,Dessert Parlor,"Icecream Cake, Brownie, Waffles, Chocolate Ice...",Ice Cream,230.0,"[('Rated 4.0', ""RATEDn Ice creams are really ...","['Midnight Indulgence Cake', 'Butterscotch Mel...",Delivery,Jayanagar
3,35188,Punjabi by Nature 2.0,No,No,4.2 /5,3236,BTM,"Casual Dining, Microbrewery","Paneer Tikki, Mutton Raan, Mango Margarita, Cr...",North Indian,,"[('Rated 3.0', ""RATEDn It has a beautiful amb...",[],Delivery,Koramangala 7th Block
4,7070,Rayalaseema Chefs,Yes,Yes,3.9/5,225,Marathahalli,Casual Dining,"Bamboo Chicken, Butter Naan, Mutton Biryani, P...","North Indian, Biryani, Andhra, Chinese",800.0,"[('Rated 5.0', 'RATEDn Had Good experience wi...",[],Delivery,Brookefield


Filling in the missing values in required columns only as follows:
   1. It seemed the best approach to fill dishes_liked with "NA" as filling it with other values might have deviated the purpose of restaurant and potentially give wrong information.
   2. First converted approx cost to float by removing the "," and type casting. Then filled it with the mean value. 

In [4]:
data["dish_liked"] = data["dish_liked"].fillna("NA")
data["approx_cost(for two people)"] = (data["approx_cost(for two people)"].str.replace(",", "")).astype(float)
data["approx_cost(for two people)"] = data["approx_cost(for two people)"].fillna(data["approx_cost(for two people)"].mean())

<b>Vectorization and Final Recommendations</b>

Following are the functions to vectorize the text and also the main recommendation.
   1. Used cosine_similarity (sklearn-linear_kernel) between user text and (dishes+cuisine) also between user text and reviews_list(cleaned), and then taking average of both the similarity score to get final score. the restaurants are sorted in decreasing order and top 3 recommendations are chosen.
   
   2. For each user query, first data is filtered by location and budget, then cosine similarity is used for remaining 2 parameters (user text, cuisines)
   
   3. While vectoring reviews_list, only top 2000 words are selected for each query. As this reduces time and space complexity still giving good results.

In [5]:
import math
from collections import Counter
from nltk import cluster
from sklearn.metrics.pairwise import linear_kernel

def buildVector(iterable1, iterable2):
    counter1 = Counter(iterable1).most_common(200)
    counter2= Counter(iterable2)
    all_items = set(counter1.keys()).union( set(counter2.keys()) )
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def tfidf_vectorizer(datay, text):
    tfidf1 = TfidfVectorizer()
    tfidf2 = TfidfVectorizer(max_features = 2000)
    x = tfidf1.fit_transform(datay["dishes+cuisines"])
    y = tfidf2.fit_transform(datay["reviews_cleaned"])
    #print([text])
    first_text = tfidf1.transform([text.lower()])
    second_text = tfidf2.transform(normalize_corpus([text]))
    
    return x.toarray(), y.toarray(), first_text.toarray(), second_text.toarray()

def recommend(datax, text, location, budget, cuisine):
    filter_data = (datax.loc[(datax["location"]==location) & (datax["approx_cost(for two people)"]<=budget)]).reset_index(drop = True)
    #print(filter_data.shape)
    text = text + " " + cuisine
    #print(text)
    first_vect, sec_vect, first_text, second_text = tfidf_vectorizer(filter_data, text)
    #print(first_vect.shape)
    #print(sec_vect.shape)
    cosine_sim_1 = linear_kernel(first_text, first_vect)
    cosine_sim_2 = linear_kernel(second_text, sec_vect)
    final = pd.DataFrame({"name": filter_data["name"], "cosine_1": cosine_sim_1[0], "cosine_2": cosine_sim_2[0]})
    final["score"] = final[["cosine_1", "cosine_2"]].mean(axis = 1)
    final = pd.DataFrame(final.sort_values(by = "score", ascending = False)[:3]["name"]).reset_index(drop = True).rename(columns = {"name": "Restaurants Recommended"})
    return final
    

<b> Text Preprocessing - Helper functions</b>

In [6]:
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

#remove negative words from stopwords as it might be useful in semantic analysis
stopword_list.remove('no')
stopword_list.remove('not')
stopword_list.append('ratedn')
stopword_list.append('rated')
stopword_list.append('rate')

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

def lemmatize_text(text):
    #text = nlp(text)
    lemmatizer = WordNetLemmatizer() 
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

def normalize_corpus(corpus, html_stripping=True, accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=False, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True, text_stemming = True):
    
    normalized_corpus = []
    
    # normalize each document in the corpus
    for doc in corpus:
        
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        
        #Lower case the text
        if text_lower_case:
            doc = doc.lower()
        
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
        
        # lemmatize/stemming text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        if text_stemming:
            doc = simple_stemmer(doc)
        
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
        
        #Removing words with length more than 15
        x = doc.split()
        a = []
        for i in x:
            if len(i)<=15:
                a.append(i)
        doc = " ".join(a)
        normalized_corpus.append(doc)
        
        
    return normalized_corpus

In [7]:
data["reviews_cleaned"] = normalize_corpus(data["reviews_list"])

In [8]:
from collections import Counter
data["len"] = 0
for i in range(data.shape[0]):
    data["len"][i] = len(Counter(data["reviews_cleaned"][i].split()))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [9]:
data = data.sort_values(by = "len", ascending = False).reset_index(drop = True)
data.head(20)

Unnamed: 0.1,Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city),reviews_cleaned,len
0,32624,Brooks and Bonds Brewery,Yes,Yes,4.5 /5,3987,Koramangala 5th Block,"Casual Dining, Microbrewery","Beef Chilli, Salad, Cocktails, Pizza, Craft Be...","Continental, Mediterranean, North Indian, Chin...",1600.0,"[('Rated 4.0', 'RATEDn Nice chill place.. pre...","['Beer Butter Fish Finger', 'Murg Malai Tikka'...",Delivery,Koramangala 6th Block,[ nice chill place prefer would recommend sit ...,1666
1,35270,Brooks and Bonds Brewery,Yes,Yes,4.5 /5,3991,Koramangala 5th Block,"Casual Dining, Microbrewery","Beef Chilli, Salad, Cocktails, Pizza, Craft Be...","Continental, Mediterranean, North Indian, Chin...",1600.0,"[('Rated 5.0', 'RATEDn Deepak was the one hap...","['Beer Butter Fish Finger', 'Murg Malai Tikka'...",Delivery,Koramangala 7th Block,[ deepak one happi go person serv us complimen...,1655
2,17568,Ciclo Cafe,Yes,No,4.3/5,1269,Indiranagar,"Cafe, Casual Dining","Pizza, Cheesy Fries, Tiramisu, Ravioli Pasta, ...","Cafe, Italian, American",1000.0,"[('Rated 4.0', 'RATEDn A very cute bicycle sh...","['Home Made Fettucini Aglio Olio Pasta', 'Basi...",Cafes,Indiranagar,[ cute bicycl shop cum cafe ground floor bicyc...,1597
3,45827,Ciclo Cafe,Yes,No,4.3 /5,1332,Indiranagar,"Cafe, Casual Dining","Pizza, Cheesy Fries, Chicken Ceaser Salad, Tir...","Cafe, Italian, American",1000.0,"[('Rated 5.0', ""RATEDn I recently visited to ...",[],Dine-out,Old Airport Road,[ recent visit bangalor stop ciclo cafe indira...,1450
4,43831,WYT RestroPub,No,No,2.6 /5,408,MG Road,Pub,"Tandoori Chicken, Onion Rings, Masala Peanuts,...","North Indian, Chinese",1000.0,"[('Rated 1.0', ""RATEDn Drinks and food price ...",[],Drinks & nightlife,MG Road,[ drink food price increased place look plain ...,1358
5,13261,WYT RestroPub,No,No,2.7/5,402,MG Road,Pub,"Tandoori Chicken, Onion Rings, Masala Peanuts,...","North Indian, Chinese",1000.0,"[('Rated 5.0', 'RATEDn Well had been visit th...",[],Drinks & nightlife,Church Street,[ well visit yesterday tri potato basket vodka...,1343
6,9301,New Friends,Yes,No,3.8/5,273,BTM,Casual Dining,"Fritters, Lasagne, Biryani, Fish, Pasta, Draug...","North Indian, Continental, Chinese, Steak",900.0,"[('Rated 5.0', 'RATEDn Our team dinner was at...",[],Delivery,BTM,[ team dinner enjoy alot atmospher way arrang ...,1311
7,20232,New Friends,Yes,No,3.8/5,273,BTM,Casual Dining,"Fritters, Lasagne, Biryani, Fish, Pasta, Draug...","North Indian, Continental, Chinese, Steak",900.0,"[('Rated 5.0', 'RATEDn Our team dinner was at...","['Friends Special Cocktail', 'Veg Hot Pot', 'C...",Delivery,Jayanagar,[ team dinner enjoy alot atmospher way arrang ...,1311
8,20539,Nouvelle Garden,No,No,3.7/5,203,JP Nagar,Casual Dining,"Tiramisu, Pasta, Salads, Gulab Jamun, Butter N...","North Indian, Continental, Italian",900.0,"[('Rated 2.5', ""RATEDn The staff cares more a...",[],Delivery,Jayanagar,[ staff care cricket match play tv screen take...,1307
9,38634,Cafe Azzure,Yes,Yes,4.2 /5,2720,MG Road,Cafe,"Pasta, Wedges, Pizza, Nachos, Mocktails, Burge...","Cafe, Continental, Italian, Burger",1200.0,"[('Rated 5.0', 'RATEDn Cafe azzure has a roof...","['Azzure Special Twister Veg Pizza', 'Azzure S...",Dine-out,Lavelle Road,[ cafe azzur roof top ambience perfectli decor...,1306


As we can see from the above table that longest review list has around 1650 words after preprocessing, hence taking most common 2000 words for each query makes sense!

A new coloumn is created by concatenating dishes_liked and cuisines for cosine similarity with user text.

In [10]:
data["dishes+cuisines"] = (data["dish_liked"] + " " + data["cuisines"]).str.lower()

Here the recommend() function with parameters being user text, location, Budget, cuisine. Please change the below values to test with other queries.

In [11]:
import time
start = time.time()
try:
    final = recommend(data, "Good ambiance restaurants, serving fish", "Koramangala 5th Block", 1000, "North Indian")
    print("Recommendations Generated!")
except:
    print("Location not found, please enter location from folowing: ", )
    print(data["location"].unique())
end = time.time()
end-start

Recommendations Generated!


0.05191302299499512

Below are the top 3 restaurant recommendations.

In [12]:
final

Unnamed: 0,Restaurants Recommended
0,Jayanthi Sagar
1,North Indian And Bengali Mess
2,Thalassery Restaurant


How to further improve the recommendations:
   1. Use advaced word embedding (Glove, Word2vec, fasttext, etc...)
   2. Use custom similarity score (R&D)
   3. Getting user past account data (previous restaurant visiting patterns).