### Outline

#### Restaurant Recommendations for (User, City)

##### Steps

1. Encode Restaurants as an Embedding that captures two things:
  - Restaurants that have similar review text should be closer
    - Tried here: Mean of Glove word embeddings after text clean up.
    - TF-IDF Weighted average of embeddings might improve results
    - Different ways of text encoding should be explored (BERT et al.)
  - Restaurants that are rated similarly by different users should be closer (irrespective of whether the actural stars rating is the same)
    - Tried here: Use Word2Vec architecture
      - User1, 4 Stars: (Res1, Res2, Res3)
      - User2, 1 Stars: (Res1, Res4, Res3)
      - => Res2 and Res4 are similar in some way because they appear in the company of similar restaurants (Res1, Res3). User1 loves the group, User2 hates the group, but there is something common about the group. Word2Vec tries to capture this. 


2. Encode Users as an Embedding that captures their preferences based on their preferred and not preferred restaurants.
  - Tried here:
    - User Embedding = Preferred Restaurants Embedding - Not Preferred Restaurants Embedding
    - Preferred Restaurants Embedding: Avarage of Positively Reviewed (Stars > 3) Restaurants
    - Not Preferred Restaurants Embedding: Average of Negatively Reviewed (Stars < 3) Restaurants). Neutral reviews ignored.


3. Train a Neural Network Model to take in a User Embedding and a Restaurant Embedding and predict the Stars Rating
 - Criterion: Mean Squared Error
 - MSE Loss observed on Test Data: 0.01 (That looks quite good; wondering if I made any mistakes. Need more scruitiny!)
 - Alternative: This could be a model that takes one-hot vectors of Users and Restaurants and have an Embedding layer in between that can be learned. Might not scale with huge number of Restaurants and Users.


4. Use the above model to predict Stars Rating of all Restaurants for a given (User, City), sort descending and recommend top 5.
  - Sample function given at the end of the notebook
  - Alternative: To avoid the Neural Network Model, recommendations may be given as sorting Restaurants ascending by Cosine Similarity between User Embedding and Restaurant Embeddings for the given City. It could be a simpler model to serve if it works decently (Right now gives very different results compared to Neural Net, and haven't looked at how good the results are). 


#### Price Range Prediction for Restaurants (Not Done)

##### Steps

1. Train a Neural Network Model to take in a Restaurant Embedding based on only reviews text and predict Price Range
  - Glove word embeddings based method mentioned above can be used as a start


2. Use the above model for prediction

### Set Data Path

In [None]:
env = 'local' # 'colab'

if env == 'colab':
    from google.colab import drive
    
    drive.mount('/content/gdrive')
    drivepath = 'gdrive/My Drive'

else:
    drivepath = '/Users/cijogeorge/Google Drive'

datapath = drivepath + '/github/yelp-dataset'

print('Done.')

### Import Libraries and Pre-trained Models

In [None]:
!pip install pandas==1.2.4
!pip install tqdm==4.60.0
!pip install gensim==4.0.1
!pip install python-Levenshtein==0.12.2
!pip install nltk==3.6.2
!pip install torch==1.8.1

import gensim.downloader as api
import logging
import multiprocessing
import nltk
import numpy as np
import os
import pandas as pd
import pickle
import re
import torch

from collections import Counter
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from numpy import dot
from numpy.linalg import norm
from sklearn import random_projection
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

# Load Glove Word Embeddings for Encoding Reviews Text
glove_word_model = api.load("glove-wiki-gigaword-50")

print('Done.')

### Load data

- Note: J

In [None]:
df_businesses_file = datapath + '/data/df_businesses.pkl'
df_reviews_file = datapath + '/data/df_reviews.pkl'

# df_businesses: 
#       Pandas dataframe with restaurants from yelp business dataset
# df_reviews: 
#       Pandas dataframe with restaurant reviews from yelp reviews dataset

# Load df_businesses data from disk if they exist, else generate
if os.path.isfile(df_businesses_file):
    print('Loading file:', df_businesses_file)

    with open(df_businesses_file, 'rb') as f:
        df_businesses = pickle.load(f)
    
else:
    print('Generating file:', df_businesses_file)

    # Import the data (chunksize returns jsonReader for iteration)
    businesses = pd.read_json("./data/yelp_academic_dataset_business.json", 
                              lines=True, orient='columns', chunksize=100000)

    # Read the data
    frames = []
    for business in tqdm(businesses):
        business = \
          business[business['categories'].str.contains('Restaurant.*') == True].reset_index()
        frames.append(business)

    df_businesses = pd.concat(frames, sort=False)
    del frames
    del businesses

    df_businesses = df_businesses[df_businesses['is_open'] == 1]
    df_businesses.set_index('business_id', inplace=True, drop=True)
        
    with open(df_businesses_file, 'wb') as f:
        pickle.dump(df_businesses, f)
    
# Load df_reviews data from disk if they exist, else generate
if os.path.isfile(df_reviews_file):
    print('Loading file:', df_reviews_file)

    with open(df_reviews_file, 'rb') as f:
        df_reviews = pickle.load(f)
else:
    print('Generating file:', df_reviews_file)

    reviews = pd.read_json("./data/yelp_academic_dataset_review.json", 
                           lines=True, orient='columns', chunksize=100000)
    frames = []
    for review in tqdm(reviews):
        review = review[review['business_id'].isin(df_businesses.index)]
        frames.append(review)

    df_reviews = pd.concat(frames, sort=False)

    del frames
    del reviews
    
    with open(df_reviews_file, 'wb') as f:
        pickle.dump(df_reviews, f)

# Select indices to keep aside test data for model evaluation 
train_indices, test_indices = train_test_split(range(0, len(df_reviews)), 
                                               test_size=0.20, 
                                               random_state=42)

# business_reviews: 
#       Dictionary mapping Restaurants and their review texts
# user_restaurants: 
#       Dictionary with mapping between Users and their preferred (stars >3) 
#       and not preferred (stars < 3) Restaurants.

business_reviews_file = datapath + '/data/business_reviews.pkl'
user_restaurants_file = datapath + '/data/user_restaurants.pkl'

# Load business_reviews and user_restaurants from file if exists, else generate
if os.path.isfile(business_reviews_file) and os.path.isfile(user_restaurants_file):
    print('Loading files:')
    print(business_reviews_file)
    print(user_restaurants_file)
    
    with open(business_reviews_file, 'rb') as f:
        business_reviews = pickle.load(f)

    with open(user_restaurants_file, 'rb') as f:          
        user_restaurants = pickle.load(f)
    
else:
    print('Generating files:')
    print(business_reviews_file)
    print(user_restaurants_file)
    
    business_reviews = {}
    user_restaurants = {}

    # Only train indices are used to fetch data. Not touching test data.
    for index in tqdm(train_indices):
        business_reviews.setdefault(
            df_reviews.iloc[index]['business_id'], []).append(
                df_reviews.iloc[index]['text'])

        if df_reviews.iloc[index]['stars'] > 3:
            user_restaurants.setdefault(
                df_reviews.iloc[index]['user_id'], {}).setdefault(
                    'positive', []).append(df_reviews.iloc[index]['business_id'])

        if df_reviews.iloc[index]['stars'] < 3:
            user_restaurants.setdefault(
                df_reviews.iloc[index]['user_id'], {}).setdefault(
                    'negative', []).append(df_reviews.iloc[index]['business_id'])  

    with open(business_reviews_file, 'wb') as f:
        pickle.dump(business_reviews, f)

    with open(user_restaurants_file, 'wb') as f:          
        pickle.dump(user_restaurants, f)

print('Done.')

### Generate Restaurant User Stars Model using Word2Vec


In [None]:
restaurants_w2v_model_file = datapath + '/data/restaurants_w2v_model.pkl'

# Load the model if exists, else generate
if os.path.isfile(restaurants_w2v_model_file):
    print('Loading model:', restaurants_w2v_model_file)
    
    with open(restaurants_w2v_model_file, 'rb') as f:
        restaurants_w2v_model = pickle.load(f)
    
else:
    print('Generating model:', restaurants_w2v_model_file)
    
    # Group restaurants with same stars given by a user
    restaurant_groups = {}

    reviews = {'user_id': list(df_reviews['user_id']),
               'business_id': list(df_reviews['business_id']),
               'stars': list(df_reviews['stars'])}

    # Only train indices are used to fetch data. Not touching test data.
    for i in tqdm(train_indices):
        user_id = reviews['user_id'][i]
        business_id = reviews['business_id'][i]
        stars = reviews['stars'][i]

        restaurant_groups.setdefault(user_id, {}).setdefault(
            stars, []).append(business_id)

    # Generate restaurant lists for Word2Vec training.
    #     Idea: Restaurants that appear in similar "context" are similar.
    #           Context: Obtained same stars by a user.
    same_stars_restaurants = []
    for user_id in tqdm(restaurant_groups):
        for stars in restaurant_groups[user_id]:
            same_stars_restaurants.append(restaurant_groups[user_id][stars])

    # Word2Vec Model Training
    restaurants_w2v_model = Word2Vec(sentences=same_stars_restaurants, 
                            vector_size=50, 
                            window=1000, 
                            min_count=1, 
                            workers=multiprocessing.cpu_count()-1)


    with open(restaurants_w2v_model_file, 'wb') as f:
        pickle.dump(restaurants_w2v_model, f)

    del same_stars_restaurants
    del reviews 
    del restaurant_groups

print('Done.')

### Helper functions

In [None]:
# Utils to clean text

def _remove_url(text):
    return re.sub(r'http\S+', '', text)

def _remove_non_alphabets(text):
    return re.sub(r'[^a-zA-Z]', ' ', text)

def _lowercase(text):
    return str(text).lower()

def _tokenize(text):
    return word_tokenize(text)

def _remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

def _lemmatize(tokens):
    return [lemmatizer.lemmatize(word=word, pos='v') for word in tokens]

def _remove_shortwords(tokens, length=2):
    return [word for word in tokens if len(word) > length]

def clean_tokenize(text):
    text = _remove_url(text)
    text = _remove_non_alphabets(text)
    text = _lowercase(text)
    
    tokens = _tokenize(text)
    tokens = _remove_stopwords(tokens)
    tokens = _lemmatize(tokens)
    tokens = _remove_shortwords(tokens, length=2)
    
    return tokens


# Utils to Generate/ Fetch Restaurant and User Embeddings

def get_restaurant_embedding(business_id):
    return business_embeddings.get(business_id, [])

def get_user_embedding(user_id):
    preferred_restaurants = user_restaurants.get(user_id, {}).get('positive', [])
    not_preferred_restaurants = user_restaurants.get(user_id, {}).get('negative', [])
    
    embedding = []
    
    if len(preferred_restaurants):
        preferred_embedding = \
            np.mean([get_restaurant_embedding(business_id) 
                     for business_id in preferred_restaurants], axis=0)
        embedding = preferred_embedding
        
    if len(not_preferred_restaurants):
        not_preferred_embedding = \
            np.mean([get_restaurant_embedding(business_id) 
                     for business_id in not_preferred_restaurants], axis=0)
    
        if len(embedding):
            embedding -= not_preferred_embedding
                    
    return list(embedding)

def generate_restaurant_embedding(business_id, userstars=True, reviews=True):
    _embedding = []
    
    if userstars:
        _embedding.extend(_generate_restaurant_userstars_embedding(business_id))
    
    if reviews: 
        _embedding.extend(_generate_restaurant_reviews_embedding(business_id))
        
    return _embedding

def _generate_restaurant_reviews_embedding(business_id):    
    reviews_tokens = []
    for text in business_reviews[business_id]:
        tokens = clean_tokenize(text)
        reviews_tokens.extend(tokens)
        
    reviews_tokens = [word for word in reviews_tokens 
                      if word in glove_word_model]
    _embedding = list(np.mean([glove_word_model[word] 
                               for word in reviews_tokens], axis=0))
    
    return _embedding

def _generate_restaurant_userstars_embedding(business_id):
    _embedding = list(restaurants_w2v_model.wv[business_id])
        
    return _embedding


print('Done.')

### Generate Restaurant Embeddings

In [None]:
# business_embeddings: 
#`      Dictionary mapping Restaurants and their generated embeddings

# Load Restaurant Embeddings from file if exits, else generate
business_embeddings_file = datapath + '/data/business_embeddings.pkl'

if os.path.isfile(business_embeddings_file):
    print('Loading file:', business_embeddings_file)
    
    with open(business_embeddings_file, 'rb') as f:
        business_embeddings = pickle.load(f)

else:
    print('Generating file:', business_embeddings_file)
    
    business_embeddings = {}
    
    for business_id in tqdm(business_reviews):
        business_embeddings[business_id] = generate_restaurant_embedding(business_id)
        
    with open(business_embeddings_file, 'wb') as f:
        pickle.dump(business_embeddings, f)
    
print('Done.')

In [None]:
len(get_restaurant_embedding('AWsOwlorVHRSpgPJy1I0eg'))

In [None]:
print(get_user_liked_restaurants('q_QQ5kBBwlCcbL1s4NVK3g', 'Atlanta'))

In [None]:
print(get_recommendations('q_QQ5kBBwlCcbL1s4NVK3g', 'Atlanta'))

In [None]:
print('Done.')

### Simple Neural Network Model

In [None]:
# Custom Dataset

class ReviewsDataset(Dataset):
    def __init__(self, indices):
        super(Dataset, self).__init__()
        self.indices = indices
        self.fallback_data = None
        
    def __len__(self):
        return len(self.indices)
    
    def __getitem__(self, idx):
        try:
            features, target = self._get_data(idx)
            return (features, target)
            
        except:
            return self._get_fallback_data()

    def _get_fallback_data(self):
        if self.fallback_data is not None:
            return (self.fallback_data[0], self.fallback_data[1])
        
        for idx in range(0, len(self.indices)):            
            try:
                features, target = self._get_data(idx)
                
                if self.fallback_data is None:
                    self.fallback_data = (features, target)
                
                return (features, target)

            except:
                continue

        
    def _get_data(self, idx):
        user_embedding = get_user_embedding(df_reviews.iloc[self.indices[idx]]['user_id'])
        restaurant_embedding = get_restaurant_embedding(df_reviews.iloc[self.indices[idx]]['business_id'])

        assert len(user_embedding) == 100
        assert len(restaurant_embedding) == 100

        features = np.array(user_embedding + restaurant_embedding, dtype=float)
        target = np.array(df_reviews.iloc[self.indices[idx]]['stars'] - 1, dtype=float)

        assert 0 <= target <= 4
        
        return (torch.Tensor(features), torch.Tensor(target))
       

# Neural Network Architecture

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        
        self.hidden = nn.Linear(200, 50)
        self.relu = nn.ReLU()
        self.final = nn.Linear(50, 1)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)

    def forward(self, x):
        hidden = self.dropout(
            self.relu(
                self.hidden(x)))
        final = self.relu(self.final(hidden))
        
        return final


# Set device 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device:', device)


# Load the Neural Network Model if exits, else train
nn_model_file = datapath + '/data/nn_model.pt'

if os.path.isfile(nn_model_file):
    print('Loading Neural Network Model from file: ', nn_model_file)
    nn_model = NeuralNetwork().to(device)
    nn_model.load_state_dict(torch.load(nn_model_file))
    nn_model.eval()

else:
    print('Training Neural Network Model')

    # Train 
    
    # Epoch is set to 1 due to time & resource constraints
    num_epochs = 1
    batch_size = 128

    train = ReviewsDataset(train_indices)
    train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)

    # Send model to device
    nn_model = NeuralNetwork().to(device)

    # Define loss & optimizer
    loss_function = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(nn_model.parameters(), lr=0.001)

    # Train the model
    total_step = len(train_loader)

    nn_model.train()
    for epoch in range(num_epochs):
        for i, (features, targets) in enumerate(train_loader):
            targets = targets.view(targets.size(0), 1)

            # Move tensors to the configured device
            features = features.to(device)
            targets = targets.to(device)

            # Forward pass
            outputs = nn_model(features)
            loss = loss_function(outputs, targets)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if (i+1) % 100 == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))

    # Save the model         
    print('Saving model to file: ', nn_model_file)
    torch.save(nn_model.state_dict(), nn_model_file)
    print('Done.')


# Test the model using tbe test data kept aside
print('Running on test data')

nn_model.eval()

test = ReviewsDataset(test_indices)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=True)

with torch.no_grad():
    loss = 0
    total = 0
    
    for features, targets in tqdm(test_loader):
        targets = targets.view(targets.size(0), 1)

        features = features.to(device)
        targets = targets.to(device)
        
        outputs = nn_model(features)
        loss += loss_function(outputs, targets)
        total += targets.size(0)
        
    print('Avg. loss on test data: {} %'.format(loss/total))

print('Done.')

### Get recommendations

In [None]:
# Cosine Similarity

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))


# Utils for recommendations

nn_model = NeuralNetwork().to(device)
nn_model.load_state_dict(torch.load(nn_model_file))
nn_model.eval()

# Use method='embedding_similarity' to use cosine similarity between
# User Embedding and Restaurant Embedding for scoring
def get_recommendations(user_id, city, method='neural_network'):
    df_businesses_subset = df_businesses[df_businesses['city'] == city]
        
    user_embedding = get_user_embedding(user_id)
    
    scores = []
    for business_id in df_businesses_subset.index:
        try:
            restaurant_embedding = get_restaurant_embedding(business_id)
            if method == 'embedding_similarity':
                scores.append(cosine_similarity(restaurant_embedding, 
                                                user_embedding))
            else:
                features = np.array(user_embedding + restaurant_embedding, 
                                  dtype=float)
                features = torch.Tensor(features)
                features = features.to(device)
        
                output = nn_model(features)
                scores.append(float(output))

        except:
            if method == 'embedding_similarity':
                scores.append(0.5)

            else:
                scores.append(3)
        
    df_businesses_subset['reco_score'] = scores
    df_businesses_subset_sorted = df_businesses_subset.sort_values(by='reco_score', 
                                                                   ascending=False)
    
    return df_businesses_subset_sorted

print('Done.')

### Sample recommendation function calls

In [None]:
# Using Neural Network Model

reco = get_recommendations('q_QQ5kBBwlCcbL1s4NVK3g', 'Atlanta', method='neural_network')
print(reco[['name', 'address', 'reco_score']])

In [None]:
# Using Embedding Similarity (not evaluated)

reco = get_recommendations('q_QQ5kBBwlCcbL1s4NVK3g', 'Atlanta', method='embedding_similarity')
print(reco[['name', 'address', 'reco_score']])

### Done.