# RESTAURANT MENU RECOMMENDER

#### We poped into one of the restaurant's today & were confused after seeing the long menu card and innovative names. 
#### Sometimes a common man might not know what the dish actually means so he just needs to know what do I order from this long menu.
#### In such cases earlier reviews might be helpful but there is lot of noise in it and you need to find out which dishes people are recommending.
#### This program will read all the restaurant, review information from Zomato and do a basic sentiment analysis telling what people recommend for a restaurant

## Lets begin by installing some libraries we would require

In [1]:
!pip install dropbox

Collecting dropbox
[?25l  Downloading https://files.pythonhosted.org/packages/60/37/1874bedfdbac91c8abfb1ddb133599306d65fca44dcd592fa5e84afbf181/dropbox-9.4.0-py3-none-any.whl (543kB)
[K     |████████████████████████████████| 552kB 7.3MB/s eta 0:00:01
Installing collected packages: dropbox
Successfully installed dropbox-9.4.0


In [2]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/85/41/c3dfd5feb91a8d587ed1a59f553f07c05f95ad4e5d00ab78702fbf8fe48a/contractions-0.0.24-py2.py3-none-any.whl
Collecting textsearch (from contractions)
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick (from textsearch->contractions)
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 10.8MB/s eta 0:00:01
[?25hCollecting Unidecode (from textsearch->contractions)
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 29.4MB/s eta 0:00:01
[?25hBuilding wheels 

In [3]:
!pip install TextBlob

Collecting TextBlob
[?25l  Downloading https://files.pythonhosted.org/packages/60/f0/1d9bfcc8ee6b83472ec571406bd0dd51c0e6330ff1a51b2d29861d389e85/textblob-0.15.3-py2.py3-none-any.whl (636kB)
[K     |████████████████████████████████| 645kB 6.0MB/s eta 0:00:01
Installing collected packages: TextBlob
Successfully installed TextBlob-0.15.3


In [4]:
!pip install num2words

Collecting num2words
[?25l  Downloading https://files.pythonhosted.org/packages/eb/a2/ea800689730732e27711c41beed4b2a129b34974435bdc450377ec407738/num2words-0.5.10-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 13.0MB/s ta 0:00:01
[?25hCollecting docopt>=0.6.2 (from num2words)
  Downloading https://files.pythonhosted.org/packages/a2/55/8f8cab2afd404cf578136ef2cc5dfb50baa1761b68c9da1fb1e4eed343c9/docopt-0.6.2.tar.gz
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/9b/04/dd/7daf4150b6d9b12949298737de9431a324d4b797ffd63f526e
Successfully built docopt
Installing collected packages: docopt, num2words
Successfully installed docopt-0.6.2 num2words-0.5.10


## I will be doing REST API calls to ZOMATO API so importing libraries needed for it

In [5]:
import requests as rq
import json
import pandas as pd
import dropbox
from pandas.io.json import json_normalize

In [6]:
headers = {}

## Lets begin by defining a module for each function

In [7]:
def initialize(api_key):
    '''
    func to intialize the headers for the request URL like api key etc.
    '''
    headers = {
    'Accept': 'application/json',
    'user-key': api_key,
    }
    return headers

In [8]:
def make_query(func,args):

    '''
    func - the call made to the API. eg. reviews/locations/restaurants/collections etc
    args - arguments to be passed alongwith a particular func call
    '''

    #construct the query
    req_url = "https://developers.zomato.com/api/v2.1/"
    req_url=req_url+func
    key_index=0
    for key, value in args.items() :
        if key_index == 0:
            req_url=req_url+"?"+key+"="+value
        else:
            req_url=req_url+"&"+key+"="+value
        key_index+=1
    
    return execute_query(req_url)

In [9]:
def execute_query(query):
    '''
    query - object of class zomatoApiRequest
    headers - dict of meta data for API call
    '''
    response=rq.get(query,headers=headers)
    return(response.json())

In [10]:
def getLocation(location_name):
    '''	
    get details of most relevant locations searched by name
    '''
    func = "locations"
    args = { 'query': location_name }
    output = make_query(func,args)
    if (checkKey(output,'location_suggestions') == 1):
        print("location found")
    else:
        print("location key not found in this iteration")
    return output['location_suggestions'][0]

In [11]:
def getCollections(city_id):
    '''	
    get details of collection
    '''
    func = "collections"
    args =dict()
    args["city_id"] = city_id
    output = make_query(func,args)
    if (checkKey(output,'collections') == 1):
        print("collections found ")
    else:
        print("collections key not found in this iteration")
    return(output['collections'])

In [12]:
def getReviews(res_id):
    '''	
    get details of reviews
    '''
    func = "reviews"
    args =dict()
    args["res_id"] = res_id
    output = make_query(func,args)
    if (checkKey(output,'reviews_shown') == 1):
        print("reviews found")
    else:
        print("reviews key not found in this iteration")
    return(output['user_reviews'])

In [13]:
def search(city_id,collection_id,start=0,count=1):
    '''
    get restaurants for the given location(entity_id); other args - cuisines/collections/category and sorting & counts for fetching how many at a time
    '''
    func = "search"
    args = { 'entity_id':city_id, 'entity_type':'city', 'start':str(start), 'count':str(count),'collection_id':collection_id,'sort':'rating','order':'desc'}
    output = make_query(func,args)
    return(output)

In [14]:
def checkKey(dict, key):
    '''	
    func to check if a dictionary key exists before taking any actions on it
    '''
    if key in dict.keys(): 
        return 1
    else: 
        return 0

In [15]:
def extract_resto_and_reviews(collection_id,output,total_results):
    '''	
    func to extract all restaurants within a collection. This will iterate multiple times as one API call gives on 20 restaurants
    We will populate all the data from the JSON objects into individual lists & finally append to a dictonary which can be easily converted to a pandas frame
    '''
    restro_temp_list = []
    resto_index = 0
    restro_temp_list.append(output['restaurants'])
    #print("Length of resto temp list ",len(restro_temp_list))
    for i in range(0,len(restro_temp_list)): #always len(restro_temp_list) = 1 
        try:
            for rest_dict_val in restro_temp_list[i]: #loop for restaturants within a collection                   

                if resto_index <=total_results: #till we get data for all restaurants in this collection  
                    if (checkKey(rest_dict_val,'restaurant') == 1):
                        #print("processing res_id ", rest_dict_val['restaurant']['id'])
                        collection_id_list.append(collection_id)
                        rest_id_list.append(rest_dict_val['restaurant']['id'])
                        rest_name_list.append(rest_dict_val['restaurant']['name'])
                        rest_locality_list.append(rest_dict_val['restaurant']['location']['locality'])
                        rest_user_rating_list.append(rest_dict_val['restaurant']['user_rating']['aggregate_rating'])

                        rev_output_list = getReviews(str(rest_dict_val['restaurant']['id']))

                        for rev_text_dict_val in rev_output_list: #loop for reviews within a restaurant
                            if (checkKey(rev_text_dict_val,'review') == 1):
                                if rev_text_dict_val['review']['review_text'] != '':
                                    review_rest_id_list.append(rest_dict_val['restaurant']['id'])
                                    review_id_list.append(rev_text_dict_val['review']['id'])
                                    review_text_list.append(rev_text_dict_val['review']['review_text'])
                                    review_rating_list.append(rev_text_dict_val['review']['rating'])                        

                        resto_index = resto_index+1 
        except KeyError:
            pass

In [96]:
def exhaustiveSearch(city_id,collection_id,collection_res_count):
    '''
    The basic call to SEARCH API of ZOMATO
    API allows retrieving only upto 20 results at a time so we iterate in same collection till all restaurants obtained
    '''
    start = 0
    cnt = 20
    resto_index = 0
    offset = start
    
    output = search(city_id,collection_id,start=offset,count=cnt)
    if (checkKey(output,'results_shown') == 1):
        total_results = output['results_shown']
        #print("results_shown = ", total_results)
        #print("collection_res_count = ",collection_res_count)

        if total_results == collection_res_count: #1st extraction itself gave all restaurants
            #print("ONLY 1 CALL ENOUGH to get all restaurants")        
            extract_resto_and_reviews(collection_id,output,total_results)               
        else:
            #print("WILL BE DOING MORE CALLS to get all restaurants")
            extract_resto_and_reviews(collection_id,output,total_results) #1st extraction as is since output is already obtained above 

            offset = start+total_results #determine new offset for next search

            while (offset <=collection_res_count):
                #print("offset=",offset)
                output = search(city_id,collection_id,start=offset,count=cnt)
                if (checkKey(output,'results_shown') == 1):
                    total_results = output['results_shown']
                    print("new results_shown = ", total_results)

                    extract_resto_and_reviews(collection_id,output,total_results)

                    offset = offset+total_results #determine new offset for next search
                    if(total_results == 0): #no more results coming
                        print("no more restaurants data to fetch")
                        break
                else:
                    print("no results found in this iteration")
                    
        all_review_data = {'rest_id':review_rest_id_list,'review_id':review_id_list, 
                           'review_text':review_text_list,'review_rating':review_rating_list}
        all_resto_data = {'collection_id':collection_id_list,'rest_id':rest_id_list, 
                      'rest_name':rest_name_list,'rest_locality':rest_locality_list,
                      'rest_user_rating':rest_user_rating_list}
    else:
        print("no results found in this iteration")
        
        all_review_data = {'rest_id':review_rest_id_list,'review_id':review_id_list, 
                           'review_text':review_text_list,'review_rating':review_rating_list}
        all_resto_data = {'collection_id':collection_id_list,'rest_id':rest_id_list, 
                      'rest_name':rest_name_list,'rest_locality':rest_locality_list,
                      'rest_user_rating':rest_user_rating_list}
    return(all_resto_data,all_review_data)

In [113]:
def upload_file(file_from, file_to):
    '''Optionally you can export pandas data to a dropbox for reviewing independtly on your PC use your DROPBOX API key in below function'''
    dbx = dropbox.Dropbox("JRedVn3NSbAAAAAAAAAAOkLdmz1jtZnNfjqfaOJMPsZ_P9o5yi325Lug4oNUI5P1")
    f = open(file_from, 'rb')
    dbx.files_upload(f.read(), file_to) 

## The MAIN code processing begins from here

In [86]:
if __name__ == "__main__":
    api_key = "19b2c1a3c8e2493d77b2231b99696407"
    headers = initialize(api_key)

    user_location=input("Enter location:")
    #user_location="Mumbai"
    location_result = getLocation(user_location)
    #print(location_result)
    
    collection_list = getCollections(str(location_result['city_id']))

Enter location:Mumbai
location found
collections found 


In [87]:
#collection_list

In [88]:
df_collections = pd.DataFrame.from_dict(json_normalize(collection_list), orient='columns')

In [89]:
df_collections.shape

(50, 7)

In [90]:
new_columns={'collection.collection_id':'collection_id','collection.description':'description','collection.res_count':'res_count','collection.title':'title'}
unwanted_columns=['collection.image_url','collection.share_url','collection.url']
df_collections = df_collections.rename(columns=new_columns)
df_collections = df_collections.drop(columns=unwanted_columns,axis=1)

In [91]:
df_collections.head()

Unnamed: 0,collection_id,description,res_count,title
0,1,Most popular restaurants in town this week,30,Trending This Week
1,274852,The hunt for the highest-rated restaurants in ...,233,"Great Food, No Bull"
2,29,The best new places in town,25,Newly Opened
3,304361,Binge. Chug. Groove!,15,Mumbai's Best Food & Party Destination - The Orb
4,4,The most idyllic outdoor-dining spots in the city,22,Outdoor Seating


## We have now collected all the collections & I will only fetch data for first 8 collections in this program

In [97]:
df_sample_collection = df_collections.head(8)

In [98]:
df_sample_collection

Unnamed: 0,collection_id,description,res_count,title
0,1,Most popular restaurants in town this week,30,Trending This Week
1,274852,The hunt for the highest-rated restaurants in ...,233,"Great Food, No Bull"
2,29,The best new places in town,25,Newly Opened
3,304361,Binge. Chug. Groove!,15,Mumbai's Best Food & Party Destination - The Orb
4,4,The most idyllic outdoor-dining spots in the city,22,Outdoor Seating
5,304866,Zomato's most searched restaurants of 2019 in ...,17,Most Searched 2019
6,304503,Zomato's best openings of 2019 - the crème de ...,19,Best Openings of 2019
7,40,From cookies and doughnuts to ice cream and ca...,86,Sweet Tooth


## Creating dataframe objects to hold review & restaurant data

In [99]:
df_reviews = pd.DataFrame(columns = ['rest_id', 'review_id','review_text','review_rating'])
df_restaurant = pd.DataFrame(columns = ['collection_id','rest_id', 'rest_name','rest_locality','rest_user_rating'])

## For every collection we get all restos & their reviews

In [100]:
for index in df_sample_collection.index: #loop over each collection (4 for now)
    all_review_data = {}
    all_resto_data = {}
    rest_id_list = []
    rest_name_list = []
    rest_locality_list = []
    rest_user_rating_list = []
    collection_id_list = []
    review_rest_id_list = []
    review_id_list = []  
    review_text_list = []
    review_rating_list = []
    
    coll_res_count = df_sample_collection['res_count'][index]
    print("processing collection = ",df_sample_collection['collection_id'][index])
    
    all_resto_data,all_review_data = exhaustiveSearch(str(location_result['city_id']),str(df_sample_collection['collection_id'][index]),coll_res_count)
    df_restaurant = df_restaurant.append(pd.DataFrame(all_resto_data),ignore_index=True) #append all resto data to a frame
    df_reviews = df_reviews.append(pd.DataFrame(all_review_data),ignore_index=True) #append all resto data to a frame
    print("")

print("processed all collections")

processing collection =  1
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found

processing collection =  274852
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
new results_shown =  20
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews found
reviews key not found in this iteration
new results_shown =  20
reviews foun

## We have now collected restaurant and review data in pandas

## There is possibility that a restaurant might be scanned in multiple collections for purpose of further analysis lets remove duplicates

In [101]:
df_resto_orig = df_restaurant.copy()
df_review_orig = df_reviews.copy()

In [102]:
df_resto_orig.shape

(82, 5)

In [103]:
df_review_orig.shape

(298, 4)

In [107]:
#df_restaurant = df_restaurant.sort_values("rest_id", inplace = True) 
#df_reviews = df_reviews.sort_values("review_id", inplace = True) 

df_restaurant = df_restaurant.drop_duplicates(subset = 'rest_id', keep='first')
df_restaurant = df_restaurant.reset_index(drop=True)

df_reviews = df_reviews.drop_duplicates(subset = 'review_id', keep='first')
df_reviews = df_reviews.reset_index(drop=True)

In [108]:
df_restaurant.shape

(81, 5)

In [109]:
df_reviews.shape

(293, 4)

In [110]:
df_restaurant.head()

Unnamed: 0,collection_id,rest_id,rest_name,rest_locality,rest_user_rating
0,1,19296768,Mitron At George,Fort,3.9
1,1,19294334,Bora Bora Duty Free,"Linking Road, Bandra West",3.6
2,1,18969541,High Jack,"Versova, Andheri West",4.0
3,1,19289844,KPJ - Veg. Kitchen & Bar,Kandivali East,4.0
4,1,19281880,Queen's Deck,Churchgate,3.7


In [111]:
df_restaurant.to_csv('restaurant.csv', header=True)
df_reviews.to_csv('reviews.csv', header=True)

In [115]:
file1_from = 'restaurant.csv'
file1_to = '/DataScience/restaurant.csv'
file2_from = 'reviews.csv'
file2_to = '/DataScience/reviews.csv'

try:
    upload_file(file1_from,file1_to)
    upload_file(file2_from,file2_to)
except:
    pass

In [116]:
#df_restaurant['rest_id'].value_counts()

In [117]:
#df_reviews['review_id'].value_counts()

## Now we will be importing libraries needed for NLP like NLTK, Text Blob, Spacy

In [118]:
import nltk
import re
import unicodedata
import contractions
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from textblob import TextBlob
from num2words import num2words

In [119]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dsxuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /home/dsxuser/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [120]:
stopword = stopwords.words('english')
stopword.extend(["rated","rating","charge","charges","bill","gst","service","visited","again","repeat","cost","friend","place","ambience"])

## Defining function for each task in NLP pre-processing

In [121]:
def to_lowercase(text):
    return text.lower()

In [122]:
def remove_html(text):
    text = re.sub(r'https?:\/\/(www\.)?[-a-zA-Z0–9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0–9@:%_\+.~#?&//=]*)', '', text, flags=re.MULTILINE) # to remove links that start with HTTP/HTTPS in the tweet
    text = re.sub(r'[-a-zA-Z0–9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0–9@:%_\+.~#?&//=]*)', '', text, flags=re.MULTILINE) # to remove other url links
    
    return text

In [123]:
def remove_hastags_mentions(text):
    text = re.sub(r"#(\w+)", ' ', text, flags=re.MULTILINE) #to remove hastags
    text = re.sub(r"@(\w+)", ' ', text, flags=re.MULTILINE) # to remove mentions
    return text

In [124]:
def remove_emoji(text):
    '''func to remove emoji & symbols'''
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

In [125]:
def remove_punctuations(text):
    punctuations = '''!()-![]{};:+'"\,<>/?@#$%^&*_~'''
    text = ''.join([i for i in text if not i in punctuations]) # to remove punctuations except .
    return text

In [126]:
def remove_contractions(text):
    '''func to convert contractions to full words e.g. can't will become cannot'''
    return ' '.join([contractions.fix(word) for word in text.split()])

In [127]:
def replace_ordinal_numbers(text):
    re_results = re.findall('(\d+(st|nd|rd|th))', text)
    for enitre_result, suffix in re_results:
        num = int(enitre_result[:-2])
        text = text.replace(enitre_result, num2words(num, ordinal=True))
    return text

In [128]:
def remove_numbers(text):
    return ''.join(c for c in text if not c.isdigit())

In [129]:
def remove_stopwords(text):
    word_tokens = nltk.word_tokenize(text)
    text = ' '.join([word for word in word_tokens if word not in stopword])
    return text

In [130]:
def lemmatize(text):
    wordnet_lemmatizer = WordNetLemmatizer()
    word_tokens = nltk.word_tokenize(text)
    text = ' '.join([wordnet_lemmatizer.lemmatize(word) for word in word_tokens])
    return text

In [131]:
def clean_review(text,is_Lower=True,remove_links=True,remove_hash_mentions=True,remove_emojis=True,
                 remove_punct=True,remove_cont=True,remove_ordinals=True,remove_nos=True,remove_stop=True,
                 is_lemma=True):
    '''
    func to pre-process the review text. You can toggle a param if you dont want particular pre-processing task to happen
    '''
    if is_Lower:
        text = to_lowercase(text)
    if remove_links:
        text= remove_html(text)
    if remove_hash_mentions:
        text=remove_hastags_mentions(text)
    if remove_emojis:
        text= remove_emoji(text)
    if remove_punct:
        text= remove_punctuations(text)
    if remove_cont:
        text= remove_contractions(text)
    if remove_ordinals:
        text= replace_ordinal_numbers(text)
    if remove_nos:
        text= remove_numbers(text)
    if remove_stop:
        text= remove_stopwords(text)
    if is_lemma:
        text= lemmatize(text)
    
    return (text)    

In [132]:
def get_ngram_noun_phrases(sentence, ngrams = 2):
    '''
    func to get all noun phrases using Text Blob. A dish name might be all nouns but could be a 2 letter word or 3 or 4 or 5
    this will extract all noun phrases from the review's blob objects and put in a list
    '''
    sent_blob = TextBlob(str(sentence))
    ngram_list = sent_blob.ngrams(n=ngrams)   
    #print(ngram_list)
    #print("")
    for pair in ngram_list:
        ngram = ' '.join(pair)
        word_blob = TextBlob(ngram)
        
        for np in word_blob.noun_phrases:
            #print(np)
            if np not in noun_phrases_list:
                noun_phrases_list.append(np)

In [133]:
def generate_noun_phrases(rev_blob):
    '''
    func to generate all noun phrases from a blob object
    '''
    for sent in rev_blob.sentences:   
        get_ngram_noun_phrases(sent,2) #bigram phrases
        get_ngram_noun_phrases(sent,3) #trigram phrases
        get_ngram_noun_phrases(sent,4) 
        get_ngram_noun_phrases(sent,5)

In [134]:
def generate_possible_menu_list(noun_phrases_list):
    '''
    func that will create a list where the noun phrase contains all nouns which means its a possible dish name
    '''
    for element in noun_phrases_list:
        tag_list = []
        phrase_blob = TextBlob(element)
        tag_list = phrase_blob.tags
        #print("tag_list ",tag_list)
        if ((tag_list[0][1] == 'NN' or tag_list[0][1] == 'NNS') and (tag_list[-1][1] == 'NN' or tag_list[-1][1] == 'NNS')):
            possible_menu_list.append(element)

In [135]:
def remove_duplicate_noun_phrases(possible_menu_list):
    '''
    We could get similar dish names while parsing each noun phrase
    e.g. chicken noodles vs spicy chicken noodles. In such cases will keep only the phrase with longest length
    '''
    for i, elements in enumerate(possible_menu_list):
        try:
            thiselem = str(elements)
            matching_elements = [s for s in possible_menu_list if thiselem in s]
            #print(matching_elements)
            for j, val in enumerate(matching_elements):
                curr_val = str(val)
                next_val = str(matching_elements[(j + 1) % len(matching_elements)])

                if len(curr_val) > len(next_val):
                    #print("Removing ",next_val)
                    if next_val not in exclude_list:
                        exclude_list.append(next_val)
                elif len(curr_val) < len(next_val):
                    #print("Removing ",curr_val)
                    if curr_val not in exclude_list:
                        exclude_list.append(curr_val)
        except:
            pass
    
    for x in exclude_list:
        try:
            possible_menu_list.remove(x)
        except:
            pass

In [136]:
def sentiment_textblob(feedback): 
    '''
    func to map the sentence polarity to a user defined sentiment label
    '''
    senti = TextBlob(feedback) 
    polarity = senti.sentiment.polarity 
    if -1 <= polarity < -0.5: 
        label = 'very bad' 
    elif -0.5 <= polarity < -0.1: 
        label = 'bad' 
    elif -0.1 <= polarity < 0.2: 
        label = 'ok' 
    elif 0.2 <= polarity < 0.6: 
        label = 'good' 
    elif 0.6 <= polarity <= 1: 
        label = 'best' 
    
    return (polarity, label) 

In [137]:
def analyze_review_to_get_menu_and_sentiment(id,text):
    '''
    func to analyze each review text
    1. Convert to a BLOB
    2. Get noun phrases using bigrams, trigrams, quadgrams and so on
    3. Get possible menu list items (all nouns POS)
    4. Retain menu items(phrase) with max length i.e. remove duplicates
    5. Search each item with each sentence of the given review text and if found
       a. Get sentiment score, label of that sentence
       b. break
       c. Store in list for adding to a pandas frame
    '''
    
    '''is_Lower=True,remove_links=True,remove_hash_mentions=True,remove_emoji=True,
       remove_punct=True,remove_cont=True,remove_ordinals=True,remove_nos=True,remove_stop=True,
       is_lemma=True
    '''
    
    cl_review = clean_review(text,True,True,True,True,True,True,True,True,True,True)
    cl_review_s = clean_review(text,True,True,True,True,True,True,True,True,True,True)
    
    rev_blob = TextBlob(cl_review)
    rev_blob_s = TextBlob(cl_review_s)
        
    generate_noun_phrases(rev_blob)
    #print("noun_phrases_list = ", noun_phrases_list)
    
    generate_possible_menu_list(noun_phrases_list)
    #print("possible menu_list = ",possible_menu_list) 
    
    remove_duplicate_noun_phrases(possible_menu_list)
    #print("possible menu_list after duplicates removal=",possible_menu_list)
    
    raw_sentence_list = sent_tokenize(str(rev_blob_s))
    
    detected_sentence = ''
    polarity= -99 
    label = ''
    for m in possible_menu_list:
        for sentence in raw_sentence_list:
            if m in sentence:
                (polarity, label) = sentiment_textblob(str(sentence))
               
                detected_sentence = sentence
                break
        res_id_list.append(id)
        menu_item_list.append(m)
        detected_sentence_list.append(detected_sentence)
        sent_polarity_list.append(polarity)
        sentiment_label_list.append(label)

## Temporary code to read data files directly from dropbox

In [138]:
#dbx = dropbox.Dropbox("JRedVn3NSbAAAAAAAAAAMp2Oi22PT1ZquPWWryIlgXIEFaI6cO_IlyjxClMv2xLp")
#with open("reviews.csv", "wb") as f:
#    metadata, res = dbx.files_download(path="/DataScience/reviews.csv")
#    f.write(res.content)

In [139]:
#with open("restaurant.csv", "wb") as f:
#    metadata, res = dbx.files_download(path="/DataScience/restaurant.csv")
#    f.write(res.content)

In [140]:
#df_reviews = pd.read_csv("reviews.csv")
#df_restaurant = pd.read_csv('restaurant.csv')

In [141]:
df_reviews['clean_review_text'] = df_reviews.review_text.apply(clean_review)

In [142]:
df_reviews.head()

Unnamed: 0,rest_id,review_id,review_text,review_rating,clean_review_text
0,19296768,47205382,"MITRON is a new place thats opened up in Fort,...",5,mitron new opened fort mustvisit . good food g...
1,19296768,47149493,Our bill was around 6k which was inclusive of ...,2,around k inclusive bomb ₹ authority ask waive ...
2,19296768,47112378,The place has newly opened. The ambience is re...,5,newly opened . really good comfortable sitting...
3,19296768,47101566,From serving one of the best pizzas in town to...,5,serving one best pizza town range quirky pocke...
4,19294334,47274716,Loved the pizza and the baos!,5,loved pizza baos


### Now we will be merging reviews(rating 3 and above) into single review text per restaurant

In [143]:
df_merged_reviews = pd.DataFrame(columns = ['rest_id','review_text'])
res_id_list = []
merged_reviews_list = []
old_rest_id = 0
curr_rest_id = 0
temp_text = ''
for index in df_reviews.index:
    if int(df_reviews['review_rating'][index]) >=3:
        cleaned_review = df_reviews['review_text'][index]
        curr_rest_id = df_reviews['rest_id'][index]
        #print(curr_rest_id)
        if old_rest_id == 0 or old_rest_id == curr_rest_id: #1st iteration or same restaurant so joining cleaned reviews
            if temp_text == '':
                temp_text = temp_text+cleaned_review
            else:
                temp_text = temp_text+'.'+cleaned_review
        else:
            res_id_list.append(old_rest_id)
            merged_reviews_list.append(temp_text)
            temp_text = ''
            old_rest_id = 0
            curr_rest_id = 0
            if temp_text == '':
                temp_text = temp_text+cleaned_review
            else:
                temp_text = temp_text+'.'+cleaned_review

        old_rest_id = curr_rest_id

res_id_list.append(curr_rest_id) #from last iteration
merged_reviews_list.append(temp_text) #from last iteration

all_cleaned_review_data = {'rest_id':res_id_list,'review_text':merged_reviews_list}
df_merged_reviews = df_merged_reviews.append(pd.DataFrame(all_cleaned_review_data),ignore_index=True)

In [144]:
df_merged_reviews.head()

Unnamed: 0,rest_id,review_text
0,19296768,"MITRON is a new place thats opened up in Fort,..."
1,19294334,Loved the pizza and the baos!.Soo soothing & S...
2,18969541,Amazing experience!.Great experience!.really l...
3,19289844,KPJ - Veg. Kitchen & Bar..... Khao Piyo Jiyo ...
4,19281880,I consider myself somewhat of a tea aficionado...


In [145]:
#df_merged_reviews['review_text'][0]

## Processing all reviews from reviews frame and store in new frame

In [146]:
df_analyzed_reviews = pd.DataFrame(columns = ['rest_id','menu_item','sentiment_score','sentiment_label','what_people_said'])
for index in df_merged_reviews.index: #loop over each review
    
    res_id_list = []
    menu_item_list = []
    detected_sentence_list = []
    sent_polarity_list = []
    sentiment_label_list = []
    
    noun_phrases_list = []
    possible_menu_list = []
    exclude_list = []
    raw_sentence_list=[]
    
    analyze_review_to_get_menu_and_sentiment(df_merged_reviews['rest_id'][index],str(df_merged_reviews['review_text'][index]))
    
    #print(menu_item_list)
    #print(sent_polarity_list)
    #print(sentiment_label_list)
    #print(detected_sentence_list)
    #break
    
    all_cleaned_review_data = {'rest_id':res_id_list, 'menu_item':menu_item_list,'sentiment_score':sent_polarity_list,
                           'sentiment_label':sentiment_label_list,'what_people_said':detected_sentence_list}

    df_analyzed_reviews = df_analyzed_reviews.append(pd.DataFrame(all_cleaned_review_data),ignore_index=True) #append all analyzed review data to a frame

print("processed all reviews")

processed all reviews


In [147]:
df_analyzed_reviews

Unnamed: 0,rest_id,menu_item,sentiment_score,sentiment_label,what_people_said
0,19296768,fort mustvisit,0.136364,ok,mitron new opened fort mustvisit .
1,19296768,cocktail sheesha,0.525000,good,good food great interior really nice cocktail ...
2,19296768,rocket leaf salad,0.300000,good,started rocket leaf salad totally worth .
3,19296768,paneer starter peri peri fry,0.150000,ok,post healthy start pan fried paneer starter pe...
4,19296768,cinnamon star,0.600000,best,amazing cocktail flavoured cinnamon star anise .
5,19296768,star anise,0.600000,best,amazing cocktail flavoured cinnamon star anise .
6,19296768,course lemon,0.038095,ok,main course lemon marinated grilled chicken gr...
7,19296768,perfection succulent,0.038095,ok,main course lemon marinated grilled chicken gr...
8,19296768,amount lemon,0.038095,ok,main course lemon marinated grilled chicken gr...
9,19296768,fire bowl streetstyle chinese dish,0.300000,good,fire bowl streetstyle chinese dish really quit...


## Lets sort each menu item within a review in descending order of score

In [148]:
df_analyzed_reviews = df_analyzed_reviews.sort_values(['rest_id','sentiment_score'],ascending=[1, 0])

In [149]:
df_analyzed_reviews.head()

Unnamed: 0,rest_id,menu_item,sentiment_score,sentiment_label,what_people_said
525,41304,splendid evening,0.766667,best,good stop splendid evening vegetarian lover .
510,41304,group hv quality,0.64,best,food great spread expectation also good.. over...
511,41304,family dinner plan birthday celebration,0.64,best,food great spread expectation also good.. over...
521,41304,menu ensures,0.6,best,huge spread menu ensures everyone happy .
512,41304,time gettogether,0.497273,good,friend long time gettogether.. really good chi...


## Now we will be merging all frames using pandas join method to arrive at final output
## Final output looks like
### COLLECTION DETAILS | RESTAURANT DETAILS | MENU ITEMS | SENTI SCORE of what people said about menu 

In [150]:
df_restaurant[['collection_id']] = df_restaurant[['collection_id']].apply(pd.to_numeric) 

In [151]:
df_temp1 = pd.merge(df_collections,df_restaurant,on='collection_id',how='inner') #merged collections with resto frame

In [152]:
df_final = pd.merge(df_temp1,df_analyzed_reviews,on='rest_id',how='inner') #merged with analyzed reviews

In [153]:
unwanted_columns=['collection_id','res_count']
df_final = df_final.drop(columns=unwanted_columns,axis=1)

In [154]:
df_final = df_final.reset_index(drop=True)

In [155]:
df_final.head()

Unnamed: 0,description,title,rest_id,rest_name,rest_locality,rest_user_rating,menu_item,sentiment_score,sentiment_label,what_people_said
0,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,pav bhaji fondue,1.0,best,pav bhaji fondue must one best .
1,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,mitron triangle,0.7,best,mitron triangle good like patti samosa stuffed...
2,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,triangle good,0.7,best,mitron triangle good like patti samosa stuffed...
3,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,corn cheese,0.7,best,mitron triangle good like patti samosa stuffed...
4,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,cinnamon star,0.6,best,amazing cocktail flavoured cinnamon star anise .


In [156]:
good_sentiments = ['best', 'good']
df_final = df_final.loc[df_final['sentiment_label'].isin(good_sentiments)]

In [157]:
df_final.head()

Unnamed: 0,description,title,rest_id,rest_name,rest_locality,rest_user_rating,menu_item,sentiment_score,sentiment_label,what_people_said
0,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,pav bhaji fondue,1.0,best,pav bhaji fondue must one best .
1,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,mitron triangle,0.7,best,mitron triangle good like patti samosa stuffed...
2,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,triangle good,0.7,best,mitron triangle good like patti samosa stuffed...
3,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,corn cheese,0.7,best,mitron triangle good like patti samosa stuffed...
4,Most popular restaurants in town this week,Trending This Week,19296768,Mitron At George,Fort,3.9,cinnamon star,0.6,best,amazing cocktail flavoured cinnamon star anise .


In [158]:
df_final.to_csv('final.csv')

In [159]:
try:
    file1_from = 'final.csv'
    file1_to = '/DataScience/final1.csv'

    upload_file(file1_from,file1_to)
except:
    pass