# Computational Approaches to the Study of Food Sign Patterns
Today, we're going to explore the ways in which restaurants of different cuisines describe food on their menus. Specifically, we're interested in the following questions about food cultures in Chicago:
1. Are there patterns in the ways in which particular cuisine genres describe food on menus?
    * Identification of indexical/iconic legisigns that position a particular cuisine within a cuisine, or social status (which can then be used by consumers to position themselves in the same light via social media posts, and so on)
2. Are these cuisines (and/or menu discourse patterns identified in #1) geographically patterned?
    * Identification of dicent indexical legisigns that point to a particular cuisine or broader menu discourse pattern on the basis of spatial location
    
First, let's load our packages. In order to run this notebook, you will need to install the `folium` package, which is available to install through the Anaconda Navigator or on the command line via the command `conda install -c conda-forge folium`

# notes from class
view HTML of website
find menu items in the code in form of pseudo-json


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
import re
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import string
from gensim import corpora, models
from sklearn.manifold import TSNE
import folium

# Some Functions from Last Time to get us started:
def get_wordnet_pos(word):
    import nltk

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": nltk.corpus.wordnet.ADJ,
                "N": nltk.corpus.wordnet.NOUN,
                "V": nltk.corpus.wordnet.VERB,
                "R": nltk.corpus.wordnet.ADV}

    return tag_dict.get(tag, nltk.corpus.wordnet.NOUN)

def get_lemmas(text):
    import nltk

    # Combine list elements together into a single string for analysis
    text = ' '.join(text)

    stop = nltk.corpus.stopwords.words('english') + list(string.punctuation) + ["amp", "39"]
    tokens = [i for i in nltk.word_tokenize(text.lower()) if i not in stop]
    lemmas = [nltk.stem.WordNetLemmatizer().lemmatize(t, get_wordnet_pos(t)) for t in tokens]
    return lemmas

def plot_top_tfidf(series, data_description):
    import nltk

    # Apply 'get lemmas' function to any Pandas Series that we pass in to get lemmas for each row in the Series
    lemmas = series.apply(get_lemmas)

    # Initialize Series of lemmas as Gensim Dictionary for further processing
    dictionary = corpora.Dictionary([i for i in lemmas])

    # Convert dictionary into bag of words format: list of (token_id, token_count) tuples
    bow_corpus = [dictionary.doc2bow(text) for text in lemmas]

    # Calculate TFIDF based on bag of words counts for each token and return weights:
    tfidf = models.TfidfModel(bow_corpus)
    tfidf_weights = tfidf[bow_corpus[0]]

    # Sort TFIDF weights highest to lowest:
    sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

    # Plot the top 10 weighted words:
    top_10 = {dictionary[k]:v for k,v in sorted_tfidf_weights[:10]} # dictionary comprehension
    plt.plot(list(top_10.keys()), list(top_10.values()), label=data_description)
    plt.xticks(rotation='vertical')
    plt.title('Top 10 Lemmas (TFIDF) for ' + data_description);

    return

html = requests.get("https://www.allmenus.com/il/chicago/22019-francescas-bryn-mawr/menu/") #requests library
# get response <200> means successful request
html.text #see exact text of the website - all in text format so can't scrape because code itself is html
#use beutiful soup library to parse file in terms of html and find particular tags that will allow us to take out menu and disregard irrelevant info
soup = BeautifulSoup(html.text, "html.parser") #already loaded in from packages above - html.parser tells python that this is html and parse it out that way
soup # easier to read because parsed correctly - now identical to when look at it on website itself - beautiful soup can find tags now
# need to identify location where data lives in code to tell beautiful soup where to scrape
# websites are putting data in json files because want google to be able to find them
# want to tell beautiful soup to find area where script refers to 
restaurant_data = soup.find('script', type = 'application/ld+json') # returns everything between two tags in form of json
#load as json b/c pythin doesn't know yet
restaurant_data_json = json.loads(restaurant_data.text, strict = False) # strict = False b/c new lines \n in text so telling it not to be too strict with that shit from htmls
# actual menu data embedded in a bunch of different lists
# e.g. "has menu" - if it has a menu, takes to further option inside list (has manu section) and then further to appetizers and then another nested item - need to get inside these nested lists
#need long list comprehension to get into nested items
nested_items = [section ['hasMenuItem'] for section in restaurant_data_json['hasMenu'][0]['hasMenuSection']]
# about above: has menu has single list so index the zeroeth, within each section under has menu section -> end up with list of lists of dictionaries
# get descriptions and name within single list
names = [j['name'] for i in nested_items for j in i] #for each list in nested items and for each dictionary within those nested items, want to take name that is in that dictionary
descriptions = [j['description'] for in in nested_items for j in i] #same for descriptions of items
# but want to have data from many restaurants - need to either know the links for each menu or scrape links from allmenus and then scrape menus from each link
# sort by italian in chicago (sorted by popularity) and grab top whatever number
# since trying to scrape links, don't want grubhub links or other links
# find in html text of all menus - the name of each restaurant and link associated with name
html = requests.get('allmenus search link')
soup = BeautifulSoup(html.text, "html.parser")
# restaurant list starts with tag "ul"
soup.find('ul', class_= 'restaurant-list') # need underscore after class b/c just class is diff. python object
# grubhub links have associated grubhub class specified but restaurant links don't have class associated with them so can use:
restaurant_anchors = soup.find('ul', class_= 'restaurant-list').findAll('a', class_=None)
# need to grab links without anchor data
restaurant_links = ['first part of link' + i.get('href') for i in restaurant_anchors[:5]] # links in html are just the ends of links, so need to append the base of the link + all the href items in the text (refer to the links), took first five


# from some other script that he wrote before class but is for some reason not here
menu_df = pd.read_json('menu_df_top100.json')
# assign map m calling folium, center map on chicago, set tiles to base map b/c it looks nice
folium.Marker(menu_df['Coordinates'][1]).add_to(m) # created marker, added coordinates for the first entry and plotted them on m
# create function to create map - take in dataframe of menus and output point map
# to color individual points, loop through each point and assign a color to each (for loop in function)
# asssign a different color to each cuisine using a dictionary
# key also has code to add pop-up content for when you click on a point
# for loop within function: create dictionary assigning colors to cuisines, then loop through each restaurant and add the color, then loop through and add the marker to the map with graphics stuff
# legend is written in html because folium makes it hard to add legend

# word to vec - words into vectors to see how words are used in sentences
# model stuff 
# min_count: minimum occurances to consider
# window: only consider x number of items around the word
# workers: how many threads can be running at one time on a computer
model.vocab # different words that made it into list and vector associated with each
model['garlic'] # see actual structure of vector for specific word - 100 dimensional vector
# reduce vector into 2D
#list of labels and list of tokens
# append 100-dimensional vector to list of tokens
# append actual words to list of labels
# define tsne model, n_components is two because reducing dimensionality to 2
# specify x and y values b/c now in 2d



