The goal of this task is to mine the data set to discover the common/popular dishes of a particular cuisine. Typically when you go to try a new cuisine, you don’t know beforehand the types of dishes that are available for that cuisine. For this task, we would like to identify the dishes that are available for a cuisine by building a dish recognizer.

**Task 3.1: Manual Tagging**

You are given a list of candidate dish names, which are all frequent (at least 10 times in corresponding corpus), automatically generated by the auto-labeling process of SegPhrase[2]. The list can be found in the manualAnnotationTask.zip file. Some of the dish names are verified by an outside knowledge base such that they are all good phrases, and some of them might be good dish names. However, some of the labels might be wrong. Therefore, your task here is to refine the label list for one cuisine. You could modify/add some phrases. Here are some actions you may take:

Remove a false positive non-dish name phrase (recommended), e.g., hong kong 1 could be removed in Chinese cuisine. Change a false positive non-dish name phrase to a negative label, e.g., hong kong 1 could be modified as hong kong 0. Remove a false negative dish name phrase, e.g., wonton strips 0 could be removed in Chinese cuisine. Change a false negative dish name phrase to a positive label (recommended), e.g., wonton strips 0 could be modified as wonton strips 1. Add some new annotated phrases in the same format. Tip: Notice that the character between a phrase and its label is a tab instead of a space.

Remember that the tools we are using were originally designed for general phrase mining instead of dish name mining. Therefore, it will be much safer if we just remove those ambiguous labels, while aggressively changing them into opposites may lead to some undetermined risks, although it is still worth a try.

**Task 3.2: Mining Additional Dish Names**

Once you have a list of dish names, it is likely that many dish names are still missing. In this step, you would expand the list of dishes by using other pattern mining techniques and/or word association methods.

For example, ToPMine[1], as we mentioned in the previous pattern mining course, is an unsupervised frequent pattern-based phrase mining algorithm. It merges consecutive words based on statistical significance (stopwords will be firstly removed and be put back later). The most state of the art framework is SegPhrase[2]. SegPhrase will need the (refined) labels in the first task. SegPhrase has a classifier to assign a quality score to each phrase candidate based on their statistical features. The classification procedure will be enhanced by phrasal segmentation results. These two parts could mutually enhance each other.

Another approach to possibly extending the dish names is using word association. You have previously learned and implemented methods to judge word associations (paradigmatic & syntagmatic relations), such as Mutual Information. There are also some more state-of-the-art methods such as word2vec[3], which you are welcome to experiment with.


In [None]:
# Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import re
import json
import nltk
import pandas as pd
import gensim
from gensim.models import Word2Vec

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Generator to read large JSON file line by line
def json_reader(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield json.loads(line)

# Load review and tip dataset
review_file_path = '/content/drive/MyDrive/Data Mining Project UIUC/dataset/yelp_academic_dataset_review.json'
tip_file_path = '/content/drive/MyDrive/Data Mining Project UIUC/dataset/yelp_academic_dataset_tip.json'

reviews = json_reader(review_file_path)
tips = json_reader(tip_file_path)

In [None]:
# Check for phrases in the raw review and tip data
raw_review_sample = json_reader(review_file_path)
raw_tip_sample = json_reader(tip_file_path)

# Function to check for the presence of phrases in raw data
def check_raw_data_for_phrases(phrases, data_gen):
    found_phrases = {phrase: False for phrase in phrases}
    for entry in data_gen:
        text = entry.get('text', '').lower()
        for phrase in phrases:
            if phrase.replace('_', ' ') in text:  # Check as regular phrase
                found_phrases[phrase] = True
    return found_phrases

# Check sample phrases in both reviews and tips
sample_phrases = ["goat cheese", "coca cola", "onion rings"]
review_presence = check_raw_data_for_phrases(sample_phrases, raw_review_sample)
tip_presence = check_raw_data_for_phrases(sample_phrases, raw_tip_sample)

print("Presence in raw review data:", review_presence)
print("Presence in raw tip data:", tip_presence)

Presence in raw review data: {'goat cheese': True, 'coca cola': True, 'onion rings': True}
Presence in raw tip data: {'goat cheese': True, 'coca cola': True, 'onion rings': True}


In [None]:
# Load the labeled file
file_path = '/content/drive/MyDrive/Data Mining Project UIUC/task3/American_Cuisine_Phrases.csv'
df = pd.read_csv(file_path, sep=',', names=['phrase', 'label'], skiprows=1)

# Separate dishes and non-dishes based on labels
dishes = df[df['label'] == 1]['phrase'].tolist()
non_dishes = df[df['label'] == 0]['phrase'].tolist()

In [None]:
df.head()

Unnamed: 0,phrase,label
0,spring training,0
1,golden corral,0
2,in n out,0
3,finger food,1
4,service stars,0


In [None]:
dishes

['finger food',
 'goat cheese',
 'coca cola',
 'rock candy',
 'wedding cake',
 "hors d'oeuvres",
 'coffee bean',
 'onion rings',
 'cuban sandwich',
 'sweet potato',
 'pine nuts',
 'hot dog',
 'ice cream sandwich',
 'stir fry',
 'pale ale',
 'fast food',
 'foie gras',
 'panna cotta',
 'baked potato',
 'cheddar cheese',
 'whipped cream',
 'chocolate cake',
 'soft drinks',
 'chocolate syrup',
 'french toast',
 'potato salad',
 'milk chocolate',
 'soda fountain',
 'monte cristo',
 'cotton candy',
 'hot chocolate',
 'american cheese',
 'king crab',
 'fried egg',
 'sea bass',
 'tap water',
 'fried chicken',
 'rye bread',
 'kobe beef',
 'red bull',
 'bone marrow',
 'corn dog',
 'chicken fried steak',
 'bread pudding',
 'red velvet cake',
 'cottage cheese',
 'amuse bouche',
 'ice cream',
 'potato chip',
 'salisbury steak',
 'green beans',
 'blue cheese',
 'tandoori chicken',
 'pound cake',
 'hard boiled egg',
 'english muffin',
 'cream pie',
 'white bread',
 'dried fruit',
 'fried fish',
 'car

In [None]:
# Define a batch size for processing
batch_size = 50000

# Convert all dish names into a dictionary mapping original to underscore-connected versions
multi_word_phrases = {dish: dish.replace(' ', '_') for dish in dishes}

# Preprocessing function to preserve multi-word dish names
def preprocess(text):
    stop_words = set(stopwords.words('english'))
    # Replace each multi-word dish name with its underscore-connected version
    for phrase, underscore_phrase in multi_word_phrases.items():
        text = text.replace(phrase, underscore_phrase)
    # Remove punctuation and tokenize
    text = re.sub(r'[^\w\s]', '', text)
    tokens = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return tokens

# Function to process all data in batches and accumulate results
def all_batch_process_json(json_reader_func, batch_size):
    batch = []
    for entry in json_reader_func:
        if 'text' in entry:
            processed_text = preprocess(entry['text'])
            if processed_text:
                batch.append(processed_text)
        # Yield batch when it reaches the desired size
        if len(batch) >= batch_size:
            yield batch
            batch = []
    # Yield any remaining data in the final batch
    if batch:
        yield batch

# Process reviews and tips fully, one batch at a time
review_batches = all_batch_process_json(json_reader(review_file_path), batch_size)
tip_batches = all_batch_process_json(json_reader(tip_file_path), batch_size)

# Accumulate all processed batches into one corpus
corpus_total = []
for batch in review_batches:
    corpus_total.extend(batch)
    print("Current Corpus Size after review batch:", len(corpus_total))

for batch in tip_batches:
    corpus_total.extend(batch)
    print("Current Corpus Size after tip batch:", len(corpus_total))

Current Corpus Size after review batch: 50000
Current Corpus Size after review batch: 100000
Current Corpus Size after review batch: 150000
Current Corpus Size after review batch: 200000
Current Corpus Size after review batch: 250000
Current Corpus Size after review batch: 300000
Current Corpus Size after review batch: 350000
Current Corpus Size after review batch: 400000
Current Corpus Size after review batch: 450000
Current Corpus Size after review batch: 500000
Current Corpus Size after review batch: 550000
Current Corpus Size after review batch: 600000
Current Corpus Size after review batch: 650000
Current Corpus Size after review batch: 700000
Current Corpus Size after review batch: 750000
Current Corpus Size after review batch: 800000
Current Corpus Size after review batch: 850000
Current Corpus Size after review batch: 900000
Current Corpus Size after review batch: 950000
Current Corpus Size after review batch: 1000000
Current Corpus Size after review batch: 1050000
Current Corp

In [None]:
# Check the final size and sample of the processed corpus
print("Final Corpus Size:", len(corpus_total))
print("Sample Corpus Entry:", corpus_total[:3])

Final Corpus Size: 1527498
Sample Corpus Entry: [['dr', 'goldberg', 'offers', 'everything', 'look', 'general', 'practitioner', 'hes', 'nice', 'easy', 'talk', 'without', 'patronizing', 'hes', 'always', 'time', 'seeing', 'patients', 'hes', 'affiliated', 'topnotch', 'hospital', 'nyu', 'parents', 'explained', 'important', 'case', 'something', 'happens', 'need', 'surgery', 'get', 'referrals', 'see', 'specialists', 'without', 'see', 'first', 'really', 'need', 'im', 'sitting', 'trying', 'think', 'complaints', 'im', 'really', 'drawing', 'blank'], ['unfortunately', 'frustration', 'dr', 'goldbergs', 'patient', 'repeat', 'experience', 'ive', 'many', 'doctors', 'nyc', 'good', 'doctor', 'terrible', 'staff', 'seems', 'staff', 'simply', 'never', 'answers', 'phone', 'usually', 'takes', '2', 'hours', 'repeated', 'calling', 'get', 'answer', 'time', 'wants', 'deal', 'run', 'problem', 'many', 'doctors', 'dont', 'get', 'office', 'workers', 'patients', 'medical', 'needs', 'isnt', 'anyone', 'answering', 'pho

In [None]:
# Example of checking for specific phrases in total_corpus
sample_phrases = ["goat_cheese", "coca_cola", "onion_rings"]
found_phrases = {phrase: any(phrase in sublist for sublist in corpus_total) for phrase in sample_phrases}
print("Presence of sample phrases in corpus:", found_phrases)

Presence of sample phrases in corpus: {'goat_cheese': True, 'coca_cola': True, 'onion_rings': True}


Alright, we're on the right way because our corpus contains the dishes on the raw data means that the data has been preprocessed well. Now, we can train the model to find similar words for each known dish name using Word2Vec model.

In [None]:
# Train Word2Vec model
model = Word2Vec(sentences=corpus_total, vector_size=100, window=10, min_count=1, workers=4)

# Find similar words for each known multi-word dish name
expanded_dish_list = []
for dish in dishes:
    try:
        similar_words = model.wv.most_similar(multi_word_phrases[dish], topn=10)
        expanded_dish_list.extend([word for word, similarity in similar_words])
    except KeyError:
        print(f"{multi_word_phrases[dish]} not found in the text corpus vocabulary")

# Remove duplicates and output the expanded list
expanded_dish_list = list(set(expanded_dish_list))
print("Expanded Dish List:", expanded_dish_list)

hors_d'oeuvres not found in the text corpus vocabulary
Expanded Dish List: ['cheddar', 'hasbrowns', 'biryani', 'snowcrab', 'baker', 'bellini', 'cabernet', 'caviar', 'breadlike', 'latte', 'scallops', 'grains', 'bouche', 'bokchoy', 'lengua', 'finely', 'lays', 'cotta', 'pistachio', 'chocolate_cake', 'lox', 'ale', 'appetizer', 'fruit', 'shiitake', 'venezias', 'pound_cake', 'annihilator', 'diced', 'crumbles', 'smore', 'machaca', 'salmon', 'shortribs', 'blizzards', 'chilean', 'chowmein', 'charsiu', 'milanesa', 'chiken', 'hoagie', 'omlet', 'swai', 'wendys', 'softserve', 'dill', 'cocacola', 'veggie', 'lager', 'wagyu', 'tritip', 'tofu', 'hushpuppies', 'pepsi', 'gravlax', 'fatburger', 'gardenburger', 'concoction', 'branzino', 'sitdown', 'cfs', 'coca', 'chocolate', 'sauteed', 'spatzle', 'tenderloin', 'strawberries', 'hotdogs', 'cornbeef', 'sorbet', 'mule', 'pork_belly', 'fast_foods', 'plantanos', 'evian', 'gin', 'mcdonalds', 'lentils', 'mochas', 'mojito', 'sweet_potato_salad', 'chicken', 'lobster

In [None]:
# Save the DataFrame to a CSV file without the index
output_file_path = 'expanded_dish_list.csv'
df = pd.DataFrame(expanded_dish_list, columns=['dish_name'])
df.to_csv(output_file_path, index=False)

print(f"Expanded dish list saved to {output_file_path}")

Expanded dish list saved to expanded_dish_list.csv


In [None]:
check_df = pd.read_csv("expanded_dish_list.csv")
check_df.sample(5)

Unnamed: 0,dish_name
282,cheddar_cheese
410,chianti
429,homefries
115,american_cheese
487,aracelli


 We did it! We already have new expanded dish list, but further refinement or manual preprocessing may be needed.