# Restaurant atmosphere/vibes analysis

## Introduction

I found some apps, that contain images of restaurants along with a vibe labelled.

`Tripadvisor` only has `business meetings`, `romantic`, and `ideal for groups` vibes. So it is not enough.

After searching a looot of apps, I found the `One Zone` app. Desgined specifically to chose a restaurant by the vibe they evoke. However, the dataset is really small, as they add restaurants manually. They don't use any kind of deep learning method. Maybe we can use it as a test dataset.

Therefore, what I am going to try to do, is to develop a NLP model to classify customer reviews by the restaurant vibe. Then I would label the restaurants images by the vibe obtained with the NLP model.

**(PONER IMAGEN DEL WORKFLOW QUE VOY A SEGUIR)**

## Yelp Dataset

So we'll be working with Yelp dataset. The Yelp Open Dataset is an extensive collection provided by Yelp for academic and research purposes. It offers a rich source of data from Yelp's vast repository of local business reviews and user interactions. It contains:

1. **Businesses:**
Basic information about local businesses, including their name, location (address, city, state, postal code), latitude and longitude, average rating, category (e.g., Restaurants, Shopping, Beauty & Spas), and the number of reviews they have received. This allows for analyses focused on geographical trends, category-based studies, and more.

2. **Reviews:**
Millions of text reviews written by Yelp users for various businesses. Each review includes the user's text review, the star rating they gave, and the date of the review. This data is invaluable for sentiment analysis, natural language processing tasks, and understanding consumer preferences.

3. **Users:**
Information about Yelp users who have written reviews, including their user ID, name, review count, yelping since (the date they joined Yelp), friends, and other social metrics. This can be used for social network analysis, studying user behavior, and personalization algorithms.

4. **Check-ins:**
Data about check-ins at businesses by Yelp users, which can help in analyzing foot traffic trends, popular times, and loyalty or frequency of visits.

5. **Tips:**
Short notes left by users about a business. Tips can contain advice, recommendations, or comments about what's good or what to avoid. This dataset is useful for extracting quick insights or highlights about businesses.

6. **Photos:**
Metadata about photos associated with businesses, including a photo ID, business ID, caption, and labels indicating the photo category (e.g., food, interior, exterior).
This can support visual analysis of businesses or complement review-based insights as in our case.

## Loading Yelp data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unzip files from drive.

In [3]:
import tarfile

# Replace 'path_to_your_file.tgz' with the actual path to your .tgz file
file_path = '/content/drive/MyDrive/yelp_dataset.tgz'

# Open the .tgz file
with tarfile.open(file_path, 'r:gz') as file:
    file.extractall(path='./data')

print('Extraction completed.')

Extraction completed.


In [8]:
import tarfile

# Replace 'path_to_your_file.tgz' with the actual path to your .tgz file
file_path = '/content/drive/MyDrive/yelp_photos.tgz'

# Open the .tgz file
with tarfile.open(file_path, 'r:gz') as file:
    file.extractall(path='./data')

print('Extraction completed.')

Extraction completed.


Let's select just restaurants:

In [1]:
import json

restaurant_ids = set()

with open('/content/data/yelp_academic_dataset_business.json', 'r') as business_file:
    for line in business_file:
        business = json.loads(line)

        if business.get('categories') and any("Restaurant" in category for category in business['categories'].split(', ')):
            restaurant_ids.add(business['business_id'])


In [3]:
import csv

with open('/content/data/yelp_academic_dataset_review.json', 'r') as review_file, \
     open('restaurant_reviews.csv', 'w', newline='', encoding='utf-8') as csv_file:

    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['business_id', 'text'])

    for line in review_file:
        review = json.loads(line)

        if review['business_id'] in restaurant_ids:
            csv_writer.writerow([review['business_id'], review['text']])

In [4]:
!mv /content/restaurant_reviews.csv /content/drive/MyDrive

In [7]:
import pandas as pd
df_restaurants = pd.read_csv("/content/drive/MyDrive/restaurant_reviews.csv")

In [9]:
indoor_photos = {}
with open('/content/data/photos.json', 'r') as photos_file:
    for line in photos_file:
        photo = json.loads(line)
        if photo['label'] == 'inside':
            if photo['business_id'] not in indoor_photos:
                indoor_photos[photo['business_id']] = photo['photo_id']

# Initialize a set to track businesses that have already been assigned a photo
assigned_photos = set()

# Step 2: Read the existing CSV, append photo_id to the first instance, and NaN to the rest
updated_rows = []
with open('/content/drive/MyDrive/restaurant_reviews.csv', 'r', newline='', encoding='utf-8') as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)  # Assuming the first row is the header
    header.append('photo_id')  # Add the photo_id column to the header

    for row in csv_reader:
        business_id = row[0]
        if business_id in indoor_photos and business_id not in assigned_photos:
            # Append the photo_id to the row if an indoor photo exists and hasn't been assigned yet
            row.append(indoor_photos[business_id])
            assigned_photos.add(business_id)  # Mark this business as having been assigned a photo
        else:
            # Append NaN for businesses without an indoor photo or already assigned
            row.append('NaN')
        updated_rows.append(row)

# Step 3: Save the updated data to a new CSV file
with open('restaurant_reviews_with_photos.csv', 'w', newline='', encoding='utf-8') as new_csv_file:
    csv_writer = csv.writer(new_csv_file)
    csv_writer.writerow(header)  # Write the updated header
    csv_writer.writerows(updated_rows)  # Write the updated rows

In [14]:
!mv /content/data/restaurant_reviews_with_photos.csv /content/drive/MyDrive

In [None]:
import requests

# The URL of the file you want to download
file_url = '/content/data/restaurant_reviews_with_photos.csv'

# The local path where you want to save the downloaded file
local_filename = 'downloaded_file.csv'

# Make a GET request to fetch the content of the file
response = requests.get(file_url, stream=True)

# Open a local file with write-binary ('wb') mode and write the content to it
with open(local_filename, 'wb') as file:
    for chunk in response.iter_content(chunk_size=128):
        file.write(chunk)

print(f'File downloaded and saved as {local_filename}')

## EDA

Let's explore the data a bit. We just want to analyse restaurants.

In [18]:
import pandas as pd

# Load the first 2000 rows
df = pd.read_csv('/content/drive/MyDrive/restaurant_reviews_with_photos.csv', nrows=2000)

# Now df contains the first 2000 rows of your CSV file
df.head()

Unnamed: 0,business_id,text,photo_id
0,XQfwVwDr-v0ZS3_CbbE5Xw,"If you decide to eat here, just be aware it is...",
1,YjUWPpI6HXG530lwP-fb2A,Family diner. Had the buffet. Eclectic assortm...,
2,kxX2SOes4o-D3ZQBkiMRfA,"Wow! Yummy, different, delicious. Our favo...",
3,e4Vwtrqf-wpJfwesgvdgxQ,Cute interior and owner (?) gave us tour of up...,-rCqVHSxxfNSCBLvFE_U6Q
4,04UD14gamNjLY0IDYVhHJg,I am a long term frequent customer of this est...,1eKNwiFMPTLfviad0Sh-Ew


In [25]:
df['photo_id'].isnull()

860

In [19]:
def get_csv_size(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        row_count = sum(1 for row in reader)  # Count the rows

        # Reset file pointer to the beginning
        file.seek(0)
        column_count = len(next(reader)) if row_count > 0 else 0  # Get the number of columns from the first row

    return row_count, column_count


In [20]:
rows, columns = get_csv_size('/content/drive/MyDrive/restaurant_reviews.csv')

print(f'The CSV file has {rows} rows and {columns} columns.')

The CSV file has 4724685 rows and 2 columns.


In [21]:
rows, columns = get_csv_size('/content/drive/MyDrive/restaurant_reviews_with_photos.csv')

print(f'The CSV file has {rows} rows and {columns} columns.')

The CSV file has 4724685 rows and 3 columns.


## Exploring some NLP models

### RoBERTa-Large Variant

Let's explore this model.

In [None]:


# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")
model = AutoModelForMaskedLM.from_pretrained("AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")

OSError: Can't load tokenizer for 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")

ValueError: Could not load model AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, <class 'transformers.models.auto.modeling_tf_auto.TFAutoModelForSequenceClassification'>, <class 'transformers.models.roberta.modeling_roberta.RobertaForMaskedLM'>, <class 'transformers.models.roberta.modeling_tf_roberta.TFRobertaForMaskedLM'>). See the original errors:

while loading with AutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3098, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

while loading with TFAutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 2829, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5 or model.ckpt

while loading with RobertaForMaskedLM, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3098, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

while loading with TFRobertaForMaskedLM, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 2829, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5 or model.ckpt




In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

OSError: Can't load tokenizer for 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.