# Restaurant atmosphere/vibes analysis

## Introduction

I found some apps, that contain images of restaurants along with a vibe labelled.

`Tripadvisor` only has `business meetings`, `romantic`, and `ideal for groups` vibes. So it is not enough.

After searching a looot of apps, I found the `One Zone` app. Desgined specifically to chose a restaurant by the vibe they evoke. However, the dataset is really small, as they add restaurants manually. They don't use any kind of deep learning method. Maybe we can use it as a test dataset.

Therefore, what I am going to try to do, is to develop a NLP model to classify customer reviews by the restaurant vibe. Then I would label the restaurants images by the vibe obtained with the NLP model.

**(PONER IMAGEN DEL WORKFLOW QUE VOY A SEGUIR)**

## Yelp Dataset

So we'll be working with Yelp dataset. The Yelp Open Dataset is an extensive collection provided by Yelp for academic and research purposes. It offers a rich source of data from Yelp's vast repository of local business reviews and user interactions. It contains:

1. **Businesses:**
Basic information about local businesses, including their name, location (address, city, state, postal code), latitude and longitude, average rating, category (e.g., Restaurants, Shopping, Beauty & Spas), and the number of reviews they have received. This allows for analyses focused on geographical trends, category-based studies, and more.

2. **Reviews:**
Millions of text reviews written by Yelp users for various businesses. Each review includes the user's text review, the star rating they gave, and the date of the review. This data is invaluable for sentiment analysis, natural language processing tasks, and understanding consumer preferences.

3. **Users:**
Information about Yelp users who have written reviews, including their user ID, name, review count, yelping since (the date they joined Yelp), friends, and other social metrics. This can be used for social network analysis, studying user behavior, and personalization algorithms.

4. **Check-ins:**
Data about check-ins at businesses by Yelp users, which can help in analyzing foot traffic trends, popular times, and loyalty or frequency of visits.

5. **Tips:**
Short notes left by users about a business. Tips can contain advice, recommendations, or comments about what's good or what to avoid. This dataset is useful for extracting quick insights or highlights about businesses.

6. **Photos:**
Metadata about photos associated with businesses, including a photo ID, business ID, caption, and labels indicating the photo category (e.g., food, interior, exterior).
This can support visual analysis of businesses or complement review-based insights as in our case.

## Loading Yelp data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!unzip "/content/drive/MyDrive/yelp.zip" -d "/content/data/"

Archive:  /content/drive/MyDrive/yelp.zip
  inflating: /content/data/Dataset_User_Agreement.pdf  
  inflating: /content/data/yelp_academic_dataset_business.json  
  inflating: /content/data/yelp_academic_dataset_checkin.json  
  inflating: /content/data/yelp_academic_dataset_review.json  
  inflating: /content/data/yelp_academic_dataset_tip.json  
  inflating: /content/data/yelp_academic_dataset_user.json  


## EDA

Let's explore the data a bit. We just want to analyse restaurants.

In [None]:
import pandas as pd

json_file_path = '/content/data/yelp_academic_dataset_review.json'  # Update this to your file path

# Define a list to hold chunks of the DataFrame
df_list = []

# Use a chunk size that fits well into your memory. Adjust as necessary.
chunksize = 10 ** 5

# Read the JSON file in chunks
with pd.read_json(json_file_path, lines=True, chunksize=chunksize) as reader:
    for chunk in reader:
        # Filter out only the 'business_id' and 'text' columns
        filtered_chunk = chunk[['business_id', 'text']]
        df_list.append(filtered_chunk)

# Concatenate all chunks into a single DataFrame
df = pd.concat(df_list, ignore_index=True)

In [None]:
csv_file_path = '/content/drive/My Drive/yelp_reviews.csv'  # Specify your desired path
df.to_csv(csv_file_path, index=False)

In [None]:
import pandas as pd
data_merged = pd.read_csv("/content/drive/MyDrive/yelp_reviews.csv")

In [None]:


# Path to your business JSON file
business_json_path = '/content/data/yelp_academic_dataset_business.json'

# Load the business data
business_df = pd.read_json(business_json_path, lines=True)

# Filter for restaurants
# Note: Adjust the condition as necessary to accurately capture all restaurants in your dataset
restaurant_df = business_df[business_df['categories'].str.contains('Restaurants', case=False, na=False)]

In [None]:
final_df = pd.merge(data_merged, restaurant_df[['business_id', 'name', 'longitude', 'latitude', 'review_count']], on='business_id', how='left')

In [None]:
csv_file_path = '/content/drive/My Drive/yelp_restaurants.csv'  # Specify your desired path
final_df.to_csv(csv_file_path, index=False)

In [None]:
sorted_df = final_df.sort_values(by='business_id', ascending=True)

## Exploring some NLP models

### RoBERTa-Large Variant

Let's explore this model.

In [None]:


# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")
model = AutoModelForMaskedLM.from_pretrained("AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")

OSError: Can't load tokenizer for 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain")

ValueError: Could not load model AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, <class 'transformers.models.auto.modeling_tf_auto.TFAutoModelForSequenceClassification'>, <class 'transformers.models.roberta.modeling_roberta.RobertaForMaskedLM'>, <class 'transformers.models.roberta.modeling_tf_roberta.TFRobertaForMaskedLM'>). See the original errors:

while loading with AutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3098, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

while loading with TFAutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 2829, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5 or model.ckpt

while loading with RobertaForMaskedLM, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3098, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

while loading with TFRobertaForMaskedLM, an error is thrown:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_tf_utils.py", line 2829, in from_pretrained
    raise EnvironmentError(
OSError: AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain does not appear to have a file named pytorch_model.bin, tf_model.h5 or model.ckpt




In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

OSError: Can't load tokenizer for 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'AliHaider0343/implicit-and-explicit-aspects-Extraction-in-Restaurant-Reviews-Domain' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.