# Challenge Name

The goal of this challange is to use data from restaurant reviews in Google Maps to create a Natural Language Processing (NLP) pipeline. For example, you can train a Deep Learning model to generate new random reviews or you could try to predict the number of stars a reviewer gave to a restaurant from the review they wrote. 

If you want you can also use this dataset in order to do some data analysis or visualization. For example, you could show the relationship between a restaurant's location, price and average rating. However, we encourage you to try to engage in some NLP/ML tasks. Even if you have no previous experience in Deep Learning or Machine Learning you can try to learn the basics over the weekend, at the end of the notebook we share with you some resources we think might be helpful.

The possibilities are endless. This notebook begins with a brief description of the dataset and continues with an example of an NLP task where I give some tips on how to tackle this challenge. At the end, we attach some links to some resources for those of you who might want to learn more about NLP and ML.

## Taking a look at out data

First of all let's take a look at the data:

In [34]:
import json # load data
import nltk # tokenizer

In [35]:
f=open("reviews_data.json")
json_data=json.load(f)

In [36]:
print("Number of records (restaurants): ",len(json_data))

Number of records (restaurants):  300


#### Example record

In [38]:
json_data[0]

{'position': 1,
 'title': 'Shalimar Braseria (Bar y Restaurante)',
 'place_id': 'ChIJla091gKZpBIRKfT7KEr1a9M',
 'data_id': '0x12a49902d63dad95:0xd36bf54a28fbf429',
 'data_cid': '15234539863374820393',
 'gps_coordinates': {'latitude': 41.3724499, 'longitude': 2.1033573},
 'rating': 4.5,
 'reviews': 140,
 'price': '$',
 'type': 'Restaurant',
 'address': "Carrer de les Aigües del Llobregat, 116, 08906 L'Hospitalet de Llobregat, Barcelona, Spain",
 'open_state': 'Open ⋅ Closes 10PM',
 'hours': 'Open until 10:00 PM',
 'phone': '+34 667 78 80 31',
 'service_options': {'dine_in': True, 'takeout': True, 'delivery': True},
 'thumbnail': 'https://lh5.googleusercontent.com/p/AF1QipM4UFPIkJBPaN6kZKHZBT_A64trnWoNJEHvMrrZ=w163-h92-k-no',
 'reviews_data': [{'user': {'name': 'Sonu Sonu',
    'link': 'https://www.google.com/maps/contrib/117302158031083735630?hl=en-CA&sa=X&ved=2ahUKEwir5ZGs5P7zAhXqgf0HHYVXAJQQvvQBegQIARAh',
    'thumbnail': 'https://lh3.googleusercontent.com/a/AATXAJzE052d2Vlg4Es2OsNZBX

In [100]:
reviews=json_data[0]["reviews_data"] # reviews for restaurant 0
print(len(reviews), "reviews per restaurant")

130 reviews per restaurant


Meaning we have 39000 reviews in total.

#### Example review

In [101]:
reviews[0]

{'user': {'name': 'Sonu Sonu',
  'link': 'https://www.google.com/maps/contrib/117302158031083735630?hl=en-CA&sa=X&ved=2ahUKEwir5ZGs5P7zAhXqgf0HHYVXAJQQvvQBegQIARAh',
  'thumbnail': 'https://lh3.googleusercontent.com/a/AATXAJzE052d2Vlg4Es2OsNZBXfDXFWtyinRhNHInqf2=s40-c-c0x00000000-cc-rp-mo-br100',
  'reviews': 1},
 'rating': 1.0,
 'date': '3 months ago',
 'snippet': 'Very bad taste , always sahi paneer is prepared with the sauce of cashew but i ate with tomato sauce very bad experience regarding taste but service ok'}

### Example task: Sentiment Analysis

You can either train an NLP model (todo: fer una intro al ML, DL and NLP i linkejar recursos) from scratch using all 39000 reviews or you can download a pretrained NLP model from the Internet (we recommend [Hugging Face](https://huggingface.co/)) and finetune it for the task you select.

In this example, we show a Deep Learning model you can get from Hugging Face. Some documentation for the library *transformers* can be found [here](https://huggingface.co/transformers/quicktour.html#getting-started-on-a-task-with-a-pipeline). The [model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) we are loading is called “distilbert-base-uncased-finetuned-sst-2-english”. 

In [48]:
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis")
print(sentiment_analysis("I love this!"))

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998764991760254}]


In [108]:
r=reviews[6]
print(r["rating"], " stars review:\n", r["snippet"], sep="")
print("Prediction:", sentiment_analysis(r["snippet"]))

2.0 stars review:
Many items on the menu were not Available and the food quality was low. The curry was all bones and the portions were not big enough. Service was good from one person but then another waiter got involved and wasnt good. Expected more after looking at the reviews and pictures but done expect much from this place
Prediction: [{'label': 'NEGATIVE', 'score': 0.9983540177345276}]


In [109]:
r=reviews[1]
print(r["rating"], " stars review:\n", r["snippet"], sep="")
print("Prediction:", sentiment_analysis(r["snippet"]))

5.0 stars review:
Very good value. Tasty grilled chicken.
Prediction: [{'label': 'POSITIVE', 'score': 0.9996946454048157}]


As you can see, it labels a bad review (2 stars) as NEGATIVE and a good review (5 stars) as POSITIVE. Feel free to try other reviews or your own texts.

So, given this model, you have severel options. For example, you could try to change the last layer on the model to perform multiclass classification  and try to predict the rating of the review through fine-tuning. Another option might be to extract the representations that this model generates in the hidden intermediate layers and use those in an ML pipeline using other algorithms. Some info on how to fine-tune your models can be found [here](https://huggingface.co/transformers/training.html). As previously, mentioned this is just one example and many other tasks and models (even from scratch if you want) can be trained. 

However, it's not all about Deep Learning. The suggestion above might seem dauting if you've never done any DL before. Luckily for us, there are other ways to generate numerical representations from texts. For example, you could use tf-idf (todo: posar info al respecte) to compute a representation for every review. Then you could use this representation to get the similarity between texts or to set up an ML pipeline using k-NN or other algorithms of your choice. 

In [89]:
# Altres cosetes 

nltk_tokens = nltk.word_tokenize(json_data[1]["reviews_data"][0]["snippet"])
print(nltk_tokens)

['Amazing', 'discovery', '!', 'We', 'had', 'the', 'degustation', 'menu', 'of', '5', 'dishes', 'and', 'OMG', '!', 'Delicious', 'food', 'and', 'our', 'waiter', 'was', 'super', 'nice', '.', 'The', 'focaccia', 'is', 'unbelievable', 'and', 'I', 'also', 'recommend', 'to', 'try', 'Aperol', 'as', 'they', 'do', 'it', 'with', 'Prosecco', 'like', 'in', 'Italy', '!', 'We', 'will', 'come', 'back', '!', 'The', 'menu', 'with', '5', 'dishes', '+', 'desert', 'is', 'more', 'than', 'enough', '!', 'We', 'were', 'so', 'full', 'and', 'so', 'happy', '!', 'Thanks', 'so', 'much', '!']
