# Homework: Sentiment Analysis with Yelp Review Dataset

https://huggingface.co/datasets/Yelp/yelp_review_full

## What is the Yelp Dataset?

This dataset is derived from Yelp reviews, where each review expresses a sentiment (1 to 5 stars) about a particular service, product, or experience. The task focuses on analyzing these reviews to extract the sentiment conveyed.





##  Motivation

Yelp is a platform where users share their experiences and opinions about various businesses, such as restaurants, stores, and services. By analyzing these reviews, we can gain insights into customer satisfaction, identify trends in consumer behavior, and understand the general perception of different businesses. This analysis can be valuable for businesses aiming to improve their services based on customer feedback.


## Problem Statement

The task is to classify each review based on its star rating, ranging from 1 to 5 stars, reflecting the sentiment expressed by the user.
This analysis will help understand the general perception of various businesses and services based on user feedback, providing valuable insights into
customer satisfaction and areas for improvement for businesses.



## What Do We Expect from You in This Assignment?

We expect you to use NLP techniques and potentially deep learning methods to analyze the text data from Yelp reviews. Your goal is to accurately classify each review based on its star rating, ranging from 1 to 5 stars.
This classification will help interpret the sentiment expressed in each review, giving insights into customer satisfaction levels across different businesses.


## Dataset Information

The Yelp dataset consists of two files:
- `yelp_review_train.csv`: Training dataset containing labeled reviews for model training.
- `yelp_review_test.csv`: Validation dataset for evaluating the model's performance on unseen data.

Each review is associated with a `label` ranging from 0 to 4, where:
- `label 0`: 1 star
- `label 1`: 2 stars
- `label 2`: 3 stars
- `label 3`: 4 stars
- `label 4`: 5 stars

The code provided below includes a step to map these labels to their corresponding star ratings for better interpretability.

## If you have any question about the homework, you can contact us at the following e-mail adresses:



*   burcusunturlu@gmail.com
*   ozgeflzcn@gmail.com



## 1 - Import Libraries

Main Libraries for you to deploy your model (Feel free to use other libraries that you think helpful):

*   Pandas
*   Numpy
*   Sklearn
*   nltk
*   keras

## 2 - Importing the Data (65 points)

## 2.1 - Loading the Data


*   Import datasets from the file.


In [None]:
import pandas as pd

# Load the datasets
train_df = pd.read_csv('yelp_review_train.csv')
val_df = pd.read_csv('yelp_review_test.csv')

# Map labels to star ratings
label_to_star = {0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'}
train_df['star_rating'] = train_df['label'].map(label_to_star)
val_df['star_rating'] = val_df['label'].map(label_to_star)

# Display the first few rows to confirm the mapping
print(train_df.head())
print(val_df.head())

## 2.2 - Exploratory Data Analysis (EDA) (25 points)

Please investigate your data according to:
* Understand the
classes.
* Check distributions.
* Check null values.
* Drop unnecessary columns (e.g., unrelated metadata).
* Visualize the data distribution across sentiment categories (1 to 5 stars).
* Consider creating other insightful visualizations from the dataset, such as analyzing the average star rating across different categories, frequently used words in positive versus negative reviews, or creating word clouds for each star rating.
* What trends or patterns can you identify in the Yelp reviews' star ratings? For instance, are there more positive (4-5 stars) or negative (1-2 stars) reviews?


## 2.3 - Data Preparation (25 points)

* Clean the comments. Remove irrelevant characters (e.g., URLs, mentions). Normalize the text (lowercasing, removing punctuation, etc.).
* Remove/unremove stopwords based on your assumption.
* Tokenize the comments.
* Lemmatize the comments.
* Vectorization.
* Word count analysis and outlier detection.

## 2.4 - TF(Term Frequency) - IDF(Inverse Document Frequency) (15 points)

* Explain TF & IDF.
* Apply TF & IDF methods.

# 3 - Training Deep Learning Models (30 Points)

* Import relevant libraries.
* Explain the differences between Neural Networks (NN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN).

In [None]:
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, Embedding, Dropout, Activation

## 3.1 - Training NN, RNN and CNN models

* Construct models starting from a simple neural network (NN) with a single layer, and incrementally add layers to build more complex architectures. Include experiments with RNN and CNN models, and analyze the performance differences among them.
* Experiment with different activation functions, optimizers, and regularization techniques (such as dropout rates). For each trial, document the effects of these changes. For example, observe how adding or removing layers, changing activation functions, or adjusting dropout rates impacts performance.
* Tune hyperparameters like learning rate, number of layers, and dropout percentage. Explain how each adjustment affects overfitting, underfitting, and generalization on the test data.

# 4 - Model Evaluation on the Validation Set (10 points)

* Evaluate the best model's performance on the validation set using a Confusion Matrix along with metrics such as accuracy, precision, recall, and F1-score. How well does the model generalize to new data based on these metrics?



## 4.1 - Testing with your Own Input

* You can test the trained model by inputting your own sentences to predict the sentiment:


In [None]:
# Example of testing with a custom input sentence
sentence = 'The food and ambiance at this restaurant were fantastic!'
tmp_pred, tmp_sentiment = predict(sentence)
print(f'The predicted sentiment for the review is: {tmp_sentiment} (based on star ratings)')

# 5 - Bonus - Adding Transformer Layer to the NN Model (15 points)

* How can a transformer layer be added to the NN model created in Chapter 3. Research and implement a solution to add a transformer layer to the neural network model. Document your findings, including how the transformer layer integrates with the existing architecture, its impact on model performance, and any adjustments required. Explain your approach and reasoning based on your research.

## Additional Notes

* Ensure all models and visualizations are well-commented.
* Include all explanations for key steps like tokenization, vectorization, hyperparameter tuning and model selection.
* Please complete your homework using this notebook.