# Homework: Sentiment Analysis with Yelp Review Dataset

## What is the Yelp Dataset?

This dataset is derived from Yelp reviews, where each review expresses a sentiment (positive, negative, or neutral) about a particular service, product, or experience. The task focuses on analyzing these reviews to extract the sentiment conveyed.

## Dataset Information

The Yelp dataset consists of two files:
- `train.csv`: Training dataset containing labeled reviews for model training.
- `val.csv`: Validation dataset for evaluating the model's performance on unseen data.

Each review is associated with a `label` ranging from 0 to 4, where:
- `label 0`: 1 star
- `label 1`: 2 stars
- `label 2`: 3 stars
- `label 3`: 4 stars
- `label 4`: 5 stars

The code provided below includes a step to map these labels to their corresponding star ratings for better interpretability.

## Data Exploration and Preprocessing

### Missing Values

- **Question**: Are there any missing values in the Yelp reviews? Explain your approach to handling missing data.

Before processing, it is essential to check for any missing values in the dataset. Handling these can be crucial to avoid errors during model training.

### Cleaning Text Data

- **Question**: How would you clean special characters, links, or emojis from the reviews?

To enhance model performance, we need to clean the text data by removing any irrelevant elements like links, special characters, or emojis. This can be achieved using regular expressions or libraries such as `re` and `emoji` in Python.

### Sentiment Distribution Visualization

- **Question**: Visualize the data distribution across sentiment categories (1 to 5 stars).

The following code snippet helps visualize the distribution of reviews across the 1-5 star categories:

```python
import matplotlib.pyplot as plt

# Assuming train_df is loaded
train_df['star_rating'].value_counts().sort_index().plot(kind='bar', title='Sentiment Distribution')
plt.xlabel('Star Rating')
plt.ylabel('Number of Reviews')
plt.show()
```
This will provide insights into whether the data is balanced or skewed across different ratings.

##  Motivation

Twitter is a platform where users share their opinions in real-time. By analyzing these messages, we can gain insights into public perception and trends related to specific entities.

## Problem Statement

The task is to classify each review as expressing a Positive, Negative, or Neutral sentiment. This will help understand the general sentiment towards various businesses and services based on user feedback.

## Label Adjustment Code

To adjust the labels, run the following code:
```python
import pandas as pd

# Load the datasets
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('val.csv')

# Map labels to star ratings
label_to_star = {0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'}
train_df['star_rating'] = train_df['label'].map(label_to_star)
val_df['star_rating'] = val_df['label'].map(label_to_star)

# Display the first few rows to confirm the mapping
print(train_df.head())
print(val_df.head())
```

This code snippet loads the `train.csv` and `val.csv` files, maps the `label` values to their respective star ratings, and displays the first few rows to verify the changes.

## Hyperparameter Tuning

- **Question**: Explain the hyperparameters you used to train your model (e.g., learning rate, number of epochs) and show how changes in these parameters impacted the model’s performance.

The hyperparameters, such as learning rate, number of epochs, batch size, etc., play a critical role in model training. By experimenting with different values, we can observe their impact on training and validation accuracy, convergence speed, and overall model performance.

For instance, a lower learning rate may result in more stable training but slower convergence, while a higher learning rate can speed up training but risk overshooting the optimal solution. Similarly, tuning the number of epochs helps control overfitting by adjusting the extent of training.

Example:

```python
# Example hyperparameter adjustments
learning_rate = 0.001  # Initial setting
num_epochs = 10        # Set initial number of epochs
# Adjust and observe the model performance
```


## What Do We Expect from You in This Assignment?

We expect you to use NLP techniques and possibly deep learning methods to analyze the text data from Yelp reviews. Your goal is to accurately classify each review into one of the sentiment categories: Positive, Negative, or Neutral.

## Additional Questions for Exploration

1. **Data Analysis**: What trends or patterns can you identify in the Yelp reviews' star ratings? For instance, are there more positive (4-5 stars) or negative (1-2 stars) reviews?
2. **Data Preprocessing**: Are there any reviews that should be removed or cleaned (e.g., empty reviews, excessive punctuation)?
3. **Sentiment Distribution**: How is the sentiment distributed across different star ratings? Is there a balanced distribution of sentiments across the dataset?
4. **Model Improvement**: What strategies might you consider to improve model accuracy on the validation set?


## Model Performance Analysis

- **Question**: Compare the accuracy of the training and validation data to analyze if the model is overfitting.

Comparing training and validation accuracy provides insights into overfitting or underfitting. A large gap with high training accuracy but low validation accuracy indicates overfitting.

- **Question**: How would you assess if there is any bias in your model’s results?

Evaluating bias involves checking if the model consistently misclassifies a particular sentiment class or star rating. This can be assessed by examining confusion matrices and analyzing error rates across categories.

## Dataset

The dataset has been shared along with the homework *(twitter_training.csv)*.

## If you have any question about the homework, you can contact us at the following e-mail adresses:



*   burcusunturlu@gmail.com
*   ozgeflzcn@gmail.com



## 1 - Import Libraries

Main Libraries for you to deploy your model (Feel free to use other libraries that you think helpful):

*   Pandas
*   Numpy
*   Sklearn
*   nltk
*   keras

## 2 - Importing the Data (65 points)

## 2.1 - Loading the Data


*   Import the dataset from the file.


In [None]:
import pandas as pd

# Read your csv file and define column names
columns = ['tweet_id', 'entity', 'sentiment', 'tweet_content']
data = pd.read_csv('/content/twitter_training.csv', names = columns)

# Replace 'Irrelevant' sentiment with 'Neutral'
data['sentiment'] = data['sentiment'].replace('Irrelevant', 'Neutral')

# Look at your data
data.head()

## 2.2 - Exploratory Data Analysis (EDA) (20 points)

Please investigate your data according to:
* Understand the
classes. Visualize the distribution of sentiment classes within the dataset.
* Check distributions.
* Check null values.
* Drop unnecessary columns (e.g., unrelated metadata).

## 2.3 - Data Preparation (25 points)

* Clean the comments. Remove irrelevant characters (e.g., URLs, mentions). Normalize the text (lowercasing, removing punctuation, etc.).
* Remove/unremove stopwords based on your assumption.
* Tokenize the comments.
* Lemmatize the comments.
* Vectorization.
* Word count analysis and outlier detection.

## 2.4 - TF(Term Frequency) - IDF(Inverse Document Frequency) (15 points)

* Explain TF & IDF.
* Apply TF & IDF methods.

## 2.5 - Train/Test Split (5 points)

* Prepare the target variables and split the data into training and testing sets.

# 3 - Training Deep Learning Models (30 Points)

* Import relevant libraries.
* Explain the difference between Neural Networks (NN) and Convolutional Neural Networks (CNN).

In [None]:
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, Embedding, Dropout, Activation

## 3.1 - Training NN models

* Construct NN models from basic one (exp. with one layer) to complex (more layer included).
* Experiment with different optimizers, regularization methods, drop-out rates, and normalization techniques.
* Evaluate in test data for different trials.

# 4 - Testing with your Own Input (5 points)

* Test the trained model with your own input sentences to predict the sentiment based on an entity.

In [None]:
# Try a sentence related to an entity, you can replace with your own example
sentence = "I love the new features of the Windows!!"
entity = "Microsoft" # specify the entity
tmp_pred, tmp_sentiment = predict(sentence, entity)
print(f"The sentiment of the sentence about {entity}: \n***\n{sentence}\n***\nis {tmp_sentiment}.")

# 5 - Bonus - Training CNN Models (20 points)

* Construct CNN models from basic (e.g., one layer) to complex (more layers included).
* Use different optimizers, regularization methods, drop-out, normalization etc.
* Evaluate in test data for different trials.

## Additional Notes

* Ensure all models and visualizations are well-commented.
* Include explanations for key steps like tokenization, vectorization, and model selection.
* Please complete your homework using this notebook.