
# Task 1: Data Preparation and Management for Yelp Sentiment Analysis

## Overview

This task focuses on preparing the Yelp dataset for a sentiment analysis project. The goal is to classify Yelp reviews into three categories—positive, negative, and neutral—based on their content, targeting specifically restaurants and hotels reviews. This document outlines the steps taken to acquire, clean, preprocess, and prepare the dataset for subsequent modeling tasks.

## Dataset Description and Acquisition

- **Source**: The dataset was obtained from Yelp's Dataset Challenge, which is publicly available for educational and research purposes. It includes a comprehensive compilation of business reviews, user interactions, and metadata associated with Yelp businesses.
- **Scope**: From the larger Yelp dataset, we filtered out reviews explicitly linked to restaurants and hotels, as these categories are most relevant to our sentiment analysis objectives.
- **Volume**: After filtering, our dataset consists of approximately [X number of reviews], spanning from [start year] to [end year].

## Detailed Data Cleaning and Preprocessing Steps

### Text Cleaning

1. **HTML Tag Removal**: Utilizing regular expressions, we stripped out any HTML tags that appear in the review texts, ensuring only textual content is retained for analysis.
2. **Special Characters and Punctuation**: We removed special characters and punctuation, again using regular expressions, to focus on the words within the reviews.

### Tokenization

- Employing the NLTK library, we tokenized the cleaned review texts into individual words. This step is crucial for breaking down the texts into manageable units for further processing.

### Stop Words Removal

- Common words that typically don’t contribute to sentiment (e.g., "and", "is", "in") were removed using NLTK’s predefined list of stop words. This helps reduce the dataset's noise, focusing on more meaningful words.

### Lemmatization

- Words were converted to their lemma or dictionary form to consolidate different forms of a word into a base form. We used NLTK’s WordNetLemmatizer for this purpose.

### Numerical Representation

- The Tokenizer class from TensorFlow’s Keras API was utilized to convert text tokens into numerical format. This involves mapping each unique word to a unique integer and transforming the texts into sequences of these integers, making the data suitable for input into deep learning models.

## Sentiment Labeling

Based on the star ratings accompanying each review, we classified sentiments as follows:

- **Positive**: Reviews rated with 4 or 5 stars.
- **Negative**: Reviews rated with 1 or 2 stars.
- **Neutral**: Reviews with a 3-star rating.

This categorical labeling facilitates a supervised learning approach, allowing models to learn from labeled examples.

## Challenges and Solutions

- **Missing Data**: Encountering reviews with missing or incomplete text posed a challenge. We opted to remove such instances to maintain the quality and consistency of our analysis.
- **Large Vocabulary**: The wide variety of words in the reviews introduced challenges in memory usage and computational efficiency during the tokenization and numerical conversion process. To address this, we limited our vocabulary size to the top [X] most frequent words for the numerical representation, ensuring a balance between computational efficiency and retaining meaningful textual information.
- **Class Imbalance**: The dataset exhibited a skew in the distribution of sentiments, with an overrepresentation of positive reviews. To mitigate potential biases, we plan to explore techniques such as class weighting during the model training phase to ensure a fair representation of each sentiment class.

## Tools and Libraries Used

- **Pandas** and **NumPy** for data manipulation.
- **NLTK** for natural language processing tasks, including tokenization, stop words removal, and lemmatization.
- **TensorFlow** and specifically the Keras API for preprocessing text data and preparing it for deep learning models.



# Documentation for `trainmodel.py`

## Overview

`trainmodel.py` serves as a foundational script in our sentiment analysis project, handling the acquisition, cleaning, preprocessing, and preparation of Yelp review data. Its primary function is to prepare the dataset for subsequent deep learning models, ensuring data is in the correct format for effective model training.

## Functions

### `load_data(file_path)`

- **Purpose**: Loads the dataset from a specified CSV file path.
- **Input**: `file_path` (str) - The path to the CSV file containing the Yelp reviews.
- **Output**: `DataFrame` - A pandas DataFrame containing the loaded dataset.
- **Description**: This function reads a CSV file into a pandas DataFrame, making the dataset available for further processing steps. It ensures that the dataset is accessible and in a manipulable format.

### `prepare_input_data(train_data_path, test_data_path)`

- **Purpose**: Prepares the training and testing datasets for model input.
- **Input**:
  - `train_data_path` (str) - The path to the training data CSV file.
  - `test_data_path` (str) - The path to the testing data CSV file.
- **Output**: Tuple containing:
  - `train_texts` (List[str]) - Preprocessed training text data.
  - `train_labels` (List[int]) - Corresponding labels for the training data.
  - `test_texts` (List[str]) - Preprocessed testing text data.
  - `test_labels` (List[int]) - Corresponding labels for the testing data.
  - `word_index` (Dict) - A dictionary mapping words to their numerical index.
- **Description**: This function encompasses the core preprocessing workflow, including text cleaning, tokenization, stop words removal, and numerical conversion. It splits the dataset into training and testing sets, ensuring each is properly preprocessed for model training.

### `clean_text(texts)`

- **Purpose**: Cleans the raw review texts by removing HTML tags, special characters, and converting all text to lowercase.
- **Input**: `texts` (List[str]) - A list of review texts to be cleaned.
- **Output**: `cleaned_texts` (List[str]) - The cleaned review texts.
- **Description**: Applies regular expressions and other text processing techniques to clean the provided review texts, preparing them for further NLP tasks.

### `tokenize_and_pad(texts)`

- **Purpose**: Tokenizes the cleaned texts and pads them to a uniform length.
- **Input**: `texts` (List[str]) - A list of cleaned review texts.
- **Output**: Tuple containing:
  - `padded_sequences` (ndarray) - Numerically encoded and padded text sequences.
  - `word_index` (Dict) - A dictionary mapping words to their numerical index.
- **Description**: Utilizes the Keras Tokenizer to convert text to sequences of integers, then pads these sequences to ensure uniform length across all texts.

### `compute_class_weights(labels)`

- **Purpose**: Computes class weights to address class imbalance in the training data.
- **Input**: `labels` (List[int]) - The list of labels for the training data.
- **Output**: `class_weights` (Dict) - A dictionary mapping class indices to their corresponding weight.
- **Description**: Calculates weights for each class based on their frequency in the dataset, providing a mechanism to counteract the effects of class imbalance during model training.

## Challenges and Solutions

During the development of `trainmodel.py`, several challenges were encountered:

- **Data Cleaning Complexity**: The diversity of text in Yelp reviews required robust cleaning methods. Regular expressions and NLTK functions were employed to effectively clean and standardize the text data.
- **Large Vocabulary**: The initial tokenization revealed a vast vocabulary, leading to high memory consumption. A decision was made to limit the tokenizer's vocabulary to the most frequent words, striking a balance between model complexity and performance.
- **Class Imbalance**: An imbalance in sentiment labels was addressed by computing class weights, allowing the model to give more importance to underrepresented classes during training.

## Dependencies

- **pandas**: For data loading and manipulation.
- **NumPy**: For numerical operations.
- **TensorFlow/Keras**: For text tokenization, sequence padding, and numerical data preparation.
- **NLTK**: For natural language processing tasks such as stop words removal.


# LSTM Model Training and Evaluation Documentation

## Model Overview

This document outlines the training and evaluation process for a Long Short-Term Memory (LSTM) model developed for sentiment analysis on Yelp reviews, focusing on classifying sentiments as positive, negative, or neutral. The model's architecture consists of LSTM layers tailored to process sequential data inherent in text, aiming to capture the contextual nuances essential for accurate sentiment classification.

## Training Process

The model was trained over 10 epochs with a batch size of [specify batch size], employing the Adam optimizer and sparse categorical cross-entropy as the loss function. The training and validation datasets were prepared using preprocessed Yelp review data, aiming for a balanced representation of sentiment classes.

### Training Results Summary

- **Epochs 1-3**: The model showed rapid improvement, with accuracy increasing significantly and loss decreasing both on training and validation datasets. This phase marked the initial learning curve where the model started capturing the underlying sentiment patterns.
- **Epoch 4-6**: While training accuracy continued to improve, validation accuracy peaked at epoch 3 and then showed a slight decline, with a corresponding increase in validation loss. This suggests the beginning of overfitting to the training data, where the model's generalization to unseen data started to decrease.

## Evaluation Metrics

- **Final Training Accuracy**: 85.13%
- **Final Training Loss**: 0.4210
- **Validation Accuracy**: 78.75% (at last epoch)
- **Validation Loss**: 0.5135 (at last epoch)
- **Test Accuracy**: 79.65%
- **Test Loss**: 0.4744

These metrics indicate that the model has learned to classify sentiments with a relatively high degree of accuracy, albeit showing signs of overfitting as evidenced by the validation and test performance.

## Observations and Insights

- The model's ability to improve significantly in the initial epochs is promising, demonstrating its capacity to learn from the Yelp review data effectively.
- The onset of overfitting from epoch 4 onwards, as indicated by the divergence of training and validation metrics, suggests the model's increasing specialization to the training data, which could limit its applicability to new, unseen data.
- The close alignment between validation and test metrics suggests that the validation set is a good representative of the test set, and the model's performance is consistent across unseen datasets.

## Conclusions and Future Directions

While the LSTM model has shown a commendable ability to classify sentiments within Yelp reviews, the training process revealed critical insights, particularly regarding the balance between learning and overfitting. For future iterations or models:

- Implementing regularization techniques and early stopping could mitigate overfitting, enhancing the model's generalization capabilities.
- Adjusting the learning rate dynamically or experimenting with the model's architecture might yield improvements in performance and efficiency.

This documentation encapsulates the LSTM model's development and evaluation phases, providing a foundation for future enhancements and iterations. The insights gained from this process underscore the importance of continuous monitoring and adjustment in model training to achieve optimal performance.
