
# Task 1: Data Preparation and Management for Yelp Sentiment Analysis

## Overview

This task focuses on preparing the Yelp dataset for a sentiment analysis project. The goal is to classify Yelp reviews into three categories—positive, negative, and neutral—based on their content, targeting specifically restaurants and hotels reviews. This document outlines the steps taken to acquire, clean, preprocess, and prepare the dataset for subsequent modeling tasks.

## Dataset Description and Acquisition

- **Source**: The dataset was obtained from Yelp's Dataset Challenge, which is publicly available for educational and research purposes. It includes a comprehensive compilation of business reviews, user interactions, and metadata associated with Yelp businesses.
- **Scope**: From the larger Yelp dataset, we filtered out reviews explicitly linked to restaurants and hotels, as these categories are most relevant to our sentiment analysis objectives.
- **Volume**: After filtering, our dataset consists of approximately [X number of reviews], spanning from [start year] to [end year].

## Detailed Data Cleaning and Preprocessing Steps

### Text Cleaning

1. **HTML Tag Removal**: Utilizing regular expressions, we stripped out any HTML tags that appear in the review texts, ensuring only textual content is retained for analysis.
2. **Special Characters and Punctuation**: We removed special characters and punctuation, again using regular expressions, to focus on the words within the reviews.

### Tokenization

- Employing the NLTK library, we tokenized the cleaned review texts into individual words. This step is crucial for breaking down the texts into manageable units for further processing.

### Stop Words Removal

- Common words that typically don’t contribute to sentiment (e.g., "and", "is", "in") were removed using NLTK’s predefined list of stop words. This helps reduce the dataset's noise, focusing on more meaningful words.

### Lemmatization

- Words were converted to their lemma or dictionary form to consolidate different forms of a word into a base form. We used NLTK’s WordNetLemmatizer for this purpose.

### Numerical Representation

- The Tokenizer class from TensorFlow’s Keras API was utilized to convert text tokens into numerical format. This involves mapping each unique word to a unique integer and transforming the texts into sequences of these integers, making the data suitable for input into deep learning models.

## Sentiment Labeling

Based on the star ratings accompanying each review, we classified sentiments as follows:

- **Positive**: Reviews rated with 4 or 5 stars.
- **Negative**: Reviews rated with 1 or 2 stars.
- **Neutral**: Reviews with a 3-star rating.

This categorical labeling facilitates a supervised learning approach, allowing models to learn from labeled examples.

## Challenges and Solutions

- **Missing Data**: Encountering reviews with missing or incomplete text posed a challenge. We opted to remove such instances to maintain the quality and consistency of our analysis.
- **Large Vocabulary**: The wide variety of words in the reviews introduced challenges in memory usage and computational efficiency during the tokenization and numerical conversion process. To address this, we limited our vocabulary size to the top [X] most frequent words for the numerical representation, ensuring a balance between computational efficiency and retaining meaningful textual information.
- **Class Imbalance**: The dataset exhibited a skew in the distribution of sentiments, with an overrepresentation of positive reviews. To mitigate potential biases, we plan to explore techniques such as class weighting during the model training phase to ensure a fair representation of each sentiment class.

## Tools and Libraries Used

- **Pandas** and **NumPy** for data manipulation.
- **NLTK** for natural language processing tasks, including tokenization, stop words removal, and lemmatization.
- **TensorFlow** and specifically the Keras API for preprocessing text data and preparing it for deep learning models.



# Documentation for `trainmodel.py`

### Overview

The `trainmodel.py` module is an essential component of our sentiment analysis project, dedicated to preparing and processing textual data for training deep learning models, specifically LSTM networks. This module encapsulates the functionality required for data loading, preprocessing, tokenization, and preparation to ensure the data is suitable for model training.

### Functions and Their Descriptions

#### 1. `load_data(file_path)`
This function is responsible for loading data from a CSV file into a pandas DataFrame, which serves as the primary data structure for further manipulations and analysis.

**Parameters:**
- **`file_path`** (str): The path to the CSV file.

**Returns:**
- **`DataFrame`**: A pandas DataFrame containing the loaded data.

#### 2. `create_tokenizer(texts, max_vocab=MAX_VOCAB)`
This function initializes and fits a Keras Tokenizer. It is configured to only consider the top `max_vocab` words ordered by word frequency across the texts. This tokenizer later transforms text strings into integer sequences.

**Parameters:**
- **`texts`** (list of str): List of text strings to tokenize.
- **`max_vocab`** (int, optional): The maximum size of the vocabulary. Defaults to `MAX_VOCAB`.

**Returns:**
- **`Tokenizer`**: A fitted Keras Tokenizer instance.

#### 3. `tokenize_and_pad(texts, tokenizer, max_length=MAX_LENGTH)`
After texts are converted to integer sequences, this function pads or truncates them to a uniform length, which is crucial for batch processing in neural networks.

**Parameters:**
- **`texts`** (list of str): The text strings to tokenize and pad.
- **`tokenizer`** (Tokenizer): The tokenizer to use for converting text to sequences.
- **`max_length`** (int, optional): The maximum length of the sequences after padding. Defaults to `MAX_LENGTH`.

**Returns:**
- **`ndarray`**: An array of shape (n_samples, max_length) containing the padded sequences.

#### 4. `compute_class_weights(labels)`
To handle class imbalance effectively, this function calculates the weights for each class based on their frequency in the data. These weights can be used during model training to give higher priority to minority classes.

**Parameters:**
- **`labels`** (array-like): An array-like structure of class labels.

**Returns:**
- **`dict`**: A dictionary mapping class indices to their respective weights.

#### 5. `prepare_input_data(file_path)`
This high-level function orchestrates the data preparation process by calling the aforementioned functions sequentially: it loads data, replaces missing text values, tokenizes texts, pads sequences, and computes class weights.

**Parameters:**
- **`file_path`** (str): The path to the dataset in CSV format.

**Returns:**
- **`tuple`**: A tuple containing:
  - **`X`** (ndarray): The tokenized and padded feature data.
  - **`y`** (array): The target labels.
  - **`word_index`** (dict): A dictionary mapping words to their integer indices.
  - **`class_weights`** (dict): Weights for each class based on their frequency.

### Conclusion

The `trainmodel.py` module plays a critical role for the sentiment analysis by ensuring that the input data is adequately prepared for training our LSTM models. By automating the preprocessing and preparation steps, this module helps streamline the workflow and ensures consistency and reproducibility in our model training processes.





## LSTM Model for Sentiment Analysis

### Introduction
In this project, we developed a sentiment analysis model using Long Short-Term Memory (LSTM) networks. This type of recurrent neural network (RNN) is particularly suited to text data due to its ability to process sequences and remember previous information, which is crucial for understanding the context in textual data.

### Data Preparation
Data preparation involved using the `trainmodel.py` script, which performed several key functions:
- **Data Loading:** Text data was loaded from a CSV file into a DataFrame.
- **Text Preprocessing:** The text was cleaned, tokenized, and padded to ensure uniform sequence lengths.
- **Class Weights Computation:** To address class imbalance, weights were computed for each sentiment class (Negative, Neutral, Positive), enhancing model fairness and accuracy.

### LSTM Model Architecture
The LSTM model was constructed with the following layers:
- **Embedding Layer:** Converts text to fixed-size dense vectors. We used an embedding dimension of 64, allowing the model to learn an effective representation of words.
- **LSTM Layer:** With 128 units, this layer processes the embeddings by capturing dependencies in text sequences.
- **Dropout Layers:** Set at a rate of 0.5, these layers help prevent overfitting by randomly setting input units to 0 during training.
- **Output Layers:** A Dense layer with 50 units followed by a ReLU activation, and a final Dense layer with 3 units (one for each sentiment class) with a softmax activation.

### Model Compilation
The model was compiled using the Adam optimizer and sparse categorical crossentropy as the loss function, suitable for multi-class classification tasks.

### Training Process
Training involved using K-Fold cross-validation with 5 folds to ensure robust evaluation across different subsets of data. Each fold of the training involved:
- Splitting the data into training and validation subsets.
- Training the model for up to 10 epochs with early stopping based on validation loss to prevent overfitting.

#### Training Outputs:
The training process highlighted model performance across different folds:
- **Accuracy:** Started at 89.38% and reached up to 94.49% on training data across epochs.
- **Validation Accuracy:** Showed a decrease from 88.41% to 86.78%, indicating potential overfitting as training progressed.

### Hyperparameter Tuning
Using the Keras Tuner, we optimized key model parameters:
- **Embedding Dimension**
- **LSTM Units**
- **Dropout Rate**

The best model achieved through hyperparameter tuning was then retrained on a larger subset of the data, ensuring it was finely tuned to the characteristics of our dataset.

### Evaluation and Results
The model's effectiveness was evaluated using a held-out test set:
- **Confusion Matrix:**
  ```
  [[ 6066   688   409]
   [  526  1410  1450]
   [  178   407 19729]]
  ```
- **Classification Report:**
  ```
              precision    recall  f1-score   support
    Negative       0.90      0.85      0.87      7163
     Neutral       0.56      0.42      0.48      3386
    Positive       0.91      0.97      0.94     20314

    accuracy                           0.88     30863
   macro avg       0.79      0.74      0.76     30863
  weighted avg       0.87      0.88      0.87     30863
  ```

### Discussion
The model demonstrated strong performance, especially in identifying positive sentiments, which could be attributed to the higher representation of this class in the dataset. The lower recall for the neutral class suggests difficulties in distinguishing neutral sentiments, potentially due to overlapping features with other classes.

### Conclusion
This LSTM model provides a robust framework for sentiment analysis, capable of effectively processing and classifying textual data. Future work could explore more sophisticated text preprocessing techniques, alternative RNN architectures like GRUs, or even transformer-based models for potentially better performance.

