

### 1. Importing Required Libraries
Begin by importing the necessary libraries for data processing, splitting, and model building. These include tools for handling data, tokenizing text, padding sequences, and defining the LSTM model.

### Key Components:
- Libraries for data manipulation and preprocessing (e.g., pandas).
- Tokenization and sequence padding to handle text input.
- Neural network layers, including Embedding and LSTM, for building the sentiment analysis model.
- Train-test splitting functionality to prepare data for training and evaluation.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Embedding,LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


### 2. Loading the Dataset
The IMDB dataset, which contains labeled text reviews, is loaded for the sentiment analysis task.

### Dataset Details:
- **Source**: The IMDB dataset.
- **Encoding**: UTF-8 to ensure proper handling of text data.
- **Structure**: The dataset typically consists of two columns:
  - `review`: Contains the text of the reviews.
  - `sentiment`: Labels indicating the sentiment (e.g., positive or negative).

This dataset serves as the input for preprocessing and model training.


In [None]:
df=pd.read_csv("/content/IMDB Dataset.csv", encoding='utf-8')

### 3. Exploring the Dataset
After loading the dataset, we can inspect the first few rows to understand its structure and get a glimpse of the data.

### Action:
- The `head()` function is used to display the first five rows of the dataset.
- This helps us verify the data format and check the presence of any missing or unusual values.

### Example Output:
The dataset should display two columns:
- **review**: The text content of the review.
- **sentiment**: The sentiment label, which typically could be 'positive' or 'negative'.


In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### 4. Checking the Dataset Shape
To understand the size of the dataset, we use the `shape` attribute. This reveals the number of rows and columns in the dataset.

### Action:
- The `shape` function provides the dimensions of the dataset, i.e., the number of samples (rows) and features (columns).

### Example Output:
- The output will be in the form `(num_rows, num_columns)`, indicating the total number of data points and features.
- This helps confirm the dataset's size and structure before moving forward with further processing.


In [None]:
df.shape

(50000, 2)

### 5. Checking the Distribution of Sentiment Labels
To understand the distribution of the sentiment labels in the dataset, we use the `value_counts()` function.

### Action:
- The `value_counts()` function is applied to the `sentiment` column to count how many instances belong to each class (e.g., positive or negative).
- This step helps us assess whether the dataset is balanced or imbalanced with respect to the target classes.

### Example Output:
The output will display the count of each sentiment label:
- **Positive**: Number of positive reviews.
- **Negative**: Number of negative reviews.

This step is crucial for deciding if any class balancing techniques are necessary during preprocessing.


In [None]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


### 6. Encoding the Sentiment Labels
The sentiment labels, which are categorical (e.g., positive, negative), are transformed into numerical values for model training using label encoding.

### Action:
- The `LabelEncoder` from scikit-learn is used to convert the sentiment labels into numerical format.
- The `fit_transform()` function is applied to the `sentiment` column, mapping each unique label (e.g., positive, negative) to an integer value.

### Example:
- The original labels ('positive', 'negative') might be encoded as 0 and 1, respectively.
- This transformation is necessary since machine learning models generally require numerical input.

### Benefit:
This step enables the model to interpret the sentiment as numeric values, which are essential for training the machine learning model.


In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['sentiment']=le.fit_transform(df['sentiment'])


### 7. Splitting the Data into Features and Target
Next, we separate the features (input variables) and the target variable (sentiment labels) for model training.

### Action:
- `X`: The features, which consist of all columns except the target column (`sentiment`). This is done using the `drop()` function.
- `Y`: The target variable, which is the `sentiment` column that we aim to predict.

### Explanation:
- **X (Features)**: The input data used by the model to make predictions.
- **Y (Target)**: The target labels (encoded sentiment values) that the model will learn to predict.

This step prepares the data for the next phase of splitting into training and test sets.


In [None]:
X=df.drop('sentiment',axis=1)
Y=df['sentiment']

### 8. Splitting the Data into Training and Test Sets
To evaluate the performance of the model, the dataset is divided into training and test sets.

### Action:
- The `train_test_split()` function from scikit-learn is used to randomly split the data into training and testing subsets.
- **X_train, Y_train**: The features and target for training the model.
- **X_test, Y_test**: The features and target for evaluating the model’s performance.
- The `test_size=0.2` parameter indicates that 20% of the data will be used for testing, while 80% will be used for training.
- `random_state=42` ensures reproducibility by fixing the random seed.

### Benefit:
This step allows the model to be trained on one portion of the data (training set) and evaluated on another portion (test set), ensuring that the model’s performance is generalized and not overfitting to the training data.


In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

### 9. Verifying the Shape of Training and Test Sets
After splitting the data, we check the dimensions of the training and test datasets to ensure the split is correct.

### Action:
- The `shape` function is applied to `train_data` and `test_data` (which should be `X_train`, `X_test`, `Y_train`, and `Y_test` if using the previous naming convention).
- This step confirms the number of samples in both the training and test sets, helping to verify that the data split was done properly.

### Example:
- `train_data.shape` will show the number of rows and columns in the training data.
- `test_data.shape` will show the number of rows and columns in the test data.


In [None]:
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(40000, 1) (10000, 1) (40000,) (10000,)


### 10. Tokenizing and Padding Text Data
To prepare the text data for the LSTM model, we tokenize the text (convert words into numerical representations) and then pad the sequences to ensure uniform length.

### Action:
- **Tokenizer**: The `Tokenizer` from Keras is initialized with a `num_words=5000` parameter, which limits the tokenization to the 5000 most frequent words in the dataset.
- **Fitting the Tokenizer**: The `fit_on_texts()` function is applied to the training data (`train_data['review']`) to build the vocabulary based on the training set.
- **Text to Sequences**: The `texts_to_sequences()` function converts the text reviews into sequences of integers where each integer corresponds to a word in the tokenizer’s vocabulary.
- **Padding Sequences**: The `pad_sequences()` function is used to ensure all input sequences have the same length (200 in this case). Shorter sequences are padded with zeros, and longer sequences are truncated to the specified length.

### Benefit:
- **Tokenization** converts words into numerical form so the model can process them.
- **Padding** ensures that all sequences are of equal length, making them compatible with the LSTM model.


In [None]:
tokenizer=Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train['review'])
X_train=tokenizer.texts_to_sequences(X_train['review'])
X_test=tokenizer.texts_to_sequences(X_test['review'])
X_train=pad_sequences(X_train,maxlen=200)
X_test=pad_sequences(X_test,maxlen=200)


### 11. Displaying the Tokenized and Padded Data
After tokenizing and padding the text data, we print the training and test data to inspect the numerical representation of the reviews.

### Action:
- The `print(X_train)` and `print(X_test)` commands display the tokenized and padded sequences of the training and test data.
- This step allows us to check the format of the data before feeding it into the LSTM model.

### Expected Output:
- The output will show the padded sequences, where each sequence is represented as an array of integers corresponding to the words in the vocabulary.
- Each sequence will be of length 200, with zeros padding shorter sequences and truncating longer ones.

This step helps confirm that the data is properly prepared for model training.


In [None]:
print(X_train)
print(X_test)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]
[[   0    0    0 ...  995  719  155]
 [  12  162   59 ...  380    7    7]
 [   0    0    0 ...   50 1088   96]
 ...
 [   0    0    0 ...  125  200 3241]
 [   0    0    0 ... 1066    1 2305]
 [   0    0    0 ...    1  332   27]]


### 12. Building the LSTM Model
The model is built using the Sequential API in Keras. It consists of an embedding layer, an LSTM layer, and a dense output layer.

### Action:
- **Sequential Model**: The `Sequential()` function is used to initialize the model, which allows layers to be stacked on top of each other.
- **Embedding Layer**: The `Embedding()` layer is added as the first layer to convert integer-encoded words into dense vector representations. It has:
  - `5000`: The size of the vocabulary (number of unique words).
  - `128`: The size of the embedding vectors (dimensionality).
  - `input_length=200`: The length of the input sequences (padded to 200 words).
- **LSTM Layer**: The `LSTM()` layer is used to capture the sequential dependencies in the text data. It has:
  - `128`: The number of LSTM units (neurons).
  - `dropout=0.2`: Dropout rate to prevent overfitting.
  - `recurrent_dropout=0.2`: Dropout rate for the recurrent connections within the LSTM.
- **Dense Layer**: The `Dense()` layer is the output layer with a single neuron and a sigmoid activation function, suitable for binary classification tasks (positive or negative sentiment).

### Model Summary:
- The `summary()` function displays the architecture of the model, including the number of layers, the number of parameters in each layer, and the total number of parameters in the model.

### Model Architecture:
- The output of this step will show the complete structure of the LSTM model with details about each layer.


In [None]:
model=Sequential()
model.add(Embedding(5000,128,input_length=200))
model.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(1,activation='sigmoid'))
model.summary()





### 13. Compiling the Model
After building the model, the next step is to compile it. This configures the model for training by specifying the loss function, optimizer, and evaluation metrics.

### Action:
- **Loss Function**:
  - `binary_crossentropy`: This is used for binary classification problems, where the target variable has two classes (e.g., positive and negative sentiment).
- **Optimizer**:
  - `adam`: The Adam optimizer is an adaptive learning rate optimization algorithm that combines the advantages of both AdaGrad and RMSProp. It is widely used due to its efficiency and low memory requirements.
- **Metrics**:
  - `accuracy`: The accuracy metric will be used to evaluate the model’s performance during training and testing.

### Benefit:
Compiling the model sets up the necessary components for training, ensuring that the model is ready to learn from the data and optimize for the specified objective.


In [None]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

### 14. Training the Model
After compiling the model, we train it using the training data. During training, the model learns to predict sentiment from the review text.

### Action:
- **Model Training**: The `fit()` function is used to train the model on the training data.
  - `X_train`: The input features (tokenized and padded text data).
  - `Y_train`: The target variable (encoded sentiment labels).
  - `epochs=5`: The number of times the entire training dataset will be passed through the model. Each epoch represents one complete pass through the training data.
  - `batch_size=64`: The number of samples per gradient update. The model will update its weights after processing 64 samples at a time.
  - `validation_split=0.2`: A portion of the training data (20%) is set aside for validation during training. This helps monitor the model’s performance on unseen data during training.

### Benefit:
Training the model enables it to learn the relationship between the text features and sentiment labels. By using validation data, we can monitor the model's performance and adjust as needed.


In [None]:
model.fit(X_train,Y_train,epochs=5,batch_size=64,validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m257s[0m 509ms/step - accuracy: 0.7180 - loss: 0.5329 - val_accuracy: 0.8371 - val_loss: 0.3831
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 508ms/step - accuracy: 0.8511 - loss: 0.3637 - val_accuracy: 0.8421 - val_loss: 0.3784
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 496ms/step - accuracy: 0.8563 - loss: 0.3438 - val_accuracy: 0.8514 - val_loss: 0.3717
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 496ms/step - accuracy: 0.8739 - loss: 0.3194 - val_accuracy: 0.8501 - val_loss: 0.3603
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 495ms/step - accuracy: 0.8915 - loss: 0.2758 - val_accuracy: 0.8689 - val_loss: 0.3352


<keras.src.callbacks.history.History at 0x7b2ea26b63e0>

### 15. Evaluating the Model
After training the model, we evaluate its performance on the test set to determine how well it generalizes to unseen data.

### Action:
- The `evaluate()` function is used to assess the model’s performance on the test data (`X_test` and `Y_test`).
  - `loss`: The value of the loss function, which indicates how well the model's predictions match the true labels. A lower loss value indicates better performance.
  - `accuracy`: The accuracy metric shows the proportion of correct predictions. Higher accuracy indicates better model performance.

### Example Output:
- `Test Loss`: Displays the loss value on the test data.
- `Test Accuracy`: Displays the accuracy achieved on the test data.

### Benefit:
Evaluating the model helps us assess its effectiveness and determine if further improvements or adjustments are necessary before deploying it for sentiment prediction.


In [None]:
loss,accuracy=model.evaluate(X_test,Y_test)
print("Test Loss:",loss)
print("Test Accuracy:",accuracy)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 109ms/step - accuracy: 0.8728 - loss: 0.3169
Test Loss: 0.3139933943748474
Test Accuracy: 0.8748000264167786


### 16. Building the Predictive System
Once the model is trained and evaluated, we can create a function to predict the sentiment of new, unseen reviews.

### Action:
- **Predict Sentiment Function**: The `predict_sentiment()` function takes a single review as input and predicts its sentiment (positive or negative).
  - **Tokenization**: The review is first converted into a sequence of integers using the `texts_to_sequences()` function of the tokenizer.
  - **Padding**: The sequence is then padded to ensure it has the same length as the sequences used during training (200 in this case).
  - **Prediction**: The `predict()` function is used to generate a prediction based on the padded sequence.
  - **Interpretation**: If the predicted value is greater than 0.5, the sentiment is classified as "Positive Review". Otherwise, it is classified as "Negative Review".

### Example:
- When a user inputs a review, the model will output whether the review is positive or negative based on the sentiment learned during training.

### Benefit:
This predictive system allows the trained model to be used in real-world applications where users can input reviews and get an instant sentiment classification (positive or negative).


In [None]:
#building a predictive system
def predict_sentiment(review):
  sequence=tokenizer.texts_to_sequences([review])
  padded_sequence=pad_sequences(sequence,maxlen=200)
  prediction=model.predict(padded_sequence)
  if prediction>0.5:
    print("Positive Review")
  else:
    print("Negative Review")

### 17. Predicting the Sentiment of a New Review
Once the predictive system is in place, we can test it by providing a new review and observing the model’s sentiment prediction.

### Action:
- The new review, `"This movie was Fantastic and Mindblowing"`, is passed into the `predict_sentiment()` function.
- The function processes the review and outputs whether the sentiment is "Positive Review" or "Negative Review".

### Example Output:
- Given that the review contains positive words like "Fantastic" and "Mindblowing", the model will likely classify it as a "Positive Review".

### Benefit:
This step demonstrates the model's ability to make predictions on new data and provides an immediate sentiment classification for user input.


In [None]:
new_review="This movie was Fantastic and Mindblowing"
predict_sentiment(new_review)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 228ms/step
Positive Review


In [None]:
new_review1="This movie was not that much of good.The concept of movie was very bad"
predict_sentiment(new_review1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
Negative Review


In [None]:
# Assuming 'tokenizer' is your trained tokenizer
with open('tokenizer.json', 'w') as f:
    f.write(tokenizer.to_json())  # Save the tokenizer in JSON format
