<a href="https://colab.research.google.com/github/alexander-toschev/mbzuai-course/blob/main/NLP_Sentiment_Analysis_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Dataset Cleaning & Sentiment Analysis Using LSTM**

## **1️⃣ Load and Clean Text Data**

In [15]:
import pandas as pd
import re
import nltk
import tensorflow as tf
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Download NLTK stopwords
nltk.download('stopwords')
# Load dataset
df = pd.read_csv("https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv")

# Select relevant columns
df = df[['label', 'tweet']]

# Text Cleaning Function
def clean_text(text):
    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # Remove mentions
    text = re.sub(r'#[A-Za-z0-9]+', '', text)  # Remove hashtags
    text = re.sub(r'http\S+', '', text)  # Remove links
    text = re.sub(r'[^a-zA-Z ]', '', text)  # Keep only letters
    text = text.lower()  # Convert to lowercase
    text = " ".join([word for word in text.split() if word not in stopwords.words("english")])
    return text

# Apply cleaning
df['clean_tweet'] = df['tweet'].apply(clean_text)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['clean_tweet'])
X = tokenizer.texts_to_sequences(df['clean_tweet'])
X = pad_sequences(X, maxlen=50)

# Encode labels (0 = Negative, 1 = Positive)
y = df['label'].values

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **2️⃣ Build and Train LSTM Model**

In [27]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define LSTM Model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=50),
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])


# Compile Model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train Model
model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 11ms/step - accuracy: 0.9181 - loss: 0.2719 - val_accuracy: 0.9432 - val_loss: 0.1695
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9563 - loss: 0.1309 - val_accuracy: 0.9453 - val_loss: 0.1638
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9667 - loss: 0.0993 - val_accuracy: 0.9438 - val_loss: 0.1725
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9759 - loss: 0.0776 - val_accuracy: 0.9435 - val_loss: 0.1921
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9794 - loss: 0.0610 - val_accuracy: 0.9381 - val_loss: 0.2116


<keras.src.callbacks.history.History at 0x7f3b5130b110>

## **3️⃣ Evaluate & Test Sentiment Analysis Model**

In [28]:
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Function to predict sentiment
def predict_sentiment(text):
    text = clean_text(text)
    text_seq = tokenizer.texts_to_sequences([text])
    text_pad = pad_sequences(text_seq, maxlen=50)
    predictions = model.predict(text_pad)
    print(predictions)
    prediction = predictions[0][0]
    return "Positive" if prediction > 0.03 else "Negative"

# Test examples
print(predict_sentiment("This is good."))  # Expected: Negative
print(predict_sentiment("This is bad."))  # Expected: Negative

[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9376 - loss: 0.2028
Test Accuracy: 0.9381
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 189ms/step
[[0.06657148]]
Positive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[[0.02833427]]
Negative


## **🎯 Lab Summary**✅ Cleaned and preprocessed real-world tweets.✅ Tokenized and padded text sequences for LSTM input.✅ Built a **deep learning-based LSTM sentiment classifier**.✅ Evaluated and tested on **real examples**.