<a href="https://colab.research.google.com/github/alexander-toschev/mbzuai-course/blob/main/NLP_Sentiment_Analysis_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Dataset Cleaning & Sentiment Analysis Using LSTM**

## **1️⃣ Load and Clean Text Data**

In [None]:
import pandas as pd
import re
import nltk
import tensorflow as tf
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Download NLTK stopwords
nltk.download('stopwords')
# Load dataset
df = pd.read_csv("https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv")

# Select relevant columns
df = df[['label', 'tweet']]

# Text Cleaning Function
def clean_text(text):
    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # Remove mentions
    text = re.sub(r'#[A-Za-z0-9]+', '', text)  # Remove hashtags
    text = re.sub(r'http\S+', '', text)  # Remove links
    text = re.sub(r'[^a-zA-Z ]', '', text)  # Keep only letters
    text = text.lower()  # Convert to lowercase
    text = " ".join([word for word in text.split() if word not in stopwords.words("english")])
    return text

# Apply cleaning
df['clean_tweet'] = df['tweet'].apply(clean_text)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['clean_tweet'])
X = tokenizer.texts_to_sequences(df['clean_tweet'])
X = pad_sequences(X, maxlen=50)

# Encode labels (0 = Negative, 1 = Positive)
y = df['label'].values

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **2️⃣ Build and Train LSTM Model**

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define LSTM Model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=50),
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])


# Compile Model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train Model
model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 11ms/step - accuracy: 0.9181 - loss: 0.2719 - val_accuracy: 0.9432 - val_loss: 0.1695
Epoch 2/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9563 - loss: 0.1309 - val_accuracy: 0.9453 - val_loss: 0.1638
Epoch 3/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9667 - loss: 0.0993 - val_accuracy: 0.9438 - val_loss: 0.1725
Epoch 4/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9759 - loss: 0.0776 - val_accuracy: 0.9435 - val_loss: 0.1921
Epoch 5/5
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9794 - loss: 0.0610 - val_accuracy: 0.9381 - val_loss: 0.2116


<keras.src.callbacks.history.History at 0x7f3b5130b110>

## **3️⃣ Evaluate & Test Sentiment Analysis Model**

In [None]:
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Function to predict sentiment
def predict_sentiment(text):
    text = clean_text(text)
    text_seq = tokenizer.texts_to_sequences([text])
    text_pad = pad_sequences(text_seq, maxlen=50)
    predictions = model.predict(text_pad)
    print(predictions)
    prediction = predictions[0][0]
    return "Positive" if prediction > 0.03 else "Negative"

# Test examples
print(predict_sentiment("This is good."))  # Expected: Negative
print(predict_sentiment("This is bad."))  # Expected: Negative

[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9376 - loss: 0.2028
Test Accuracy: 0.9381
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 189ms/step
[[0.06657148]]
Positive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[[0.02833427]]
Negative


## **🎯 Lab Summary**
✅ Cleaned and preprocessed real-world tweets.

✅ Tokenized and padded text sequences for LSTM input.

✅ Built a **deep learning-based LSTM sentiment classifier**.

✅ Evaluated and tested on **real examples**.

### **Example: Teaching Students to Build Trustworthy AI Systems**

#### **📌 Objective:**  
Help students **understand and apply ethical principles** in AI development, ensuring models are **fair, reliable, and transparent**.

---

### **1️⃣ Start with Real-World Examples**  
Begin with **case studies** showing the impact of untrustworthy AI:
- **Amazon’s Hiring Bias (2018)** – AI favored male candidates.
- **Self-Driving Car Failures** – Errors in AI perception leading to accidents.
- **Deepfakes & Fake News** – AI-generated misinformation risks.

📢 **Discussion Prompt:** *“What could have been done differently to prevent these failures?”*

---

### **2️⃣ Define the 4 Pillars of Trustworthy AI**  
1. **Fairness & Bias Mitigation**  
   - Teach **data fairness**: balancing datasets to avoid biased predictions.
   - Introduce **techniques** like re-weighting classes & adversarial debiasing.

2. **Transparency & Explainability**  
   - Use **SHAP/LIME** to interpret AI decisions.
   - Example: *Show why an image classifier predicts "dog" instead of "cat".*

3. **Robustness & Security**  
   - Demonstrate **adversarial attacks**: how minor pixel changes fool AI.
   - Teach **defenses** like adversarial training & anomaly detection.

4. **Privacy & Ethical Considerations**  
   - Discuss **GDPR, AI ethics guidelines**.
   - Example: *Show how federated learning improves privacy in AI training.*

---

### **3️⃣ Hands-on Activity: Bias Detection in AI**  
💻 **Example Code: Detecting Gender Bias in NLP Models**
```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier(["He is a doctor.", "She is a doctor."]))
```
📢 **Discussion:** *"Why does the AI assign different sentiment scores based on gender?"*

---

### **4️⃣ Implement Fairness Techniques**
💡 **Fixing AI Bias with Counterfactual Data**
```python
# Balance dataset by augmenting underrepresented groups
augmented_data = add_synthetic_samples(original_dataset, minority_class="female")
```

📢 **Activity:** *Students apply fairness methods to their own dataset and compare model results.*

---

### **5️⃣ Final Takeaways**
✅ **AI should be ethical, explainable, and fair.**  
✅ **Bias detection & fairness techniques must be part of AI development.**  
✅ **Trustworthy AI = Better public trust & regulatory compliance.**  

🚀 **Next Steps:** Have students create a **"Fair AI Checklist"** to follow in future projects.

---


In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier(["He is a doctor.", "She is a doctor."]))


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9941556453704834}, {'label': 'POSITIVE', 'score': 0.9947004318237305}]


In [5]:
!pip install ace_tools

Collecting ace_tools
  Downloading ace_tools-0.0-py3-none-any.whl.metadata (300 bytes)
Downloading ace_tools-0.0-py3-none-any.whl (1.1 kB)
Installing collected packages: ace_tools
Successfully installed ace_tools-0.0


In [8]:
import pandas as pd
import random
from textblob import TextBlob
from nltk.corpus import wordnet
import nltk

# Download necessary NLTK resources
nltk.download("wordnet")
nltk.download("omw-1.4")

# Sample dataset with imbalanced classes
data = {
    "clean_tweet": [
        "I love this product!", "Horrible experience, never again!", "Amazing quality!", "Worst purchase ever.",
        "Absolutely fantastic!", "Terrible service.", "Best thing I bought!", "Not worth the money.",
        "Excellent customer support!", "Awful and disappointing."
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative
}

# Create a DataFrame
original_dataset = pd.DataFrame(data)


# Function for Synonym Replacement
def synonym_replacement(text, n=2):
    """Replace n random words with synonyms."""
    words = text.split()
    new_words = words.copy()
    random_indices = random.sample(range(len(words)), min(n, len(words)))

    for i in random_indices:
        synonyms = wordnet.synsets(words[i])
        if synonyms:
            new_words[i] = synonyms[0].lemmas()[0].name()

    return " ".join(new_words)

# Function for Back Translation
def back_translation(text, from_lang="en", to_lang="fr"):
    """Translate text to another language and back (French in this case)."""
    try:
        blob = TextBlob(text)
        translated = blob.translate(from_lang=from_lang, to=to_lang)
        return translated.translate(from_lang=to_lang, to=from_lang)
    except:
        return text  # Return original text if translation fails

# Function to Add Synthetic Samples
def add_synthetic_samples(df, minority_class=1, num_samples=5):
    """
    Generates synthetic data for underrepresented classes.
    - Uses synonym replacement and back-translation.
    """
    minority_df = df[df["label"] == minority_class]
    augmented_texts = []

    for _, row in minority_df.iterrows():
        new_text = synonym_replacement(row["clean_tweet"])
        new_text = back_translation(new_text)
        augmented_texts.append(new_text)

        if len(augmented_texts) >= num_samples:
            break

    new_samples = pd.DataFrame({"clean_tweet": augmented_texts, "label": minority_class})
    return pd.concat([df, new_samples], ignore_index=True)

# Apply augmentation to minority class (positive sentiment in this case)
balanced_dataset = add_synthetic_samples(original_dataset, minority_class=1, num_samples=3)

display(balanced_dataset)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,clean_tweet,label
0,I love this product!,1
1,"Horrible experience, never again!",0
2,Amazing quality!,1
3,Worst purchase ever.,0
4,Absolutely fantastic!,1
5,Terrible service.,0
6,Best thing I bought!,1
7,Not worth the money.,0
8,Excellent customer support!,1
9,Awful and disappointing.,0
