### Real-Time Sentiment Analysis for Customer Feedback Using Neural Networks and Streamlit App


**To Develop a system that uses a Neural Network (NN) model to perform sentiment analysis on customer feedback provided through a web application**

#### Dataset Loading and Preprocessing

In [1]:
# Install Hugging Face dataset loader
!pip install datasets --quiet

In [7]:
# Load TweetEval Sentiment Dataset
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("tweet_eval", "sentiment")

In [8]:
# Convert to pandas DataFrames
df_train = dataset["train"].to_pandas()
df_val = dataset["validation"].to_pandas()
df_test = dataset["test"].to_pandas()

In [9]:
# Map numerical labels to text
label_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
df_train["sentiment"] = df_train["label"].map(label_map)
df_val["sentiment"] = df_val["label"].map(label_map)
df_test["sentiment"] = df_test["label"].map(label_map)


In [10]:
# View sample data
df_train.head()

Unnamed: 0,text,label,sentiment
0,"""QT @user In the original draft of the 7th boo...",2,Positive
1,"""Ben Smith / Smith (concussion) remains out of...",1,Neutral
2,Sorry bout the stream last night I crashed out...,1,Neutral
3,Chase Headley's RBI double in the 8th inning o...,1,Neutral
4,@user Alciato: Bee will invest 150 million in ...,2,Positive


# 1. LSTM-BASED SENTIMENT CLASSIFIER

In [6]:
# Import Required Libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.metrics import classification_report


In [7]:
# Tokenize and Pad the text
# Parameters
vocab_size = 20000
max_len = 100

# Tokenization
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(df_train['text'])

X_train = tokenizer.texts_to_sequences(df_train['text'])
X_val = tokenizer.texts_to_sequences(df_val['text'])
X_test = tokenizer.texts_to_sequences(df_test['text'])

# Padding
X_train = pad_sequences(X_train, maxlen=max_len, padding='post')
X_val = pad_sequences(X_val, maxlen=max_len, padding='post')
X_test = pad_sequences(X_test, maxlen=max_len, padding='post')

y_train = df_train['label']
y_val = df_val['label']
y_test = df_test['label']


**Checking Class Distribution (for class weights)**

In [10]:
from sklearn.utils import class_weight

# Compute class weights for imbalance handling
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights = dict(enumerate(class_weights))
print("Class Weights:", class_weights)


Class Weights: {0: 2.14366276610743, 1: 0.7355004111643206, 2: 0.8518684520141184}


**LSTM Model Building**

In [11]:
model_lstm = Sequential([
    Embedding(input_dim=vocab_size, output_dim=128),
    LSTM(128, return_sequences=True),
    Dropout(0.5),
    LSTM(64),
    Dropout(0.5),
    Dense(3, activation='softmax')
])

model_lstm.build(input_shape=(None, max_len))
model_lstm.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model_lstm.summary()



In [16]:
print(X_train.shape,y_train.shape)
print(X_val.shape,y_val.shape)
print(X_test.shape,y_test.shape)

(45615, 50) (45615,)
(2000, 50) (2000,)
(12284, 50) (12284,)


**Model Training**

In [12]:
history_lstm = model_lstm.fit(X_train, y_train,
                              validation_data=(X_val, y_val),
                              epochs=5,
                              batch_size=32,
                              class_weight=class_weights)


Epoch 1/5
[1m1426/1426[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 57ms/step - accuracy: 0.3232 - loss: 1.1005 - val_accuracy: 0.1560 - val_loss: 1.1029
Epoch 2/5
[1m1426/1426[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 57ms/step - accuracy: 0.3621 - loss: 1.0825 - val_accuracy: 0.4345 - val_loss: 1.0927
Epoch 3/5
[1m1426/1426[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 57ms/step - accuracy: 0.3118 - loss: 1.1011 - val_accuracy: 0.1560 - val_loss: 1.0995
Epoch 4/5
[1m1426/1426[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 56ms/step - accuracy: 0.3110 - loss: 1.0983 - val_accuracy: 0.4095 - val_loss: 1.0959
Epoch 5/5
[1m1426/1426[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 57ms/step - accuracy: 0.2326 - loss: 1.1034 - val_accuracy: 0.4095 - val_loss: 1.0868


**Evaluation on Test Data**

In [22]:
y_pred_probs = model_lstm.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

print(classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0))


[1m384/384[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 17ms/step
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00      3972
     Neutral       0.00      0.00      0.00      5937
    Positive       0.19      1.00      0.32      2375

    accuracy                           0.19     12284
   macro avg       0.06      0.33      0.11     12284
weighted avg       0.04      0.19      0.06     12284



**The LSTM Model is Severely Underperforming.**

Accuracy: 19%

Only predicts "Positive" for every input 

No "Negative" or "Neutral" class predictions at all

This often happens when the model learns a bias toward the majority class or a single class, especially when:

There’s class imbalance

Model is not generalizing due to lack of semantics.

**As the LSTM model is not performing well, let's use the pre-trained transformer model BERT for fine-tuning.**

In [None]:
!pip install ipywidgets



#### BERT-Based Sentiment Classifier (with Hugging Face)

In [1]:
# Import Libraries and Load Tokenizer

from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


**Encode the Text Data**

In [11]:
# Tokenize the text and truncate/pad to max length
def encode_texts(texts, labels):
    encodings = tokenizer(texts.tolist(), truncation=True, padding=True, max_length=128, return_tensors='tf')
    dataset = tf.data.Dataset.from_tensor_slices((
        dict(encodings),
        labels
    ))
    return dataset

train_dataset = encode_texts(df_train['text'], df_train['label']).batch(16)
val_dataset = encode_texts(df_val['text'], df_val['label']).batch(16)
test_dataset = encode_texts(df_test['text'], df_test['label']).batch(16)



**Load and Compile the BERT Model**

In [12]:
model_bert = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = ['accuracy']

model_bert.compile(optimizer=optimizer, loss=loss, metrics=metrics)
model_bert.summary()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  2307      
                                                                 
Total params: 109,484,547
Trainable params: 109,484,547
Non-trainable params: 0
_________________________________________________________________


**Train the BERT Model**

In [14]:
model_bert.fit(train_dataset.take(100), validation_data=val_dataset.take(30), epochs=1)





<keras.callbacks.History at 0x1b22c09c3d0>

**Evaluate on Test Set**

In [15]:
logits = model_bert.predict(test_dataset).logits
y_pred = tf.argmax(logits, axis=1).numpy()

from sklearn.metrics import classification_report
print(classification_report(df_test['label'], y_pred, target_names=['Negative', 'Neutral', 'Positive']))


              precision    recall  f1-score   support

    Negative       0.58      0.86      0.69      3972
     Neutral       0.82      0.34      0.48      5937
    Positive       0.50      0.82      0.62      2375

    accuracy                           0.60     12284
   macro avg       0.63      0.67      0.60     12284
weighted avg       0.68      0.60      0.58     12284

