# NLP for Business Applications: Sentiment Analysis

1. Text Representation Methods (Three)

2. Pre-trained models

3. Models from scratch

**Choosing the Right Deep Learning Model**

| Task                                | Use FFNN? | Use LSTM/RNN? | Use CNN? | Use Transformer? |
|-------------------------------------|-----------|---------------|----------|------------------|
| Customer churn (structured data)    | ✅ Yes    | ❌            | ❌       | ⚠️ Emerging use  |
| Product review sentiment (text)     | ❌        | ✅ Yes        | ❌       | ✅ Yes           |
| Sales forecasting (time series)     | ❌        | ✅            | ❌       | ✅ Growing trend |
| Image classification                | ❌        | ❌            | ✅ Yes   | ✅ (Vision Transformer) |


# 1. Text Representation: Old & New

- Bag of Words (BoW)
- TF-IDF (Term Frequency-Inverse Document Frequency)
- **Word Embedding**: Words are learned as dense vectors that **capture semantic relationships** (e.g., ```king - man + woman = queen```)

These different text representations become the input for (supervised) text models such as sentiment analysis.

## Bag of Words (BoW)

This example shows how **one-hot encoding** assigns each word a unique vector (**BoW**). However, there are some key limitations:

- There’s **no relationship** between words like `'buy'` and `'product'` — they are treated as completely unrelated.
- All vectors are the **same length as the vocabulary**, which can become very large.
- The vectors are **sparse** — most of the elements are zeros, which is inefficient for computation.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Toy customer reviews
docs = [
    "Great product and fast delivery",
    "Terrible product, very slow service",
    "Fast shipping and excellent service",
]

In [None]:
# Token-level one-hot using CountVectorizer with binary encoding
vectorizer = CountVectorizer(binary=True)  # binary=True makes it one-hot like
X = vectorizer.fit_transform(docs)

# Convert to DataFrame for visualization
one_hot_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
display(one_hot_df)

## the TF-IDF Matrix

- Each **row** represents a document (in this case, a customer review).
- Each **column** represents a word from the vocabulary.
- Each **value** in the matrix reflects how important that word is in the context of the specific document — higher means more important.
- **Common but uninformative words** (like *and*, *the*, *is*) are automatically down-weighted.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Toy customer reviews
docs = [
    "Great product and fast delivery",
    "Terrible product, very slow service",
    "Fast shipping and excellent service",
]

# TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(docs)

# Convert to DataFrame for readability
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
display(tfidf_df)

## Word Embeddings (GloVe Example)

- Word embeddings represent words as **dense vectors** in a high-dimensional space (e.g., 50 dimensions).
- Unlike one-hot or TF-IDF, embeddings **capture semantic meaning**.
- Words like `'buy'` and `'purchase'` have **similar vector representations**, so they appear close together in the embedding space.
- Pre-trained models like **GloVe** or **Word2Vec** are trained on massive corpora (e.g., Wikipedia, news) and can be used directly for downstream NLP tasks.


In [None]:
!pip install --upgrade gensim --quiet

In [None]:
import gensim.downloader as api

# Load small pre-trained word embeddings (GloVe 50-dimensions)
glove = api.load("glove-wiki-gigaword-50")     # Trained on Wikipedia + Gigaword news corpus

In [None]:
# Look at the vector for a word
print("Vector for 'buy':")
print(glove['buy'])

In [None]:
glove.most_similar("buy")

In [None]:
# try a different word (e.g., king)


In [None]:
# Similarity between two words
print("\nSimilarity between 'buy' and 'purchase':")
print(glove.similarity('buy', 'purchase'))

In [None]:
# try another two words (e.g., king, queen)



In [None]:
# Most similar words to 'cheap'
print("\nWords most similar to 'cheap':")
print(glove.most_similar('cheap'))

In [None]:
# Word vector analogy
result = glove.most_similar(positive=["king", "woman"], negative=["man"])  # glove["king"] - glove["man"] + glove["woman"]
print(result[:5])  # Top 5 most similar words to the analogy

# 2. Sentiment Analysis Using Pre-trained Models

<img src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/cecbccba-6358-476e-9fd8-e2807de9f220/Frame_118.png?t=1693044751" width=500>

Founded in 2016

Thousands of models (e.g., BERT, ChatGPT) you can use **without training from scratch**!

[Go to Hugging Face](https://huggingface.co/) and explore [the pre-trained models available on the website](https://huggingface.co/models).

In [None]:
# Sample product reviews
reviews = [
    "The product quality is excellent and exceeded my expectations!",
    "Terrible experience. I want a refund.",
    "Pretty good, but the delivery was slow.",
    "Absolutely love it! Will buy again.",
    "The item broke after one week. Very disappointed.",
]

df = pd.DataFrame({'Review': reviews})
df

In [None]:
from transformers import pipeline

In [None]:
# Use HuggingFace sentiment analysis pipeline
classifier = pipeline("sentiment-analysis",
                      model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
results = classifier(df['Review'].tolist())

for result in results:
    print(f"Label: {result['label']}, with score: {round(result['score'], 4)}")

In [None]:
# Add results to DataFrame
df['Sentiment'] = [r['label'] for r in results]
df['Score'] = [r['score'] for r in results]
df

In [None]:
sentiment_counts = df['Sentiment'].value_counts()
sentiment_counts.plot(kind='bar', color=['lightgreen', 'salmon'], edgecolor='black')
plt.title('Sentiment Breakdown')
plt.ylabel('Count')
plt.xlabel('Sentiment')
plt.xticks(rotation=0)
plt.show()


## Search for sentiment analysis models

In [None]:
# Install huggingface_hub if not already installed
!pip -q install -q huggingface_hub

In [None]:
from huggingface_hub import HfApi

# Initialize the API
api = HfApi()

# Search for sentiment analysis models
models = api.list_models(search="sentiment",
                         task="sentiment-analysis", limit=50)

# Print the models
for model in models:
    print(model.id)

## Twitter-roBERTa-base for Sentiment Analysis

This is a [RoBERTa-base model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark.

In [None]:
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sentiment_task = pipeline("sentiment-analysis", model=model_name)
sentiment_task("Covid cases are increasing fast!")

## Financial sentiment analysis: FinancialBERT for Sentiment Analysis
- [FinancialBERT](https://huggingface.co/ahmedrachid/FinancialBERT-Sentiment-Analysis) is a BERT model pre-trained on a large corpora of financial texts. The purpose is to enhance financial NLP research and practice in financial domain, hoping that financial practitioners and researchers can benefit from this model without the necessity of the significant computational resources required to train the model.

- The model was fine-tuned for Sentiment Analysis task on Financial PhraseBank dataset. Experiments show that this model outperforms the general BERT and other financial domain-specific models.

- More details on FinancialBERT's pre-training process can be found at [this article](https://www.researchgate.net/publication/358284785_FinancialBERT_-_A_Pretrained_Language_Model_for_Financial_Text_Mining):

**Training data**

- FinancialBERT model was fine-tuned on Financial PhraseBank, a dataset consisting of 4840 Financial News categorised by sentiment (negative, neutral, positive).

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

model = BertForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis",num_labels=3)
tokenizer = BertTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")

nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

sentences = ["Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales.",
             "Bids or offers include at least 1,000 shares and the value of the shares must correspond to at least EUR 4,000.",
             "Raute reported a loss per share of EUR 0.86 for the first half of 2009 , against EPS of EUR 0.74 in the corresponding period of 2008.",
             ]
results = nlp(sentences)

for result in results:
    print(f"Label: {result['label']}, with score: {round(result['score'], 4)}")

# 3. RNN / LSTM (Recurrent Neural Network / Long Short-Term Memory) - Supervised Learning


<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*3ltsv1uzGR6UBjZ6CUs04A.jpeg" width=500>

- Commonly used for **supervised learning**

- Designed for **sequence data** like **text**, time series, or logs

- Remembers context across steps (e.g., previous words or sales days)

LSTM is an improved version of RNN that solves the “forgetting” problem

```python
model.add(Embedding(input_dim=5000, output_dim=128))  # Converts word indices into 128-dimensional dense vectors (embedding layer)
model.add(LSTM(64))                                   # Adds an LSTM layer with 64 units to capture sequential patterns in the data
model.add(Dense(1, activation='sigmoid'))             # Adds an output layer for binary classification (sigmoid activation outputs probability between 0 and 1)



🔡 Embedding Layer

- Converts each word (as an integer index) into a 128-dimensional vector

- input_dim=5000: The model will recognize up to 5,000 unique words

- output_dim=128: Each word will be mapped to a 128-length vector

🔁 LSTM Layer (Long Short-Term Memory)

- Processes the sequence of word embeddings

- Remembers context from earlier words (helps with understanding meaning like “not good”)

- 64 = number of LSTM cells (units), or how much “memory power” this layer has

🧠 Think of this as the "reader" of the sentence, remembering important pieces as it goes.


## An Simple Example (Supervised Learning / Supervised Sentiment Analysis)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Tokenize and pad the text
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Compile the LSTM model and train the model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam             # Adam: an efficient optimizer for training

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import warnings
warnings.filterwarnings("ignore")

# Set seeds for reproducibility
import tensorflow as tf
import random
seed_value = 42  # Choose any seed value you want
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

In [None]:
reviews = [
    "The product quality is excellent and exceeded my expectations!",
    "Terrible experience. I want a refund.",
    "Pretty good, but the delivery was slow.",
    "Absolutely love it! Will buy again.",
    "The item broke after one week. Very disappointed.",
]

labels = [1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

df = pd.DataFrame({'Review': reviews, 'Label': labels})

df

### Tokenize and pad the text

In [None]:
# Tokenize and pad the text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['Review'])    # builds a word-to-index dictionary

tokenizer.word_index

In [None]:
# turns each sentence into a list of token IDs based on the vocabulary we've built.
X = tokenizer.texts_to_sequences(df['Review'])
X

 in deep learning for NLP, especially with models like LSTM or Transformers, the model expects each input to be **a sequence of the same length**. That’s where **padding** comes in.



In [None]:
X = pad_sequences(X)   # padding
y = df['Label'].values

In [None]:
# after padding
X

### Build and compile the LSTM model

In [None]:
# Initializing the model
model = Sequential()
model.
model.
model.

In [None]:
# Configuring the model
model.compile(optimizer='',
              loss='',
              metrics=[''])

In [None]:
# train the model
model.fit( ,  , epochs=15)

In [None]:
loss, accuracy = model.evaluate(X, y)
print(f"Accuracy: {accuracy:.2f}")

In [None]:
# Step 1: Predict probabilities
y_probs = model.predict(X)

# Step 2: Convert probabilities to class labels (0 or 1)
y_pred = (y_probs > 0.5).astype(int)

# Step 3: Create confusion matrix
cm = confusion_matrix(y, y_pred)
cm

In [None]:
# Add results to your DataFrame
df['Predicted'] = y_pred
df['Confidence'] = y_probs

df

### Make Predictions

In [None]:
test_sentence = ["Terrible. I want a refund"]

In [None]:
# Tokenize and pad the sentence
test_seq = tokenizer.texts_to_sequences(test_sentence)
test_pad = pad_sequences(test_seq)

# Predict sentiment
pred = model.predict(test_pad)[0][0]

# Show result
sentiment = "Positive" if pred > 0.5 else "Negative"
print(f"Review: {test_sentence[0]}")
print(f"Predicted Sentiment: {sentiment} (Confidence: {pred:.2f})")

---

# Discussion Prompts

1. How could a business use this sentiment data from customer reviews?
2. What might be some challenges with relying solely on sentiment analysis?
3. What business decisions could you inform with this insight?
