## EmpathyBot – Notebook Overview

This notebook demonstrates building a personalized empathetic chatbot using transformer-based emotion models and semantic embeddings. The chatbot predicts the emotion of user input, retrieves similar texts from a dataset, highlights key words, and generates an empathetic response.

## 📦 Install Required Packages

In [66]:
!pip install -q transformers sentence-transformers faiss-cpu streamlit fastapi uvicorn pandas nltk pyngrok

## Load the Emotion Dataset

In [95]:
from datasets import load_dataset
import pandas as pd

# Load emotion dataset
ds = load_dataset("cardiffnlp/tweet_eval", "emotion")

# Convert to DataFrame
df_train = pd.DataFrame(ds["train"])

In [96]:
df_train

Unnamed: 0,text,label
0,“Worry is a down payment on a problem you may ...,2
1,My roommate: it's okay that we can't spell bec...,0
2,No but that's so cute. Atsu was probably shy a...,1
3,Rooneys fucking untouchable isn't he? Been fuc...,0
4,it's pretty depressing when u hit pan on ur fa...,3
...,...,...
3252,I get discouraged because I try for 5 fucking ...,3
3253,The @user are in contention and hosting @user ...,3
3254,@user @user @user @user @user as a fellow UP g...,0
3255,You have a #problem? Yes! Can you do #somethin...,0


## 🎯 Filter Important Emotion Labels

The dataset contains multiple emotion labels, but for our analysis and chatbot, we focus only on Positive (1) and Negative (3) emotions. Neutral and other emotions are ignored for simplicity.

| Label | Meaning (example)                |
| ----- | -------------------------------- |
| 0     | Neutral / No strong emotion      |
| 1     | Positive / Happy / Cute          |
| 2     | Caring / Empathetic / Supportive |
| 3     | Negative / Sad / Worried / Angry |


In [97]:
# Keep only Positive (1) and Negative (3) labels
df = df_train[df_train['label'].isin([1, 3])]
df.head()

Unnamed: 0,text,label
2,No but that's so cute. Atsu was probably shy a...,1
4,it's pretty depressing when u hit pan on ur fa...,3
6,Making that yearly transition from excited and...,3
7,Tiller and breezy should do a collab album. Ra...,1
11,#NewYork: Several #Baloch &amp; Indian activis...,3


## Text Preprocessing Step

Before feeding text data into the model, it's important to clean and normalize it. Preprocessing helps the model focus on meaningful content and reduces noise from URLs, mentions, or extra spaces.

In [84]:
import re

def preprocess_text(text):
    """
    Preprocess text without removing emojis.
    """
    text = text.lower()
    text = re.sub(r'http\S+|www.\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['clean_text'] = df['text'].apply(preprocess_text)
df.drop(columns=["text"], inplace=True)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text'] = df['text'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["text"], inplace=True)


Unnamed: 0,label,clean_text
2,1,no but that's so cute. atsu was probably shy a...
4,3,it's pretty depressing when u hit pan on ur fa...
6,3,making that yearly transition from excited and...
7,1,tiller and breezy should do a collab album. ra...
11,3,#newyork: several #baloch &amp; indian activis...


## Load Pre-trained Emotion Model

To predict emotions (happiness/sadness) from text, I use a pre-trained BERT-based sentiment model.
--
 `The nlptown/bert-base-multilingual-uncased-sentiment` model is capable of predicting sentiment across multiple languages and returns scores for 1–5 star ratings. We map these ratings to binary emotion labels (Positive → Happiness, Negative → Sadness).

In [85]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

def create_emotion_model(model_name="nlptown/bert-base-multilingual-uncased-sentiment"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    classifier = pipeline(
        "text-classification",
        model=model,
        tokenizer=tokenizer,
        return_all_scores=True
    )
    return classifier

emotion_model = create_emotion_model()

Device set to use cuda:0


## Test the Emotion Model

In [86]:
# Test
text = "I feel really happy today"
results = emotion_model(text)
for r in results[0]:
    print(f"Emotion: {r['label']}, Score: {r['score']:.4f}")

Emotion: 1 star, Score: 0.0033
Emotion: 2 stars, Score: 0.0035
Emotion: 3 stars, Score: 0.0247
Emotion: 4 stars, Score: 0.2547
Emotion: 5 stars, Score: 0.7138


## Binary Emotion Mapping

1–2 stars → Sadness (negative)

3–5 stars → Happiness (positive)

In [87]:
# 1–2 stars → negative (3), 3–5 stars → positive (1)
def predict_binary_label_star(text, model):
    results = model(text)[0]
    top_star = max(results, key=lambda x: x['score'])['label']
    star_num = int(top_star.split()[0])

    return 3 if star_num <= 2 else 1

In [88]:
# Apply to dataset
df['predicted_label'] = df['clean_text'].apply(lambda x: predict_binary_label_star(x, emotion_model))
df['predicted_emotion'] = df['predicted_label'].map({1: "happiness", 3: "sadness"})
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['predicted_label'] = df['clean_text'].apply(lambda x: predict_binary_label_star(x, emotion_model))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['predicted_emotion'] = df['predicted_label'].map({1: "happiness", 3: "sadness"})


Unnamed: 0,label,clean_text,predicted_label,predicted_emotion
2,1,no but that's so cute. atsu was probably shy a...,1,happiness
4,3,it's pretty depressing when u hit pan on ur fa...,1,happiness
6,3,making that yearly transition from excited and...,1,happiness
7,1,tiller and breezy should do a collab album. ra...,3,sadness
11,3,#newyork: several #baloch &amp; indian activis...,3,sadness


## Evaluating Model Performance

This cell evaluates how well your binary emotion mapping (Happiness vs Sadness) aligns with the original dataset labels.

In [89]:
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(df['label'], df['predicted_label'], labels=[1,3])
print("Confusion Matrix:\n", cm)

report = classification_report(df['label'], df['predicted_label'], labels=[1,3])
print("Classification Report:\n", report)

matches = (df['label'] == df['predicted_label']).sum()
total = len(df)
print(f"Exact matches: {matches}/{total} ({matches/total:.2%})")


Confusion Matrix:
 [[514 194]
 [237 618]]
Classification Report:
               precision    recall  f1-score   support

           1       0.68      0.73      0.70       708
           3       0.76      0.72      0.74       855

    accuracy                           0.72      1563
   macro avg       0.72      0.72      0.72      1563
weighted avg       0.73      0.72      0.72      1563

Exact matches: 1132/1563 (72.42%)


## Creating Embeddings and Building FAISS Index

In [90]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode texts
texts = df['clean_text'].tolist()
text_embeddings = embed_model.encode(texts, convert_to_tensor=True)

# FAISS index
dimension = text_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(text_embeddings.cpu()))

# Map label to emotion
label_to_emotion = {1: "happiness", 3: "sadness"}
df['predicted_emotion'] = df['predicted_label'].map(label_to_emotion)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['predicted_emotion'] = df['predicted_label'].map(label_to_emotion)


In [91]:
df["clean_text"].iloc[0], df["predicted_emotion"].iloc[0]

("no but that's so cute. atsu was probably shy about photos before but cherry helped her out uwu",
 'happiness')

## Retrieving Similar Responses and Generating Empathetic Replies

In [92]:
def retrieve_similar_responses(user_text, user_emotion, top_k=3):
    user_emb = embed_model.encode([user_text], convert_to_tensor=True).cpu().numpy()
    distances, indices = index.search(user_emb, top_k*5)

    retrieved = []
    for idx in indices[0]:
        if df.iloc[idx]['predicted_emotion'] == user_emotion:
            retrieved.append(df.iloc[idx]['clean_text'])
        if len(retrieved) == top_k:
            break
    return retrieved

def generate_empathetic_reply(user_text, user_emotion, top_k=3):
    retrieved = retrieve_similar_responses(user_text, user_emotion, top_k=top_k)
    combined_reply = " ".join(retrieved)
    final_reply = f"{combined_reply} I’m not a therapist; please seek professional help for serious issues."
    return final_reply

# Test sample
sample = "heard of panic! at the disco? how about kach-ing! at the atm"
reply = generate_empathetic_reply(sample, "sadness")
print("Sample reply:", reply)

Sample reply: heard of panic! at the disco? how about kach-ing! at the atm i really want to go for fright night but i really don't 😁 anyyyyone wanna go to fright fest with me on friday night? 👻 I’m not a therapist; please seek professional help for serious issues.


## Testing the EmpathyBot on Sample Dataset Entries

In [93]:
# Sample test
for i, row in df.sample(5).iterrows():
    user_text = row['clean_text']
    predicted_emotion = row['predicted_emotion']

    reply = generate_empathetic_reply(user_text, predicted_emotion)

    print("User:", user_text)
    print("Predicted Emotion:", predicted_emotion)
    print("Bot Reply:", reply)
    print("Original Emotion:", label_to_emotion[row['label']])
    print("-"*50)


User: just had a steak pie supper
Predicted Emotion: happiness
Bot Reply: just had a steak pie supper b day next week, catch the sauce and watch the heaviness! may i suggest, that you have a meal that is made with beans, onions &amp; garlic, the day before class. I’m not a therapist; please seek professional help for serious issues.
Original Emotion: happiness
--------------------------------------------------
User: one day i'm drinking a bottle of nyquil, the other i'm sleeping zero. my lovely #horror fam, which should i watch? 🎩
Predicted Emotion: sadness
Bot Reply: one day i'm drinking a bottle of nyquil, the other i'm sleeping zero. my lovely #horror fam, which should i watch? 🎩 me, myself, and i \n #horror movie alone again tonight maybe a #zombie would eat me and finish my life game already - i want #gameover this is the scariest american horror story out of all of them... i'm gonna have to watch in the daytime. #frightened I’m not a therapist; please seek professional help for s

# 💛 EmpathyBot Streamlit App

This Streamlit app allows users to input a text prompt and receive a detailed **emotion analysis** along with an **empathetic response** based on a curated dataset. It is designed for both usability and visual appeal.

---

## **Features:**

- **Emotion prediction:** Detects whether the input text expresses *happiness* or *sadness* using a BERT-based sentiment model.
- **Model confidence visualization:** Displays a bar chart showing the confidence scores for all star ratings.
- **Similar texts from dataset:** Retrieves up to 3 examples from the dataset that match the predicted emotion.
- **Highlighted keywords:** Highlights words in the input text that strongly contribute to the predicted emotion.
- **Empathetic combined reply:** Provides a synthesized response based on the retrieved similar texts.

---

## **App Workflow:**

### 1. User Input
- The user types or pastes a text in the **centered text area**.

### 2. Emotion Prediction
- The text is fed into a pretrained `nlptown/bert-base-multilingual-uncased-sentiment` model.
- Predictions are mapped into **binary emotions**:
  - `1–2 stars → sadness`
  - `3–5 stars → happiness`

### 3. Keyword Highlighting
- Key emotion words (like *happy*, *sad*, *love*, *pain*) are highlighted with **color-coded backgrounds**.

### 4. Vector Retrieval
- The app encodes the input text using **SentenceTransformer embeddings**.
- Uses **FAISS** to find similar texts from the dataset that share the same emotion.

### 5. Empathetic Reply Generation
- Combines the similar texts into a **single empathetic response**.
- Displayed as a **bullet-point item** below the input.

### 6. Output Display
Shows:
- Highlighted input text
- Predicted emotion
- Similar dataset texts
- Empathetic combined reply
- Confidence score chart

---

## **UI/UX Design**
- Everything is **centered** using custom CSS.
- Input box and buttons are visually aligned.
- Highlighted words use **green** for happy and **red** for sad emotions.
- Bullet points summarize key outputs.


In [115]:
%%writefile app.py
import streamlit as st
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd
import re

# --------------------------
# App Config
# --------------------------
st.set_page_config(
    page_title="EmpathyBot 💛",
    page_icon="💛",
    layout="wide"
)

# --------------------------
# Custom CSS for centering everything
# --------------------------
st.markdown("""
<style>
    .css-18e3th9 {
        display: flex;
        flex-direction: column;
        align-items: center;
    }

    .stTextInput>div>div>input, .stTextArea>div>div>textarea {
        text-align: center;
    }

    .stButton>button {
        margin-left: auto;
        margin-right: auto;
        display: block;
    }

    h1, h2, h3, h4, h5, h6, p, span, div, li {
        text-align: center !important;
    }
</style>
""", unsafe_allow_html=True)

st.title("💛 EmpathyBot")
st.markdown("""
**Enter your text and get:**
- Emotion prediction (Happiness / Sadness)
- Model confidence visualization
- Similar texts from dataset
- Highlighted key words contributing to emotion
- Empathetic combined reply
""")

# --------------------------
# Load Dataset
# --------------------------
@st.cache_data
def load_dataset():
    from datasets import load_dataset
    ds = load_dataset("cardiffnlp/tweet_eval", "emotion")
    df = pd.DataFrame(ds["train"])
    df = df[df['label'].isin([1,3])]
    df['clean_text'] = df['text'].str.lower()
    df['clean_text'] = df['clean_text'].str.replace(r'http\S+|www.\S+', '', regex=True)
    df['clean_text'] = df['clean_text'].str.replace(r'@\w+', '', regex=True)
    df['clean_text'] = df['clean_text'].str.replace(r'\s+', ' ', regex=True).str.strip()
    df['predicted_emotion'] = df['label'].map({1:"happiness", 3:"sadness"})
    return df

df = load_dataset()

# --------------------------
# Load Models
# --------------------------
@st.cache_resource
def load_models():
    model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, return_all_scores=True)

    embed_model = SentenceTransformer('all-MiniLM-L6-v2')
    return classifier, embed_model

emotion_model, embed_model = load_models()

# --------------------------
# Build FAISS Index
# --------------------------
@st.cache_resource
def build_faiss(_embed_model, df):
    embeddings = _embed_model.encode(df['clean_text'].tolist(), convert_to_tensor=True)
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(np.array(embeddings.cpu()))
    return index

index = build_faiss(embed_model, df)

# --------------------------
# Helper Functions
# --------------------------
def predict_emotion(text, model):
    results = model(text)[0]
    scores = {r['label']: r['score'] for r in results}
    top_star = max(results, key=lambda x: x['score'])['label']
    star_num = int(top_star.split()[0])
    predicted_emotion = "happiness" if star_num >= 3 else "sadness"
    return predicted_emotion, scores

def highlight_text(text, emotion):
    happy_words = ["happy","love","good","great","fun","joy","excited"]
    sad_words = ["sad","bad","angry","upset","depressed","pain","worried"]
    words = text.split()
    highlighted = []
    for w in words:
        clean_w = re.sub(r'[^\w\s]', '', w.lower())
        if emotion=="happiness" and clean_w in happy_words:
            highlighted.append(f"<span style='background-color:#d4edda'>{w}</span>")
        elif emotion=="sadness" and clean_w in sad_words:
            highlighted.append(f"<span style='background-color:#f8d7da'>{w}</span>")
        else:
            highlighted.append(w)
    return " ".join(highlighted)

def retrieve_similar_texts(user_text, user_emotion, top_k=3):
    user_emb = embed_model.encode([user_text], convert_to_tensor=True).cpu().numpy()
    distances, indices = index.search(user_emb, top_k*5)
    retrieved = []
    for idx in indices[0]:
        if df.iloc[idx]['predicted_emotion'] == user_emotion:
            retrieved.append(df.iloc[idx]['clean_text'])
        if len(retrieved) == top_k:
            break
    return retrieved

def generate_empathetic_reply(similar_texts):
    if similar_texts:
        combined = " ".join(similar_texts)
        return f"💡 Empathetic reply based on dataset:\n\n{combined}"
    else:
        return "💡 Sorry, no similar texts found in the dataset."

# --------------------------
# User Input
# --------------------------
user_input = st.text_area("Enter your text:")

if st.button("Analyze") and user_input.strip():
    # Predict emotion
    predicted_emotion, scores = predict_emotion(user_input, emotion_model)

    # Highlight important words
    highlighted_text = highlight_text(user_input, predicted_emotion)
    st.markdown(f"""
    <div style='text-align:center; padding:20px; border-radius:10px; background-color:#f0f8ff; font-size:18px'>
        {highlighted_text}
    </div>
    """, unsafe_allow_html=True)

    # Predicted Emotion Title
    st.markdown(f"<h2 style='color:#FF5733; text-align:center'>Predicted Emotion: {predicted_emotion.upper()}</h2>", unsafe_allow_html=True)

    # Retrieve similar texts
    similar_texts = retrieve_similar_texts(user_input, predicted_emotion)

    # Empathetic reply
    empathetic_reply = generate_empathetic_reply(similar_texts)

    # Bullets (only once)
    bullet_html = "<ul style='list-style-type:disc; text-align:left; display:inline-block;'>"
    bullet_html += f"<li>💡 Predicted Emotion: <b>{predicted_emotion.upper()}</b></li>"
    bullet_html += "<li>📊 Model confidence visualization below</li>"

    if similar_texts:
        bullet_html += "<li>🔹 Similar texts from dataset:</li>"
        for txt in similar_texts:
            bullet_html += f"<li style='margin-left:20px'>{txt}</li>"

    bullet_html += f"<li>💡 Empathetic reply based on dataset: {empathetic_reply}</li>"
    bullet_html += "</ul>"

    st.markdown(f"<div style='text-align:center'>{bullet_html}</div>", unsafe_allow_html=True)

    # Model confidence bar chart
    st.markdown("<h3 style='text-align:center'>Model Confidence Scores</h3>", unsafe_allow_html=True)
    score_df = pd.DataFrame(list(scores.items()), columns=["Label","Score"]).sort_values(by="Score", ascending=False)
    st.bar_chart(score_df.set_index("Label"))



Overwriting app.py


## 🌐 Exposing the Streamlit App with Ngrok

This cell allows us to run the **EmpathyBot Streamlit app** in Google Colab (or any remote environment) and expose it via a public URL using **ngrok**.

**Steps:**

1. **Set your Ngrok Auth Token**  
   Replace the placeholder with your own [ngrok auth token](https://dashboard.ngrok.com/get-started/your-authtoken):
   ```python
   NGROK_AUTH_TOKEN = "YOUR_NGROK_AUTH_TOKEN"
2. Configure ngrok
3. Run Streamlit in the background
4. Get the public URL

In [100]:
from pyngrok import ngrok, conf

# Replace with your token
NGROK_AUTH_TOKEN = "32W7hIW1xGI3CKHENP4w0v5JuQW_zVF5twNG1TZv2kRHsyh7"

!ngrok config add-authtoken $NGROK_AUTH_TOKEN


Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [102]:
from pyngrok import ngrok
!streamlit run app.py &>/dev/null &
url = ngrok.connect(8501)
print('Chatbot running at:', url)

Chatbot running at: NgrokTunnel: "https://3731c030a273.ngrok-free.app" -> "http://localhost:8501"
