# Sentiment Analysis for Customer Feedback
In this project, we develop a sentiment analysis solution to gauge customer satisfaction.
## 1. Data Exploration & Preprocessing
### 1.1. Examine the Data

In [6]:
import pandas as pd
import numpy as np

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [1]:
# Load the dataset
df = pd.read_csv('bank_reviews3.csv')

# Display the first few rows
display(df.head())

# Show the structure and info
print('--- DataFrame Info ---')
df.info()

# Show summary statistics for numeric columns
print('--- Summary Statistics ---')
display(df.describe(include='all'))

Unnamed: 0,author,date,address,bank,rating,review_title_by_user,review,bank_image,rating_title_by_user,useful_count
0,AMRENDRA T,"Mar 21, 2020",New delhi,SBI,4.0,"""Best saving""",State Bank Of India is located nearby in our a...,https://static.bankbazaar.com/images/common/ba...,Great!,133
1,BISHWA,"Mar 20, 2020",Kolkata,SBI,5.0,"""Good service""","I have my salary account in SBI, when I applie...",https://static.bankbazaar.com/images/common/ba...,Blown Away!,89
2,SANTOSH,"Mar 20, 2020",Hooghly,Axis Bank,5.0,"""Excellent Service""",I am using Axis bank saving account for the p...,https://static.bankbazaar.com/images/common/ba...,Blown Away!,48
3,MAHADEV,"Mar 20, 2020",Pune,HDFC Bank,5.0,"""Excellent service""",I have my salary bank account in HDFC bank for...,https://static.bankbazaar.com/images/common/ba...,Blown Away!,52
4,R,"Mar 20, 2020",Bangalore,review,5.0,"""Good account""","Close to around 10 years, I am holding this Co...",https://static.bankbazaar.com/images/common/ba...,Blown Away!,22


--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   author                996 non-null    object 
 1   date                  1000 non-null   object 
 2   address               1000 non-null   object 
 3   bank                  1000 non-null   object 
 4   rating                1000 non-null   float64
 5   review_title_by_user  1000 non-null   object 
 6   review                1000 non-null   object 
 7   bank_image            1000 non-null   object 
 8   rating_title_by_user  1000 non-null   object 
 9   useful_count          1000 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 78.3+ KB
--- Summary Statistics ---


Unnamed: 0,author,date,address,bank,rating,review_title_by_user,review,bank_image,rating_title_by_user,useful_count
count,996,1000,1000,1000,1000.0,1000,1000,1000,1000,1000.0
unique,620,110,107,10,,352,999,10,10,
top,ANONYMOUS,"Jan 20, 2020",Bangalore,review,,"""Good Account""","In SBI customer care, they are not responding ...",https://static.bankbazaar.com/images/common/ba...,Blown Away!,
freq,117,26,245,285,,105,2,285,550,
mean,,,,,4.3515,,,,,2.752
std,,,,,0.940788,,,,,7.638904
min,,,,,0.5,,,,,0.0
25%,,,,,4.0,,,,,0.0
50%,,,,,5.0,,,,,0.0
75%,,,,,5.0,,,,,2.0


#### Dataset Overview
- Rows: 1000
- Columns: 10
- Key Text Fields: `review_title_by_user`, `review`
- Metadata Fields: `author`, `date`, `address`, `bank`, `bank_image`, `rating`, `rating_title_by_user`, `useful_count`

In [2]:
# Check for missing values
print('--- Missing Values ---')
display(df.isnull().sum())

# Check for anomalies: e.g., unique values in rating, useful_count, etc.
print('--- Unique Values in Key Columns ---')
for col in ['bank', 'address', 'rating', 'useful_count', 'rating_title_by_user']:
    print(f'{col}:', df[col].unique()[:10], '...')

--- Missing Values ---


author                  4
date                    0
address                 0
bank                    0
rating                  0
review_title_by_user    0
review                  0
bank_image              0
rating_title_by_user    0
useful_count            0
dtype: int64

--- Unique Values in Key Columns ---
bank: ['SBI' 'Axis Bank' 'HDFC Bank' 'review' 'IDBI' 'Kotak' 'IndusInd Bank'
 'Canara Bank' 'Citibank' 'Punjab National Bank'] ...
address: ['New delhi' 'Kolkata' 'Hooghly' 'Pune' 'Bangalore' 'Hyderabad' 'Chennai'
 'Darbhanga' 'Jaipur' 'Nasik'] ...
rating: [4.  5.  3.  4.5 0.5 2.  3.5 1.5 1.  2.5] ...
useful_count: [133  89  48  52  22  37  44  35  29  28] ...
rating_title_by_user: ['Great!' 'Blown Away!' 'Satisfactory' 'Excellent!' 'Unacceptable'
 'Expected more' 'Pretty good' 'Bad' 'Really Bad' 'Just OK'] ...


##### Data Types & Structure
Most columns are of type string, except:
- `rating`: float (customer's numeric rating, e.g., 0.5 to 5.0)
- `useful_count`: integer (number of users who found the review useful)
##### Categorical & Numeric Insights
- `bank`: 10 unique banks (plus a generic 'review' label).
- `rating`: Ranges from 0.5 to 5.0 (steps of 0.5), with most ratings at 4.0 or 5.0.
- `rating_title_by_user`: 10 unique sentiment labels (e.g., "Blown Away!", "Great!", "Satisfactory", "Unacceptable", "Bad").
- `useful_count`: Ranges from 0 to 133, with a median of 0 (most reviews are not marked as useful by others).

##### Missing Values
- The author column has 4 missing values.
- All other columns are complete (no missing values).

In [3]:
# Fill missing author values with 'ANONYMOUS'
df['author'] = df['author'].fillna('ANONYMOUS')

Fill with "ANONYMOUS" to be consistent with other anonymous entries.

##### Anomalies & Observations
- The bank column contains a generic value "review" for some entries, which may need to be filtered or handled separately.
- The review field is almost always unique, indicating each row is a distinct customer feedback.
- The most common author is "ANONYMOUS" (117 times), suggesting many users prefer not to disclose their names.

We flag rows where bank == "review"

In [4]:
# Add a flag for generic reviews
df['is_generic_review'] = df['bank'].str.lower() == 'review'

### 1.2. Text Cleaning
To prepare the customer feedback text for sentiment analysis, we will:
- Convert all text to lowercase (normalization)
- Remove punctuation and irrelevant symbols
- Remove stop words (common words that add little meaning, e.g., 'the', 'is', 'and')
- Tokenize the text (split into individual words)
- Lemmatize each token (Lemmatization reduces words to their base or dictionary form (lemma)).

We will apply these steps to the 'review' column and display sample outputs to illustrate the transformations.

In [8]:
# Download NLTK resources if not already present
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation (but keep spaces)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Remove non-alphabetic tokens
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize each token
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens
    
# Show before/after for a few samples
sample_reviews = df['review'].sample(3, random_state=42)
for i, review in enumerate(sample_reviews):
    print(f'Original Review {i+1}:', review)
    print(f'Cleaned Tokens {i+1}:', clean_text(review))
    print('-'*60)

# Apply cleaning to the whole column and store as new column
df['review_clean_tokens'] = df['review'].apply(clean_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sghas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sghas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sghas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Review 1: I have been holding salary account with HDFC bank for the past 3 months.  The customer service and response are  good.  If I call the customer care also, they responding properly. This is a zero balance account, so no need to maintain minimum balance.
Cleaned Tokens 1: ['holding', 'salary', 'account', 'hdfc', 'bank', 'past', 'month', 'customer', 'service', 'response', 'good', 'call', 'customer', 'care', 'also', 'responding', 'properly', 'zero', 'balance', 'account', 'need', 'maintain', 'minimum', 'balance']
------------------------------------------------------------
Original Review 2: I am holding salary account with HDFC Bank for more than 10 years. It is a zero balance account with no hidden charges. I use to get alert message on time whenever I do a transaction. ATM is near but branches are far of 10-15 km away from my place. ATM charges are applicable, if I do a transaction more than 5-6 times in a month. I use to get offers messages from bank.
Cleaned Tokens 2:

- All reviews were lowercased, punctuation and non-alphabetic symbols removed.
- Stop words were filtered out.
- Text was tokenized into individual words.
- Tokens were lemmatized.
- The cleaned tokens are stored in a new column: `review_clean_tokens`.