# **Project Description:Customer Feedback Sentiment Predictor**
 
# **Data Description**:

- A sentiment analysis job about the customer feedback
- Feedback talks about different IT Services, Infrastructure etc.

# **Dataset**:

- Contains two columns "review" & "label"
    - review : Customer Feedback about the Product and the Service
    - label : '1' for Negative and '0' for Positive

# **Objective**:

- To classify the sentiment of customer reviews into the positive or negative, with negative sentiments being in focus

# **Steps Applied**:

- Basic Text pre-processing.

- Build the classification model.
    - HF BERT Based Models
    - Fine Tuning based on Feedback Data

- Tune & Evaluate the Model performance.

In [96]:
import re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

# NLP Related Dependencies
from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
import matplotlib.pyplot as plt                         
import seaborn as sns
from collections import Counter


import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata

# Evaluation Libraries
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.metrics import classification_report


# TRANSFER LEARNING BASED LIBRARIES
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification


# Logging all the Actions.
import logging                                           
logging.basicConfig(filename='sentiment_analyzer_transfer_learning.log', \
                    encoding='UTF-8', level=logging.INFO)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anchitsaxena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/anchitsaxena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/anchitsaxena/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [52]:
def read_analyze_data(datafile):
    """
    Read the Data File and Log the Dimension
    Count and Drop Null Values
    
    datafile : Source Data File
    data : Returns non-null df
    """
    data = pd.read_csv(datafile)
    logging.info(str(data.shape))
    logging.info(str(data.isnull().sum(axis=0)))
    data.dropna(inplace=True)
    return data

In [53]:
data = read_analyze_data('review_data.csv')

In [54]:
data_org = pd.read_csv('review_data.csv')

# Data Pre-processing:
- Remove html tags.
- Replace contractions in string. (e.g. replace I'm --> I am) and so on.\
- Remove numbers.
- Tokenization
- To remove Stopwords.
- Lemmatized data
- We have used NLTK library to tokenize words , remove stopwords and lemmatize the remaining words.

In [55]:
def strip_html(text):
    """
    Remove HTML Tags, if any
    """
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

In [56]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

In [57]:
def remove_numbers(text):
    """Remove Numbers from the string of text"""
    text = re.sub(r'\d+', '', text)
    return text

In [58]:
def tokenize_data(data):
    data['review'] = data.apply(lambda row: nltk.word_tokenize(row['review']), axis=1)
    return data

In [59]:
data['review'] = data['review'].apply(lambda x: strip_html(str(x)))
data['review'] = data['review'].apply(lambda x: replace_contractions(x))
data['review'] = data['review'].apply(lambda x: remove_numbers(x))
data = tokenize_data(data)

In [60]:
def filter_stopwords():
    stopwords_ls = stopwords.words('english')

    customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

    # Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, 
    # so not removing them from original data.

    stopwords_mod = list(set(stopwords_ls) - set(customlist))
    return stopwords_mod

In [61]:
stopwords = filter_stopwords()

In [62]:
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

In [63]:
def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

In [64]:
data['review'] = data.apply(lambda row: normalize(row['review']), axis=1)

### Model evaluation criterion

### Model can make wrong predictions as:

1. Predicting a Review/Feedback being Negative, but in reality it is Positive. 
2. Predicting a Review/Feednack being Positive, but in actuality it is Negative

### Which case is more important? 
* Both the cases are important as:

* If a positive Feedback is termed as Negative, it's a loss of resource. False Positive

* If a Negative Feedback is termed as Positive, it's a loss of opportunity which in turn will result high Churn Rate. False Negative 



### How to reduce the losses?

* Product would want `F1 Score` to be maximized, greater the F1  score higher are the chances of minimizing False Negatives and False Positives. 

In [65]:
def load_pipeline(pipeline_name, model_name):
    """
    Load the "Sentiment-Analysis" Pipeline
    and the corresponding model
    
    pipeline_name : Pipeline Utility to be used
    model_name : Specific Model Name
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    sentiment_pipeline = pipeline(pipeline_name, 
                     model = model_name,
                     tokenizer = tokenizer)
    return sentiment_pipeline, tokenizer
    

In [66]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [67]:
sentiment_pipeline, tokenizer = load_pipeline('sentiment-analysis', model_name)

In [68]:
def split_data(data, test_ratio):
    """
    Split the Data into Independent and Target variable
    """
    X = data.review
    y = data.label
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=42, stratify=y)
    return X_train, X_test, y_train, y_test

In [69]:
X_train, X_test, y_train, y_test = split_data(data, 0.3)

### FEW DATA CHALLENGES
- Few of the Feedbacks are too large. Max Tensor size for BERT is 512
- Encountered few size related exceptions and that's why applying Summarization

In [70]:
summary_model = "facebook/bart-large-cnn"

In [71]:
summary_pipeline, tokenizer = load_pipeline('summarization', summary_model)

In [72]:
def get_summary_for_longer_texts(X_train, max_size, min_size, summary_pipeline, tokenizer):
    """
    Extract only the problematic Reviews with >= 3000 length and apply Summarizer
    
    X_train : Train set containing reviews
    max_size : Maximum words to be in Summary
    min_size : Minimum words to be in Summary
    summary_pipeline : Pipeline Summary Instance
    """
    train_samples = X_train.tolist()
    index_ls = [i for i, x in enumerate(train_samples) if len(x) >= 3000]
    counter = 0
    for index in index_ls:
        sample = train_samples[index]
        if len(sample) >= 3000:
            """
            SINCE SEQ2SEQ BERT HAVE ONLY 512 MAX LENGTH
            """
            sample = sample[:512]
            summary = summary_pipeline(sample, \
                max_length=max_size, min_length=min_size)

            train_samples[index] = summary[0]['summary_text']
    return train_samples

In [73]:
X_train_samples = get_summary_for_longer_texts(X_train, 100, 50, summary_pipeline, tokenizer)

Your max_length is set to 100, but you input_length is only 84. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 100, but you input_length is only 97. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 100, but you input_length is only 90. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)
Your max_length is set to 100, but you input_length is only 85. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 100, but you input_length is only 88. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
Your max_length is set to 100, but you input_length is only 74. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=37)
Your max_length is set to 100, but you input_length is only 87. You might consider

In [82]:
X_test_samples = get_summary_for_longer_texts(X_test, 100, 50, summary_pipeline, tokenizer)

Your max_length is set to 100, but you input_length is only 91. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)
Your max_length is set to 100, but you input_length is only 85. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 100, but you input_length is only 84. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 100, but you input_length is only 81. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)
Your max_length is set to 100, but you input_length is only 94. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)
Your max_length is set to 100, but you input_length is only 97. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 100, but you input_length is only 97. You might consider

In [74]:
def perform_sentiment_analysis(samples, sentiment_pipeline, label_map):
    """
    Run Sentiment Analyzer on DistillBERT
    """
    pred_ls = []
    for sample in samples:
        """
        SINCE SEQ2SEQ BERT HAVE ONLY 512 MAX LENGTH
        """
        sample = sample[:512]
        pred_ls.append(label_map[sentiment_pipeline(sample)[0]['label']])
    return pred_ls

In [78]:
def get_metrics(y_train, y_pred):
    """
    Print F1 Score and the Classification Report
    """
    print('F1-Score %s' % f1_score(y_pred, y_train))
    logging.info('F1-Score %s' % f1_score(y_pred, y_train))
    print(classification_report(y_train, y_pred,target_names=['Positive', 'Negative']))
    logging.info(classification_report(y_train, y_pred,target_names=['Positive', 'Negative']))

### Making Predictions on Summarized Feedbacks

In [75]:
pred_ls_summ = perform_sentiment_analysis(X_train_samples, sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

### Training Score 

In [79]:
get_metrics(y_train, pred_ls_summ)

F1-Score 0.06435944140862174
              precision    recall  f1-score   support

    Positive       0.99      0.41      0.58      2615
    Negative       0.03      0.87      0.06        61

    accuracy                           0.42      2676
   macro avg       0.51      0.64      0.32      2676
weighted avg       0.97      0.42      0.57      2676



### Test Score

In [83]:
pred_ls_test_summ = perform_sentiment_analysis(X_test_samples, sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [84]:
get_metrics(y_test, pred_ls_test_summ)

F1-Score 0.06423357664233577
              precision    recall  f1-score   support

    Positive       0.99      0.43      0.60      1122
    Negative       0.03      0.85      0.06        26

    accuracy                           0.44      1148
   macro avg       0.51      0.64      0.33      1148
weighted avg       0.97      0.44      0.59      1148



### Making Predictions on Text Preprocessed Data (including Lemmatization) Feedbacks

In [85]:
pred_ls_pre = perform_sentiment_analysis(X_train.tolist(), sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [86]:
get_metrics(y_train, pred_ls_pre)

F1-Score 0.06432038834951456
              precision    recall  f1-score   support

    Positive       0.99      0.41      0.58      2615
    Negative       0.03      0.87      0.06        61

    accuracy                           0.42      2676
   macro avg       0.51      0.64      0.32      2676
weighted avg       0.97      0.42      0.57      2676



In [87]:
pred_ls_test_pre = perform_sentiment_analysis(X_test.tolist(), sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [88]:
get_metrics(y_test, pred_ls_test_pre)

F1-Score 0.0641399416909621
              precision    recall  f1-score   support

    Positive       0.99      0.43      0.60      1122
    Negative       0.03      0.85      0.06        26

    accuracy                           0.44      1148
   macro avg       0.51      0.64      0.33      1148
weighted avg       0.97      0.44      0.59      1148



### WITHOUT TEXT PROCESSING

In [89]:
data_org = data_org.dropna()

In [90]:
X_train_w_norm, X_test_w_norm, y_train_w_norm, y_test_w_norm = split_data(data_org, 0.3)

In [91]:
pred_ls_w_norm = perform_sentiment_analysis(X_train_w_norm.tolist(), \
                                sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [92]:
get_metrics(y_train, pred_ls_w_norm)

F1-Score 0.10478359908883828
              precision    recall  f1-score   support

    Positive       0.99      0.71      0.82      2615
    Negative       0.06      0.75      0.10        61

    accuracy                           0.71      2676
   macro avg       0.52      0.73      0.46      2676
weighted avg       0.97      0.71      0.81      2676



In [94]:
pred_ls_test_w_norm = perform_sentiment_analysis(X_test_w_norm.tolist(), \
                                sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [95]:
get_metrics(y_test, pred_ls_test_w_norm)

F1-Score 0.10382513661202186
              precision    recall  f1-score   support

    Positive       0.99      0.71      0.83      1122
    Negative       0.06      0.73      0.10        26

    accuracy                           0.71      1148
   macro avg       0.52      0.72      0.47      1148
weighted avg       0.97      0.71      0.81      1148



### Text Preprocessing Changes
- Strip HTML Tags
- Remove Accented Characters

In [97]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def pre_process_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        doc = strip_html_tags(doc)
        doc = str(doc)
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))
        doc = doc.lower()
        doc = remove_accented_chars(doc)
        doc = contractions.fix(doc)
        # lower case and remove special characters\whitespaces
        doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()  
        norm_docs.append(doc)
    return norm_docs

In [98]:
norm_train_reviews = pre_process_corpus(X_train_w_norm)
norm_test_reviews = pre_process_corpus(X_test_w_norm)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|████████████████████████████████████| 2676/2676 [00:00<00:00, 9359.65it/s]
100%|███████████████████████████████████| 1148/1148 [00:00<00:00, 10054.57it/s]


In [100]:
pred_ls_norm = perform_sentiment_analysis(norm_train_reviews, sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [101]:
get_metrics(y_train, pred_ls_norm)

F1-Score 0.08742194469223906
              precision    recall  f1-score   support

    Positive       0.99      0.61      0.76      2615
    Negative       0.05      0.80      0.09        61

    accuracy                           0.62      2676
   macro avg       0.52      0.71      0.42      2676
weighted avg       0.97      0.62      0.74      2676



In [102]:
pred_ls_test_norm = perform_sentiment_analysis(norm_test_reviews, sentiment_pipeline, {'POSITIVE':0, 'NEGATIVE':1})

In [103]:
get_metrics(y_test, pred_ls_test_norm)

F1-Score 0.08869179600886919
              precision    recall  f1-score   support

    Positive       0.99      0.64      0.78      1122
    Negative       0.05      0.77      0.09        26

    accuracy                           0.64      1148
   macro avg       0.52      0.70      0.43      1148
weighted avg       0.97      0.64      0.76      1148



### LOADING FINE-TUNE MODEL

### Hosted Model at HuggingFace Github repository
- https://huggingface.co/anchit48/fine-tuned-sentiment-analysis-customer-feedback/tree/main


In [141]:
model_fine_tuned = "anchit48/fine-tuned-sentiment-analysis-customer-feedback"

In [142]:
sentiment_pipeline_fine = pipeline("sentiment-analysis", model=model_fine_tuned)

Downloading:   0%|          | 0.00/812 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/405 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [143]:
y_pred_fine_tune = perform_sentiment_analysis(X_train_w_norm.tolist(), sentiment_pipeline_fine, {'POSITIVE':0, 'NEGATIVE':1})

In [144]:
get_metrics(y_train_w_norm, y_pred_fine_tune)

F1-Score 0.5252525252525253
              precision    recall  f1-score   support

    Positive       0.99      1.00      0.99      2615
    Negative       0.68      0.43      0.53        61

    accuracy                           0.98      2676
   macro avg       0.84      0.71      0.76      2676
weighted avg       0.98      0.98      0.98      2676



In [145]:
y_pred_test_fine_tune = perform_sentiment_analysis(X_test_w_norm.tolist(), \
                            sentiment_pipeline_fine, {'POSITIVE':0, 'NEGATIVE':1})

In [146]:
get_metrics(y_test_w_norm, y_pred_test_fine_tune)

F1-Score 0.5217391304347826
              precision    recall  f1-score   support

    Positive       0.99      0.99      0.99      1122
    Negative       0.60      0.46      0.52        26

    accuracy                           0.98      1148
   macro avg       0.79      0.73      0.76      1148
weighted avg       0.98      0.98      0.98      1148



### Analyzing Negative Reviews

In [132]:
def extract_neg_reviews(y_test, data_org):
    """
    Spot Checks on few Negative Samples
    """
    index_ls = [i for i, num in enumerate(y_test) if num == 1]
    neg_reviews = [(idx, data_org['review'].tolist()[idx]) for idx in index_ls]
    return neg_reviews

In [133]:
neg_reviews = extract_neg_reviews(y_test, data_org)

In [134]:
neg_reviews

[(14,
  "Overview\n\nThis is a great array for someone looking to expand their home storage network or possibly someone with a SOHO setup that needs plenty of storage. I used this array to combine many different drives I had laying around from past system builds as well as to give me a central place to go for over 15TB of storage.\n\nUnpacking\n\nThe array arrived well packed with two layers of boxes and internal Styrofoam.\n\nInstallation\n\nYou will definitely want to buy the optional rackmount mounting rails. The railkit costs approximately $32 more but was worth not having to sit the array on top of another rackmount device. There is no way the front brackets will hold the weight of the array empty let alone full of hard drives.\n\nHardware\n\nAll hardware was included. eSATA cables, SATA cables, power cables, drive trays, and drive tray mounting screws were included. The chassis is reasonably high quality metal, and I believe the power supply is a standard ATX power supply which i

### OBSERVATIONS
- Clearly few of the Samples looks Positives like "Thanks", "Nice Program", "Good seller and ok item.", "Downloaded right after purchase, and installed with no problem."
- There needs to be a thorough look-up to the Data