This is a project to understand how to implement NLP Applications using HuggingFace models. This is referenced from https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face.
This uses runway date (Rent a Runway)

Objective:
1. Learn about NLP and Transformers
2. Hugging face implementation
3. Experimentation for personal understanding and making notes

In [2]:
# import libraries
import pandas as pd
import numpy as np
import datetime
import re
import string
import matplotlib.pyplot as plt
import seaborn


from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

In [8]:
# 1. import data (downloaded from https://github.com/barkaat-ali/NLP-Marvels-Mastering-Sentiment-Text-Embedding-Semantic-Search-with-Hugging-Face/blob/main/runway.csv)
## Read data
runway = pd.read_csv("runway.csv", parse_dates=['review_date'])

## Print column information
print(runway.head(10))
print(runway.shape)
print(runway.info())

   user_id  item_id  rating     rented for  \
0   476109   139086       8  formal affair   
1   203660  1126889       6          party   
2   868581   652189       8        wedding   
3   935076  1879504       8        wedding   
4   995023  1179146      10          party   
5   307791   226072      10          party   
6    49264   145906       8  formal affair   
7   859603  1498329       8        wedding   
8   124317  1268360      10          other   
9    22414   859889       8        wedding   

                                         review_text category height  size  \
0  it hit the floor perfectly with a pair of heel...     gown  5' 3"    15   
1  the dress is absolutely gorgeous unfortunately...    dress  5' 4"    12   
2  even though it was lined with satin this was a...    dress  5' 5"    24   
3  this dress was greatit fit really well and was...   sheath  5' 3"    14   
4  super flattering i am usually a sizemi have a ...    dress  5' 2"    14   
5  thefit better in the c

In [15]:
#other tests
runway.isnull().sum()  # Total missing values per column
# print(runway['item_id'].unique())
print(runway['rating'].unique())

[ 8  6 10  4  2]


In [18]:
# review is unstructured hence process and clean data
# is it necessary though while putting it through transformers

# runway['review_text_cleaned'] = runway['review _text'].str.replace(r"\/", "").str.translate(string.punctuation)


KeyError: 'review _text'

In [19]:
# 3. Initialize the Hugging Face Pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
# defaults to distilbert/distilbert-base-uncased-finetuned-sst-2-english

# 4. Define a Function to Analyze Sentiment
def analyze_sentiment(text):
    try:
        result = sentiment_pipeline(text)[0]
        return result['label'], result['score']
    except Exception as e:
        return "Error", 0.0

# 5. Apply the Sentiment Analysis to the Data
runway[['sentiment', 'confidence']] = runway['review_text'].apply(
    lambda x: pd.Series(analyze_sentiment(x))
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [21]:
# Testing
example_text = "I love using Hugging Face models!"
result = sentiment_pipeline(example_text)
print(result)

[{'label': 'POSITIVE', 'score': 0.9992625117301941}]


In [22]:
# 6. See the output
print("Data after sentiment analysis:")
print(runway.head())


Data after sentiment analysis:
   user_id  item_id  rating     rented for  \
0   476109   139086       8  formal affair   
1   203660  1126889       6          party   
2   868581   652189       8        wedding   
3   935076  1879504       8        wedding   
4   995023  1179146      10          party   

                                         review_text category height  size  \
0  it hit the floor perfectly with a pair of heel...     gown  5' 3"    15   
1  the dress is absolutely gorgeous unfortunately...    dress  5' 4"    12   
2  even though it was lined with satin this was a...    dress  5' 5"    24   
3  this dress was greatit fit really well and was...   sheath  5' 3"    14   
4  super flattering i am usually a sizemi have a ...    dress  5' 2"    14   

    age review_date sentiment  confidence  
0  27.0  2017-12-19  POSITIVE    0.999809  
1  28.0  2022-01-03  NEGATIVE    0.950723  
2  30.0  2021-08-05  NEGATIVE    0.987216  
3  37.0  2021-10-02  POSITIVE    0.996077  
4  

In [37]:
# testing outcome with ratings column
# Check for missing values in essential columns
# Define a function to map ratings to sentiment labels
def map_rating_to_sentiment(rating):
    if 1 <= rating <= 4:
        return 'NEGATIVE'
    elif 5 <= rating <= 6:
        return 'NEUTRAL'
    elif 7 <= rating <= 10:
        return 'POSTIVE'
    else:
        return 'Unknown'  # Handle unexpected ratings

# Apply the mapping to create a new column
runway['true_sentiment'] = runway['rating'].apply(map_rating_to_sentiment)

# Verify the mapping
print("\nData with True Sentiment:")
print(runway[['rating', 'true_sentiment']].head())

essential_columns = ['review_text', 'confidence', 'rating', 'sentiment', 'true_sentiment']
missing_values = runway[essential_columns].isnull().sum()
print("\nMissing Values:")
print(missing_values)
df = runway[essential_columns].copy()


Data with True Sentiment:
   rating true_sentiment
0       8       Positive
1       6        Neutral
2       8       Positive
3       8       Positive
4      10       Positive

Missing Values:
review_text       0
confidence        0
rating            0
sentiment         0
true_sentiment    0
dtype: int64


In [50]:
# Check unique values in predicted and true sentiments
print("\nUnique Predicted Sentiments:", df['sentiment'].unique())
print("Unique True Sentiments:", df['true_sentiment'].unique())
print(df.info())
df_new = df[df['true_sentiment'] != 'Neutral']
print(df_new.info())
misclassified_non_neutral = df[
    (df['sentiment'] != df['true_sentiment']) &
    (df['true_sentiment'] == 'Neutral')
]
print("\nMisclassified Examples:", len(misclassified))
print(misclassified[['review_text', 'rating', 'true_sentiment', 'sentiment']].head())

# # Verify the standardization
# print("\nStandardized Sentiments:")
# print(df[['sentiment', 'true_sentiment']].head())


Unique Predicted Sentiments: ['Positive' 'Negative']
Unique True Sentiments: ['Positive' 'Negative']
<class 'pandas.core.frame.DataFrame'>
Index: 1431 entries, 0 to 1505
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   review_text     1431 non-null   object 
 1   confidence      1431 non-null   float64
 2   rating          1431 non-null   int64  
 3   sentiment       1431 non-null   object 
 4   true_sentiment  1431 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 67.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 1431 entries, 0 to 1505
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   review_text     1431 non-null   object 
 1   confidence      1431 non-null   float64
 2   rating          1431 non-null   int64  
 3   sentiment       1431 non-null   object 
 4   true_sentiment  1431 non-null   object

In [51]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

y_true = df['true_sentiment']
y_pred = df['sentiment']

# Compute evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
# Generate confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred, labels=['Positive', 'Neutral', 'Negative'])
conf_matrix_df = pd.DataFrame(conf_matrix,
                               index=['True Positive', 'True Neutral', 'True Negative'],
                               columns=['Predicted Positive', 'Predicted Neutral', 'Predicted Negative'])

print("\nConfusion Matrix:")
print(conf_matrix_df)

Accuracy: 0.79
Precision: 0.98
Recall: 0.79
F1-Score: 0.86

Confusion Matrix:
               Predicted Positive  Predicted Neutral  Predicted Negative
True Positive                1100                  0                 300
True Neutral                    0                  0                   0
True Negative                   4                  0                  27
