<a href="https://colab.research.google.com/github/freehtet/NLP/blob/main/M508A_Big_Data_Analytics_(WS0924)_Wai_Yann_Htet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem Statement


My company, local coffee shop, assigned me to work on the customers' reviews from yelp. Nowadays, it is easier to access to receive customers feedback and all the feedbacks matter as it can boost the company sales or reduce the company sales. Previously, there are only handful of review, so our team can work on reviews manually. As internet user growth, it is a bit out of hand, I have been assigned to create a machine learning model which can provide whether customers reviews are positive, negative or neutral. This will help company to quickly see what area of the shop need to be improved and eventually increase the sales.

Data has been collected from the Kaggle. (https://www.kaggle.com/datasets/sripaadsrinivasan/yelp-coffee-reviews/data).



#High level system design

In general, I have created 3 models in order to compare the results and identify which model perform the best and use the best model, 1. is using transformer and 2. is using neural network and 3. is using traditional model with SVM.
So, in the end, by comparing all performance and recommend the best model to fit with the NLP task.

As of process, in general high level,

Step 1. Data collection, data has been downloaded used from Kaggle as mentioned above.

Step 2. Data processing and feature engineering, even though transformer model can tokenization, encoding and paddings, I also used Spacy library to do tokenization for neural network model and TF-IDF vecotrization for tradtional model. Additionally, as data is imbalanced, I have sample data form each rating to have balance dataset. Finally, label data has been changed and categorized from numerical ratings to the Positive, Neutral and Negative.

Step 3. Model training and evaluation: Data have been spliced into Train and test by 20%. Models will be trained on the provided data ( reviews and ratings) by using transformer and neural network. The evaluation will be done with accuracy by using test dataset in the end.

Overall, these steps are very important in implementation of NLP model. Starting from Step 2, if the data is not properly cleaned and pre-process, the output result will be impacted. On the other hand, it can help model to performance better. As Step 3, is the very important part of the process as model required to be trained with to understand the area of doing analyzing and evaluation of the efficiency of the model which can be based to make the decision whether the model should be implemented or require different approach.

#Importing Libraries

In [None]:
!pip install evaluate
!pip install datasets

In [145]:
import numpy as np
import pandas as pd
import spacy
import sklearn.metrics
import sklearn.model_selection
from sklearn.preprocessing import LabelEncoder
import datasets
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, models
import evaluate
import transformers
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from sklearn.svm import SVC
from prettytable import PrettyTable

#Loading Dataset

In [28]:
raw = pd.read_csv('raw_yelp_review_data.csv')

In [29]:
raw.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [127]:
raw.isnull().sum().sum()

0

The data have no null value.

#Data Processing

Removing interger and text combinaition of rating in dataset.

In [32]:
raw['ratings'] = raw['star_rating'].apply(lambda x: int(x[:2].strip()))

Removing date from the text.

In [33]:
raw['full_review_text'] = raw['full_review_text'].apply(lambda x: x[11:])

Loading Spacy

In [34]:
nlp = spacy.load('en_core_web_sm')

Applying Spacy into review text to clean text as part of text pre-processing for the neural network model.

In [35]:
raw['clean text'] = raw['full_review_text'].apply(lambda x: nlp(x))

In [36]:
raw['text'] = raw['clean text'].apply(lambda x: ' '.join([token.text for token in x]))

Creating new DF with new column name

In [37]:
df = raw[['full_review_text','text', 'ratings']]
df.columns = ['text','c_text', 'label']
df.head()

Unnamed: 0,text,c_text,label
0,1 check-in Love love loved the atmosphere! Ev...,1 check - in Love love loved the atmosphere ...,5
1,"Listed in Date Night: Austin, Ambiance in Aust...","Listed in Date Night : Austin , Ambiance in Au...",4
2,1 check-in Listed in Brunch Spots I loved the...,1 check - in Listed in Brunch Spots I loved ...,4
3,Very cool decor! Good drinks Nice seating Ho...,Very cool decor ! Good drinks Nice seating ...,2
4,1 check-in They are located within the Northcr...,1 check - in They are located within the North...,4


Checking dataset in terms of label distribution

In [38]:
df["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
5,3780
4,2360
3,738
2,460
1,278


Spliting the dataset based on rating to create the well balance dataset from sample

In [39]:
df0 = df[df['label'] == 1]
df1 = df[df['label'] == 2]
df2 = df[df['label'] == 3]
df3 = df[df['label'] == 4]
df4 = df[df['label'] == 5]

Createing new DF with 100 sample from each rating except from rating 2 with 200 sample. Total 600 sample taken out.

In [40]:
df_final = pd.concat([df0.sample(100), df1.sample(100), df2.sample(200), df3.sample(100), df4.sample(100)])

In [41]:
df_final['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,200
1,100
2,100
4,100
5,100


Randomizing the DF in order spread out the data.

In [42]:
df_final = df_final.sample(frac=1, random_state=42).reset_index(drop=True)

In [43]:
df_final.head()

Unnamed: 0,text,c_text,label
0,I LOVE THIS PLACE... ...and I will 100% upda...,I LOVE THIS PLACE ... ... and I will 100 %...,2
1,1 check-in Listed in Be Weird in Austin Have y...,1 check - in Listed in Be Weird in Austin Have...,4
2,This place is so cute. New favorite coffee sp...,This place is so cute . New favorite coffee ...,5
3,was thinking this place was okay until a huge...,was thinking this place was okay until a hug...,1
4,Very cool decor! Good drinks Nice seating Ho...,Very cool decor ! Good drinks Nice seating ...,2


Creating the function to change rating into textual Positive, Negative and Netural

In [44]:
def label_change(num):
  if num < 3:
    return 'Negative'
  elif num == 3:
    return 'Neutral'
  else:
    return 'Postivie'

In [45]:
df_final['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,200
2,100
4,100
5,100
1,100


Applying the function into label column

In [46]:
df_final['label'] = df_final['label'].apply(lambda x: label_change(x))

Now data is well balace dataset.

In [47]:
df_final['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Negative,200
Postivie,200
Neutral,200


Encoding the label

In [48]:
le = LabelEncoder()
df_final['label'] = le.fit_transform(df_final['label'])

In [49]:
df_final.head()

Unnamed: 0,text,c_text,label
0,I LOVE THIS PLACE... ...and I will 100% upda...,I LOVE THIS PLACE ... ... and I will 100 %...,0
1,1 check-in Listed in Be Weird in Austin Have y...,1 check - in Listed in Be Weird in Austin Have...,2
2,This place is so cute. New favorite coffee sp...,This place is so cute . New favorite coffee ...,2
3,was thinking this place was okay until a huge...,was thinking this place was okay until a hug...,0
4,Very cool decor! Good drinks Nice seating Ho...,Very cool decor ! Good drinks Nice seating ...,0


Train Test split for the Transformer model.

In [50]:
df_train, df_test = sklearn.model_selection.train_test_split(df_final, test_size=0.2, random_state=42)

Transformer model used is DistilBert and auto tokenizer from the model. Then, use padding to have same lenght for text.

In [51]:
model_name = 'distilbert-base-uncased'

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Create a function to truncate the text with max length 128 to make sure all text have same length.

In [52]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

Encoding training and testing data by using of funcation created above.

In [53]:
df_train_encoded = datasets.Dataset.from_pandas(df_train).map(preprocess_function, batched=True)
df_test_encoded = datasets.Dataset.from_pandas(df_test).map(preprocess_function, batched=True)

Map:   0%|          | 0/480 [00:00<?, ? examples/s]

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Creating the evaluation metric to complute the performance of the model.

In [54]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Loading pre-trained model.

In [55]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=df_final['label'].nunique())

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Creating the arguments for training

In [56]:
training_args = transformers.TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    report_to="none",
)



Seting up trainer with above created arguments for training data and setting up the evaluation metric.

In [57]:
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=df_train_encoded,
    eval_dataset=df_test_encoded,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = transformers.Trainer(


Training the model and saving the model.

In [58]:
trainer.train()
trainer.save_model("./results/" + model_name)

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.006204,0.633333
2,No log,0.766511,0.633333
3,No log,0.659148,0.733333
4,No log,0.648007,0.7
5,No log,0.612786,0.75
6,No log,0.581912,0.75
7,No log,0.568121,0.75
8,No log,0.585472,0.766667
9,No log,0.576149,0.758333
10,No log,0.578952,0.758333


Evaluating the training result

In [59]:
trainer.evaluate()

{'eval_loss': 0.578951895236969,
 'eval_accuracy': 0.7583333333333333,
 'eval_runtime': 23.1115,
 'eval_samples_per_second': 5.192,
 'eval_steps_per_second': 0.346,
 'epoch': 10.0}

#Neural network

Creating the neural network to work on same process as above to compare the 2 different model.

Splitting the training and testing set.

In [60]:
X = df_final['c_text']
y = df_final['label']

In [61]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

Creating the max words and lenght

In [62]:
max_words = 40000
max_len = 200

Tokenizing the text

Tokenizing the words with max words and fit the tokenizer based on X_train.

In [63]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

Creating the sequence of number based on tokenizer instead of text.

In [64]:
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

Padding process have been done on sequence in order to have the same length of data.

In [65]:
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

Creating the sequntial layers for deep learning simple neural network with 2 LSTM layers , along side with dropout, batch normalization and 2 dense layers.

In [129]:
model = models.Sequential([
    layers.Embedding(max_words, 200),
    layers.SpatialDropout1D(0.2),
    layers.LSTM(100, return_sequences=True),
    layers.BatchNormalization(),
    layers.LSTM(50),
    layers.Dense(100, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation="softmax")
])

Creating adam optimizer with learning rate 0.002.

In [130]:
adam = tf.keras.optimizers.Adam(learning_rate=0.002)

Compiling the model.

In [131]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

Creating early stopping to prevent overfitting and time comsuming.

In [132]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True)

Training process start and stroing the data in histroy.

In [133]:
history = model.fit(X_train_pad, y_train, epochs=10, batch_size=32, validation_split=0.2, callbacks=[early_stopping])

Epoch 1/10


  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 591ms/step - accuracy: 0.3440 - loss: 0.0000e+00 - val_accuracy: 0.3438 - val_loss: 0.0000e+00
Epoch 2/10
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 405ms/step - accuracy: 0.3440 - loss: 0.0000e+00 - val_accuracy: 0.3438 - val_loss: 0.0000e+00
Epoch 3/10
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 561ms/step - accuracy: 0.3440 - loss: 0.0000e+00 - val_accuracy: 0.3438 - val_loss: 0.0000e+00
Epoch 4/10
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 410ms/step - accuracy: 0.3440 - loss: 0.0000e+00 - val_accuracy: 0.3438 - val_loss: 0.0000e+00


Tesing on testset to get result in unseem data.

In [134]:
y_pred = model.predict(X_test_pad)



[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 109ms/step



[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 212ms/step


In [135]:
accuracy = sklearn.metrics.accuracy_score(y_test, y_pred.round())

In [136]:
round(accuracy *100)

36

#Traditional model

Downloading nltk package to do text processing.

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

creating function to do tokenization with nltk.

In [88]:
def custom_tokenizer(text):
    return nltk.word_tokenize(text)

Creating the numerical features with IF-IDF vectorizer.

In [96]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=custom_tokenizer)
tfidf_features = tfidf_vectorizer.fit_transform(df_final['c_text'])

Train and test split to train and test the model.

In [103]:
X = tfidf_features
y = df_final['label']
X_train_tf, X_test_tf, y_train_tf, y_test_tf = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=100)

Creating SVM model

In [110]:
svm_model_tf = SVC(kernel='linear', probability=True)
svm_model_tf.fit(X_train_tf, y_train_tf)

Predicting model with SVM

In [142]:
svm_preds_tf = svm_model.predict(X_test_tf)

Checking accuracy of the model

In [143]:
accuracy_svm_tf = sklearn.metrics.accuracy_score(y_test_tf, svm_preds_tf)

In [144]:
round(accuracy_svm_tf *100)

61

Creating Table to compare the results.

In [146]:
table = PrettyTable()

In [147]:
table.field_names = ['Model', 'Accuracy']
table.add_row(['Transformer model', 75])
table.add_row(['Neural network model', 36])
table.add_row(['Traditional model with SVM', 61])

In [148]:
display('The result of models')
print(table)

'The result of models'

+----------------------------+----------+
|           Model            | Accuracy |
+----------------------------+----------+
|     Transformer model      |    75    |
|    Neural network model    |    36    |
| Traditional model with SVM |    61    |
+----------------------------+----------+


#Final Discussion

Since I have created 3 models, there are difference requirement to properly train models. Firslty, as of the result of the model, using Transformer model is very effictive in term of evaluation result. However, using Transformer model would be time comsuming. Therefore, only sample of dataset can be training in order to save time and with the limited resource. On the other hand, using the Neural network can be a bit quite faster in term of model training with the Transformer model. Thirdly, creating traditional model SVM, as per results perfromance of the model, it is better than neural netowrk but the transformer model can do better. In conclusion, the result of models are not as good as using transformer model.
Overall, based on the comparison of the results amongs Transformer model, Neural network and traditional model (SVM), I would recommended to use Transformer model even though it take time and resources can be sigfinicantly required compared to neural network model and better than traditional model.
