<a href="https://colab.research.google.com/github/adityakalkeri1/Projects/blob/main/Ratings_project/Review_Sentiment_using_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers
!pip install tensorflow-text
!pip install tf-models-official

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 29.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 43.3MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |

In [3]:
#Mounting google drive where the dataset is situated
from google.colab import drive
drive.mount(('/content/drive'))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
#Importing all the dependencies
import pandas as pd                                                              #Package for reading structured data 
import numpy as np                                                               #Package for interacting with arrays
import re                                                                        #For cleaning the text

import nltk                                                                      #Package for NLP tasks
nltk.download('stopwords')                                                       #Downloading Stopwords
from nltk.corpus import stopwords

from transformers import DistilBertTokenizerFast                                 #importing the tokenizer
from transformers import TFDistilBertForSequenceClassification
import tensorflow as tf                                                          #Package for Deep Learning
import tensorflow_text as text                                                   #Required import for BERT
#from official.nlp import optimization                                            #Package for AdamW (optimizer for BERT)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
nltk.download('stopwords')                                                       #Downloading Stopwords
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
#Importing the data
df = pd.read_csv('/content/drive/MyDrive/Full_comments_dataset.csv')

In [21]:
#Only copying the required columns and dropping all rows with null values
df = df[['Comment', 'Rating']].copy()
df.dropna(inplace = True)

In [35]:
#There are 5 ratings in total, but for this application, we will be using only 
#Positive and negative sentiments

#np.where works as an 'if condition', 1 if Rating is greater than 3,
# 0 if rating less than or equal to 3
rating = np.where(df['Rating'].values > 3, 1, 0)
df['Rating'] = rating

In [22]:
df['Rating'].unique()

array([5, 4, 3, 2, 1])

In [23]:
#Here, we will write a function to do basic cleaning of text
stop_words = stopwords.words('english')
def text_clean(row):
    row = re.sub('\n', ' ', row)                                                            #For removing \n in the text
    row = re.sub('@[A-Za-z0-9]', '', row)                                                   #For removing any usernames
    row = re.sub("""[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]""", ' ', row)                        #For removing punctuations
    row = row.split()                                                                       
    row = [word for word in row if word not in stop_words]                                  #For removing stopwords
    row = (' ').join(row)                                                             
    return row
#applying the text_clean function to each row of the text. It works as a for loop,
#Looping through every row to which text clean is applied
df['Comment'] = df['Comment'].apply(text_clean)                                             


In [24]:
#Here we have some additional words that might have low predictive value. They will be removed
remove_words = ['I', 'The', 'Amazon','Flipkart', 'It', 'mobiles', 'TV', 'DSLR', 'Smartwatch', 'Laptop']
def words_to_be_removed(row):
  row = row.split()
  row = [word for word in row if word not in remove_words]
  row = (' ').join(row)
  return row
df['Comment'] = df['Comment'].apply(words_to_be_removed)

In [36]:
text = [txt for txt in df['Comment']]
label = [l for l in df['Rating']]

In [37]:
#Converting the dataset into train and test set. We will not include validation set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text, label)

In [38]:
#Downloading the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

BERT layer requires 3 inputs: 

1. Word Id: These are numeric value associated with every unique word. 
2. Attention Mask: Helps BERT distinguish between sentence tokens and padding tokens
3. Segment Id: Helps BERT distinguish between two sentences

The DistilBERT requires only first 2 inputs

The DistilBERT tokenizer takes text as input and prepares the output as required by DistilBERT layer. 
So this becomes an efficient way for making the dataset as we don't have to encode everything.

tokenizer arguments:
truncation=True
BERT layer has a cap of 512 tokens, this argument will cut the sentences into required length

padding = True
BERT layer will make every text into same length.

In [39]:
#Tokenizing the train and test datasets
X_train = tokenizer(X_train, truncation=True, padding=True)
X_test = tokenizer(X_test, truncation=True, padding=True)

In [41]:
#Making a tensorflow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((dict(X_train), y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(X_test), y_test))

In [45]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics = ['accuracy']) # can also use any keras loss fn

model.summary()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_139', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

Model: "tf_distil_bert_for_sequence_classification_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_139 (Dropout)        multiple                  0         
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [58]:
model.fit(train_dataset.shuffle(1000).batch(16), validation_data=test_dataset.shuffle(1000).batch(16), epochs=1, batch_size=16, validation_batch_size=16)











<tensorflow.python.keras.callbacks.History at 0x7f30025389d0>

In [52]:
model.save('/content/drive/MyDrive/')





INFO:tensorflow:Assets written to: /content/drive/MyDrive/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/assets
