<a href="https://colab.research.google.com/github/eshmaneva/DS-Practical/blob/main/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
#Preinstalling the necessary libraries
#Certain versions are required to avoid compatibility issues 

In [28]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [29]:
!pip install numpy==1.19.5
!pip install tensorflow==2.7.0
!pip install transformers==4.7.0
!pip install sacremoses==0.0.45



In [30]:
#Importing necessary classes for classification and summarizaton
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

import pandas as pd
import numpy as np
import nltk
import re

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from six import viewitems
#Importing methods for splitting and shuffling data (as dataset contains no pre-trained data)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
#Checking the avaliable GPUs (not necessary, made as a test of the system)

#num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
#print("Num GPUs Available: ", num_gpus_available)
#assert num_gpus_available > 0

In [32]:
#Used dataset: Amazon product reviews / Mobile Electronics

ds = tfds.load('amazon_us_reviews/Mobile_Electronics_v1_00', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
#print(ds)

INFO:absl:Load dataset info from /root/tensorflow_datasets/amazon_us_reviews/Mobile_Electronics_v1_00/0.1.0
INFO:absl:Reusing dataset amazon_us_reviews (/root/tensorflow_datasets/amazon_us_reviews/Mobile_Electronics_v1_00/0.1.0)
INFO:absl:Constructing tf.data.Dataset for split train, from /root/tensorflow_datasets/amazon_us_reviews/Mobile_Electronics_v1_00/0.1.0


In [33]:
#Setting the dataset as a frame (trasforming it from tensor)
df = tfds.as_dataframe(ds)
#Preview of the data
df.head()

Unnamed: 0,data/customer_id,data/helpful_votes,data/marketplace,data/product_category,data/product_id,data/product_parent,data/product_title,data/review_body,data/review_date,data/review_headline,data/review_id,data/star_rating,data/total_votes,data/verified_purchase,data/vine
0,b'20980074',0,b'US',b'Mobile_Electronics',b'B00D1847NE',b'274617424',b'Teenage Mutant Ninja Turtles Boombox CD Play...,b'Does not work',b'2015-01-09',b'One Star',b'R1OVS0D6SEXPW7',1,0,0,1
1,b'779273',0,b'US',b'Mobile_Electronics',b'B00KMO6DYG',b'397452138',b'4 Gauge Amp Kit Amplifier Install Wiring Com...,b'This is a great wiring kit i used it to set ...,b'2015-08-06',b'Great kit',b'R9VSD0ET8FERB',4,0,0,1
2,b'15410531',0,b'US',b'Mobile_Electronics',b'B000GWLL0K',b'948304826',b'Travel Wall Charger fits Creative Zen Vision...,b'It works great so much faster than USB charg...,b'2007-03-15',b'A/C Charger for Creative Zen Vision M',b'R3ISXCZHWLJLBH',5,0,0,1
3,b'27389005',0,b'US',b'Mobile_Electronics',b'B008L3JE6Y',b'466340015',b'High Grade Robust 360\xc2\xb0 Adjustable Car...,b'This product was purchased to hold a monitor...,b'2013-07-30',b'camera stand',b'R1TWVUDOFJSQAW',5,0,0,1
4,b'2663569',0,b'US',b'Mobile_Electronics',b'B00GHZS4SC',b'350592810',b'HDE Multifunctional Bluetooth FM Audio Car K...,"b""it works but it has really bad sound quality...",b'2014-12-31',b'bad sound quality',b'R2PEOEUR1LP0GH',3,0,0,1


In [34]:
#Classifying the data into two classes: positive and negative based on their star rating
df["Sentiment"] = df["data/star_rating"].apply(lambda score: "positive" if score >= 3 else "negative")
df['Sentiment'] = df['Sentiment'].map({'positive':1, 'negative':0})

In [35]:
df['short_review'] = df['data/review_body'].str.decode("utf-8")

In [36]:
df = df[["short_review", "Sentiment"]]

In [37]:
#Dropping last n rows using drop
n = 54975
df.drop(df.tail(n).index,
        inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [38]:
#To check how big is the dataset / num of rows
#index = df.index
#number_of_rows = len(index)
#print(number_of_rows)

#Printing the beginning part to see if the data is read correctly
#df.head()

#Printing the beginning part to see if the data is read correctly
#df.tail()

In [39]:
#Testing the labels
reviews = df['short_review'].values.tolist()
labels = df['Sentiment'].tolist()
#print(reviews[:2])
#print(labels[:2])

In [40]:
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(reviews, labels, test_size=.2)
#training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(reviews, labels, test_size=0.33, random_state=42, stratify=labels)


In [41]:
#Preprocessing the data using DistilBert for punctuation splitting and wordpieces
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [42]:
tokenizer([training_sentences[0]], truncation=True,
                            padding=True, max_length=128)

{'input_ids': [[101, 1045, 4149, 2023, 2000, 5672, 2026, 2214, 4718, 11659, 1012, 1996, 2214, 2028, 2001, 1037, 5592, 1010, 1998, 2074, 5941, 3031, 1996, 3953, 1997, 1996, 26322, 1010, 2478, 2009, 1005, 1055, 2219, 2373, 1012, 1045, 2001, 5305, 1997, 2383, 2000, 3715, 1996, 26322, 19802, 22139, 2135, 1998, 2025, 2108, 2583, 2000, 4952, 2000, 2009, 1999, 1996, 2482, 2065, 1045, 9471, 2000, 2079, 2061, 1010, 2061, 1045, 2359, 1037, 25025, 11659, 2023, 2051, 2105, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2023, 2518, 2003, 3492, 3722, 2041, 1997, 1996, 3482, 1010, 3733, 2000, 3275, 2041, 2302, 2383, 2000, 3191, 2083, 1996, 8128, 1010, 1998, 2009, 3849, 3492, 2092, 3833, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2034, 2518, 1045, 2018, 2019, 3277, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [43]:

train_encodings = tokenizer(training_sentences,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(validation_sentences,
                            truncation=True,
                            padding=True)

In [44]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    training_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    validation_labels
))

In [45]:
#tbd
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_59', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [46]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-08)
callbacks=tf.keras.callbacks.EarlyStopping(
    monitor='accuracy', 
    min_delta=0.0001,
    patience=3,
    mode='auto',
    verbose=2,
    baseline=None
)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(100).batch(16),
          epochs=2,
          batch_size=16,
          validation_data=val_dataset.shuffle(100).batch(16),callbacks=callbacks)

Epoch 1/2
























Epoch 2/2
Epoch 2/2


<keras.callbacks.History at 0x7fa9104f8690>

<keras.callbacks.History at 0x7fa9104f8690>

In [None]:
'''
import matplotlib.pyplot as plt

plt.title('Loss curves')
plt.plot(model.train_loss_history, '-', label='train')
plt.plot(model.val_loss_history, '-', label='val')
plt.legend(loc='lower right')
plt.xlabel('Iteration')
plt.show()

'''

In [48]:
model.save_pretrained("./sentiment")

In [49]:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("./sentiment")

Some layers from the model checkpoint at ./sentiment were not used when initializing TFDistilBertForSequenceClassification: ['dropout_59']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers from the model checkpoint at ./sentiment were not used when initializing TFDistilBertForSequenceClassification: ['dropout_59']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassificati

In [50]:
import pandas as pd
#Testing a model with a user-written input

#df = pd.DataFrame({'Text': ["This is a not a good product. I hate it", "This product is okay", "I hate you", 'I think this apple is good', "This toast is terrible"]})
#test_sentence = "This is a not a good product. I hate it"

df = pd.read_csv("/content/drive/MyDrive/DS Practical/summary_test.csv", engine="python", error_bad_lines=False)
df=df.loc[0:50]
#df = df.drop('index',axis=1, inplace=True)

df.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0.1,Unnamed: 0,index,Summary,Text,Generated_summary
0,0,900,Dessert to Them. But is it Nutritional Enough?,My twin 10-month old babies treat this one as ...,Sweet Potato
1,1,901,Repeated deliveries of broken jars. Great pro...,I liked getting this one for my twin 10-month ...,I like this one because it's vegetarian
2,2,902,"Good flavor, too runny though","Baby loved this one, but as he has progressed ...",I love this one!
3,3,903,12% Protein and 50% Vitamin A,Earth's Best Turkey Vegetable Dinner hits the ...,Delicious!
4,4,904,Baby Lilly says 2 thumbs up!,My little girl can't get enough of this! I lik...,Delicious!


Unnamed: 0.1,Unnamed: 0,index,Summary,Text,Generated_summary
0,0,900,Dessert to Them. But is it Nutritional Enough?,My twin 10-month old babies treat this one as ...,Sweet Potato
1,1,901,Repeated deliveries of broken jars. Great pro...,I liked getting this one for my twin 10-month ...,I like this one because it's vegetarian
2,2,902,"Good flavor, too runny though","Baby loved this one, but as he has progressed ...",I love this one!
3,3,903,12% Protein and 50% Vitamin A,Earth's Best Turkey Vegetable Dinner hits the ...,Delicious!
4,4,904,Baby Lilly says 2 thumbs up!,My little girl can't get enough of this! I lik...,Delicious!


In [None]:
for i in range(0, len(df)):
  df['Sentiment_text'] = i 
  df['Sentiment_summary'] = i

In [52]:
for i in range(0,len(df)):
  
  predict_input_text = tokenizer.encode(df['Text'][i],
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
  tf_output_text = loaded_model.predict(predict_input_text)[0]
  tf_prediction_text = tf.nn.softmax(tf_output_text, axis=1)
  labels = ['Negative','Positive']
  label_text = tf.argmax(tf_prediction_text, axis=1)
  label_text = label_text.numpy()
  df["Sentiment_text"][i] = (labels[label_text[0]])

  

  predict_input_sum = tokenizer.encode(df['Generated_summary'][i],
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
  tf_output_sum = loaded_model.predict(predict_input_sum)[0]
  tf_prediction_sum = tf.nn.softmax(tf_output_sum, axis=1)
  #labels = ['Negative','Positive']
  label_sum = tf.argmax(tf_prediction_sum, axis=1)
  label_sum = label_sum.numpy()
  df["Sentiment_summary"][i] = (labels[label_sum[0]])

  


#df = df.append(data, columns = "Sentiment")
print(df)











A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)








A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy






TypeError: ignored

TypeError: ignored