<a href="https://colab.research.google.com/github/baneabhishek/tensorflow_yelp/blob/main/Tensorflow_Yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [59]:
import pandas as pd
import numpy as np
import os
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [60]:
!pip install pickle5
import pickle5



In [3]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [61]:
with open('/content/drive/My Drive/archive/reviews_curtailed.pkl','rb') as file:
    user_review = pickle5.load(file)

In [62]:
user_review = user_review[['text','stars']]
user_review.head()

Unnamed: 0,text,stars
4,"Oh happy day, finally have a Canes near my cas...",4.0
5,This is definitely my favorite fast food sub s...,5.0
6,"Really good place with simple decor, amazing f...",5.0
8,Most delicious authentic Italian I've had in t...,5.0
11,ORDER In (Delivery) Review\n\nI discovered thi...,4.0


In [63]:
user_review.shape

(3305941, 2)

Data is too big so taking a fraction to accomodate it in the Colab memory 

In [64]:
user_review.stars = user_review.stars.astype(int)
user_review = user_review.sample(frac=0.1)
user_review.shape

(330594, 2)

Change stars to boolean value to predict if a review was positive or negative. Stars >3 will be a positive review and less than 3 will be negative reviews

In [65]:
user_review['label'] = np.where(user_review['stars']>=3,1,0)
user_review.head()

Unnamed: 0,text,stars,label
7532887,I had a very good experience at the store I wa...,5,1
2984593,One off my favorite spot in Vegas to eat. To b...,5,1
1846807,The girls here are wonderful:) Yasmin in parti...,5,1
6777744,I violated my rule of avoiding grocery shoppin...,5,1
7627782,I LOVE nachos and these are hands down the bes...,5,1


Preprocessing the text data

In [66]:
user_review['text'] = user_review['text'].str.lower()
user_review.head()

Unnamed: 0,text,stars,label
7532887,i had a very good experience at the store i wa...,5,1
2984593,one off my favorite spot in vegas to eat. to b...,5,1
1846807,the girls here are wonderful:) yasmin in parti...,5,1
6777744,i violated my rule of avoiding grocery shoppin...,5,1
7627782,i love nachos and these are hands down the bes...,5,1


Train Test Split

In [67]:
train, test = train_test_split(user_review, test_size=0.2, shuffle=True)
train, valid = train_test_split(train, test_size=0.1, shuffle=True)

In [73]:
def df_to_tensor(data):
  data = user_review[['text','label']]
  target = data.pop('label')
  ### use np.array to create an array of text column instead of data.values, data.values will create a shape (10,1), we want (10,)
  dataset = tf.data.Dataset.from_tensor_slices((np.array(data.text), target.values))
  return dataset

In [74]:
## example dataset
next(iter(df_to_tensor(valid).batch(10)))

(<tf.Tensor: shape=(10,), dtype=string, numpy=
 array([b'i had a very good experience at the store i was in there last week as soon as i walked in our young lady offered to help me find what i needed i told her what my budget was she showed me the uniforms that was in my price range and then i asked her what the dressing rooms were she want me over to it on the way i saw more uniforms that way in my budget. i was able to get everything that i needed and i told all my nursing  friends about this store. needless to say i was very very pleased and i will be returning',
        b"one off my favorite spot in vegas to eat. to bad that my bf is stubborn and afraid of food poisoning so we didn't eat there if not i will defiantly go back.\\n\\nanyways the menu change during the day which is cool but i think in my opinion the best menu is at midnight until morning.\\n\\nthe portion is big so you can be 2-3 people in one plate depends on what you order. last time i order that was something with f

In [75]:
train_data = df_to_tensor(train)
valid_data = df_to_tensor(valid)
test_data = df_to_tensor(test)

In [76]:
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.4.1
Eager mode:  True
Hub version:  0.11.0
GPU is available


Embeddings from Tensorflow Hub

In [15]:
## Embeddings will change the shape of all sentences to a dimension of 50(in case of this embedding), irrespective of sentence length
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

In [80]:
### Temporary examples of embeddings
temp_train, temp_labels = next(iter(train_data.batch(5)))
hub_layer(temp_train[:2])

<tf.Tensor: shape=(2, 50), dtype=float32, numpy=
array([[ 0.3326142 ,  0.13766728, -0.10483283,  0.78116024,  0.04657589,
         0.10187986,  0.7211402 , -0.2967714 , -0.3339088 ,  0.34028798,
        -0.01863033,  0.35074052,  0.30993348,  0.11579325, -0.30461624,
        -0.3857032 , -0.03695595,  0.18487148,  0.15475176, -0.5163114 ,
        -0.19074649,  0.06762648,  0.33590364, -0.03583926, -0.31726542,
         0.24449469, -1.3563333 ,  0.01606731,  0.15576953, -0.12061935,
        -0.0585728 ,  0.61368513,  0.92193604, -0.4416206 , -0.5482102 ,
         0.20726408,  0.33545595, -0.10232715,  0.16410267, -0.30076176,
         0.11632676,  0.5154865 , -0.29869187,  0.47768974, -0.05003402,
         0.0180978 , -0.19847   , -0.5053115 , -0.12412173,  0.15891649],
       [ 0.8205721 ,  0.02564078, -0.2729957 ,  0.5263815 , -0.19107832,
        -0.22988442,  0.5678372 ,  0.17527197, -0.7139991 ,  0.44599622,
         0.47246113,  0.21661897,  0.14366399,  0.15856886, -0.4217355 ,
 

In [81]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 50)                48190600  
_________________________________________________________________
dense (Dense)                (None, 16)                816       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


In [82]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [84]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=valid_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Evaluate model on test data

In [85]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

646/646 - 6s - loss: 0.0772 - accuracy: 0.9673
loss: 0.077
accuracy: 0.967
