Copyright 2023 Aaryan Chandna

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

I chose to use an RNN model for this task since RNNs work well with sequential data, and I was processing sentences to try and predict the binary output. Vectorizing the words was important for the first layer of the model, as it would allow for the network to make use of the relations between words directly. As for the embedding dimension, I chose 16 since this was a smaller dataset and as such, a smaller embedding dimension would be more efficient. I also set batch size to 1 for a similar reason. For the RNN, I used 20 epochs for the first layer, and 10 for the second since this would make the model accurate while preventing overfitting. I found that it was consistently producing an output of around 70% accuracy on test cases. 

In [1]:
#imports
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses


2023-07-04 02:24:04.101940: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
import pandas as pd
df = pd.read_csv('trainingsample.csv')

In [3]:
df = pd.read_csv(
    "trainingsample.csv")

In [4]:
df['Ans']=df['product_related'].apply(lambda x: 1 if x=='Yes' else 0)
df = df.drop('product_related', axis=1)
df.head()

Unnamed: 0,Content,Content_Length,Ans
0,The benefits I think we will see from the chan...,55,0
1,I would just add one more thing. While we shou...,64,0
2,"Ken, I don't have that number at my fingertips...",51,0
3,I think the only part of the segment that I di...,137,1
4,"No, nothing has changed. I have been an invest...",75,0


In [5]:
df = df.drop('Content_Length', axis=1)

In [6]:
#splitting data into 3 sets: train and validation for creating model, and test for unbiased examination of model
from sklearn.model_selection import train_test_split

trainer, test = train_test_split(df, test_size=0.25)
train, val = train_test_split(trainer, test_size=0.15)
print(trainer.shape)

(225, 2)


In [7]:
#converting pandas dfs to tensorflow
train_dataset=tf.data.Dataset.from_tensor_slices((train['Content'].values, train['Ans'].values))
val_dataset=tf.data.Dataset.from_tensor_slices((val['Content'].values, val['Ans'].values))
test_dataset=tf.data.Dataset.from_tensor_slices((test['Content'].values, test['Ans'].values))

In [8]:
#shuffling data into batches of size 1
train_dataset=train_dataset.shuffle(10000).batch(1)
val_dataset = val_dataset.shuffle(10000).batch(1)
test_dataset=test_dataset.shuffle(10000).batch(1)

In [9]:
#cleaning input strings
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

In [10]:
#creating word vectorization layer
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

In [11]:
# Make a text-only dataset (without labels), then call adapt
train_text = train_dataset.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

2023-07-04 02:24:08.686934: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [191]
	 [[{{node Placeholder/_1}}]]
2023-07-04 02:24:08.687341: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [191]
	 [[{{node Placeholder/_1}}]]


In [12]:
#vectorizing text method
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [13]:
#vectorizing dataset strings
train_ds = train_dataset.map(vectorize_text)
val_ds = val_dataset.map(vectorize_text)
test_ds = test_dataset.map(vectorize_text)

In [14]:
#tuning data
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [15]:
embedding_dim = 16

In [16]:
#creating model
model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          160016    
                                                                 
 dropout (Dropout)           (None, None, 16)          0         
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 1)                 17        
                                                                 
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
__________________________________________________

In [17]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [18]:
#model training
epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/20


2023-07-04 02:24:09.201453: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_14' with dtype int64
	 [[{{node Placeholder/_14}}]]
2023-07-04 02:24:09.201880: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_12' with dtype int64
	 [[{{node Placeholder/_12}}]]




2023-07-04 02:24:10.190844: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [34]
	 [[{{node Placeholder/_0}}]]
2023-07-04 02:24:10.191279: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_13' with dtype string
	 [[{{node Placeholder/_13}}]]


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [19]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.5829906463623047
Accuracy:  0.6933333277702332


2023-07-04 02:24:16.917847: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_11' with dtype resource
	 [[{{node Placeholder/_11}}]]
2023-07-04 02:24:16.918362: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [75]
	 [[{{node Placeholder/_1}}]]


In [20]:
#2nd run of RNN
epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.5216538310050964
Accuracy:  0.746666669845581


In [22]:
export_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)

loss, accuracy = export_model.evaluate(test_dataset)
print(accuracy)

2023-07-04 02:24:20.435636: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [75]
	 [[{{node Placeholder/_1}}]]
2023-07-04 02:24:20.435934: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [75]
	 [[{{node Placeholder/_1}}]]


0.746666669845581


In [25]:
example = ["And it is so far. It just depends on -- what we're hearing from those clients is not that they're making cuts to their capital spending programs, but that they might if the country doesn't open back up or their respective states don't open up because their biggest consumers of gas are restaurants, bars, hotel services, hotels that have restaurants and bars and that type of industry. So if those demands stay down, they may not have the need. You're not seeing a lot of -- you're seeing drops in new housing and some of those -- so those customers, those new customers that they would have, there won't be that need there. So there'll be reductions there. As far as replacement or maintenance of existing systems or old systems, that work is going to continue."]
export_model.predict(example)



array([[0.8452713]], dtype=float32)

In [27]:
def prediction(text):
  listText = [text]
  if export_model.predict(listText) >= 0.5:
    return 1
  else:
    return 0

In [28]:
prediction("And it is so far. It just depends on -- what we're hearing from those clients is not that they're making cuts to their capital spending programs, but that they might if the country doesn't open back up or their respective states don't open up because their biggest consumers of gas are restaurants, bars, hotel services, hotels that have restaurants and bars and that type of industry. So if those demands stay down, they may not have the need. You're not seeing a lot of -- you're seeing drops in new housing and some of those -- so those customers, those new customers that they would have, there won't be that need there. So there'll be reductions there. As far as replacement or maintenance of existing systems or old systems, that work is going to continue.")



1