<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

## Stanford Sentiment Treebank - Movie Review Classification Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using keras Tokenizer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



## 1. Get data in and set up X_train, X_test, y_train objects

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [1]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv").squeeze("columns") 
X_test=pd.read_csv("sst2_competition_data/X_test.csv").squeeze("columns") 

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv").squeeze("columns") 

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

## 2. Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.

In [2]:
X_train

0       The Rock is destined to be the 21st Century 's...
1       The gorgeously elaborate continuation of `` Th...
2       Singer/composer Bryan Adams contributes a slew...
3                    Yet the act is still charming here .
4       Whether or not you 're enlightened by any of D...
                              ...                        
6915                                      A real snooze .
6916                                       No surprises .
6917    We 've seen the hippie-turned-yuppie plot befo...
6918    Her fans walked out muttering words like `` ho...
6919                                  In this case zero .
Name: text, Length: 6920, dtype: object

In [3]:
X_test

0       If you sometimes like to go to the movies to h...
1       Emerges as something rare , an issue movie tha...
2       Offers that rare combination of entertainment ...
3       Perhaps no picture ever made has more literall...
4       Steers turns in a snappy screenplay that curls...
                              ...                        
1816                     An imaginative comedy/thriller .
1817                        ( A ) rare , beautiful film .
1818                   ( An ) hilarious romantic comedy .
1819                  Never ( sinks ) into exploitation .
1820                          ( U ) nrelentingly stupid .
Name: text, Length: 1821, dtype: object

In [4]:
y_train

Unnamed: 0,Negative,Positive
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
6915,1,0
6916,1,0
6917,0,1
6918,1,0


### The training dataset contains 6920 observations and the testing dataset has 1821 observations. The X dataset includes the movie reviews and y dataset has the one hot encoded data for whether the movie review is negative or positive sentiment. 
### Building a predictive model for movie review sentiment classification built using deep learning techniques such as neural networks can effectively analyze vast amounts of data, automatically extract features from unstructured textual reviews, and categorize them as positive or negative. The adaptability of deep learning models enables them to improve and become more accurate with more data, benefiting movie critics and review websites. The model automates the process of reviewing and classifying movie reviews, reducing human error, increasing efficiency, and ensuring consistent results.
### Similarly, classic machine learning models can also be trained on labeled data to classify new movie reviews into various sentiment categories. These classifiers can be easily deployed in real-world scenarios, such as a movie studio using them to analyze audience sentiment and adjust their marketing strategy or make changes to a movie based on feedback. These models can be optimized to improve accuracy and handle new data, requiring no retraining. As new movies are released, these models can quickly adapt and provide accurate sentiment analysis, providing valuable insights into audience reactions and preferences and assisting the film industry in making informed decisions.

## 3.   Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [14]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tf.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=100, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 100)
(1821, 100)


### Model 1 with LSTM


In [6]:
from tensorflow.keras.layers import Dense, Embedding,Flatten, LSTM, SimpleRNN
from tensorflow.keras.models import Sequential

model1 = Sequential()
model1.add(Embedding(10000, 64, input_length=100))
model1.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) 
model1.add(Dense(2, activation='softmax'))
model1.summary()

# try using different optimizers and different optimizer configs

model1.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])


model1.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=64,
                    validation_split=0.2)

2023-04-12 16:44:08.726108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 64)           640000    
                                                                 
 lstm (LSTM)                 (None, 128)               98816     
                                                                 
 dense (Dense)               (None, 2)                 258       
                                                                 
Total params: 739,074
Trainable params: 739,074
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc89d15da90>

#### Save preprocessor function to local "preprocessor.zip" file

In [7]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [8]:
# Save tf.keras model (or any tensorflow model) to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model1, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

2023-04-12 16:46:14,569 - INFO - Signatures found in model: [serving_default].
2023-04-12 16:46:14,569 - INFO - Output names: ['dense']
2023-04-12 16:46:14,771 - INFO - Using tensorflow=2.12.0, onnx=1.13.1, tf2onnx=1.14.0/8f8d49
2023-04-12 16:46:14,771 - INFO - Using opset <onnx, 13>
2023-04-12 16:46:14,815 - INFO - Computed 0 values for constant folding
2023-04-12 16:46:14,831 - INFO - Computed 0 values for constant folding
2023-04-12 16:46:14,846 - INFO - Computed 1 values for constant folding
2023-04-12 16:46:14,861 - INFO - folding node using tf type=StridedSlice, name=StatefulPartitionedCall/sequential/lstm/strided_slice_1
2023-04-12 16:46:14,919 - INFO - Optimizing ONNX model
2023-04-12 16:46:15,409 - INFO - After optimization: Cast -3 (10->7), Concat -1 (2->1), Const -23 (52->29), Expand -1 (3->2), Identity -30 (31->1), Placeholder -1 (11->10), Squeeze -1 (2->1), Unsqueeze -4 (4->0)
2023-04-12 16:46:15,421 - INFO - 
2023-04-12 16:46:15,421 - INFO - Successfully converted TensorF

In [9]:
with open("model1.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [10]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:········
AI Modelshare Password:········
AI Model Share login credentials set successfully.


In [11]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [13]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model1.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model1.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 58

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [14]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,80.90%,80.89%,80.96%,80.90%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,26
1,80.57%,80.35%,82.02%,80.58%,keras,,True,Sequential,4.0,201154.0,1.0,,,,1.0,,,,,,2.0,,2.0,,,str,RMSprop,805360.0,,1jiahe,46
2,79.80%,79.63%,80.85%,79.81%,keras,,True,Sequential,5.0,193702.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,776112.0,,amsay99,43
3,80.13%,80.13%,80.16%,80.13%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,25
4,79.58%,79.46%,80.34%,79.59%,keras,,True,Sequential,4.0,206850.0,1.0,,,,1.0,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,828272.0,,jer2240,51
5,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
6,77.94%,77.77%,78.80%,77.95%,keras,,True,Sequential,3.0,739074.0,1.0,,,,,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,2957168.0,,francesyang,58
7,77.50%,77.21%,78.98%,77.51%,keras,,True,Sequential,3.0,164290.0,1.0,,,,,,,1.0,,,1.0,1.0,,,,str,RMSprop,658464.0,,adrianwang,38
8,77.50%,77.39%,78.07%,77.51%,keras,,True,Sequential,3.0,161282.0,1.0,,,,1.0,,,,,,1.0,,1.0,,,str,RMSprop,645600.0,,adrianwang,31
9,77.50%,77.41%,77.97%,77.50%,keras,,True,Sequential,3.0,161282.0,1.0,,,,1.0,,,,,,1.0,,1.0,,,str,RMSprop,645600.0,,lprockop,44


### Model 2 with Conv1d


In [44]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, Conv1D, InputLayer, Conv1D, Dense, Flatten, MaxPooling1D, GlobalMaxPooling1D

model2 = Sequential()
model2.add(Embedding(10000, 64, input_length=100))
model2.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
model2.add(MaxPooling1D(2))
model2.add(Flatten())
model2.add(Dense(64, activation="relu"))
model2.add(Dense(2, activation='softmax'))

model2.summary()
model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Model: "sequential_27"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_25 (Embedding)    (None, 100, 64)           640000    
                                                                 
 conv1d_24 (Conv1D)          (None, 100, 32)           4128      
                                                                 
 max_pooling1d_14 (MaxPoolin  (None, 50, 32)           0         
 g1D)                                                            
                                                                 
 flatten_13 (Flatten)        (None, 1600)              0         
                                                                 
 dense_32 (Dense)            (None, 64)                102464    
                                                                 
 dense_33 (Dense)            (None, 2)                 130       
                                                     

In [45]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

2023-04-12 17:11:52,792 - INFO - Signatures found in model: [serving_default].
2023-04-12 17:11:52,792 - INFO - Output names: ['dense_33']
2023-04-12 17:11:53,021 - INFO - Using tensorflow=2.12.0, onnx=1.13.1, tf2onnx=1.14.0/8f8d49
2023-04-12 17:11:53,021 - INFO - Using opset <onnx, 13>
2023-04-12 17:11:53,036 - INFO - Computed 0 values for constant folding
2023-04-12 17:11:53,065 - INFO - Optimizing ONNX model
2023-04-12 17:11:53,293 - INFO - After optimization: Cast -3 (4->1), Const -2 (15->13), Identity -2 (2->0), Reshape -1 (3->2), Transpose -2 (4->2)
2023-04-12 17:11:53,301 - INFO - 
2023-04-12 17:11:53,301 - INFO - Successfully converted TensorFlow model /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmps32ti9mr to ONNX
2023-04-12 17:11:53,301 - INFO - Model inputs: ['embedding_25_input']
2023-04-12 17:11:53,301 - INFO - Model outputs: ['dense_33']
2023-04-12 17:11:53,301 - INFO - ONNX model is saved at /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmps32ti9mr/temp.onnx


In [46]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 59

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [47]:
# Compare two or more models 
data=mycompetition.compare_models([58, 59], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_58_Layer,Model_58_Shape,Model_58_Params,Model_59_Layer,Model_59_Shape,Model_59_Params
0,Embedding,"[None, 100, 64]",640000.0,Embedding,"[None, 100, 64]",640000
1,LSTM,"[None, 128]",98816.0,Conv1D,"[None, 100, 32]",4128
2,Dense,"[None, 2]",258.0,MaxPooling1D,"[None, 50, 32]",0
3,,,,Flatten,"[None, 1600]",0
4,,,,Dense,"[None, 64]",102464
5,,,,Dense,"[None, 2]",130


### Model 3 with transfer learning with glove embeddings

In [62]:
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-12 17:27:29--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-12 17:27:29--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-12 17:27:29--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [63]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [87]:
import os
# Extract embedding data for 300 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.300d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [68]:
word_index = tokenizer.word_index
max_words = 10000

In [88]:
# Build embedding matrix
embedding_dim = 300
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [93]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=100))
model3.add(Flatten())
model3.add(Dense(64, activation='relu'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 100, 300)          3000000   
                                                                 
 flatten_10 (Flatten)        (None, 30000)             0         
                                                                 
 dense_18 (Dense)            (None, 64)                1920064   
                                                                 
 dense_19 (Dense)            (None, 2)                 130       
                                                                 
Total params: 4,920,194
Trainable params: 4,920,194
Non-trainable params: 0
_________________________________________________________________


In [94]:
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False



model3.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])
model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
model3.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [95]:
# Save tf.keras model (or any tensorflow model) to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

2023-04-12 17:55:56,532 - INFO - Signatures found in model: [serving_default].
2023-04-12 17:55:56,532 - INFO - Output names: ['dense_19']
2023-04-12 17:55:57,309 - INFO - Using tensorflow=2.12.0, onnx=1.13.1, tf2onnx=1.14.0/8f8d49
2023-04-12 17:55:57,309 - INFO - Using opset <onnx, 13>
2023-04-12 17:55:57,417 - INFO - Computed 0 values for constant folding
2023-04-12 17:55:57,516 - INFO - Optimizing ONNX model
2023-04-12 17:55:57,992 - INFO - After optimization: Cast -1 (2->1), Const -1 (7->6), Identity -2 (2->0)
2023-04-12 17:55:58,022 - INFO - 
2023-04-12 17:55:58,022 - INFO - Successfully converted TensorFlow model /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmp8ziwwhnw to ONNX
2023-04-12 17:55:58,022 - INFO - Model inputs: ['embedding_10_input']
2023-04-12 17:55:58,022 - INFO - Model outputs: ['dense_19']
2023-04-12 17:55:58,023 - INFO - ONNX model is saved at /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmp8ziwwhnw/temp.onnx


In [96]:
with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [97]:
#Submit Model 3: 


prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 62

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [98]:
# Compare two or more models 
data=mycompetition.compare_models([58, 59, 62], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_58_Layer,Model_58_Shape,Model_58_Params,Model_59_Layer,Model_59_Shape,Model_59_Params,Model_62_Layer,Model_62_Shape,Model_62_Params
0,Embedding,"[None, 100, 64]",640000.0,Embedding,"[None, 100, 64]",640000,Embedding,"[None, 100, 300]",3000000.0
1,LSTM,"[None, 128]",98816.0,Conv1D,"[None, 100, 32]",4128,Flatten,"[None, 30000]",0.0
2,Dense,"[None, 2]",258.0,MaxPooling1D,"[None, 50, 32]",0,Dense,"[None, 64]",1920064.0
3,,,,Flatten,"[None, 1600]",0,Dense,"[None, 2]",130.0
4,,,,Dense,"[None, 64]",102464,,,
5,,,,Dense,"[None, 2]",130,,,


## 4. Discuss which models performed better and point out relevant hyper-parameter values for successful models.

### Model 1 with LSTM layers performed the best with 0.779 accuracy, 0.778 f1 score, 0.788 precision and 0.779 recall.  As of why LSTM performed better, in my opinion, it's due to its capability to capture contextual information and long-term dependencies in text. This is crucial in determining the sentiment expressed in the review, especially considering that movie reviews often contain contextual structures that can be challenging for traditional machine learning models. By remembering important information from previous words in the sequence, LSTMs can make more informed predictions about the sentiment of the entire review. Additionally, LSTMs are designed to handle variable-length input sequences, making them an attractive choice for modeling text data where the length of the review can vary significantly, which is often the case for movie reviews that can range from a few words to several paragraphs. For the embedding layer parameters, I will experiment with different max words and max length in the following models.

### Model 4: Tune model within range of hyperparameters with Keras Tuner

In [102]:
#Separate validation data 
from sklearn.model_selection import train_test_split
x_train_split, x_val, y_train_split, y_val = train_test_split(
     X_train, y_train, test_size=0.2, random_state=42)

In [104]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten
import keras_tuner as kt

#Define model structure & parameter search space with function
def build_model(hp):
    model = keras.Sequential()
    model.add(Embedding(10000, 64, input_length=100))
    model.add(LSTM(units=hp.Int("units", min_value=16, max_value=1024, step=32),
                   return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
    model.add(Flatten())
    model.add(Dense(2, activation='softmax'))
    model.compile(
        optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"],
    )
    return model

#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=1, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=1, validation_data=(preprocessor(x_val), y_val))


Trial 3 Complete [00h 10m 05s]
val_accuracy: 0.6965317726135254

Best val_accuracy So Far: 0.7348265647888184
Total elapsed time: 00h 11m 11s


In [105]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=1)




<keras.callbacks.History at 0x7fc84db0e7f0>

In [106]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

2023-04-12 18:29:11,299 - INFO - Signatures found in model: [serving_default].
2023-04-12 18:29:11,299 - INFO - Output names: ['dense_1']
2023-04-12 18:29:11,576 - INFO - Using tensorflow=2.12.0, onnx=1.13.1, tf2onnx=1.14.0/8f8d49
2023-04-12 18:29:11,577 - INFO - Using opset <onnx, 13>
2023-04-12 18:29:11,626 - INFO - Computed 0 values for constant folding
2023-04-12 18:29:11,629 - INFO - Computed 0 values for constant folding
2023-04-12 18:29:11,678 - INFO - Computed 1 values for constant folding
2023-04-12 18:29:11,697 - INFO - folding node using tf type=StridedSlice, name=StatefulPartitionedCall/sequential_1/lstm_1/strided_slice_1
2023-04-12 18:29:11,765 - INFO - Optimizing ONNX model
2023-04-12 18:29:12,327 - INFO - After optimization: Cast -4 (11->7), Concat -1 (2->1), Const -18 (46->28), Expand -1 (3->2), Identity -30 (31->1), Placeholder -1 (11->10), Squeeze -1 (1->0), Unsqueeze -4 (4->0)
2023-04-12 18:29:12,340 - INFO - 
2023-04-12 18:29:12,341 - INFO - Successfully converted T

In [107]:
#Submit Model 4: 

#-- Generate predicted y values (Model 4)
prediction_column_index=tuned_model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 64

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Model 5

In [39]:
def preprocessor(data, maxlen=80, max_words=20000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 80)
(1821, 80)


In [40]:
from tensorflow.keras.layers import Dense, Embedding,Flatten, LSTM, SimpleRNN, Conv1D, MaxPooling1D
from tensorflow.keras.models import Sequential
model5 = Sequential()
model5.add(Embedding(20000, 32, input_length=80))
model5.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model5.add(MaxPooling1D(pool_size=2))
model5.add(Flatten())
model5.add(Dense(250, activation='relu'))
model5.add(Dense(2, activation='softmax'))
model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model5.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 80, 32)            640000    
                                                                 
 conv1d_4 (Conv1D)           (None, 80, 32)            3104      
                                                                 
 max_pooling1d_4 (MaxPooling  (None, 40, 32)           0         
 1D)                                                             
                                                                 
 flatten_4 (Flatten)         (None, 1280)              0         
                                                                 
 dense_14 (Dense)            (None, 250)               320250    
                                                                 
 dense_15 (Dense)            (None, 2)                 502       
                                                      

In [41]:
model5.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=64,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f9b8e1e0550>

In [42]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


In [43]:
# Save tf.keras model (or any tensorflow model) to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

2023-04-12 19:12:18,754 - INFO - Signatures found in model: [serving_default].
2023-04-12 19:12:18,754 - INFO - Output names: ['dense_15']
2023-04-12 19:12:19,003 - INFO - Using tensorflow=2.12.0, onnx=1.13.1, tf2onnx=1.14.0/8f8d49
2023-04-12 19:12:19,003 - INFO - Using opset <onnx, 13>
2023-04-12 19:12:19,030 - INFO - Computed 0 values for constant folding
2023-04-12 19:12:19,063 - INFO - Optimizing ONNX model
2023-04-12 19:12:19,347 - INFO - After optimization: Cast -3 (4->1), Const -2 (15->13), Identity -2 (2->0), Reshape -1 (3->2), Transpose -2 (4->2)
2023-04-12 19:12:19,358 - INFO - 
2023-04-12 19:12:19,358 - INFO - Successfully converted TensorFlow model /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmpmu_dqrau to ONNX
2023-04-12 19:12:19,359 - INFO - Model inputs: ['embedding_6_input']
2023-04-12 19:12:19,359 - INFO - Model outputs: ['dense_15']
2023-04-12 19:12:19,359 - INFO - ONNX model is saved at /var/folders/l0/b3zg_5450c93plszkqzs6dm00000gn/T/tmpmu_dqrau/temp.onnx


In [44]:
with open("model5.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [47]:
#Submit Model 5: 

#-- Generate predicted y values (Model 5)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model5.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 5 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

100% [..........................................................] 21357 / 21357Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 65

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Model 6 with transformersblock

In [3]:
import nltk, random
import numpy as np
from nltk.corpus import movie_reviews
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from lime.lime_text import LimeTextExplainer
class TransformerLayer(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.001):
        super(TransformerLayer, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads,
                                             key_dim=embed_dim)
        self.ffn = keras.Sequential([
            layers.Dense(ff_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)  # self-attention layer
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)  # layer norm
        ffn_output = self.ffn(out1)  #feed-forward layer
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)  # layer norm

In [4]:
class EmbeddingLayer22(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(EmbeddingLayer22, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size,
                                          output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [7]:
from tensorflow.keras.utils import pad_sequences
## Hyperparameters fot tokenizer
vocab_size = 10000
maxlen = 80

## tokenizer
tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size)
## fit tokenizer
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=80, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 80)
(1821, 80)


In [8]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

## Using Sequential API
model6 = keras.Sequential()
model6.add(layers.Input(shape=(maxlen, )))
model6.add(EmbeddingLayer22(maxlen, vocab_size, embed_dim))
model6.add(TransformerLayer(embed_dim, num_heads, ff_dim))
model6.add(layers.GlobalAveragePooling1D())
model6.add(layers.Dropout(0.1))
model6.add(layers.Dense(ff_dim, activation='relu'))
model6.add(layers.Dropout(0.1))
model6.add(layers.Dense(2, activation='softmax'))

2023-04-12 20:51:42.097425: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [9]:
model6.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

model6.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_layer22 (Embeddin  (None, 80, 32)           322560    
 gLayer22)                                                       
                                                                 
 transformer_layer (Transfor  (None, 80, 32)           10656     
 merLayer)                                                       
                                                                 
 global_average_pooling1d (G  (None, 32)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_2 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 32)                1056      
                                                        

In [10]:
history = model6.fit(preprocessor(X_train),
                    y_train,
                    batch_size=64,
                    epochs=10,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)
with open("model6.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [121]:
#Submit Model 6: 

#-- Generate predicted y values (Model 6)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model6.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

100% [..........................................................] 21357 / 21357Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 67

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Discuss results

### Model 6 has the best results with 0.818 accuracy and 0.818 f1 score. The model had an embedding layer with max length of 80 and embedding size of 32. Then I created a Conv1D layer and maxpooling1D, as of why I think it worked, in text classification task scenrio like movie sentiment, Conv1D can learn to identify crucial patterns and relationships between words in a text sequence, which can be useful for distinguishing between different classes of documents or predicting the sentiment of a sentence. And Conv1D is less prone to overfitting than traditional recurrent neural networks, which can be especially beneficial for smaller text classification datasets.

### Discuss which models you tried and which models performed better and point out relevant hyper-parameter values for successful models.

### LSTM and Conv1D have worked well in the movie review sentiment classification prediction models because they excel at identifying distinct patterns and relationships within textual data. LSTM, a type of RNN, is particularly proficient in processing sequences of information by detecting temporal relationships between words, thereby enabling the model to capture context and sentiment. Besides that LSTM is adept at handling long-term dependencies within sequences, which is especially crucial in natural language processing. In contrast, Conv1D, a type of CNN, is excellent at processing one-dimensional sequences of data. In text classification tasks, Conv1D can identify local patterns and relationships between adjacent words in a sentence, which is essential for capturing vital features that contribute to sentiment. Furthermore, Conv1D is capable of capturing spatial relationships between adjacent words, which is necessary for language processing tasks.