# Semi-supervised Sequence Learning


# Description

In this project, I employed two innovative approaches to enhance sequence learning in Recurrent Neural Networks (RNNs).

Initially, I trained a model for sentiment analysis on the IMDB Review dataset using a common LSTM network with two LSTM layers. Following this, I implemented a highly effective machine learning technique known as the Sequence-to-Sequence (seq2seq) model. In this technique, I trained a seq2seq model and used its LSTM layer weights to initialize the main sequence model for sentiment analysis. This approach leverages the abundance of unlabeled datasets, allowing us to train an unsupervised network more easily and subsequently use the resulting weights to improve the performance of our primary project.

The second approach involved using a language model. I experimented with two different language models for this task. The first approach utilized a pretrained language model, while the second approach involved building a language model from scratch and then using its weights for the main sequence learning project.

By incorporating these techniques, the project aims to enhance the stability and effectiveness of LSTM networks for sentiment analysis.

Paper Link : <a href="https://paperswithcode.com/paper/semi-supervised-sequence-learning">Semi-Supervised Sequence Learning</a>


In [13]:
! kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
! unzip -o -q imdb-dataset-of-50k-movie-reviews.zip -d Data
! rm imdb-dataset-of-50k-movie-reviews.zip

! kaggle datasets download -d stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset
! unzip -o -q rotten-tomatoes-movies-and-critic-reviews-dataset.zip -d Data
! rm rotten-tomatoes-movies-and-critic-reviews-dataset.zip
! rm Data/rotten_tomatoes_movies.csv 

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to c:\Users\ASUS\Desktop\KULIAH\Pembelajaran Mesin\FP\DeepLearning




  0%|          | 0.00/25.7M [00:00<?, ?B/s]
100%|██████████| 25.7M/25.7M [00:00<00:00, 749MB/s]


Dataset URL: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset
License(s): CC0-1.0
Downloading rotten-tomatoes-movies-and-critic-reviews-dataset.zip to c:\Users\ASUS\Desktop\KULIAH\Pembelajaran Mesin\FP\DeepLearning




  0%|          | 0.00/77.2M [00:00<?, ?B/s]
100%|██████████| 77.2M/77.2M [00:00<00:00, 854MB/s]


In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Input, LSTM, RepeatVector, TimeDistributed, Dense, Embedding, Dropout, InputLayer
from tensorflow.keras.optimizers import Adam
import tensorflow_hub as hub
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Loading Data

- IMDB Review:<a href="https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews"> Kaggle </a>

- Rotten Tomatoes Review: <a href = "https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset">Kaggle</a>

In [15]:
datasets_folder = 'Data'
datasets_file = 'IMDB Dataset.csv'
datasets_path = f'{datasets_folder}/{datasets_file}'

try:
    df = pd.read_csv(datasets_path)
    print(f"File dataset {datasets_file} berhasil dimuat dari folder {datasets_folder}.")
except FileNotFoundError:
    print(f"Tidak ada file {datasets_file} di dalam folder {datasets_folder}.")


df.head(5)

File dataset IMDB Dataset.csv berhasil dimuat dari folder Data.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [16]:
datasets_folder = 'Data'
datasets_file = 'rotten_tomatoes_critic_reviews.csv'
datasets_path = f'{datasets_folder}/{datasets_file}'

try:
    to_df = pd.read_csv(datasets_path)
    print(f"File dataset {datasets_file} berhasil dimuat dari folder {datasets_folder}.")
except FileNotFoundError:
    print(f"Tidak ada file {datasets_file} di dalam folder {datasets_folder}.")

to_df.head(5)

File dataset rotten_tomatoes_critic_reviews.csv berhasil dimuat dari folder Data.


Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...


In [17]:
df.shape, to_df.shape

((50000, 2), (1130017, 8))

In [18]:
df.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [19]:
to_df.head()

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...


In [20]:
to_df.review_type.value_counts()

review_type
Fresh     720210
Rotten    409807
Name: count, dtype: int64

In [21]:
df1 = to_df[["review_type", "review_content"]]
df1.head()

Unnamed: 0,review_type,review_content
0,Fresh,A fantasy adventure that fuses Greek mythology...
1,Fresh,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,Fresh,With a top-notch cast and dazzling special eff...
3,Fresh,Whether audiences will get behind The Lightnin...
4,Rotten,What's really lacking in The Lightning Thief i...


# Text Preprocessing

Before everything we have to clean and process on our data before applying any machine learning algorithm.

# Convert String values to Numeric

In [22]:
df["label"] = df["sentiment"].apply(lambda x: 1 if x == "positive" else 0)
df.head()

Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [23]:
df1["label"] = df1["review_type"].apply(lambda x: 1 if x == "Fresh" else 0)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["label"] = df1["review_type"].apply(lambda x: 1 if x == "Fresh" else 0)


Unnamed: 0,review_type,review_content,label
0,Fresh,A fantasy adventure that fuses Greek mythology...,1
1,Fresh,"Uma Thurman as Medusa, the gorgon with a coiff...",1
2,Fresh,With a top-notch cast and dazzling special eff...,1
3,Fresh,Whether audiences will get behind The Lightnin...,1
4,Rotten,What's really lacking in The Lightning Thief i...,0


## Sample Cleaning text

In [24]:
review = re.sub(r'^RT[\s]+', '', df.iloc[1]["review"])
print(review)

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


In [25]:
review = re.sub(r'<br />', '', review)
review = review.replace("\'", "")
review

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great masters of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional dream techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwells murals decorating every surface) are terribly well done.'

In [26]:
tokens = review.split()
print(tokens)

['A', 'wonderful', 'little', 'production.', 'The', 'filming', 'technique', 'is', 'very', 'unassuming-', 'very', 'old-time-BBC', 'fashion', 'and', 'gives', 'a', 'comforting,', 'and', 'sometimes', 'discomforting,', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece.', 'The', 'actors', 'are', 'extremely', 'well', 'chosen-', 'Michael', 'Sheen', 'not', 'only', '"has', 'got', 'all', 'the', 'polari"', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too!', 'You', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'Williams', 'diary', 'entries,', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece.', 'A', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life.', 'The', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things:', 'the', 'fantasy', 'of', 'the', 'guard', 'which,', '

In [27]:
stopwords_english = stopwords.words("english")

cleaned_worlds = []

for x in tokens:
    if (x not in stopwords_english and string.punctuation):
        cleaned_worlds.append(x)

In [28]:
len(cleaned_worlds), len(tokens)

(94, 156)

In [29]:
stemmer = PorterStemmer()
text_stem = []

for x in cleaned_worlds:
    stem_word = stemmer.stem(x)
    text_stem.append(stem_word)

In [30]:
for i in range(len(cleaned_worlds)):
    print(f"{cleaned_worlds[i]} ----- {text_stem[i]}")

A ----- a
wonderful ----- wonder
little ----- littl
production. ----- production.
The ----- the
filming ----- film
technique ----- techniqu
unassuming- ----- unassuming-
old-time-BBC ----- old-time-bbc
fashion ----- fashion
gives ----- give
comforting, ----- comforting,
sometimes ----- sometim
discomforting, ----- discomforting,
sense ----- sens
realism ----- realism
entire ----- entir
piece. ----- piece.
The ----- the
actors ----- actor
extremely ----- extrem
well ----- well
chosen- ----- chosen-
Michael ----- michael
Sheen ----- sheen
"has ----- "ha
got ----- got
polari" ----- polari"
voices ----- voic
pat ----- pat
too! ----- too!
You ----- you
truly ----- truli
see ----- see
seamless ----- seamless
editing ----- edit
guided ----- guid
references ----- refer
Williams ----- william
diary ----- diari
entries, ----- entries,
well ----- well
worth ----- worth
watching ----- watch
terrificly ----- terrificli
written ----- written
performed ----- perform
piece. ----- piece.
A ----- a
mast

# Data Cleaning Function

In [31]:
def preprocessing(text):
    text = str(text)
    text = re.sub(r'^RT[\s]+', '', text)
    text = re.sub(r'<br />', '', text)
    text = text.replace("\'", "")
    
    tokens = text.split()
    stopwords_english = stopwords.words("english")
    stemmer = PorterStemmer()
    
    cleaned_words = []

    for x in tokens:
        if (x not in stopwords_english and string.punctuation):
            stem_word = stemmer.stem(x)
            cleaned_words.append(stem_word)
    return ' '.join(cleaned_words)

In [32]:
df["cleaned_text"] = df["review"].apply(preprocessing)

Selecting first 120,000 data of rotten tomato

In [33]:
df1 = df1[:120000]
df1["cleaned_text"] = df1["review_content"].apply(preprocessing)

In [34]:
df1.head()

Unnamed: 0,review_type,review_content,label,cleaned_text
0,Fresh,A fantasy adventure that fuses Greek mythology...,1,a fantasi adventur fuse greek mytholog contemp...
1,Fresh,"Uma Thurman as Medusa, the gorgon with a coiff...",1,"uma thurman medusa, gorgon coiffur writh snake..."
2,Fresh,With a top-notch cast and dazzling special eff...,1,"with top-notch cast dazzl special effects, tid..."
3,Fresh,Whether audiences will get behind The Lightnin...,1,whether audienc get behind the lightn thief ha...
4,Rotten,What's really lacking in The Lightning Thief i...,0,what realli lack the lightn thief genuin sens ...


In [35]:
X = df["cleaned_text"]
y = df["label"]

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Tokenizing Data

In [36]:
tokenizer = Tokenizer(num_words = 5000, oov_token="<OOV>")
tokenizer.fit_on_texts(train_x)

train_x_sequences = tokenizer.texts_to_sequences(train_x)
test_x_sequences = tokenizer.texts_to_sequences(test_x)

max_length = 100
train_x_padded = pad_sequences(train_x_sequences, maxlen=max_length, padding='post', truncating='post')
test_x_padded = pad_sequences(test_x_sequences, maxlen=max_length, padding='post', truncating='post')

In [37]:
train_x_tomato = df1["cleaned_text"]
tokenizer.fit_on_texts(train_x_tomato)
train_x_tomato_sequences = tokenizer.texts_to_sequences(train_x_tomato)
train_x_tomato_padded = pad_sequences(train_x_tomato_sequences, maxlen=max_length, padding='post', truncating='post')

In [38]:
len(train_x_tomato)

120000

# Recurrent Neural Network 

In [39]:
model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=max_length),
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation = "sigmoid")
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 64)           320000    
                                                                 
 lstm (LSTM)                 (None, 100, 64)           33024     
                                                                 
 dropout (Dropout)           (None, 100, 64)           0         
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 386,113
Trainable params: 386,113
Non-trai

In [40]:
history = model.fit(train_x_padded, train_y, epochs=10, batch_size=32, validation_data=(test_x_padded, test_y))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Evaluation

In [41]:
loss, accuracy = model.evaluate(test_x_padded, test_y)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8176000118255615


## Checking some Sample Review

In [42]:
text_sample = [
    "It was an awful movie but I liked it.",
    "It was a good movie",
    "I think it was the only movie that I could see until the end because it wasn't like the other movies that I have ever seen"
]

text_sample_sequences = tokenizer.texts_to_sequences(text_sample)
max_length = 100
text_sample_padded = pad_sequences(text_sample_sequences, maxlen=max_length, padding='post', truncating='post')

predictions = model.predict(text_sample_padded)

print(predictions)

[[0.9240088]
 [0.9369654]
 [0.9212563]]


# SA-LSTM

In [43]:
max_length = 100
latent_dim = 64
vocab_size = 5000

## Encoder

In [44]:
inputs = Input(shape=(max_length,))
embedding = Embedding(vocab_size, latent_dim, input_length=max_length)(inputs)
encoded = LSTM(latent_dim, return_sequences=False)(embedding)

## Decoder

In [45]:
decoded = RepeatVector(max_length)(encoded)
decoded = LSTM(latent_dim, return_sequences=True)(decoded)
decoded = TimeDistributed(Dense(vocab_size, activation='softmax'))(decoded)

## Sequence model

In [46]:
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

In [47]:
sequence_autoencoder.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
sequence_autoencoder.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 100)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 100, 64)           320000    
                                                                 
 lstm_2 (LSTM)               (None, 64)                33024     
                                                                 
 repeat_vector (RepeatVector  (None, 100, 64)          0         
 )                                                               
                                                                 
 lstm_3 (LSTM)               (None, 100, 64)           33024     
                                                                 
 time_distributed (TimeDistr  (None, 100, 5000)        325000    
 ibuted)                                                     

In [48]:
sequence_autoencoder.fit(train_x_tomato_padded[:50000], np.expand_dims(train_x_tomato_padded[:50000], -1),
                         epochs=10,
                         batch_size=128,
                         validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
 17/313 [>.............................] - ETA: 9:00 - loss: 0.8127

KeyboardInterrupt: 

In [None]:
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=latent_dim, input_length=max_length),
    LSTM(latent_dim, return_sequences=True),
    Dropout(0.2),
    LSTM(latent_dim),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model.layers[1].set_weights(encoder.layers[2].get_weights())
model.layers[3].set_weights(encoder.layers[2].get_weights())

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(train_x_padded[:, :-1], train_y,
          epochs=10,
          batch_size=128,
          validation_split=0.2)

loss, accuracy = model.evaluate(test_x_padded, test_y)
print(f'Test Loss: {loss}')
print(f'Test Accuracy: {accuracy}')

y_pred = (model.predict(test_x_padded) > 0.5).astype("int32")
print(f'Accuracy: {accuracy_score(test_y, y_pred)}')

Model: "sequential_6"

_________________________________________________________________

 Layer (type)                Output Shape              Param #   


 embedding_16 (Embedding)    (None, 100, 64)           320000    

                                                                 

 lstm_23 (LSTM)              (None, 100, 64)           33024     

                                                                 

 dropout_22 (Dropout)        (None, 100, 64)           0         

                                                                 

 lstm_24 (LSTM)              (None, 64)                33024     

                                                                 

 dropout_23 (Dropout)        (None, 64)                0         

                                                                 

 dense_14 (Dense)            (None, 1)                 65        

                                                                 


Total params: 386113 (1.47 MB)

Train

<keras.src.callbacks.History at 0x2afedca7dd0>

In [None]:
text_sample = [
    "It was awful movie but I liked it.",
    "It was a good movie",
    "I think it was the only movie that I could see until the end because it wasn't like the other movies that I have ever seen"
]
text_sample_cleaned = [preprocessing(text) for text in text_sample]

text_sample_sequences = tokenizer.texts_to_sequences(text_sample_cleaned)
text_sample_padded = pad_sequences(text_sample_sequences, maxlen=max_length, padding='post', truncating='post')

predictions = model.predict(text_sample_padded)
print(predictions)

for i, prediction in enumerate(predictions):
    print(f'Text: {text_sample[i]}')
    print(f'Predicted Sentiment: {"Positive" if prediction > 0.5 else "Negative"}\n')


[[0.01109009]

 [0.86672544]

 [0.74528325]]

Text: It was awful movie but I liked it.

Predicted Sentiment: Negative



Text: It was a good movie

Predicted Sentiment: Positive



Text: I think it was the only movie that I could see until the end because it wasn't like the other movies that I have ever seen

Predicted Sentiment: Positive




# Saving trained models

In [None]:
sequence_autoencoder.save('sequence_autoencoder.h5')
encoder.save('encoder.h5')
model.save('sentiment_classifier.h5')


  saving_api.save_model(


# Loading Trained Models

In [None]:
loaded_sequence_autoencoder = load_model('sequence_autoencoder.h5')
loaded_encoder = load_model('encoder.h5')
loaded_sentiment_classifier = load_model('sentiment_classifier.h5')



# LM-LSTM

In [None]:
pretrained_model_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
pretrained_model = hub.KerasLayer(pretrained_model_url, trainable=False)

train_embeddings = np.array([pretrained_model([text]).numpy()[0] for text in train_x])
test_embeddings = np.array([pretrained_model([text]).numpy()[0] for text in test_x])

In [None]:
LM_LSTM_model = Sequential([
    InputLayer(input_shape=(train_embeddings.shape[1],)),
    tf.keras.layers.Reshape((1, train_embeddings.shape[1])),
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

LM_LSTM_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
LM_LSTM_model.summary()

Model: "sequential_1"

_________________________________________________________________

 Layer (type)                Output Shape              Param #   


 reshape (Reshape)           (None, 1, 512)            0         

                                                                 

 lstm_6 (LSTM)               (None, 1, 64)             147712    

                                                                 

 dropout_2 (Dropout)         (None, 1, 64)             0         

                                                                 

 lstm_7 (LSTM)               (None, 64)                33024     

                                                                 

 dropout_3 (Dropout)         (None, 64)                0         

                                                                 

 dense_3 (Dense)             (None, 1)                 65        

                                                                 


Total params: 180801 (706.25 KB)

Tra

In [None]:
LM_LSTM_model.fit(train_embeddings, train_y, epochs=10, batch_size=128, validation_split=0.2)

loss, accuracy = LM_LSTM_model.evaluate(test_embeddings, test_y)
print(f'Test Loss: {loss}')
print(f'Test Accuracy: {accuracy}')

from sklearn.metrics import accuracy_score
y_pred = (LM_LSTM_model.predict(test_embeddings) > 0.5).astype("int32")
print(f'Accuracy: {accuracy_score(test_y, y_pred)}')

Epoch 1/10


Epoch 2/10


Epoch 3/10


Epoch 4/10


Epoch 5/10


Epoch 6/10


Epoch 7/10


Epoch 8/10


Epoch 9/10


Epoch 10/10



Test Loss: 0.40373358130455017

Test Accuracy: 0.8123999834060669


Accuracy: 0.8124


In [None]:
text_sample = [
    "It was awful movie but i liked it.",
    "it was a good movie",
    "i think it was the only movie that I could see until the end because it wan't like the other movies that I have ever seen"
]
text_sample_cleaned = [preprocessing(text) for text in text_sample]

text_sample_embeddings = np.array([pretrained_model([text]).numpy()[0] for text in text_sample_cleaned])

predictions = LM_LSTM_model.predict(text_sample_embeddings)
print(predictions)

for i, prediction in enumerate(predictions):
    print(f'Text: {text_sample[i]}')
    print(f'Predicted Sentiment: {"Positive" if prediction > 0.5 else "Negative"}\n')


[[0.54970556]

 [0.7660237 ]

 [0.8268357 ]]

Text: It was awful movie but i liked it.

Predicted Sentiment: Positive



Text: it was a good movie

Predicted Sentiment: Positive



Text: i think it was the only movie that I could see until the end because it wan't like the other movies that I have ever seen

Predicted Sentiment: Positive




# Implementing Language model from Scratch

In [None]:
inputs = Input(shape=(max_length,)) 
embedding = Embedding(vocab_size, latent_dim, input_length=max_length)(inputs)
lstm = LSTM(latent_dim)(embedding)
output = Dense(vocab_size, activation='softmax')(lstm)

language_model = Model(inputs, output)
language_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) 
language_model.summary()

Model: "model_11"

_________________________________________________________________

 Layer (type)                Output Shape              Param #   


 input_11 (InputLayer)       [(None, 100)]             0         

                                                                 

 embedding_8 (Embedding)     (None, 100, 64)           320000    

                                                                 

 lstm_8 (LSTM)               (None, 64)                33024     

                                                                 

 dense_6 (Dense)             (None, 5000)              325000    

                                                                 


Total params: 678024 (2.59 MB)

Trainable params: 678024 (2.59 MB)

Non-trainable params: 0 (0.00 Byte)

_________________________________________________________________


In [None]:
train_x_padded = pad_sequences(train_x_sequences, maxlen=max_length + 1, padding='post', truncating='post') 
test_x_padded = pad_sequences(test_x_sequences, maxlen=max_length + 1, padding='post', truncating='post')

language_model = Model(inputs, output)
language_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) 

language_model.fit(
    train_x_padded[:, :-1], 
    train_x_padded[:, 1:].reshape(-1), 
    epochs=10,
    batch_size=128,
    validation_split=0.2
)

Epoch 1/10


Epoch 2/10


Epoch 3/10


Epoch 4/10


Epoch 5/10


Epoch 6/10


Epoch 7/10


Epoch 8/10


Epoch 9/10


Epoch 10/10



<keras.src.callbacks.History at 0x2afe99b44d0>

In [None]:
model_LM = Sequential([
    Embedding(input_dim=vocab_size, output_dim=latent_dim, input_length=max_length),
    LSTM(latent_dim, return_sequences=True),
    Dropout(0.2),
    LSTM(latent_dim),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model_LM.layers[1].set_weights(language_model.layers[2].get_weights())
model_LM.layers[3].set_weights(language_model.layers[2].get_weights())

model_LM.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_LM.summary()

model_LM.fit(train_x_padded[:, :-1], train_y,
          epochs=10,
          batch_size=128,
          validation_split=0.2)

Model: "sequential_4"

_________________________________________________________________

 Layer (type)                Output Shape              Param #   


 embedding_13 (Embedding)    (None, 100, 64)           320000    

                                                                 

 lstm_17 (LSTM)              (None, 100, 64)           33024     

                                                                 

 dropout_18 (Dropout)        (None, 100, 64)           0         

                                                                 

 lstm_18 (LSTM)              (None, 64)                33024     

                                                                 

 dropout_19 (Dropout)        (None, 64)                0         

                                                                 

 dense_11 (Dense)            (None, 1)                 65        

                                                                 


Total params: 386113 (1.47 MB)

Train

<keras.src.callbacks.History at 0x2aff0a574d0>