[Dataset](https://www.kaggle.com/datasets/kazanova/sentiment140?fbclid=IwAR3s2VPpKlJ28onFUBhUndGJ-MtMzdtQ8Rbvj7A5PhOOxnkXDRnHTYsOCV0)

Link Colab: [Twitter-sentiment](https://colab.research.google.com/drive/1fYb57IpSfIvmBGhyaONgX3ej5Ez9idKu?usp=sharing)

**Dataset**

- **Link dataset**: [Sentiment Twitter](https://www.kaggle.com/datasets/kazanova/sentiment140?fbclid=IwAR3s2VPpKlJ28onFUBhUndGJ-MtMzdtQ8Rbvj7A5PhOOxnkXDRnHTYsOCV0)
- **Link dataset**: [Description](https://www.linkedin.com/pulse/social-machine-learning-h2o-twitter-python-marios-michailidis/)

- **Data description**:

**target**: the polarity of the tweet (0 = negative, 4 = positive)

**ids**: The id of the tweet ( 2087)

**date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

**flag**: The query (lyx). If there is no query, then this value is NO_QUERY.

**user**: the user that tweeted (robotickilldozr)

**text**: the text of the tweet (Lyx is cool)

Encoding in ISO-8859-1


# Prepare Data

In [None]:
! gdown 1TzXYHe9Yu7QCx_kpJolufdQtiG_4_vcr


Downloading...
From: https://drive.google.com/uc?id=1TzXYHe9Yu7QCx_kpJolufdQtiG_4_vcr
To: /content/sentiment-twitter.csv
100% 239M/239M [00:04<00:00, 54.7MB/s]


In [None]:
DATASET_ENCODING = "ISO-8859-1"
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]

In [None]:
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split


In [None]:
df = pd.read_csv('sentiment-twitter.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS)
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
len(df)

1600000

In [None]:
X = df['text']
y = df['target']

In [None]:
X[3]

'my whole body feels itchy and like its on fire '

## Check for imbalance data

In [None]:
df.groupby('target').count()

Unnamed: 0_level_0,ids,date,flag,user,text
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,800000,800000,800000,800000,800000
4,800000,800000,800000,800000,800000


In [None]:
df['target'].unique()

array([0, 4])

**Preprocessing Workflow**:
1. Text Cleaning
- Remove any unnecessary characters, such as special characters, numbers, or punctuation.
- Convert text to lowercase to ensure uniformity.
- Handle contractions (e.g., "don't" to "do not").
- Remove or replace stop words (common words like "the," "is," "and" that don't contribute much to the meaning).

2. Tokenization
- Split the text into individual words or tokens. Tokenization is the process of breaking text into words, phrases, symbols, or other meaningful elements (tokens).
- Consider using a tokenizer from a natural language processing library like NLTK, spaCy, or the tokenization functions provided by frameworks like TensorFlow or PyTorch.
3. Removing Noise
- Remove irrelevant information, such as URLs, special characters, or HTML tags.
- Handle or remove rare words or typos that might not contribute much to the model's learning.

4. Stemmers and Lemmatization
- Reduce words to their base or root form to normalize the text.
- Lemmatization involves reducing words to their base or root form using vocabulary and morphological analysis.

5. Handling Missing Data
- Check for missing or null values and decide on an appropriate strategy for handling them (e.g., removing rows, imputation).

6. Vectorization
- Convert the text into numerical representations that machine learning models can work with.
- Techniques include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or more advanced word embeddings like Word2Vec or GloVe.
- For deep learning, you might use pre-trained embeddings or train your embeddings using models like Word2Vec or GloVe.

7. Padding or Truncation
- Ensure that input sequences are of the same length by padding shorter sequences or truncating longer ones. This step is crucial when working with sequence models like recurrent neural networks (RNNs) or transformers.

8. Splitting the Data
- Split the dataset into training, validation, and test sets for model training, hyperparameter tuning, and final evaluation.

9. Feature Engineering (Optional)
- Depending on your task, you might perform additional feature engineering, such as extracting features from the text (e.g., sentiment, named entities).

10. Encoding Labels (if classification)
- If you're working on a classification task, encode categorical labels into numerical format (e.g., using one-hot encoding or label encoding).

# Preprocessing

## 1. Text cleaning
- Remove any unnecessary characters, such as special characters, numbers, or punctuation.
- Convert text to lowercase to ensure uniformity.
- Handle contractions (e.g., "don't" to "do not").
- Remove or replace stop words (common words like "the," "is," "and" that don't contribute much to the meaning).

In [None]:
!pip install contractions
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from contractions import contractions_dict

def remove_twitter_handles(text):
    # Remove Twitter handles (user mentions)
    return re.sub(r'@[^\s]+', '', text)

def remove_special_characters(text):
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

def convert_to_lowercase(text):
    # Convert text to lowercase
    return text.lower()

def expand_contractions(text):
    # Assuming you have a contractions_dict similar to the one in your original code
    # contractions_dict = {"I'm": "I am", "you're": "you are", "don't": "do not"}  # Add more as needed
    for contraction, expansion in contractions_dict.items():
        text = text.replace(contraction, expansion)
    return text

def normalize_repeated_characters(text):
    # Use a regular expression to find repeated characters (more than two occurrences)
    pattern = re.compile(r"(.)\1{2,}")
    # Replace repeated characters with a single occurrence
    normalized_text = pattern.sub(r"\1", text)
    return normalized_text





## 2. Removing Noise
- Remove irrelevant information, such as URLs, special characters, or HTML tags.
- Handle or remove rare words or typos that might not contribute much to the model's learning.

In [None]:
def normalize_repeated_characters(text):
  # Precompile the regular expression pattern outside the loop to avoid compiling it repeatedly.
  pattern = re.compile(r"(.)\1{2,}")

  def normalize():
    # Replace repeated characters with a single occurrence
    normalized_text = pattern.sub(r"\1", text)
    return normalized_text
  return normalize()

def remove_link(text):
  return re.sub(r'http[s]?://\S+', '', text)

# normalize_repeated_characters(text_test)

## 3. Tokenization and Lemmatization
- Split the text into individual words or tokens. Tokenization is the process of breaking text into words, phrases, symbols, or other meaningful elements (tokens).
- Reduce words to their base or root form to normalize the text.
- Lemmatization involves reducing words to their base or root form using vocabulary and morphological analysis.


In [None]:
def stemmer(tokens):
  lancaster = nltk.LancasterStemmer()
  return [lancaster.stem(t) for t in tokens]


def lemmatizer(tokens):
  wnl = nltk.WordNetLemmatizer()
  return [wnl.lemmatize(t) for t in tokens]


# Because we want to keep the meaning of word in context, so we use lemmatizer

## 4. Check for missing data

In [None]:
from pandas.core.frame import DataFrame
def handleMissing(data: DataFrame, label: DataFrame)-> DataFrame:
  result_df = pd.merge(data, label, left_index=True, right_index=True)
  # return result_df
  result_df = result_df.dropna()
  new_data = result_df.iloc[:, 0]
  new_label = result_df.iloc[:, 1]
  return new_data, new_label

In [20]:
from typing import Dict
import numpy as np
def preprocessing_pipeline(text_df: DataFrame, labels: DataFrame)->Dict:
  result = {}
  sentiment_mapping = {0: 'negative', 2: 'neutral', 4: 'positive'}
  for i in range(len(text_df)):
    sentiment = labels[i]
    text = text_df[i]
    text = expand_contractions(text)
    text = remove_twitter_handles(text)
    text = remove_link(text)
    text = remove_special_characters(text)
    text = convert_to_lowercase(text)
    text = normalize_repeated_characters(text)
    text = word_tokenize(text)
    text = lemmatizer(text)
    text = ' '.join(text)
    result[text] = sentiment_mapping[sentiment]
  return result



In [21]:
from typing import Dict
import numpy as np
def preprocessing_text(text):
  text = expand_contractions(text)
  text = remove_twitter_handles(text)
  text = remove_link(text)
  text = remove_special_characters(text)
  text = convert_to_lowercase(text)
  text = normalize_repeated_characters(text)
  text = word_tokenize(text)
  text = lemmatizer(text)
  text = ' '.join(text)
  return text

In [22]:
preprocessing_text("you're so beautiful @ghbihuy")

'you are so beautiful'

In [None]:
preprocessing_pipeline(X[:2], y[:2])

{'aw that is a bummer you shoulda got david carr of third day to do it d': 'negative',
 'is upset that he can not update his facebook by texting it and might cry a a result school today also blah': 'negative'}

In [None]:
data = preprocessing_pipeline(X, y)

In [None]:
data

In [None]:
import json

# Serialize data into file:
json.dump(data, open( "sentiment-data-clean.json", 'w' ) )

# Read data from file:
# data = json.load( open( "file_name.json" ) )

# Model

## Train-Dev-Test split: 0.8, 0.1, 0.1

In [16]:
# Download already prepocessing data
! gdown 1LHWbtjvr95AKPg6tS8UZaU53eYDkL04x

Downloading...
From: https://drive.google.com/uc?id=1LHWbtjvr95AKPg6tS8UZaU53eYDkL04x
To: /content/sentiment-data-clean.json
100% 12.4M/12.4M [00:00<00:00, 28.9MB/s]


In [17]:
import numpy as np


def encode_labels(labels):
  encode_list = []
  for label in labels:
    if label == 'positive':
      encode_list.append(1)
    elif label == 'negative':
      encode_list.append(0)
  return np.array(encode_list)


In [18]:
# Your dictionary of sentences and labels
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split




data = json.load( open( "sentiment-data-clean.json" ) )

# Step 1: Data Preprocessing
sentences = list(data.keys())
labels = list(data.values())
# labels = encode_labels(labels)


# Step 2: Feature Extraction
vectorizer = CountVectorizer()
X = vectorizer.fit(sentences)
X = vectorizer.transform(sentences)

# Step 3: Train-Test Split
# First, split the data into training and a temporary set (temp_set)
X_train, X_temp, y_train, y_temp = train_test_split(X, labels, test_size=0.2, random_state=42)

# Now, split the temporary set into development and testing sets
X_dev, X_test, y_dev, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Train dataset shape: X_train: {X_train.shape}, y_train: {len(y_train)}')
print(f'Dev dataset shape: X_train: {X_dev.shape}, y_train: {len(y_dev)}')
print(f'Train dataset shape: X_train: {X_test.shape}, y_train: {len(y_test)}')


Train dataset shape: X_train: (124521, 84954), y_train: 124521
Dev dataset shape: X_train: (15565, 84954), y_train: 15565
Train dataset shape: X_train: (15566, 84954), y_train: 15566


In [26]:
def predict_sentiment(text, model):
  my_preprocessing_text = preprocessing_text(text)
  text_vector = vectorizer.transform([my_preprocessing_text])
  return model.predict(text_vector)

In [27]:
# text = "you looks beautiful"
# predict_sentiment(text, perceptron)

array(['positive'], dtype='<U8')

## Perceptron

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score, classification_report

# Preprocessing label to [-1, 1]


# Step 4: Perceptron Model
perceptron = Perceptron(
    alpha=0.0001,       # Regularization strength
    fit_intercept=True, # Whether to calculate the intercept
    max_iter=1000,      # Maximum number of iterations
    tol=1e-3,           # Tolerance for stopping criteria
    shuffle=True,       # Whether to shuffle the training data
    random_state=42,  # Random seed
    early_stopping=True, # Whether to use early stopping
    n_iter_no_change=5,  # Number of iterations with no improvement to wait for early stopping
    verbose=1,          # Verbosity (0 for no output, 1 for some output)
    n_jobs=-1          # Number of CPU cores to use (-1 to use all available cores)
)
# Step 5: Model Training
perceptron.fit(X_train, y_train)

# Step 6: Model Evaluation
y_pred = perceptron.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)


-- Epoch 1
Norm: 271.61, NNZs: 29785, Bias: 1.040000, T: 112068, Avg. loss: 1.865450
Total training time: 0.11 seconds.
-- Epoch 2
Norm: 367.86, NNZs: 38863, Bias: 1.120000, T: 224136, Avg. loss: 1.656593
Total training time: 0.17 seconds.
-- Epoch 3
Norm: 449.17, NNZs: 43895, Bias: 1.050000, T: 336204, Avg. loss: 1.532766
Total training time: 0.23 seconds.
-- Epoch 4
Norm: 519.12, NNZs: 47046, Bias: 1.150000, T: 448272, Avg. loss: 1.425961
Total training time: 0.28 seconds.
-- Epoch 5
Norm: 581.54, NNZs: 49164, Bias: 0.960000, T: 560340, Avg. loss: 1.359831
Total training time: 0.35 seconds.
-- Epoch 6
Norm: 639.79, NNZs: 50763, Bias: 1.020000, T: 672408, Avg. loss: 1.287197
Total training time: 0.42 seconds.
-- Epoch 7
Norm: 692.64, NNZs: 51992, Bias: 1.010000, T: 784476, Avg. loss: 1.240351
Total training time: 0.47 seconds.
Convergence after 7 epochs took 0.48 seconds
Accuracy: 0.7135423358602081
Classification Report:
               precision    recall  f1-score   support

    neg

In [None]:
def model_predict(text, model):
  text = expand_contractions(text)
  text = remove_twitter_handles(text)
  text = remove_link(text)
  text = remove_special_characters(text)
  text = convert_to_lowercase(text)
  text = normalize_repeated_characters(text)
  text = word_tokenize(text)
  text = lemmatizer(text)
  text = ' '.join(text)
  text_vector = vectorizer.transform([text])
  return model.predict(text_vector)

model_predict('I love you my brother', perceptron)

array(['positive'], dtype='<U8')

### Save model

In [None]:
!pip install joblib



In [None]:
from joblib import dump, load
dump(perceptron, 'perceptron.joblib')

['perceptron.joblib']

In [None]:
clf = load('perceptron.joblib')
model_predict('I love you my brother', clf)

array(['positive'], dtype='<U8')

## Ridge Regression

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 4: Customize Hyperparameters for RidgeClassifier Model
ridge_classifier = RidgeClassifier(
    fit_intercept=True,  # Whether to calculate the intercept (default: True)
    solver='auto',       # Solver to use ('auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga')
    random_state=42
)

# Step 5: Model Training
ridge_classifier.fit(X_train, y_train)

# Step 6: Model Evaluation
y_pred = ridge_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)


Accuracy: 0.7616600282667352
Classification Report:
               precision    recall  f1-score   support

    negative       0.77      0.75      0.76      7887
    positive       0.75      0.77      0.76      7679

    accuracy                           0.76     15566
   macro avg       0.76      0.76      0.76     15566
weighted avg       0.76      0.76      0.76     15566



In [None]:
model_predict('I love you my brother', ridge_classifier)

array(['positive'], dtype='<U8')

### Save model

In [None]:
from joblib import dump, load
dump(ridge_classifier, 'ridge.joblib')

['ridge.joblib']

In [None]:
clf_2 = load('ridge.joblib')
model_predict('I love you my brother', clf_2)

array(['positive'], dtype='<U8')

## Maxent Model

In [None]:
import nltk
from nltk.classify import MaxentClassifier


# Feature extraction function
def extract_features(sentence):
    return {word: True for word in sentence.split()}

# Prepare the labeled data
labeled_data = [(extract_features(sentence), label) for sentence, label in data.items()]

# Split the data into training and testing sets
train_data, test_data = train_test_split(labeled_data, test_size=0.2, random_state=42)

# Train the MaxEnt model using the 'iis' algorithm
classifier = MaxentClassifier.train(train_data, algorithm='iis', max_iter=4)



  ==> Training (4 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.500
             2          -0.61663        0.820
             3          -0.56079        0.837
         Final          -0.51857        0.843


In [None]:
def maxent_predict(text, classifier):
  text = expand_contractions(text)
  text = remove_twitter_handles(text)
  text = remove_link(text)
  text = remove_special_characters(text)
  text = convert_to_lowercase(text)
  text = normalize_repeated_characters(text)
  text = word_tokenize(text)
  text = lemmatizer(text)
  text = ' '.join(text)
  text = extract_features(text)
  return classifier.classify(text)

maxent_predict('I love you brother', classifier)

'positive'

In [None]:
# Evaluate the model
y_test_maxent = [label for _, label in test_data]
y_pred_maxent = [classifier.classify(sentence[0]) for sentence in test_data]

accuracy = accuracy_score(y_test_maxent, y_pred_maxent)
classification_rep = classification_report(y_test_maxent, y_pred_maxent)

print(f"Accuracy: {accuracy:.2%}")
print("Classification Report:\n", classification_rep)

Accuracy: 76.58%
Classification Report:
               precision    recall  f1-score   support

    negative       0.75      0.81      0.78     15713
    positive       0.79      0.72      0.75     15418

    accuracy                           0.77     31131
   macro avg       0.77      0.77      0.77     31131
weighted avg       0.77      0.77      0.77     31131



In [None]:
# Classify new sentences
new_sentences = ['a positive example', 'a negative example']
for sentence in new_sentences:
    features = extract_features(sentence)
    prediction = classifier.classify(features)
    print(f'Sentence: {sentence}, Prediction: {prediction}')

Sentence: a positive example, Prediction: positive
Sentence: a negative example, Prediction: positive


### Save model

In [None]:
import pickle

# Save the trained MaxentClassifier
with open('maxent_classifier.pkl', 'wb') as classifier_file:
    pickle.dump(classifier, classifier_file)

In [None]:
import pickle

# Load the saved MaxentClassifier
with open('maxent_classifier.pkl', 'rb') as classifier_file:
    loaded_classifier = pickle.load(classifier_file)

maxent_predict('I love you brother', loaded_classifier)

'positive'

## CountVectorizer Explain (Optional)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Horizon axis: vocabulary
# Verical axis: sentence
# Value: Count

# Sample documents
documents = ["This is the first document.",
              "This document is the second document.",
              "And this is the third one.",
              "Is this the first document?"]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the feature names (words) in the vocabulary
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense array for easier exploration
X_array = X.toarray()

# Display the matrix and feature names
print("Feature names:", feature_names)
print("Document-term matrix:")
print(X_array)

Feature names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Document-term matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Deep Learning Model

## LSTM

### Build model

In [43]:
max_length

40

In [28]:
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

# Step 3: Train-Test Split
# First, split the data into training and a temporary set (temp_set)


data = json.load( open( "sentiment-data-clean.json" ) )

# Step 1: Data Preprocessing
texts = list(data.keys())
my_labels = list(data.values())


# Step 2: Label Encoding
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(my_labels)

# Step 3: Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Step 4: Padding Sequences
max_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')


# Step 3: Train, test, dev split
X_train, X_temp, y_train, y_temp = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Now, split the temporary set into development and testing sets
X_dev, X_test, y_dev, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Train dataset shape: X_train: {X_train.shape}, y_train: {len(y_train)}')
print(f'Dev dataset shape: X_train: {X_dev.shape}, y_train: {len(y_dev)}')
print(f'Train dataset shape: X_train: {X_test.shape}, y_train: {len(y_test)}')

Train dataset shape: X_train: (124521, 40), y_train: 124521
Dev dataset shape: X_train: (15565, 40), y_train: 15565
Train dataset shape: X_train: (15566, 40), y_train: 15566


In [None]:
import tensorflow as tf
import json
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping


# Define the embedding dimension (e.g., 64)
embedding_dim = 64
epochs = 20
batch_size = 128

# Optimizer agurment
learning_rate = 0.001
beta_1 = 0.9
beta_2 = 0.999
epsilon = 1e-07  # Small constant for numerical stability



# Step 5: Define the RNN Model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim, input_length=max_length),
    tf.keras.layers.SimpleRNN(64, kernel_regularizer=l2(0.0001), recurrent_regularizer=l2(0.0001), bias_regularizer=l2(0.0001)),
    tf.keras.layers.Dropout(0.25),  # Add dropout
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer = tf.keras.optimizers.Adam(
    learning_rate=learning_rate,
    beta_1=beta_1,
    beta_2=beta_2,
    epsilon=epsilon
)

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)


# Step 6: Compile the Model
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Step 7: Train the Model
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_dev, y_dev), callbacks=[early_stopping])


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


In [None]:
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2



# Define the embedding dimension (e.g., 64)
embedding_dim = 64
epochs = 20
batch_size = 128

# Optimizer arguments
learning_rate = 0.001
beta_1 = 0.9
beta_2 = 0.999
epsilon = 1e-07  # Small constant for numerical stability

# Step 5: Define the RNN Model with LSTM
model1 = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim, input_length=max_length),
    tf.keras.layers.LSTM(64, kernel_regularizer=l2(0.0001), recurrent_regularizer=l2(0.0001), bias_regularizer=l2(0.0001)),
    tf.keras.layers.Dropout(0.2),  # Add dropout
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer = tf.keras.optimizers.Adam(
    learning_rate=learning_rate,
    beta_1=beta_1,
    beta_2=beta_2,
    epsilon=epsilon
)

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)


# Step 6: Compile the Model
model1.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Step 7: Train the Model
history = model1.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_dev, y_dev), callbacks=[early_stopping])


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


In [None]:
y_pred = model1.predict(X_test)



In [None]:
import numpy as np

def y_round(y_pred, threshold):
  result = []
  for i in y_pred:
    if i >= threshold:
      result.append(1)
    else:
      result.append(0)
  return np.array(result)
my_y_pred = y_round(y_pred, 0.5)

In [None]:
from sklearn.metrics import accuracy_score, classification_report


# y_pred = model1.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, my_y_pred)
classification_rep = classification_report(y_test, my_y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)


Accuracy: 0.7831812925607092
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.80      0.79      7887
           1       0.79      0.77      0.78      7679

    accuracy                           0.78     15566
   macro avg       0.78      0.78      0.78     15566
weighted avg       0.78      0.78      0.78     15566



### Save model and tokenizer

In [None]:
model1.save('lstm.h5')

  saving_api.save_model(


In [None]:
# Serialize the Tokenizer to JSON
tokenizer_json = tokenizer.to_json()

# Save the Tokenizer JSON to a file
with open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False, indent=4))

### Load model and tokenizer

In [29]:
!gdown 10Ok_uWUZ0tXBL6HnjglImEW-xHxMgTj9
!gdown 1Ouqs4xUtha_9peSwzayd2Un4CUNNmk-l




Downloading...
From: https://drive.google.com/uc?id=10Ok_uWUZ0tXBL6HnjglImEW-xHxMgTj9
To: /content/lstm.h5
100% 65.7M/65.7M [00:01<00:00, 39.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ouqs4xUtha_9peSwzayd2Un4CUNNmk-l
To: /content/tokenizer.json
100% 10.1M/10.1M [00:00<00:00, 30.1MB/s]


In [30]:
import tensorflow as tf

# Load the Tokenizer from JSON
with open('tokenizer.json', 'r', encoding='utf-8') as f:
    tokenizer_json = json.load(f)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(tokenizer_json)


In [31]:
import tensorflow as tf

# Load the model
model = tf.keras.models.load_model('lstm.h5')

In [32]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 64)            5438784   
                                                                 
 lstm (LSTM)                 (None, 64)                33024     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 5471873 (20.87 MB)
Trainable params: 5471873 (20.87 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [33]:
y_pred = model.predict(X_test)




In [34]:
import numpy as np

def y_round(y_pred, threshold):
  result = []
  for i in y_pred:
    if i >= threshold:
      result.append(1)
    else:
      result.append(0)
  return np.array(result)
my_y_pred = y_round(y_pred, 0.5)

In [35]:
from sklearn.metrics import accuracy_score, classification_report


# y_pred = model1.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, my_y_pred)
classification_rep = classification_report(y_test, my_y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)


Accuracy: 0.7847873570602596
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.83      0.80      7887
           1       0.81      0.74      0.77      7679

    accuracy                           0.78     15566
   macro avg       0.79      0.78      0.78     15566
weighted avg       0.79      0.78      0.78     15566



In [61]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def lstm_predict(model, tokenizer, text):
  max_length = 40
  threshold = 0.5
  my_preprocessing_text = preprocessing_text(text)
  token = tokenizer.texts_to_sequences([my_preprocessing_text])
  padded_sequences = pad_sequences(token, maxlen=max_length, padding='post')
  y_pred = model.predict(padded_sequences)
  return y_round(y_pred, threshold)


In [63]:
my_text = "you're so bad"
lstm_predict(model, tokenizer, my_text)



array([0])