**Model to classify SMS as either spam or non-spam
**

https://miro.medium.com/v2/resize:fit:620/0*ww5ON8MjhmUdMvoi.jpeg


**Overview/Introduction:**
In today's digital age, mobile devices and Short Message Service (SMS) have become ubiquitous tools for communication. However, this convenience also brings the challenge of dealing with spam messages. These unwanted texts can not only be annoying but also pose security risks, such as phishing attempts. This project aims to leverage artificial intelligence to effectively detect and categorize spam messages, thereby providing a shield for users and enhancing their overall communication experience.

**Problem Statement:**
The landscape of spam messages is constantly evolving to evade detection, making their identification a complex task. It is essential to accurately identify these messages while ensuring that legitimate messages are not incorrectly labeled as spam and no spam messages are missed. Given the high volume and rapid transmission of messages, manual inspection is not feasible. We require an automated, accurate, and adaptable solution.

**Objectives:**
1. Develop a machine learning model that accurately classifies SMS messages into spam or ham (legitimate) categories.
2. Ensure the model's capability to adapt to the changing nature of spam messages, maintaining a high level of accuracy over time.
3. Minimize the risk of mislabeling messages, preventing legitimate messages from being marked as spam and ensuring the detection of spam messages.
4. Evaluate different algorithms to identify the most suitable one for the task.
5. Contribute to ongoing research in spam detection by providing a robust and scalable solution that can potentially be extended to other forms of digital communication.

**Dataset Description:**
**Context:** The SMS Spam Collection is a curated dataset of SMS messages designed to support research in SMS spam detection. It comprises a total of 5,574 English SMS messages, each explicitly labeled as either ham (legitimate) or spam.

**Content:** The dataset is straightforward and well-structured. Each row represents an individual message and is divided into two columns: 'v1' for labeling the message as spam or ham, and 'v2' containing the actual message text.

**Source:** The data has been sourced from various origins, primarily those that are freely available or intended for research purposes:
1. 425 spam SMS messages were manually selected from the Grumbletext website, a UK forum where mobile users discuss spam-related issues. This selection process involved sifting through numerous web pages to identify the spam content.
2. 3,375 legitimate SMS messages were randomly sampled from the NUS SMS Corpus (NSC), a collection of approximately 10,000 genuine messages used for research at the National University of Singapore. These messages are primarily from students in Singapore.
3. An additional 450 legitimate SMS messages were extracted from Caroline Tag's Ph.D. thesis.
4. Finally, the SMS Spam Corpus v.0.1 Big, a publicly available dataset containing 1,002 legitimate messages and 322 spam messages, was included in the dataset.

This dataset serves as a valuable resource for advancing the field of SMS spam detection and contributes to the development of effective anti-spam solutions.



**Technologies Used**

- **Python**: The primary programming language used for building machine learning models and conducting data analysis.

- **Scikit-Learn (Sklearn)**: A Python library for machine learning that offers efficient tools for data analysis, model building, and evaluation.

- **TensorFlow**: An open-source machine learning platform developed by the Google Brain team, used for building and training deep learning models.

- **TensorFlow Hub**: A library that simplifies the sharing and consumption of pre-trained machine learning models, facilitating model reuse and integration.

- **Matplotlib**: A Python library for creating a wide range of static, animated, and interactive data visualizations to aid in data exploration and presentation.

- **NLTK (Natural Language Toolkit)**: A Python library designed for working with human language data, enabling natural language processing tasks such as text analysis and classification.

**Methodology:**
This project employed a supervised learning approach for text classification, with the primary objective of using artificial intelligence to identify spam SMS messages. A variety of machine learning and deep learning models were utilized, including LSTM, Bidirectional LSTM, GRU, a simple dense network, a model based on the Neural Network Language Model (NNLM), and a model utilizing the Universal Sentence Encoder (USE). Additionally, Conv1D and Bidirectional GRU models were included, along with a baseline model for benchmarking.

**Implementation:**
The implementation began with crucial text data preprocessing, a fundamental step in effective Natural Language Processing (NLP). This preprocessing involved tasks such as removing unwanted elements like emojis, newline characters, URLs, mentions, and special characters. Additionally, text was converted to lowercase, common stop words were eliminated, relevant hashtags were retained, and lemmatization was applied.

The initial model was a baseline Multinomial Naive Bayes classifier with TF-IDF vectorization. Subsequently, various models, including simple dense networks, LSTM, Bidirectional LSTM, GRU, Bidirectional GRU, and Conv1D, were designed and trained using the dataset. Two additional models were created: one using the Neural Network Language Model (NNLM) and another incorporating the Universal Sentence Encoder (USE) for generating the embedding layer.

**Results:**
Comprehensive model evaluation was conducted using critical performance metrics, including accuracy, precision, recall, F1-score, and loss. Notably, the simple dense network achieved perfect scores across all metrics, closely followed by the LSTM and an ensemble model, formed by combining multiple models. Models based on LSTM, Bidirectional GRU, and the USE also demonstrated strong performance.

**Discussion/Interpretation of Results:**
The results were insightful, highlighting the outstanding performance of the simple dense network in text classification tasks, particularly for spam detection. The LSTM model also exhibited exceptional performance, underscoring the effectiveness of recurrent neural networks for such tasks. The ensemble model, leveraging the strengths of multiple models, secured a solid position in terms of performance, affirming the advantages of ensemble methods in enhancing model robustness.

**Conclusion:**
This project underscored the potential of machine learning and deep learning models in text classification, specifically for the identification of spam SMS messages. A diverse array of model architectures was explored, ranging from straightforward dense networks to more intricate structures like LSTM and GRU. The results demonstrated that well-designed models can achieve impressive accuracy. Future work can focus on further enhancing these models through larger datasets, refined data preprocessing techniques, hyperparameter optimization, and the exploration of alternative architectures and ensemble strategies.

**Acknowledgments:**
The original dataset used in this project can be accessed at the following URL: [SMS Spam Collection Dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).



In [None]:
import numpy as np # Import the NumPy library for numerical operations
import pandas as pd # Import the Pandas library for data processing and manipulation
import tensorflow as tf # Import TensorFlow for deep learning operations

# Input data files are available in the read-only "/kaggle/input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# Use the os library to traverse the directory structure starting from '/kaggle/input'
import os

# Iterate through the directory structure and list all files in the '/kaggle/input' directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        # Print the full path to each file in the '/kaggle/input' directory
        print(os.path.join(dirname, filename))
        #####  ******other way to write the code *******************
import os

# Specify the directory path you want to list files from
directory_path = '/your/directory/path'  # Replace with your directory path

# Use the os library to traverse the specified directory structure
for dirname, _, filenames in os.walk(directory_path):
    for filename in filenames:
        # Print the full path to each file in the specified directory
        print(os.path.join(dirname, filename))


In [None]:
import requests
import importlib.util

# Define the URL of the helper functions script
helper_functions_url = "https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py"

# Define the local file name for the downloaded script
local_script_filename = "helper_functions.py"

# Download the script from the specified URL
response = requests.get(helper_functions_url)

# Check if the download was successful
if response.status_code == 200:
    # Save the downloaded script to a local file
    with open(local_script_filename, "w") as script_file:
        script_file.write(response.text)

    # Import specific functions from the downloaded script
    spec = importlib.util.spec_from_file_location("helper_functions", local_script_filename)
####### other way to write the code #########
# Import a series of helper functions for the notebook

# Download the helper functions script from a specified URL
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import specific functions from the downloaded script
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys


**Import with Pandas**

In [None]:
DATA_DIR='/kaggle/input/ham-spam-messages-dataset/spam-ham v2.csv'

In [None]:
# Load CSV file into a DataFrame
df = pd.read_csv(DATA_DIR, encoding='latin1')

# Display the DataFrame
df.head(20)

The code renames the columns of a DataFrame from 'v1' to 'label' and from 'v2' to 'text'. It then assigns the modified DataFrame to the variable 'df'. Here's how you can rewrite the code:


In [None]:
# Assuming you have a DataFrame named 'original_df' that you want to modify
import pandas as pd

# Rename the columns and assign the modified DataFrame to 'df'
df = original_df.rename(columns={'v1': 'label', 'v2': 'text'})


In [None]:
df.info()

The code imports the 'matplotlib.pyplot' library for data visualization, calculates the value counts of the 'label' column in the DataFrame, and stores the counts in the 'category_counts' variable. Here's how you can rewrite the code:

import matplotlib.pyplot as plt  # Import the matplotlib library for data visualization

# Assuming you have a DataFrame named 'df' with a 'label' column
# Calculate the value counts of the 'label' column
category_counts = df['label'].value_counts()

In [None]:
import matplotlib.pyplot as plt

# Calculate the value counts of the 'label' column
category_counts = df['label'].value_counts()

# Bar chart
plt.figure(figsize=(6, 4))
category_counts.plot(kind='bar')
plt.xlabel('Label')
plt.ylabel('Counts')
plt.title('Bar Chart of Counts')
plt.show()
print()

# Pie chart
plt.figure(figsize=(6, 4))
category_counts.plot(kind='pie', autopct='%1.1f%%')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Pie Chart of Distribution')

# Add legend
plt.legend()

plt.show()

**Duplicates and Missing Values**

The code checks for missing values in the DataFrame using the 'isnull()' function, calculates the sum of missing values for each column, and prints the number of missing values for each column and the total number of missing values in the DataFrame. Here's how you can rewrite the code:

In [None]:
# How many reviews do we have?
print('There are', df.shape[0], 'data in this dataset')

# Do we have duplicates?
print('Number of Duplicates:', len(df[df.duplicated()]))

# Do we have missing values?
missing_values = df.isnull().sum()
print('Number of Missing Values by column:\n',missing_values)

print('Number of Missing Values:', df.isnull().sum().sum())

The code then imports the 'numpy' library as 'np' and replaces empty strings with NaN values in the DataFrame using 'df.replace("", np.nan, inplace=True)'. It calculates and prints the number of missing values and empty spaces for each column in the DataFrame.

In [None]:
df.replace("", np.nan, inplace=True)
missing_values = df.isnull().sum()
print('Number of Missing Values and Empty Spaces by column:\n',missing_values)

Review Duplicates**

The code identifies duplicate rows in the DataFrame by finding rows that have duplicates across all columns. It then sorts the duplicate rows based on all columns and selects the top 5 pairs of duplicates (10 rows) using 'sorted_duplicates.head(20)'. Here's how you can rewrite the code:

In [None]:
# First, get all duplicate rows (keep=False ensures all duplicates are kept)
duplicate_rows = df[df.duplicated(keep=False)]

# Then sort the dataframe on all columns to ensure duplicates are adjacent
sorted_duplicates = duplicate_rows.sort_values(by=list(duplicate_rows.columns))

# Now, if we want to see 5 pairs of duplicates (10 rows), we can simply:
top_5_duplicate_pairs = sorted_duplicates.head(20)
top_5_duplicate_pairs

**Drop Duplicates**

This code removes duplicate rows from the DataFrame 'df'. Removing duplicate rows ensures that each row in the DataFrame is unique and eliminates redundant data, which can affect analysis and modeling.

In [None]:
df = df.drop_duplicates()
print('Number of Duplicates:', len(df[df.duplicated()]))

**Drop Missing Values**

This code drops rows containing missing values (NaN) from the DataFrame 'df'. Removing rows with missing values is important to ensure data quality and avoid errors or biased analysis caused by missing information.

In [None]:
df = df.dropna()
print('Number of Missing Values:', df.isnull().sum().sum())

In [None]:
df.info()

**View random samples for each category**

This code defines a function 'random_sample_reviews' that takes a DataFrame 'df' and the number of samples to be extracted as input and returns a DataFrame with random samples from each group based on the 'label' column. This can be useful for data exploration, model training, or analysis.

In [None]:
def random_sample_reviews(df, num_samples):
    # Use groupby on 'label' and then apply the sample function to 'text' of each group
    samples = df.groupby('label')['text'].apply(lambda x: x.sample(num_samples))

    # Convert series to dataframe and reset index
    # samples_df = samples.reset_index()
    samples_df = samples.reset_index().drop(columns='level_1')

    return samples_df
pd.set_option('display.max_colwidth', 200) # This will display up to 100 characters
samples = random_sample_reviews(df, num_samples=10)
samples.head(20)

**Data Cleaning**

These follwoing functions perform various text cleaning operations such as removing emojis, links, mentions, punctuation, and extra spaces from text data. Text cleaning is an essential preprocessing step in natural language processing tasks as it helps remove noise, irrelevant characters, and standardize the text data for better analysis and modeling.

In [None]:
# Necessary libraries
from sklearn.model_selection import train_test_split
from sklearn import metrics

import re
import string

from tensorflow import keras
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import SimpleRNN, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def strip_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese characters
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#Remove punctuations, links, mentions and \r\n new line characters
def strip_all_entities(text):
    text = text.replace('\r', '').replace('\n', ' ').replace('\n', ' ').lower() #remove \n and \r and lowercase
    text = re.sub(r"(?:\@|https?\://)\S+", "", text) #remove links and mentions
    text = re.sub(r'[^\x00-\x7f]',r'', text) #remove non utf8/ascii characters such as '\x9a\x91\x97\x9a\x97'
    banned_list= string.punctuation + 'Ã'+'±'+'ã'+'¼'+'â'+'»'+'§'
    table = str.maketrans('', '', banned_list)
    text = text.translate(table)
    return text

#clean hashtags at the end of the sentence, and keep those in the middle of the sentence by removing just the # symbol
def clean_hashtags(text):
    new_text = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', text)) #remove last hashtags
    new_text2 = " ".join(word.strip() for word in re.split('#|_', new_text)) #remove hashtags symbol from words in the middle of the sentence
    return new_text2

#Filter special characters such as & and $ present in some words
def filter_chars(a):
    sent = []
    for word in a.split(' '):
        if ('$' in word) | ('&' in word):
            sent.append('')
        else:
            sent.append(word)
    return ' '.join(sent)

def remove_mult_spaces(text): # remove multiple spaces
    return re.sub("\s\s+" , " ", text)

This code applies a series of text cleaning functions to the 'text' column of the DataFrame 'df' and assigns the cleaned text to a new column 'text1'.

In [None]:
df['text1'] = (df['text']
                     .apply(strip_emoji)
                     .apply(strip_all_entities)
                     .apply(clean_hashtags)
                     .apply(filter_chars)
                     .apply(remove_mult_spaces))

In [None]:
df.head()

This code creates a DataFrame 'df_comparison'.This DataFrame will be used to compare the original and cleaned text data and analyze the impact of text cleaning on the text length.

In [None]:
df_comparison = pd.DataFrame()

# Original text and its length
df_comparison['pre-clean text'] = df['text']
df_comparison['pre-clean len'] = df['text'].apply(lambda x: len(str(x).split()))

# Cleaned text and its length
df_comparison['post-clean text'] = df['text1']
df_comparison['post-clean len'] = df['text1'].apply(lambda x: len(str(x).split()))

df_comparison.head(20)

Remove Stopwords

This next code defines a function 'remove_stopwords' that takes a sentence as input and removes common stopwords from it. Removing stopwords can help eliminate words that carry little semantic meaning and improve the quality of text analysis or modeling tasks.

In [None]:
def remove_stopwords(sentence):
    """
    Removes a list of stopwords

    Args:
        sentence (string): sentence to remove the stopwords from

    Returns:
        sentence (string): lowercase sentence without the stopwords
    """
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

    # Sentence converted to lowercase-only
    sentence = sentence.lower()

    words = sentence.split()
    no_words = [w for w in words if w not in stopwords]
    sentence = " ".join(no_words)

    return sentence


This code applies the 'remove_stopwords' function to the 'text1' column of the DataFrame 'df' and assigns the stopwords-removed text to a new column 'text2'.

In [None]:
df['text2'] = (df['text1'].apply(remove_stopwords))

The following code creates a DataFrame 'df_comp' and will be used to compare the original and stopwords-removed text data and analyze the impact of removing stopwords on the text length.

In [None]:
df_comp = pd.DataFrame()

# Original text and its length
df_comp['pre-clean text'] = df['text1']
df_comp['pre-clean len'] = df['text1'].apply(lambda x: len(str(x).split()))

# Cleaned text and its length
df_comp['post-clean text'] = df['text2']
df_comp['post-clean len'] = df['text2'].apply(lambda x: len(str(x).split()))

df_comp.head(20)

Lemmatization

The following code defines a lemmatization function, which uses the WordNetLemmatizer from the NLTK library to lemmatize a given text. Lemmatization reduces words to their base or root form, allowing for better analysis and modeling.

In [None]:
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

import nltk
nltk.download('wordnet')
nltk.download('punkt')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    # Tokenize the sentence
    word_list = nltk.word_tokenize(text)

    # Lemmatize list of words and join
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])

    return lemmatized_output

The lemmatization function is then applied to the 'text2' column of the DataFrame 'df' using the apply method, and the lemmatized output is assigned to the 'text3' column.

In [None]:
df['text3'] = df['text2'].apply(lemmatize_text)

A new DataFrame, 'df_lemma', is created to store the original and lemmatized text data along with their respective lengths.

In [None]:
df_lemma = pd.DataFrame()

# Original text and its length
df_lemma['pre-clean text'] = df['text2']
df_lemma['pre-clean len'] = df['text2'].apply(lambda x: len(str(x).split()))

# Cleaned text and its length
df_lemma['post-clean text'] = df['text3']
df_lemma['post-clean len'] = df['text3'].apply(lambda x: len(str(x).split()))

df_lemma.head(20)


**Text Length**

The code calculates the length of each text in the 'text3' column and finds the 95th quartile of text lengths.

In [None]:
df['text_length'] = df['text3'].apply(lambda x: len(str(x).split()))

In [None]:
# Calculate the length of each text
text_lengths = [len(text.split()) for text in df["text3"]]

# Find the 95th quartile
quartile_95 = np.percentile(text_lengths, 95)

print(f"95th Quartile of Text Lengths: {quartile_95}")

A histogram of the text lengths is plotted using Matplotlib and Seaborn libraries, visualizing the distribution of text lengths. The 95th quartile is marked with a red dashed line.

In [None]:
# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(text_lengths, bins=20, edgecolor='black')
plt.xlabel('Word Length')
plt.ylabel('Frequency')
plt.title('Distribution of Text Lengths')

# Adding a vertical line for the 95th quartile
quartile_95 = np.percentile(text_lengths, 95)
plt.axvline(x=quartile_95, color='red', linestyle='--', label='95th Quartile')
plt.legend()

plt.grid(True)
plt.show()

Descriptive statistics of the text lengths are displayed using the describe method of the DataFrame 'df.text_length'.

In [None]:
df.text_length.describe()

Visualize text with less than 10 words

A count plot is generated to visualize the distribution of texts with less than 10 words, filtering the data based on the 'text_length' column.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(7,5))
ax = sns.countplot(x='text_length', data=df[df['text_length']<10], palette='mako')
plt.title('Training text with less than 10 words')
plt.yticks([])
ax.bar_label(ax.containers[0])
plt.ylabel('Count')
plt.xlabel('')
plt.show()


A subset of the DataFrame 'df' is created, including only the rows where the text length is less than 2.

In [None]:
data_head=df[df['text_length']<2]
data_head.head(30)

**Remove rows**

As can be seen, rows below word length of 2 either are empty rows or don't carry too much insight. Therefore, rows with word length below 2 are dropped from the DataFrame 'df'.

In [None]:
df = df[df['text_length'] >= 2]

Drop columns and shuffle

Columns 'text', 'text1', and 'text2' are dropped from the DataFrame 'df'.

In [None]:
df = df.drop(['text', 'text1', 'text2'], axis=1)

The training dataframe 'df' is shuffled randomly to ensure a more balanced distribution of labels.

In [None]:
# Shuffle training dataframe
df = df.sample(frac=1, random_state=42) # shuffle with random_state=42 for reproducibility
df.head()

In [None]:
df.label.value_counts()

Recategorise Labels

Label encoding is applied to the 'label' column of the DataFrame 'df', converting categorical labels into numerical representations. A DataFrame 'dr' is then created to display the original labels and their corresponding encoded values.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Define the encoder
le = LabelEncoder()

# Apply label encoding
df['label_le'] = le.fit_transform(df['label'])

# Define data
data = {
    'Label': le.classes_,
    'Label Encoded': le.transform(le.classes_)
}

# Create DataFrame
dr = pd.DataFrame(data)
dr

The 'Label' column of the DataFrame 'dr' is converted into a list, 'class_names', for further use in classification tasks.

In [None]:
class_names=dr.Label.tolist()
class_names

**Data Balancing**

The following code shows the process of oversampling the data using RandomOverSampler from the imbalanced-learn library. It resamples the data to balance the classes by generating synthetic samples from the minority class.

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
train_x, train_y = ros.fit_resample(np.array(df['text3']).reshape(-1, 1), np.array(df['label_le']).reshape(-1, 1));
train_os = pd.DataFrame(list(zip([x[0] for x in train_x], train_y)), columns = ['text3', 'label_le']);

In [None]:
train_os.head()

In [None]:
train_os['label_le'].value_counts()

In [None]:
# Shuffle training dataframe
train_os = train_os.sample(frac=1, random_state=42) # shuffle with random_state=42 for reproducibility
train_os.head()

In [None]:
X = train_os['text3'].to_numpy()
y = train_os['label_le'].to_numpy()

In [None]:
X, y

The oversampled data is then split into training and validation sets using the train_val_split function, which takes the sentences and labels along with a training split ratio as input and returns the split sets.

In [None]:
def train_val_split(sentences, labels, training_split):
    """
    Splits the dataset into training and validation sets

    Args:
        sentences (list of string): lower-cased sentences without stopwords
        labels (list of string): list of labels
        training split (float): proportion of the dataset to convert to include in the train set

    Returns:
        train_sentences, validation_sentences, train_labels, validation_labels - lists containing the data splits
    """

    # Compute the number of sentences that will be used for training (should be an integer)
    train_size = int(len(sentences) * training_split)

    # Split the sentences and labels into train/validation splits
    train_sentences = sentences[:train_size]
    train_labels = labels[:train_size]

    validation_sentences = sentences[train_size:]
    validation_labels = labels[train_size:]

    return train_sentences, validation_sentences, train_labels, validation_labels

In [None]:
# Test your function
X_train, X_valid, y_train, y_valid = train_val_split(X, y, 0.8)

print(f"There are {len(X_train)} sentences for training.\n")
print(f"There are {len(y_train)} labels for training.\n")
print(f"There are {len(X_valid)} sentences for validation.\n")
print(f"There are {len(y_valid)} labels for validation.")

In [None]:
# Check the shape
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

In [None]:
# Check the lengths
len(X_train), len(X_valid), len(y_train), len(y_valid)

In [None]:
X_train[:5], X_valid[:5]

In [None]:
y_train[:5], y_valid[:5]

The train and valid texts are copied to be used later.

In [None]:
X_train_tx=X_train.copy()
X_valid_tx=X_valid.copy()

In [None]:
X_train_tx[:5], X_valid_tx[:5]

**Model 0: Baseline**

Next, a simple baseline model is defined using a TF-IDF vectorizer and a multinomial Naive Bayes classifier within a Scikit-Learn pipeline. The model is then trained on the training data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Now fit the model
model_0.fit(X_train_tx, y_train)

The accuracy of the baseline model is then evaluated on the validation data and printed out.

In [None]:
baseline_score = model_0.score(X_valid_tx, y_valid)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Now, predictions are made using the baseline model on the validation sentences.

In [None]:
# Make predictions
baseline_preds = model_0.predict(X_valid_tx)
baseline_preds[:20]

Then a function calculate_results is defined to calculate the accuracy, precision, recall, and F1 score of the model's predictions.

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred, loss=None):
    """
    Calculates model accuracy, precision, recall, f1-score, and loss of a binary classification model.

    Args:
    -----
    y_true: true labels in the form of a 1D array
    y_pred: predicted labels in the form of a 1D array
    loss: (optional) loss value of the model, default is None

    Returns a dictionary of accuracy, precision, recall, f1-score, and loss.
    """
    # Calculate model accuracy
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    # Calculate model precision, recall, and f1 score using "weighted" average
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    model_results = {
        "accuracy": model_accuracy,
        "precision": model_precision,
        "recall": model_recall,
        "f1": model_f1,
        "loss": loss
    }
    return model_results


The function is used to calculate the results of the baseline model and these are printed out.

In [None]:
# Get baseline results
baseline_results = calculate_results(y_true=y_valid,
                                     y_pred=baseline_preds)
baseline_results

A helper function compare_baseline_to_new_results is defined to compare the results of the baseline model to a new model. It does this by calculating the difference between the baseline and new results for each metric.

In [None]:
# Create a helper function to compare our baseline results to new model results
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for key, value in baseline_results.items():
    if key != 'loss': # Do not compare if the key is 'loss'
      print(f"Baseline {key}: {value:.2f}, New {key}: {new_model_results[key]:.2f}, Difference: {new_model_results[key]-value:.2f}")

**Text Length in Training Data**

Next, the code analyzes the text length in the training data. It calculates the average number of tokens (words) in the training texts, finds the 98th percentile of text lengths, and plots a histogram to visualize the distribution of text lengths.

In [None]:
# Find average number of tokens (words) in training texts
round(sum([len(i.split()) for i in X_train])/len(X_train))

In [None]:
# Calculate the length of each text in X_train
text_lengths = [len(text.split()) for text in X_train]

# Find the 98th percentile
percentile_98 = np.percentile(text_lengths, 98)

print(f"98th Percentile of Text Lengths: {percentile_98}")

In [None]:
# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(text_lengths, bins=20, edgecolor='black')
plt.xlabel('Word Length')
plt.ylabel('Frequency')
plt.title('Distribution of text Lengths')

# Adding a vertical line for the 98th quartile
percentile_98 = np.percentile(text_lengths, 98)
plt.axvline(x=percentile_98, color='red', linestyle='--', label='98th Quartile')
plt.legend()

plt.grid(True)
plt.show()

In [None]:
max_text_length = max(text_lengths)
print(f"Maximum Text Length: {max_text_length}")

# ****Text Tokenization and Sequence Padding

# The code proceeds by tokenizing the training and validation sentences using the `fit_tokenizer` function, which generates a tokenizer and adapts it to the training sentences.

In [None]:
# FUNCTION: fit_tokenizer
def fit_tokenizer(train_sentences, num_words, oov_token):
    """
    Instantiates the Tokenizer class on the training sentences

    Args:
        train_sentences (list of string): lower-cased sentences without stopwords to be used for training
        num_words (int) - number of words to keep when tokenizing
        oov_token (string) - symbol for the out-of-vocabulary token

    Returns:
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary
    """

    # Instantiate the Tokenizer class, passing in the correct values for num_words and oov_token
    tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)

    # Fit the tokenizer to the training sentences
    tokenizer.fit_on_texts(train_sentences)

    return tokenizer

In [None]:
OOV_TOKEN = "<OOV>"

In [None]:
NUM_WORDS = 7000

# Subsequently, this function is employed to train a tokenizer on the training sentences, extracting the word index. The word index acts as a dictionary that associates words with their respective numerical identifiers.

In [None]:
# Test your function
tokenizer = fit_tokenizer(X_train, NUM_WORDS, OOV_TOKEN)
word_index = tokenizer.word_index

print(f"Vocabulary contains {len(word_index)} words\n")
print("<OOV> token included in vocabulary" if "<OOV>" in word_index else "<OOV> token NOT included in vocabulary")

In [None]:
max_vocab_length=NUM_WORDS
# max_vocab_length=len(word_index)

# In the subsequent code block, a function named `seq_and_pad` is introduced. This function's purpose is to transform sentences into sequences of integers (tokens) and then pad these sequences to ensure they all share the same length.

In [None]:
# FUNCTION: seq_and_pad
def seq_and_pad(sentences, tokenizer, padding, maxlen):
    """
    Generates an array of token sequences and pads them to the same length

    Args:
        sentences (list of string): list of sentences to tokenize and pad
        tokenizer (object): Tokenizer instance containing the word-index dictionary
        padding (string): type of padding to use
        maxlen (int): maximum length of the token sequence

    Returns:
        padded_sequences (array of int): tokenized sentences padded to the same length
    """

    # Convert sentences to sequences
    sequences = tokenizer.texts_to_sequences(sentences)

    # Pad the sequences using the correct padding and maxlen
    padded_sequences = pad_sequences(sequences, padding=padding, maxlen=maxlen)

    return padded_sequences

In [None]:
PADDING = 'post'
max_length=int(percentile_98)

# Next, this function is utilized to process the training and validation sentences, rendering them suitable for model input. The code prints out the shape of the resulting sequences as a confirmation of the data preparation.

In [None]:
# Test your function
X_train = seq_and_pad(X_train, tokenizer, PADDING, max_length)
X_valid = seq_and_pad(X_valid, tokenizer, PADDING, max_length)

print(f"Padded training sequences have shape: {X_train.shape}\n")
print(f"Padded validation sequences have shape: {X_valid.shape}")

In [None]:
X_train[:5], X_valid[:5]

In [None]:
y_train[:10], y_valid[:10]

**Callbacks**

# Callbacks are introduced, among which is the ModelCheckpoint callback. This callback is configured to save the best model during training, determined by the validation loss.

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

def create_checkpoint_callback(checkpoint_path):
    """
    This function returns a ModelCheckpoint callback that saves the model's weights only when the
    validation accuracy improves.

    Parameters:
    checkpoint_path (str): The filepath where the model weights should be saved.

    Returns:
    ModelCheckpoint callback
    """
    checkpoint_callback = ModelCheckpoint(filepath=checkpoint_path,
                                          monitor='val_loss',
                                          mode='min',
                                          save_best_only=True,
                                          verbose=1)
    return checkpoint_callback

Embedding layer

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras import layers

In [None]:
max_length

In [None]:
max_vocab_length

In this part of the code, an embedding layer is established to convert input words into fixed-size dense vectors. This transformation captures semantic relationships between words. Additionally, `tf.random.set_seed(42)` is employed to ensure reproducibility by controlling randomness in the process.

In [None]:
# from tensorflow.keras import layers

tf.random.set_seed(42)

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=300, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding

**Model: Simple Dense**

In [None]:
X_train.shape[1]

This code block constructs our initial neural network model for text classification. The incorporation of the GlobalAveragePooling1D layer is instrumental in reducing the dimensionality of the input, enhancing model performance by condensing information.

In [None]:
from tensorflow.keras import layers

inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")
x = embedding(inputs)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_dense = tf.keras.Model(inputs, outputs, name="model_dense")

The code compiles the `model_dense` by specifying `binary_crossentropy` as the loss function, `Adam` as the optimizer, and `accuracy` as the evaluation metric. This configuration prepares the model for binary classification tasks, utilizing the Adam optimizer for training and assessing its performance by measuring the accuracy of its predictions.

In [None]:
# Compile model
model_dense.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Get a summary of the model
model_dense.summary()

In this step, an instance of the ModelCheckpoint callback is instantiated. This callback will be responsible for saving the best model during the training process based on certain criteria, such as validation loss.

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_dense"

cc = create_checkpoint_callback(checkpoint_path)

The `model_dense` is trained using the training data (`X_train` and `y_train`) for 20 epochs. The validation data (`X_valid` and `y_valid`) is used for evaluation during training. The training progress is recorded in `model_dense_history`, and the defined callbacks (including the ModelCheckpoint callback) are applied throughout the training process.

In [None]:
# Fit the model
model_dense_history = model_dense.fit(X_train,
                              y_train,
                              epochs=20,
                              validation_data=(X_valid, y_valid),
                              callbacks=[cc])

Upon completing the training process, the code generates plots that visualize the model's accuracy and loss over the course of multiple epochs. These plots provide insights into how the model's performance evolved during training.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_dense_history, 'accuracy')
plot_graphs(model_dense_history, 'loss')

The code loads the best model, which was determined based on the validation loss during the training process. This ensures that the model with the highest validation performance is used for further tasks or evaluations.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_dense = load_model(checkpoint_path)

The code proceeds to evaluate the model's performance on the validation set. This assessment provides insights into how well the model generalizes to data it hasn't seen during training.

In [None]:
# Check the results
model_dense_ev = model_dense.evaluate(X_valid, y_valid)
model_dense_loss = model_dense_ev[0]
model_dense_loss

The code predicts probabilities on the validation set and subsequently transforms these probabilities into class predictions. This step determines which class (e.g., binary classification labels) each data point belongs to based on the predicted probabilities.

In [None]:
# Make predictions (these come back in the form of probabilities)
model_dense_pred_probs = model_dense.predict(X_valid)
model_dense_pred_probs[:10] # only print out the first 10 prediction probabilities

In [None]:
# Convert prediction probabilities to labels
model_dense_preds = tf.squeeze(tf.round(model_dense_pred_probs))
model_dense_preds[:10]

Multiple metrics, including accuracy, precision, recall, and F1-score, are computed to evaluate the performance of the model. These metrics provide a comprehensive assessment of the model's ability to classify data accurately and are particularly valuable for classification tasks.

In [None]:
# Calculate model_dense metrics
model_dense_results = calculate_results(y_true=y_valid,
                                    y_pred=model_dense_preds,
                                       loss=model_dense_loss)
model_dense_results

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_dense.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

A confusion matrix is created to provide a visual representation of the classification model's performance. After generating the confusion matrix, a custom function is employed to enhance its readability. This function likely formats and labels the matrix for better interpretation.

In [None]:
# Check out the non-prettified confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

def make_confusion_matrix(y_true, y_pred, classes=None, figsize=(10, 10), text_size=15):
    """
    Create a labeled confusion matrix comparing predictions and ground truth labels.
    
    Args:
        y_true: Array of true labels (must be the same shape as y_pred).
        y_pred: Array of predicted labels (must be the same shape as y_true).
        classes: Array of class labels (e.g., string form). If None, integer labels are used.
        figsize: Size of the output figure (default=(10, 10)).
        text_size: Size of the text in the output figure (default=15).
    
    Returns:
        A labeled confusion matrix plot comparing y_true and y_pred.
    """
    # Create the confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    cm_norm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]  # Normalize it
    n_classes = cm.shape[0]  # Find the number of classes we're dealing with
    
    # Plot the figure and make it visually appealing
    fig, ax = plt.subplots(figsize=figsize)
    cax = ax.matshow(cm, cmap=plt.cm.Blues)  # Use colors to represent how 'correct' a class is (darker == better)
    fig.colorbar(cax)
    
    # Check if there's a list of class labels
    if classes:
        labels = classes
    else:
        labels = np.arange(cm.shape[0])
    
    # Label the axes
    ax.set(title="Confusion Matrix",
           xlabel="Predicted label",
           ylabel="True label",
           xticks=np.arange(n_classes),  # Create enough axis slots for each class
           yticks=np.arange(n_classes),
           xticklabels=labels,  # Axes are labeled with class names (if available) or integers
           yticklabels=labels)
    
    # Place x-axis labels at the bottom
    ax.xaxis.set_label_position("bottom")
    ax.xaxis.tick_bottom()
    
    # Set the threshold for different colors
    threshold = (cm.max() + cm.min()) / 2.
    
    # Plot text on each cell
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1]):
        plt.text(j, i, f"{cm[i, j]} ({cm_norm[i, j] * 100:.1f}%)",
                 horizontalalignment="center",
                 color="white" if cm[i, j] > threshold else "black",
                 size=text_size)

# Example usage:
# make_confusion_matrix(y_true, y_pred, classes=class_names, figsize=(15, 15), text_size=10)


In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

The following show_random_predictions function picks random samples from the validation set, makes predictions on them, and prints the actual and predicted labels. It provides an intuitive way to see how the model performs on unseen data.

In [None]:
index_word = {v: k for k, v in tokenizer.word_index.items()}

from colorama import Fore, Style
def show_random_predictions(model, X_valid, y_valid, tokenizer, num_samples=5, class_names=None):
    # Check if it's binary or multi-class classification
    is_binary_classification = len(np.unique(y_valid)) == 2

    # Getting indices of the random samples
    random_indices = np.random.choice(np.arange(len(X_valid)), size=num_samples, replace=False)

    # Selecting the random samples
    random_X_samples = X_valid[random_indices]
    random_y_samples = y_valid[random_indices]

    # Making predictions on the random samples
    y_pred_probs = model.predict(random_X_samples)

    if is_binary_classification:
        y_pred = np.squeeze(np.round(y_pred_probs).astype(int))
    else:
        y_pred = np.argmax(y_pred_probs, axis=1)

    # Print the actual and predicted labels
    for i in range(num_samples):
        text_tokens = random_X_samples[i]
        text = ' '.join([index_word.get(token) for token in text_tokens if token != 0])  # 0 is typically the padding token
        true_label = random_y_samples[i] if is_binary_classification else np.argmax(random_y_samples[i])
        predicted_label = y_pred[i]

        # If class names are provided, use them for printing
        if class_names is not None:
            true_label_name = class_names[true_label]
            predicted_label_name = class_names[predicted_label]
        else:
            true_label_name = true_label
            predicted_label_name = predicted_label

        # Determine the color of the text (green for correct, red for incorrect)
        text_color = Fore.GREEN if true_label == predicted_label else Fore.RED

        print(f"\nSample {i + 1}:")
        print(f"Text: {text}")
        print(text_color + f"True: {true_label_name} \n Predicted: {predicted_label_name}" + Style.RESET_ALL)

The show_random_predictions function is called to generate and display predictions on random samples.

In [None]:
show_random_predictions(model_dense,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: LSTM**

The code establishes an LSTM (Long Short-Term Memory) model for sequence classification. Here's a summary of the code's key components:

1. **Random Seed for Reproducibility**:
   - The `tf.random.set_seed(42)` function is used to set a random seed, ensuring reproducibility by fixing the randomness of the model initialization.

2. **Embedding Layer**:
   - An `Embedding` layer is added to convert input words into dense fixed-size vectors. This layer is commonly used for text data to capture semantic relationships between words.

3. **LSTM Layers**:
   - Two consecutive LSTM layers are added. The first LSTM layer returns sequences and allows for layer stacking. LSTM layers are recurrent layers that can capture sequential dependencies in the data.

4. **Dense Layer with 'relu' Activation**:
   - A dense layer with a 'relu' (Rectified Linear Unit) activation function is included. This layer helps the model learn relevant features from the LSTM-encoded representations.

5. **Output Dense Layer with 'sigmoid' Activation**:
   - The model concludes with an output dense layer featuring a 'sigmoid' activation function. This configuration is suitable for binary classification tasks, as it produces probabilities ranging from 0 to 1.

6. **Model Name**:
   - The model is named "model_1LSTM" and is constructed to accept inputs and generate outputs according to the defined layers.

Overall, this code sets up an LSTM-based neural network model for sequence classification, which is capable of learning from sequential data and making binary classification predictions.

In [None]:
# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)
from tensorflow.keras import layers
model_1LSTM_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=300,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_2")


# Create LSTM model
inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")
x = model_1LSTM_embedding(inputs)
x = layers.LSTM(64, return_sequences=True)(x)
x = layers.LSTM(64)(x)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_1LSTM = tf.keras.Model(inputs, outputs, name="model_1LSTM")

The code compiles the LSTM model using binary_crossentropy as the loss function, Adam as the optimizer, and accuracy as the evaluation metric. This configuration sets up the model for binary classification tasks, optimizing it with the Adam optimizer and assessing its performance based on the accuracy of its predictions.

In [None]:
# Compile model
model_1LSTM.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
model_1LSTM.summary()

We now create a checkpoint callback for model LSTM.

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_LSTM"

cc = create_checkpoint_callback(checkpoint_path)


The LSTM model is fit to the training data (X_train and y_train) for 20 epochs, with validation data (X_valid and y_valid) used for evaluation. The training progress is recorded in model_1LSTM_history, and the defined callbacks (cc) are utilized during training.

In [None]:
# Fit model
model_1LSTM_history = model_1LSTM.fit(X_train, y_train,
                              epochs=20,
                              validation_data=(X_valid, y_valid),
                              callbacks=[cc])

Following training, the history of LSTM model's accuracy and loss over the epochs is plotted.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_1LSTM_history, 'accuracy')
plot_graphs(model_1LSTM_history, 'loss')

The best LSTM model (as determined by validation loss) is loaded for further analysis.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_1LSTM = load_model(checkpoint_path)

The LSTM model is evaluated on the validation set to understand its performance on unseen data.

In [None]:
# Check the results
model_1LSTM_ev  = model_1LSTM.evaluate(X_valid, y_valid)
model_1LSTM_loss = model_1LSTM_ev [0]
model_1LSTM_loss

The LSTM model predicts probabilities on the validation set, which are then converted into class predictions.

In [None]:
# Make predictions on the validation dataset
model_1LSTM_pred_probs = model_1LSTM.predict(X_valid)
model_1LSTM_pred_probs.shape, model_1LSTM_pred_probs[:10] # view the first 10

In [None]:
# Convert prediction probabilities to labels
model_1LSTM_preds = tf.squeeze(tf.round(model_1LSTM_pred_probs))
model_1LSTM_preds[:10]

Metrics such as accuracy, precision, recall, and F1-score are calculated to evaluate the performance of the LSTM model.

In [None]:
# Calculate LSTM model results
model_1LSTM_results = calculate_results(y_true=y_valid,
                                    y_pred=model_1LSTM_preds,
                                       loss=model_1LSTM_loss)
model_1LSTM_results

The function compares the performance metrics of the baseline model with the LSTM model. The comparison include various metrics such as accuracy, precision, recall, and F1-score.

In [None]:
# Compare model 2 to baseline
compare_baseline_to_new_results(baseline_results, model_1LSTM_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_1LSTM.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

A confusion matrix is generated to visualize the classification performance of the LSTM model. A custom function is used to make the matrix more readable.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

The show_random_predictions function is called to generate and display predictions of the LSTM model on random samples.

In [None]:
show_random_predictions(model_1LSTM,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: Bidirectional LSTM**

The code introduces a Bidirectional LSTM (Long Short-Term Memory) model, which is a more advanced version of a recurrent neural network (RNN) model and is particularly suitable for text classification tasks. Here's an overview of the key components of this model:

1. **Reproducibility**:
   - Reproducibility is ensured by setting a random seed using `tf.random.set_seed(42)`.

2. **Input and Embedding Layer**:
   - The model, named `model_lstm`, accepts word indices as input and transforms them into dense vectors through an embedding layer.
   - The embedding layer has a uniform embedding initializer and an output dimension of 128.

3. **Bidirectional LSTM Layers**:
   - Two bidirectional LSTM layers are employed, each consisting of 64 units. Bidirectional LSTMs capture context from both past and future data, enhancing their ability to capture sequential patterns.

4. **Dense Layer**:
   - A dense layer with 512 units and a 'relu' activation function is added. This layer is intended to learn high-level features from the LSTM-encoded representations.

5. **Output Dense Layer with 'sigmoid' Activation**:
   - The model is finalized with an output dense layer featuring a 'sigmoid' activation function, suitable for binary classification tasks.

This model is designed to capture complex sequential patterns in text data and make binary classification predictions. The bidirectional LSTM layers enhance the model's ability to understand both past and future context, making it well-suited for text classification tasks.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

# Parameters
embedding_dim=128

tf.random.set_seed(42)

# Input layer
inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")
# Create an embedding of the numerized numbers
x = layers.Embedding(input_dim=max_vocab_length,
                     output_dim=128,
                     embeddings_initializer="uniform",
                     input_length=max_length,
                     name="embedding_2")(inputs)

# Bidirectional LSTM
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# Another LSTM Layer
x = layers.Bidirectional(layers.LSTM(64))(x)
# Dense layer
x = layers.Dense(512, activation='relu')(x)
# Output layer
outputs = layers.Dense(1, activation='sigmoid')(x)
# Create the model
model_lstm = tf.keras.Model(inputs, outputs)


After these layers are assembled, the model, referred to as 'model_lstm', is compiled using the Adam optimizer and a binary cross-entropy loss function - appropriate for binary classification tasks.

In [None]:
# Set the training parameters
model_lstm.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# Print the model summary
model_lstm.summary()

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_Bi-LSTM"

cc = create_checkpoint_callback(checkpoint_path)

The model_lstm is fit to the training data (X_train and y_train) for 20 epochs, with validation data (X_valid and y_valid) used for evaluation. The training progress is recorded in history_lstm, and the defined callbacks (cc) are utilized during training.

In [None]:
NUM_EPOCHS = 20

# Train the model
history_lstm = model_lstm.fit(X_train, y_train, epochs=NUM_EPOCHS, validation_data=(X_valid, y_valid),callbacks=[cc])

Post-training, the model's accuracy and loss evolution across epochs is visualized.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(history_lstm, 'accuracy')
plot_graphs(history_lstm, 'loss')

The model with the best validation loss is loaded for further usage.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_lstm = load_model(checkpoint_path)

Performance of this model is assessed on the validation dataset.

In [None]:
# Check the results
model_lstm_ev = model_lstm.evaluate(X_valid, y_valid)
model_lstm_loss = model_lstm_ev[0]
model_lstm_loss

Class predictions are generated by transforming predicted probabilities on the validation dataset.

In [None]:
# Make predictions with model
model_lstm_pred_probs = model_lstm.predict(X_valid)
model_lstm_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_lstm_preds = tf.squeeze(tf.round(model_lstm_pred_probs))
model_lstm_preds[:10]

To evaluate the model's performance, metrics such as accuracy, precision, recall, and F1-score are computed.

In [None]:
# Calculate model performance metrics
model_lstm_results = calculate_results(y_valid, model_lstm_preds, loss=model_lstm_loss)
model_lstm_results

Performance metrics of the baseline model and the bidirectional LSTM model are compared.

In [None]:
# Compare model 2 to baseline
compare_baseline_to_new_results(baseline_results, model_1LSTM_results)


In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_lstm.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list


A confusion matrix is constructed to offer a visual perspective of the classification model's performance. The matrix is simplified using a custom function.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)


In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

The function 'show_random_predictions' is invoked to generate and display predictions on random samples.

In [None]:
show_random_predictions(model_lstm,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: GRU**

The provided code initializes a recurrent neural network (RNN) with Gated Recurrent Units (GRUs), which are known for their efficiency in sequence processing. Here's an overview of the key components of this model:

1. **Reproducibility**:
   - Reproducibility is ensured by setting a random seed using `tf.random.set_seed(42)`.

2. **Input and Embedding Layer**:
   - The model, named `model_GRU`, accepts input sequences and transforms them into dense vectors using an embedding layer.

3. **Two GRU Layers**:
   - Two GRU layers are incorporated into the model. GRUs are a type of recurrent layer designed for efficient sequence modeling.
   
4. **Dense Layer**:
   - A dense layer is included in the model with unspecified units and an unspecified activation function. This layer is responsible for learning relevant features from the GRU-encoded representations.

5. **Output Dense Layer with 'sigmoid' Activation**:
   - The model concludes with an output dense layer featuring a 'sigmoid' activation function. This setup is suitable for binary text classification tasks, as it produces probabilities in the range of 0 to 1.

Overall, the code configures an RNN model with GRUs for efficient sequence processing, making it well-suited for text classification tasks. The model is designed to learn from sequential data and make binary classification predictions.

In [None]:
# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)

from tensorflow.keras import layers
model_GRU_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=300,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_GRU")

# Build an RNN using the GRU cell
inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")
x = model_GRU_embedding(inputs)
x = layers.GRU(64, return_sequences=True)(x)
x = layers.GRU(64)(x)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_GRU = tf.keras.Model(inputs, outputs, name="model_GRU")

The 'model_GRU' is compiled using the Adam optimizer and binary cross-entropy as the loss function, suitable for binary classification tasks.

In [None]:
# Compile GRU model
model_GRU.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Get a summary of the GRU model
model_GRU.summary()

A checkpoint callback is created for the GRU model.

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_GRU"

cc = create_checkpoint_callback(checkpoint_path)

The improved  is fit to the training data (X_train and y_train) for 20 epochs, with validation data (X_valid and y_valid) used for evaluation. The training progress is recorded in model_GRU_history, and the defined callbacks (cc) are utilized during training.

In [None]:
# Fit model
model_GRU_history = model_GRU.fit(X_train, y_train,
                              epochs=20,
                              validation_data=(X_valid, y_valid),
                              callbacks=[cc])

The model's accuracy and loss history is visualized post-training.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_GRU_history, 'accuracy')
plot_graphs(model_GRU_history, 'loss')

The best model, determined by validation loss, is then loaded.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_GRU = load_model(checkpoint_path)

Model evaluation occurs on the validation set.

In [None]:
# Check the results
model_GRU_ev = model_GRU.evaluate(X_valid, y_valid)
model_GRU_loss = model_GRU_ev[0]
model_GRU_loss

The model predicts probabilities on the validation set, converting these into class predictions.

In [None]:
# Make predictions on the validation data
model_GRU_pred_probs = model_GRU.predict(X_valid)
model_GRU_pred_probs.shape, model_GRU_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_GRU_preds = tf.squeeze(tf.round(model_GRU_pred_probs))
model_GRU_preds[:10]

Performance metrics, including accuracy, precision, recall, and F1-score, are computed for model evaluation.

In [None]:
# Calcuate model_GRU results
model_GRU_results = calculate_results(y_true=y_valid,
                                    y_pred=model_GRU_preds,
                                     loss=model_GRU_loss)
model_GRU_results

The baseline model's performance is compared with the GRU model.

In [None]:
# Compare to baseline
compare_baseline_to_new_results(baseline_results, model_GRU_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_GRU.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list


A confusion matrix is created for visualization of the model's classification performance. The matrix readability is enhanced via a custom function.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

Lastly, the 'show_random_predictions' function generates and displays predictions on random samples.

In [None]:
show_random_predictions(model_GRU,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: Bi-directional GRU**

The following code forms a bidirectional recurrent neural network (RNN) using Gated Recurrent Units (GRUs). After setting a random seed for consistency, an Embedding layer transforms inputs into dense vectors. Two bidirectional GRU layers allow past and future context capture. The model, 'model_bi_GRU', includes a Dense layer and concludes with a sigmoid-function output layer, apt for binary classification tasks.

In [None]:
# Set random seed and create embedding layer
tf.random.set_seed(42)

from tensorflow.keras import layers
model_GRU_embedding = layers.Embedding(input_dim=max_vocab_length,
                                       output_dim=300,
                                       embeddings_initializer="uniform",
                                       input_length=max_length,
                                       name="embedding_GRU")

# Build a bidirectional RNN using the GRU cell
inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")
x = model_GRU_embedding(inputs)
x = layers.Bidirectional(layers.GRU(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.GRU(64))(x)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_bi_GRU = tf.keras.Model(inputs, outputs, name="model_bi_GRU")

The 'model_bi_GRU' is compiled using the Adam optimizer and the binary cross-entropy as the loss function, which is suitable for binary classification tasks.

In [None]:
# Compile bi-GRU model
model_bi_GRU.compile(loss="binary_crossentropy",
                     optimizer=tf.keras.optimizers.Adam(),
                     metrics=["accuracy"])

In [None]:
# Get a summary of the model
model_bi_GRU.summary()

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_bi_GRU"

cc = create_checkpoint_callback(checkpoint_path)

The model_bi_GRU is fit to the training data (X_train and y_train) for 20 epochs, with validation data (X_valid and y_valid) used for evaluation. The training progress is recorded in model_bi_GRU_history, and the defined callbacks (cc) are utilized during training.

In [None]:
# Fit model
model_bi_GRU_history = model_bi_GRU.fit(X_train, y_train,
                                        epochs=20,
                                        validation_data=(X_valid, y_valid),
                                        callbacks=[cc])

After the training phase, the model's history of accuracy and loss is plotted over epochs.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_GRU_history, 'accuracy')
plot_graphs(model_GRU_history, 'loss')

The optimal model, based on validation loss, is loaded for further use.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_bi_GRU = load_model(checkpoint_path)

The model's performance is evaluated on the validation dataset.

In [None]:
# Check the results
model_bi_GRU_ev = model_bi_GRU.evaluate(X_valid, y_valid)
model_bi_GRU_loss = model_bi_GRU_ev[0]
model_bi_GRU_loss

The model predicts class probabilities on the validation dataset, which are then transformed into class predictions.

In [None]:
# Make predictions on the validation data
model_bi_GRU_pred_probs = model_bi_GRU.predict(X_valid)
model_bi_GRU_pred_probs.shape, model_bi_GRU_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_bi_GRU_preds = tf.squeeze(tf.round(model_bi_GRU_pred_probs))
model_bi_GRU_preds[:10]

Various performance metrics, such as accuracy, precision, recall, and the F1-score, are computed for model evaluation.

In [None]:
# Calcuate model_bi_GRU results
model_bi_GRU_results = calculate_results(y_true=y_valid,
                                    y_pred=model_bi_GRU_preds,
                                        loss=model_bi_GRU_loss)
model_bi_GRU_results

The model's performance is compared with that of the baseline and the bidirectional GRU models.

In [None]:
# Compare to baseline
compare_baseline_to_new_results(baseline_results, model_bi_GRU_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_bi_GRU.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list


To visualize the model's classification performance, a confusion matrix is generated. Its readability is enhanced through a custom function.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

Finally, the function 'show_random_predictions' is invoked to generate and display predictions on random samples.

In [None]:
show_random_predictions(model_bi_GRU,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: Conv1D**

The provided code block establishes a 1D Convolutional Neural Network (CNN) model for binary text classification. Here's an overview of the key components of this model:

1. **Reproducibility**:
   - Reproducibility is ensured by setting a random seed using `tf.random.set_seed(42)`.

2. **Input and Embedding Layer**:
   - The model begins with an input layer that accepts sequences of word indices.
   - An embedding layer is introduced to convert these indices into dense vectors, which capture semantic relationships between words.

3. **Convolutional Layer (Conv1D)**:
   - A Conv1D layer is included with a specific number of filters and kernel size. This layer performs convolution operations on the input sequences to learn spatial patterns and features.

4. **GlobalMaxPooling1D Layer**:
   - Following the Conv1D layer, a GlobalMaxPooling1D layer is applied to reduce the spatial dimensions of the output. This step helps capture the most relevant information from the convolutional layer's output.

5. **Dense Layer with 'relu' Activation**:
   - A dense layer is added to the model, with a 'relu' (Rectified Linear Unit) activation function. This layer is responsible for learning high-level features from the extracted representations.

6. **Output Dense Layer with 'sigmoid' Activation**:
   - The model is finalized with an output dense layer featuring a 'sigmoid' activation function. This configuration is well-suited for binary text classification tasks, as it produces probabilities in the range of 0 to 1.

This model is designed to leverage convolutional operations to capture patterns in the input sequences and make binary classification predictions. It can be particularly effective for text classification tasks where local patterns in the text data are important.

In [None]:
from tensorflow.keras import layers

# Parameters
embedding_dim = 300
filters = 64
kernel_size = 5

tf.random.set_seed(42)

# Input layer
inputs = layers.Input(shape=(X_train.shape[1],), dtype="int32")

# Create an embedding of the numerized numbers
x = layers.Embedding(input_dim=max_vocab_length,
                     output_dim=embedding_dim,
                     embeddings_initializer="uniform",
                     input_length=max_length,
                     name="embedding_2")(inputs)
# Conv1D layer
x = layers.Conv1D(filters, kernel_size, activation='relu')(x)
# GlobalMaxPooling1D layer
x = layers.GlobalMaxPooling1D()(x)
# Dense layer
x = layers.Dense(512, activation='relu')(x)
# Output layer
outputs = layers.Dense(1, activation='sigmoid')(x)
# Create the model
model_conv = tf.keras.Model(inputs, outputs)



The model is then compiled with the 'adam' optimizer and 'binary_crossentropy' loss function, and the model summary is printed.

In [None]:
# Set the training parameters
model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
model_conv.summary()

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_conv"

cc = create_checkpoint_callback(checkpoint_path)

The model_conv is fit to the training data (X_train and y_train) for 20 epochs, with validation data (X_valid and y_valid) used for evaluation. The training progress is recorded in history_conv1d, and the defined callbacks (cc) are utilized during training.

In [None]:
NUM_EPOCHS = 20

# Train the model
history_conv1d = model_conv.fit(X_train, y_train, epochs=NUM_EPOCHS, validation_data=(X_valid, y_valid),callbacks=[cc])

After training, the model's accuracy and loss history are plotted over the epochs.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(history_conv1d, 'accuracy')
plot_graphs(history_conv1d, 'loss')

The best-performing model, determined by validation loss, is loaded for future use.

In [None]:
from tensorflow.keras.models import load_model

# Load the entire model
model_conv = load_model(checkpoint_path)

The model is evaluated on the validation set, after which probabilities are predicted and transformed into class predictions.

In [None]:
# Check the results
model_conv_ev = model_conv.evaluate(X_valid, y_valid)
model_conv_loss = model_conv_ev[0]
model_conv_loss

In [None]:
# Make predictions with the model
model_conv_pred_probs = model_conv.predict(X_valid)
model_conv_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_conv_preds = tf.squeeze(tf.round(model_conv_pred_probs))
model_conv_preds[:10]

Performance metrics such as accuracy, precision, recall, and the F1-score are calculated for model evaluation.

In [None]:
# Calculate model performance metrics
model_conv_results = calculate_results(y_valid, model_conv_preds, loss=model_conv_loss)
model_conv_results

Performance comparisons are made between the baseline and the convolutional model.

In [None]:
# Compare model to baseline
compare_baseline_to_new_results(baseline_results, model_conv_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_conv.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

A confusion matrix is created to visualize the classification model's performance, and a custom function enhances the matrix's readability.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

Then, the 'show_random_predictions' function generates and displays predictions on random samples.

In [None]:
show_random_predictions(model_conv,
                   X_valid,
                   y_valid,
                   tokenizer,
                   num_samples=10,
                   class_names=class_names)

**Model: USE**

The code you've described outlines the construction and training of a binary text classification model using Google's Universal Sentence Encoder (USE) from TensorFlow Hub. Below is a summarized version of the code's key steps and components:

1. **Universal Sentence Encoder (USE)**: 
   - Google's Universal Sentence Encoder is used as a pre-trained model from TensorFlow Hub. It takes text sentences as input and encodes them into high-dimensional vectors. The USE layer is non-trainable, meaning it retains the pre-trained weights and functions as a feature extractor.

2. **Dense Layer with 'relu' Activation**:
   - The encoded vectors from the USE layer are passed through a dense layer with a Rectified Linear Unit (ReLU) activation function. This dense layer is responsible for learning relevant features from the encoded text representations.

3. **Output Layer with 'sigmoid' Activation**:
   - The final layer of the model is an output layer with a 'sigmoid' activation function. This is suitable for binary classification tasks, as it outputs probabilities ranging from 0 to 1.

4. **Model Compilation**:
   - The model is compiled using the Adam optimizer, which is a popular optimization algorithm for training neural networks. The binary cross-entropy loss function is chosen, which is commonly used for binary classification problems.

5. **Model Summary**:
   - A summary of the model is printed, displaying the architecture, layer details, and the number of trainable and non-trainable parameters.

The code you've described seems to be focused on the construction and compilation of the model. To complete the model, you would typically perform the following additional steps:

6. **Model Training**:
   - Fit the model to your training data using the `fit` method, specifying the training data, labels, and other parameters like batch size and the number of epochs.

7. **Evaluation and Prediction**:
   - Evaluate the model's performance on validation and test datasets. You can also use the trained model to make predictions on new text data.

Overall, this code serves as the foundation for building a binary text classification model using the Universal Sentence Encoder and TensorFlow Hub and you would continue by training and evaluating the model on your specific dataset.

In [None]:
# import tensorflow_hub as hub

# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # shape of inputs coming to our model
                                        dtype=tf.string, # data type of inputs coming to the USE layer
                                        trainable=False, # keep the pretrained weights (we'll create a feature extractor)
                                        name="USE")

In [None]:
tf.random.set_seed(42)

# Create model using the Sequential API
model_USE = tf.keras.Sequential([
sentence_encoder_layer, # take in sentences and then encode them into an embedding
layers.Dense(512, activation="relu"),
layers.Dense(1, activation="sigmoid")
], name="model_USE")

# Compile model
model_USE.compile(loss="binary_crossentropy",
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

model_USE.summary()

A ModelCheckpoint callback is set to save only the best model, as determined by validation loss.

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_USE"

# Create a ModelCheckpoint callback that saves the model's weights only when the validation accuracy improves
cc = ModelCheckpoint(filepath=checkpoint_path,
                                      monitor='val_loss',
                                      mode='min',
                                      save_best_only=True,
                                      verbose=1)

In [None]:
X_train_tx, y_train


The model_USE is fit to the training data (X_train_tx and y_train) for 20 epochs, with validation data (X_valid_tx and y_valid) used for evaluation. The training progress is recorded in model_USE_history, and the defined callbacks (cc) are utilized during training.

In [None]:
# Train a classifier on top of pretrained embeddings
model_USE_history = model_USE.fit(X_train_tx,
                              y_train,
                              epochs=20,
                              validation_data=(X_valid_tx, y_valid),
                              callbacks=[cc])

The model demonstrating the best validation loss is loaded for use in subsequent steps.

In [None]:
# Load the entire model
model_USE = load_model(checkpoint_path)

Next, the model is evaluated on the validation set, where probabilities are predicted and converted into class predictions.

In [None]:
# Check the results
model_USE_ev = model_USE.evaluate(X_valid_tx, y_valid)
model_USE_loss = model_USE_ev[0]
model_USE_loss

Post-training, the model's performance is assessed by plotting the history of accuracy and loss over the epochs.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_USE_history, 'accuracy')
plot_graphs(model_USE_history, 'loss')

In [None]:
# Make predictions with USE TF Hub model
model_USE_pred_probs = model_USE.predict(X_valid_tx)
model_USE_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_USE_preds = tf.squeeze(tf.round(model_USE_pred_probs))
model_USE_preds[:10]

Various performance metrics such as accuracy, precision, recall, and the F1-score are calculated for the model.

In [None]:
# Calculate model performance metrics
model_USE_results = calculate_results(y_valid, model_USE_preds, loss=model_USE_loss)
model_USE_results

A comparison of the baseline and USE models is then performed based on these metrics.

In [None]:
# Compare model to baseline
compare_baseline_to_new_results(baseline_results, model_USE_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_USE.predict(X_valid_tx)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

A confusion matrix is created to offer a visual perspective of the classification model's performance, which is further refined for readability using a custom function.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

The function '`random_predictions`' generates and displays predictions on random samples specifically model USE.

In [None]:
def random_predictions(model, val_padded_seq, val_labels, num_samples=5, class_names=None):
    # Check if it's binary or multi-class classification
    num_classes = val_labels.shape[1] if len(val_labels.shape) > 1 else 2
    is_binary_classification = num_classes == 2

    # Getting indices of the random samples
    random_indices = np.random.choice(np.arange(len(val_padded_seq)), size=num_samples, replace=False)

    # Selecting the random samples
    random_X_samples = val_padded_seq[random_indices]
    random_y_samples = val_labels[random_indices]

    # Making predictions on the random samples
    y_pred_probs = model.predict(random_X_samples)

    if is_binary_classification:
        y_pred = np.squeeze(np.round(y_pred_probs).astype(int))
    else:
        y_pred = np.argmax(y_pred_probs, axis=1)

    # Print the actual and predicted labels
    for i in range(num_samples):
        text = random_X_samples[i]
        true_label = np.argmax(random_y_samples[i]) if not is_binary_classification else np.squeeze(random_y_samples[i])
        predicted_label = y_pred[i]

        # If class names are provided, use them for printing
        if class_names is not None:
            true_label_name = class_names[true_label]
            predicted_label_name = class_names[predicted_label]
        else:
            true_label_name = true_label
            predicted_label_name = predicted_label

        # Determine the color of the text (green for correct, red for incorrect)
        text_color = Fore.GREEN if true_label == predicted_label else Fore.RED

        print(f"\nSample {i + 1}:")
        print(f"Text: {text}")
        print(text_color + f"True: {true_label_name} \n Predicted: {predicted_label_name}" + Style.RESET_ALL)

In [None]:
random_predictions(model_USE,
                   X_valid_tx,
                   y_valid,
                   num_samples=20,
                   class_names=class_names)

**Model: nnlm-en-dim128-with-normalization**


The provided code block constructs a sequential Neural Network model in TensorFlow Keras, utilizing a pre-trained embedding layer from TensorFlow Hub. Here's a summary of the key components of this model:

1. **Pre-trained Embedding Layer**:
   - The model uses Google's NNLM (Neural Network Language Model) with 128 dimensions for the embedding layer. This pre-trained embedding layer converts input text into fixed-size vectors.
   - Normalization is applied to the embedding layer, ensuring that the vectors are scaled to have unit norm.

2. **Embedding Training**:
   - The model allows for embedding training, meaning the pre-trained embeddings can be fine-tuned during the training process to adapt to the specific dataset.

3. **Dropout Layer**:
   - To prevent overfitting, a Dropout layer is introduced. During training, it randomly nullifies 40% of the input units. This helps improve model generalization.

4. **Dense Layer with 'sigmoid' Activation**:
   - The model concludes with a Dense layer featuring a 'sigmoid' activation function. This configuration is suitable for binary classification tasks, as it produces probabilities in the range of 0 to 1.

This model leverages pre-trained embeddings to represent input text, allows for fine-tuning of these embeddings, incorporates dropout for regularization, and makes binary classification predictions. It can be highly effective for text classification tasks, benefiting from the pre-trained embeddings' ability to capture semantic information in text data.

In [None]:
from tensorflow.keras.layers import Dropout

# embedding_url = "https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2"
embedding_url = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"

model_NN = Sequential()
model_NN.add(hub.KerasLayer(embedding_url, input_shape=(), dtype=tf.string, trainable=True))
model_NN.add(Dropout(0.4))
model_NN.add(Dense(1, activation="sigmoid"))
model_NN.summary()

This next line compiles the neural network model (model_NN) using the RMSprop optimizer and binary cross-entropy loss function. It also specifies that the model's performance should be evaluated using accuracy as a metric.

In [None]:
model_NN.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

 This line creates a ModelCheckpoint callback (cc). The callback monitors the validation accuracy (val_loss) during training and saves the model's weights to the specified checkpoint_path only when the validation loss decreases.

In [None]:
# Define the checkpoint path
checkpoint_path = "best_model_nn"

# Create a ModelCheckpoint callback that saves the model's weights only when the validation accuracy improves
cc = ModelCheckpoint(filepath=checkpoint_path,
                                      monitor='val_loss',
                                      mode='min',
                                      save_best_only=True,
                                      verbose=1)


The fit method is called to train the neural network model. It takes the training data (X_train_tx and y_train) and runs for 20 epochs. The validation_data parameter is set to evaluate the model's performance on the validation data (X_valid_tx and y_valid). The callbacks argument is provided with the cc callback, which triggers the model checkpointing process during training.

In [None]:
model_NN_history = model_NN.fit(X_train_tx,
                              y_train,
                              epochs=20,
                              validation_data=(X_valid_tx, y_valid),
                              callbacks=[cc])


After training is complete, we load the entire model from the saved checkpoint path (checkpoint_path).

In [None]:
# Load the entire model
model_NN = load_model(checkpoint_path)

In [None]:
# Check the results
model_NN_ev = model_NN.evaluate(X_valid_tx, y_valid)
model_NN_loss = model_NN_ev[0]
model_NN_loss

Post-training, the model's performance is assessed by plotting the history of accuracy and loss over the epochs.

In [None]:
# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and loss history
plot_graphs(model_NN_history, 'accuracy')
plot_graphs(model_NN_history, 'loss')

Next, the model is evaluated on the validation set, where probabilities are predicted and converted into class predictions.

In [None]:
# Make predictions with the model
model_NN_pred_probs = model_NN.predict(X_valid_tx)
model_NN_pred_probs[:10]

In [None]:
# Convert prediction probabilities to labels
model_NN_preds = tf.squeeze(tf.round(model_NN_pred_probs))
model_NN_preds[:10]

Various performance metrics such as accuracy, precision, recall, and the F1-score are calculated for the model.

In [None]:
# Calculate model performance metrics
model_NN_results = calculate_results(y_valid, model_NN_preds,loss=model_NN_loss)
model_NN_results

A comparison of the baseline and NNLM model is then performed based on these metrics.

In [None]:
# Compare model to baseline
compare_baseline_to_new_results(baseline_results, model_NN_results)

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_NN.predict(X_valid_tx)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

A confusion matrix is created to offer a visual perspective of the classification model's performance, which is further refined for readability using a custom function.

In [None]:
# Check out the non-prettified confusion matrix
confusion_matrix(y_true=y_true,
                 y_pred=y_preds)

In [None]:
# Make a prettier confusion matrix
make_confusion_matrix(y_true=y_true,
                      y_pred=y_preds,
                      classes=class_names,
                      figsize=(15, 15),
                      text_size=10)

The function '`random_predictions`' generates and displays predictions on random samples specifically model NNLM.

In [None]:
random_predictions(model_NN,
                   X_valid_tx,
                   y_valid,
                   num_samples=20,
                   class_names=class_names)

**Comparing all models**

The following section compares and visualizes the performance of various models. Initially, the evaluation results from each model are compiled into a DataFrame for comparison. Metrics such as F1-score are used to sort the models in terms of performance.

In [None]:
# Combine model results into a DataFrame
all_model_results = pd.DataFrame({"baseline": baseline_results,
                                  "Simple Dense": model_dense_results,
                                  "LSTM": model_1LSTM_results,
                                  "Bidirectional LSTM": model_lstm_results,
                                  "GRU": model_GRU_results,
                                  "Bidirectional GRU": model_bi_GRU_results,
                                  "Conv1D": model_conv_results,
                                  "USE": model_USE_results,
                                  "NNLM": model_NN_results,
                                  })


all_model_results = all_model_results.transpose()
# Reduce the accuracy to same scale as other metrics
all_model_results["accuracy"] = all_model_results["accuracy"]/100
all_model_results=all_model_results.sort_values(by="loss", ascending=True)
all_model_results


The summary of the model performance results revealed distinct variations in accuracy, precision, recall, F1 score, and loss among the different models. The Simple Dense model stood out as the best-performing model, achieving perfect accuracy, precision, recall, and F1 score all at 1.0, accompanied by the smallest loss of 0.003321. The LSTM model followed closely with an accuracy, precision, recall, and F1 score of approximately 0.998 and a slightly higher loss of 0.004826. Bidirectional GRU and Bidirectional LSTM also showed impressive performance, albeit slightly lower than the LSTM model, with similar scores in the range of 0.993 to 0.995 and a marginally higher loss. The NNLM and GRU models also produced high-level performance metrics, ranging around 0.992 to 0.994, with incrementally higher losses. The USE and Conv1D models were on par with GRU, but with an increased loss. Lastly, the baseline model demonstrated the lowest performance metrics in the range of approximately 0.980 and an undefined loss.

In [None]:
# Plot and compare all of the model results
all_model_results.plot(kind="bar", figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0));

In [None]:
# Sort model results by f1-score
all_model_results.sort_values("f1", ascending=False)["f1"].plot(kind="bar", figsize=(10, 7));

**Evaluation Metrics**

Moving forward, we will deploy the Model Dense for further analysis.

In [None]:
y_true = y_valid.tolist()  # Convert labels to a list
preds = model_dense.predict(X_valid)
y_probs = preds.squeeze().tolist()  # Store the prediction probabilities as a list
y_preds = tf.round(y_probs).numpy().tolist()  # Convert probabilities to class predictions and convert to a list

In [None]:
from sklearn.metrics import classification_report, accuracy_score, f1_score, recall_score, precision_score

report = classification_report(y_true, y_preds)
print(report)

**Make Prediction on Text from Wild**

In [None]:
# Turn Text into string
textx = "text meet someone sexy today u can find date even flirt, call customer service 78990gh3. xxx9893jkdljk"

In [None]:
# Turn Text into string
textx2 = "thanks ringtone order ref number r836 mobile will charged 450 tone not arrive please call customer service 09065069154"

In [None]:
# Turn Text into string
textx3 = "The weather today is beautiful, perfect for a walk in the park."

The 'predict_on_sentence' function is used to make predictions on unseen sentences, displaying the predicted class and its corresponding probability.

In [None]:
def predict_on_sentence(model, sentence, category_reverse_mapping, tokenizer, max_length):
    """
    Uses model to make a prediction on sentence.

    Returns the sentence, the predicted labels and the prediction probabilities.
    """

    # Convert the sentence into sequences
    sequence = tokenizer.texts_to_sequences([sentence])

    # Pad the sequences to ensure consistent length
    padded_sequence = pad_sequences(sequence, maxlen=max_length)

    # Make the prediction
    pred_prob = model.predict(padded_sequence)
    pred_label = np.round(pred_prob).astype(int)[0]  # Converting to int to match the format of your labels

    # Get the label names of the predicted class
    pred_label_str = category_reverse_mapping[pred_label[0]]  # Use the first element of pred_label
    pred_prob_str = pred_prob[0][0]

    print(f"Prediction: {pred_label_str}")  # Print the predicted label
    print(f"Prediction probability: {pred_prob_str}")  # Print the prediction probabilities
    print(f"Text:\n{sentence}")


In [None]:
# Make a prediction on text from the wild
predict_on_sentence(model=model_dense,
                    sentence=textx,
                    category_reverse_mapping=class_names,
                    tokenizer=tokenizer,
                    max_length=max_length
                   )

In [None]:
# Make a prediction on text from the wild
predict_on_sentence(model=model_dense,
                    sentence=textx2,
                    category_reverse_mapping=class_names,
                    tokenizer=tokenizer,
                    max_length=max_length)

In [None]:
# Make a prediction on text from the wild
predict_on_sentence(model=model_dense,
                    sentence=textx3,
                    category_reverse_mapping=class_names,
                    tokenizer=tokenizer,
                    max_length=max_length)

**Most Wrong Predictions**

The 'Most Wrong Predictions' code segment below allows for a deeper understanding of model errors by identifying and analyzing predictions that deviate most from actual values, assisting in pinpointing model weaknesses and potential areas of improvement. Here, a DataFrame is created with the validation sentences and predictions of the best-performing model, which helps to identify the most incorrect predictions, both false positives and negatives.

In [None]:
val_df = pd.DataFrame({
    "text": X_valid_tx.tolist(),
    "target": [class_names[label] for label in y_valid],
    "target_label": y_valid.tolist(),
    "pred": [class_names[int(round(prob))] for prob in y_preds],
    "pred_label": [int(round(prob)) for prob in y_preds],
    "pred_prob": y_preds
})
val_df

In [None]:
# Find the wrong predictions and sort by prediction probabilities
most_wrong = val_df[val_df["target"] != val_df["pred"]].sort_values("pred_prob", ascending=False)
most_wrong[:10]

Given that the Simple Dense model achieved a 100% accuracy rate on the validation data, this means that there were no incorrect predictions made by the model. Thus, when attempting to identify and analyze the most wrong predictions, as intended in the given code snippet, we find that there is no such instance to consider. The model's perfect accuracy highlights its robust performance and ability to precisely classify the validation data. Nonetheless, it's important to evaluate the model's performance on unseen data to ensure its generalizability, as a 100% accuracy rate might also be indicative of overfitting to the training data.

**The speed/score tradeoff**

The speed-performance trade-off code below helps quantify and visualize the balance between a model's prediction accuracy (performance) and the computational resources required (time taken for predictions), which is critical in optimizing real-world applications.

In [None]:
# Calculate the time of predictions
import time
def pred_timer(model, samples):
  """
  Times how long a model takes to make predictions on samples.

  Args:
  ----
  model = a trained model
  sample = a list of samples

  Returns:
  ----
  total_time = total elapsed time for model to make predictions on samples
  time_per_pred = time in seconds per single sample
  """
  start_time = time.perf_counter() # get start time
  model.predict(samples) # make predictions
  end_time = time.perf_counter() # get finish time
  total_time = end_time-start_time # calculate how long predictions took to make
  time_per_pred = total_time/len(X_valid) # find prediction time per sample
  return total_time, time_per_pred

In [None]:
# Calculate model prediction times
model_total_pred_time, model_time_per_pred = pred_timer(model_dense, X_valid)
model_total_pred_time, model_time_per_pred

In [None]:
# Calculate Naive Bayes prediction times
baseline_total_pred_time, baseline_time_per_pred = pred_timer(model_0, X_valid_tx)
baseline_total_pred_time, baseline_time_per_pred

The time per prediction and F1-score for each model are plotted on a scatter plot to visualize this trade-off.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))
plt.scatter(baseline_time_per_pred, baseline_results["f1"], label="baseline")
plt.scatter(model_time_per_pred, model_dense_results["f1"], label="model dense")
plt.legend()
plt.title("F1-score versus time per prediction")
plt.xlabel("Time per prediction")
plt.ylabel("F1-Score");

![](https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/images/08-ideal-performance-speed-of-pred-tradeoff-highlighted.png)
*Ideal position for speed and performance tradeoff model (fast predictions with great results).*

As can be seen, the best-performing model, while significantly improving the F1-score, takes considerably more time per prediction, highlighting a crucial consideration for machine learning applications: balancing model performance and prediction speed.

**Improving model performance: Ensemble Models**

Ensemble methods are a cornerstone of machine learning where multiple models are trained and their predictions are combined, typically leading to a more robust and accurate result compared to a single model. They reduce both bias and variance, provide a way to handle large datasets, and improve generalizability and robustness over a single estimator.

In this project, an ensemble of three models (Bi-LSTM, Dense, and Bi-GRU) is created. The prediction probabilities of these models are added together and the average is calculated (by dividing by 3). The class (label) with the highest mean probability is then selected as the final ensemble prediction. This effectively votes on the most likely class based on the individual model's predictions, hence capitalizing on their collective learning.

In [None]:
# Get mean pred probs for 3 models
combined_pred_probs = tf.squeeze(model_dense_pred_probs, axis=1) + tf.squeeze(model_bi_GRU_pred_probs, axis=1) + tf.squeeze(model_lstm_pred_probs)
combined_preds = tf.round(combined_pred_probs/3) # average and round the prediction probabilities to get prediction classes
combined_preds[:20]

In [None]:
from tensorflow.keras.losses import BinaryCrossentropy

loss_fn = BinaryCrossentropy()
loss = loss_fn(y_valid, combined_pred_probs/3) # Note, this is before rounding
loss_np=loss.numpy()
print('Ensemble model loss: ', loss_np)

The following code evaluates the ensemble model's performance, adds these results to a DataFrame for comparison with other models, and plots the F1 scores of all models. The code evaluates the ensemble model's performance, adds these results to a DataFrame for comparison with other models, and plots the F1 scores of all models.

In [None]:
# Calculate results from averaging the prediction probabilities
ensemble_results = calculate_results(y_valid, combined_preds, loss=loss_np)
ensemble_results

In [None]:
# Add our combined model's results to the results DataFrame
all_model_results.loc["ensemble_results"] = ensemble_results
all_model_results.loc["ensemble_results"]["accuracy"] = all_model_results.loc["ensemble_results"]["accuracy"]/100

In [None]:
all_model_results

In [None]:
# Convert the accuracy to the same scale as the rest of the results
sorted_df = all_model_results.sort_values(by='loss', ascending=True)
sorted_df

The ensemble model's performance evaluation revealed intriguing results, surpassing most individual models, including Bidirectional GRU, Bidirectional LSTM, NNLM, GRU, USE, Conv1D, and the baseline model. It achieved remarkable accuracy, precision, recall, and F1 score (approximately 0.997) with a low loss of 0.006873, ranking third in performance, only behind the Simple Dense and LSTM models.

This underscores the efficacy of ensemble methods, harnessing the strengths of multiple models to enhance overall performance. Ensembles exhibit superior generalization and stability, minimizing errors from individual models. Despite its excellence, the ensemble model couldn't outperform the Simple Dense and LSTM models, suggesting potential for fine-tuning or exploring alternative ensemble techniques to further elevate performance.

**Save Model**

In [None]:
# Save dense model to HDF5 format
model_dense.save("model.h5")

**Load saved model**

In [None]:
# Load saved model
loaded_model_SavedModel = tf.keras.models.load_model("model.h5")

In [None]:
# Evaluate loaded model
loaded_model_SavedModel.evaluate(X_valid, y_valid)

In [None]:
loaded_model_SavedModel.summary()

In [None]:
# Evaluate loaded SavedModel format
loaded_model_SavedModel.evaluate(X_valid, y_valid)