<a href="https://colab.research.google.com/github/cammylexi/CS2341-Assignment-3/blob/main/McPhaul_Llanes_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Preparation



https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/imdb-dataset-of-50k-movie-reviews


In [3]:
import pandas as pd

df = pd.read_csv(path + "/IMDB Dataset.csv")

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Class Variables
Binary Classification Task


*   Sentiment
 *  Positive
 *  Negative

In [4]:
# Change 'positive' to 1 and 'negative' to 0 in the 'sentiment' column
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


#Tokenization and Sequence Preparation:

For this task, I've chosen to use the BERT tokenizer which employs WordPiece tokenization. This method breaks words into subword units, which helps handle out-of-vocabulary words effectively. The maximum sequence length is set to 256 tokens, which balances:

Coverage: Most movie reviews fit within this length
Information Preservation: Key sentiment indicators are typically distributed throughout the text
Computational Efficiency: Keeps memory requirements manageable

In [5]:
from transformers import BertTokenizer
import pandas as pd
import numpy as np

# Load BERT tokenizer (lowercased version)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Preprocess function using the tokenizer
def preprocess_text_bert(texts, max_length=256):
    return tokenizer(
        texts.tolist(),  # list of strings
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf',          # returns TensorFlow tensors
        return_attention_mask=True,   # include attention mask
        return_token_type_ids=False   # not needed for single-sentence inputs
    )

encoded_inputs = preprocess_text_bert(df['review'])

# You now have:
# - encoded_inputs['input_ids']: token ID sequences
# - encoded_inputs['attention_mask']: attention masks


import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((
    dict(input_ids=encoded_inputs['input_ids'],
         attention_mask=encoded_inputs['attention_mask']),
    df['sentiment'].map({'positive': 1, 'negative': 0})  # Adjust target labels
))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Stratified Ten Fold Cross Validation

In [6]:
from sklearn.model_selection import StratifiedKFold
import numpy as np

# Define the stratified k-fold cross-validation
n_splits = 10
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Prepare data arrays
X = encoded_inputs['input_ids'].numpy()  # Tokenized input IDs
attention_masks = encoded_inputs['attention_mask'].numpy()  # Attention masks
y = df['sentiment'].values  # Target labels

# Lists to store metrics for each fold
fold_metrics = []

# Implement cross-validation
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    print(f"\n--- Fold {fold+1}/{n_splits} ---")

    # Split data
    X_train, X_test = X[train_idx], X[test_idx]
    mask_train, mask_test = attention_masks[train_idx], attention_masks[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Verify stratification (should be close to original distribution)
    print(f"Training set class distribution: {np.bincount(y_train) / len(y_train)}")
    print(f"Testing set class distribution: {np.bincount(y_test) / len(y_test)}")



    # For demonstration purposes, we'll just show how to create the datasets
    train_dataset = tf.data.Dataset.from_tensor_slices((
        {'input_ids': X_train, 'attention_mask': mask_train},
        y_train
    )).shuffle(10000).batch(32)

    test_dataset = tf.data.Dataset.from_tensor_slices((
        {'input_ids': X_test, 'attention_mask': mask_test},
        y_test
    )).batch(32)




--- Fold 1/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 2/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 3/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 4/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 5/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 6/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 7/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 8/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 9/10 ---
Training set class distribution: [0.5 0.5]
Testing set class distribution: [0.5 0.5]

--- Fold 10/10 ---
Training set class distribution: [0.5 0.5]
T

In [None]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

# Define a simple transformer model for binary classification
def create_model():
    model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    return model

# Now let's go through each fold and train the model
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    print(f"\n--- Training Fold {fold+1}/{n_splits} ---")

    # Split data
    X_train, X_test = X[train_idx], X[test_idx]
    mask_train, mask_test = attention_masks[train_idx], attention_masks[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Create train and test datasets
    train_dataset = tf.data.Dataset.from_tensor_slices(({
        'input_ids': X_train, 'attention_mask': mask_train},
        y_train
    )).shuffle(10000).batch(32)

    test_dataset = tf.data.Dataset.from_tensor_slices(({
        'input_ids': X_test, 'attention_mask': mask_test},
        y_test
    )).batch(32)

    # Create the model
    model = create_model()

    # Train the model on the training dataset
    history = model.fit(train_dataset, epochs=3, validation_data=test_dataset)

    # Evaluate the model on the test dataset
    test_loss, test_accuracy = model.evaluate(test_dataset)
    print(f"Fold {fold+1} Test Accuracy: {test_accuracy:.4f}")



--- Training Fold 1/10 ---


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3


#Modeling

#Exceptional Work