# Cybersec 590 - Offensive and Defensive Uses of AI
### Module: AI Fundamentals
### Task: Demonstrating AI Workflows

#### Compare workflows for:
1. Building models from scratch
2. Fine-tuning a model

**Building models from scratch** involves creating an entirely new neural network architecture, defining layers, parameters, and training protocols from the ground up. This workflow requires data collection, preprocessing, architecture design, training from random weights, and comprehensive validation. It's computationally expensive and time-consuming but offers complete control over the model's behavior and is necessary when no existing models suit the specific problem domain.

**Fine-tuning a model** starts with a pre-trained model that has already learned general patterns from large datasets. The workflow involves selecting an appropriate base model, adapting its final layers for the new task, and training only specific layers or the entire model with a smaller learning rate on task-specific data. This approach is more efficient, requires less data and computational resources, and often achieves better performance faster by leveraging existing knowledge.

_______________________

Run locally or open in Google Colab:

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dukecybersec/cybersec_590_ODUAI/blob/main/ai_fundamentals/ai_fundamentals_workflows.ipynb)


## Task: Sentiment Analysis


### Code Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
import openai
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x784b2589e670>

In [3]:
# Download NLTK tools
nltk.download('punkt')      # tokenizer (turns words into tokens)
nltk.download('stopwords')  # a list of stop words (these are words that don't matter for sentiment analysis that we remove)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
# Load IMDB movie reviews dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [5]:
# Get word index for decoding
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [6]:
# Helper function to convert a numerical review back to text
def decode_review(encoded_review):
    """Decode numerical review back to text"""
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

# Convert back to text for processing using helper function decode_review()
x_train_text = [decode_review(review) for review in x_train]
x_test_text = [decode_review(review) for review in x_test]

In [7]:
# Create DataFrame for easier handling
train_df = pd.DataFrame({
    'text': x_train_text,
    'label': y_train
})

test_df = pd.DataFrame({
    'text': x_test_text,
    'label': y_test
})

# Combine for proper train/val/test split
full_df = pd.concat([train_df, test_df], ignore_index=True)

print(f"Dataset shape: {full_df.shape}")
print(f"Label distribution:\n{full_df['label'].value_counts()}")

Dataset shape: (50000, 2)
Label distribution:
label
1    25000
0    25000
Name: count, dtype: int64


In [8]:
# Split data: 70% train, 15% validation, 15% test
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    full_df['text'], full_df['label'],
    test_size=0.3, random_state=42, stratify=full_df['label']
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels,
    test_size=0.5, random_state=42, stratify=temp_labels
)

print(f"Train set: {len(train_texts)} samples")
print(f"Validation set: {len(val_texts)} samples")
print(f"Test set: {len(test_texts)} samples")

Train set: 35000 samples
Validation set: 7500 samples
Test set: 7500 samples


In [9]:
# Text preprocessing function
def preprocess_text(text):
    """Basic text preprocessing"""
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Apply preprocessing
train_texts_clean = [preprocess_text(text) for text in train_texts]
val_texts_clean = [preprocess_text(text) for text in val_texts]
test_texts_clean = [preprocess_text(text) for text in test_texts]

### Building a model from scratch

In [10]:
# Tokenization and vectorization
max_features = 5000
max_len = 500

tokenizer = Tokenizer(num_words=max_features, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts_clean)

# Convert texts to sequences
train_sequences = tokenizer.texts_to_sequences(train_texts_clean)
val_sequences = tokenizer.texts_to_sequences(val_texts_clean)
test_sequences = tokenizer.texts_to_sequences(test_texts_clean)

# Pad sequences
train_data = pad_sequences(train_sequences, maxlen=max_len)
val_data = pad_sequences(val_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)

print(f"Training data shape: {train_data.shape}")

Training data shape: (35000, 500)


In [11]:
model_scratch = Sequential([
    Embedding(max_features, 64, input_length=max_len),  # Smaller embedding
    tf.keras.layers.GlobalAveragePooling1D(),  # Much faster than LSTM
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

model_scratch.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print(model_scratch.summary())



None


In [12]:
# Train the model
history_scratch = model_scratch.fit(
    train_data, train_labels,
    batch_size=32,
    epochs=3,  # Few epochs for demo
    validation_data=(val_data, val_labels),
    verbose=1
)

Epoch 1/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.5329 - loss: 0.6826 - val_accuracy: 0.8252 - val_loss: 0.4024
Epoch 2/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8054 - loss: 0.4385 - val_accuracy: 0.8397 - val_loss: 0.3539
Epoch 3/3
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8463 - loss: 0.3624 - val_accuracy: 0.8705 - val_loss: 0.3136


In [13]:
# Evaluate on test set
test_loss_scratch, test_acc_scratch = model_scratch.evaluate(test_data, test_labels, verbose=0)
test_predictions_scratch = (model_scratch.predict(test_data) > 0.5).astype(int).flatten()

print(f"\nFrom Scratch Model Results:")
print(f"Test Accuracy: {test_acc_scratch:.4f}")
print(f"Classification Report:")
print(classification_report(test_labels, test_predictions_scratch))

[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step

From Scratch Model Results:
Test Accuracy: 0.8663
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.94      0.88      3750
           1       0.93      0.79      0.86      3750

    accuracy                           0.87      7500
   macro avg       0.87      0.87      0.87      7500
weighted avg       0.87      0.87      0.87      7500



### Fine-tune a pre-trained model

In [14]:
# Select pre-trained model, download from HuggingFace via transformers library
model_name = "prajjwal1/bert-tiny"
tokenizer_bert = AutoTokenizer.from_pretrained(model_name)
model_bert = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

max_length = 128

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
def tokenize_batch(texts, labels, tokenizer, max_length=128):
   """Tokenize a batch of texts"""
   encodings = tokenizer(
       texts,
       truncation=True,
       padding='max_length',
       max_length=max_length,
       return_tensors='pt'
   )
   return {
       'input_ids': encodings['input_ids'],
       'attention_mask': encodings['attention_mask'],
       'labels': torch.tensor(labels, dtype=torch.long)
   }

In [16]:
# Use the same full dataset as the other approaches
print(f"Using full training set: {len(train_texts)} samples")

Using full training set: 35000 samples


In [17]:
# Convert to lists for easier handling
train_texts_list = train_texts.tolist()
train_labels_list = train_labels.tolist()
val_texts_list = val_texts.tolist()
val_labels_list = val_labels.tolist()
test_texts_list = test_texts.tolist()
test_labels_list = test_labels.tolist()

In [18]:
# Set up training parameters
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_bert.to(device)
print(f"Using device: {device}")

Using device: cuda


In [19]:
# Optimizer and loss
optimizer = torch.optim.AdamW(model_bert.parameters(), lr=2e-5)
batch_size = 32
num_epochs = 1

In [20]:
# Training loop
model_bert.train()
print("Starting fine-tuning...")

for epoch in range(num_epochs):
   total_loss = 0
   num_batches = len(train_texts_list) // batch_size

   for i in range(0, len(train_texts_list), batch_size):
       # Get batch
       batch_texts = train_texts_list[i:i+batch_size]
       batch_labels = train_labels_list[i:i+batch_size]

       # Tokenize batch
       batch_data = tokenize_batch(batch_texts, batch_labels, tokenizer_bert, max_length)

       # Move to device
       input_ids = batch_data['input_ids'].to(device)
       attention_mask = batch_data['attention_mask'].to(device)
       labels = batch_data['labels'].to(device)

       # Forward pass
       outputs = model_bert(input_ids=input_ids,
                          attention_mask=attention_mask,
                          labels=labels)
       loss = outputs.loss

       # Backward pass
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()

       total_loss += loss.item()

       # Print progress
       if (i // batch_size) % 50 == 0:
           print(f"Epoch {epoch+1}/{num_epochs}, Batch {i//batch_size}/{num_batches}, Loss: {loss.item():.4f}")

   avg_loss = total_loss / num_batches
   print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")

Starting fine-tuning...


model.safetensors:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Epoch 1/1, Batch 0/1093, Loss: 0.6956
Epoch 1/1, Batch 50/1093, Loss: 0.6970
Epoch 1/1, Batch 100/1093, Loss: 0.6614
Epoch 1/1, Batch 150/1093, Loss: 0.6640
Epoch 1/1, Batch 200/1093, Loss: 0.6652
Epoch 1/1, Batch 250/1093, Loss: 0.5941
Epoch 1/1, Batch 300/1093, Loss: 0.6759
Epoch 1/1, Batch 350/1093, Loss: 0.5276
Epoch 1/1, Batch 400/1093, Loss: 0.5911
Epoch 1/1, Batch 450/1093, Loss: 0.5683
Epoch 1/1, Batch 500/1093, Loss: 0.6221
Epoch 1/1, Batch 550/1093, Loss: 0.3669
Epoch 1/1, Batch 600/1093, Loss: 0.3547
Epoch 1/1, Batch 650/1093, Loss: 0.4125
Epoch 1/1, Batch 700/1093, Loss: 0.4898
Epoch 1/1, Batch 750/1093, Loss: 0.4351
Epoch 1/1, Batch 800/1093, Loss: 0.4628
Epoch 1/1, Batch 850/1093, Loss: 0.4458
Epoch 1/1, Batch 900/1093, Loss: 0.3060
Epoch 1/1, Batch 950/1093, Loss: 0.4915
Epoch 1/1, Batch 1000/1093, Loss: 0.5987
Epoch 1/1, Batch 1050/1093, Loss: 0.6564
Epoch 1 completed. Average Loss: 0.5370


In [21]:
# Evaluation on test set
print("Evaluating on test set...")
model_bert.eval()
test_predictions = []

with torch.no_grad():
   for i in range(0, len(test_texts_list), batch_size):
       batch_texts = test_texts_list[i:i+batch_size]
       batch_labels = test_labels_list[i:i+batch_size]

       # Tokenize batch
       batch_data = tokenize_batch(batch_texts, batch_labels, tokenizer_bert, max_length)

       # Move to device
       input_ids = batch_data['input_ids'].to(device)
       attention_mask = batch_data['attention_mask'].to(device)

       # Get predictions
       outputs = model_bert(input_ids=input_ids, attention_mask=attention_mask)
       predictions = torch.argmax(outputs.logits, dim=-1)
       test_predictions.extend(predictions.cpu().numpy())

# Calculate accuracy
test_acc_bert = accuracy_score(test_labels_list, test_predictions)

print(f"\nFine-tuned Model Results:")
print(f"Test Accuracy: {test_acc_bert:.4f}")
print(f"Classification Report:")
print(classification_report(test_labels_list, test_predictions))

Evaluating on test set...

Fine-tuned Model Results:
Test Accuracy: 0.8068
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.78      0.80      3750
           1       0.79      0.84      0.81      3750

    accuracy                           0.81      7500
   macro avg       0.81      0.81      0.81      7500
weighted avg       0.81      0.81      0.81      7500

