This project was part of my Applied Machine Learning CSCI-B 565 assignment at Indiana University.

In [1]:
import pandas as pd
import gdown
import numpy
from sklearn.model_selection import train_test_split

I have created my own dataset containing at least 1000 words in total and at least two categories with at least 100 examples per category.

I have web-scrapped quotes from 'Star Wars' and 'Friends' Series using selenium and BeautifulSoup. For Classification: 0 is Star wars and 1 is Friends.


In [None]:
# Download Star Wars quotes
star_wars_url = '1AvxQPVe9zwFpyA74gI2_nnvbNVQ0X5Ci'
star_wars_download_url = f"https://drive.google.com/uc?id={star_wars_url}"
gdown.download(star_wars_download_url, 'star_wars.txt', quiet=False)
star_wars = pd.read_fwf('star_wars.txt',header=None, names=['Quotes'],encoding='utf-8')

# Download Friends quotes
friends_url = '1oohK6Yplzh43d-qmu7LRbhF789CnKhJT'
friends_download_url = f"https://drive.google.com/uc?id={friends_url}"
gdown.download(friends_download_url, 'friends.txt', quiet=False)
# friends = pd.read_fwf('friends.txt', encoding='utf-8')

friends = pd.read_csv('friends.txt', delimiter='\t', header=None, names=['Quotes'])

# Add labels
star_wars['Label'] = 0
friends['Label'] = 1


Downloading...
From: https://drive.google.com/uc?id=1AvxQPVe9zwFpyA74gI2_nnvbNVQ0X5Ci
To: /geode2/home/u030/gchaudh/Carbonate/Desktop/hw4_q2/star_wars.txt
100%|██████████| 5.77k/5.77k [00:00<00:00, 3.88MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oohK6Yplzh43d-qmu7LRbhF789CnKhJT
To: /geode2/home/u030/gchaudh/Carbonate/Desktop/hw4_q2/friends.txt
100%|██████████| 7.37k/7.37k [00:00<00:00, 13.7MB/s]


In [None]:
star_wars.shape, friends.shape

((100, 2), (100, 2))

In [None]:
quotes = pd.concat([star_wars, friends], ignore_index=True)

In [None]:
quotes.head()

Unnamed: 0,Quotes,Label
0,“Try not. Do or do not. There is no try.”,0
1,“Your eyes can deceive you; don’t trust them.”,0
2,"“Luminous beings we are, not this crude matter.”",0
3,“Who’s the more foolish: the fool or the fool ...,0
4,“Your focus determines your reality.”,0


We decided to perform some preliminary basic data cleaning

In [None]:
# Removing quotations
quotes['Quotes'] = quotes['Quotes'].str.replace('"', '')

Split the dataset into training (at least 160examples) and test (at least 40 examples) sets.

In [None]:
# Split the dataset into training 160 and test 40
train, test = train_test_split(quotes, test_size=0.2, random_state=42,stratify=quotes['Label'])

# # Ensure that both sets have at least the specified number of examples
# while len(train) < 160 or len(test) < 40:
#     quotes = quotes.sample(frac=1, random_state=42).reset_index(drop=True)
#     train, test = train_test_split(quotes, test_size=0.2, random_state=42)

print("Training set size:", len(train))
print("Test set size:", len(test))


Training set size: 160
Test set size: 40


In [None]:
X_train = train.drop(columns='Label')
y_train = train['Label']
X_test = test.drop(columns='Label')
y_test = test['Label']

Fine tuning a pretrained language model capable of generating text (e.g., GPT) that you can take from the Hugging Face Transformers library with the dataset your created (this tutorial could be very helpful: https://huggingface.co/docs/transformers/training).

In [None]:
import transformers
from transformers import AutoTokenizer
import numpy as np
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from transformers import TFBertForSequenceClassification, BertTokenizer


Tokenizing data and also adding padding and truncation

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")


In [None]:
X_train_tokenized = tokenizer(X_train['Quotes'].tolist(),return_tensors='np',padding=True, truncation=True,max_length=512, return_attention_mask=True)
X_test_tokenized = tokenizer(X_test['Quotes'].tolist(),return_tensors='np',padding=True, truncation=True,max_length=512, return_attention_mask=True)

In [None]:
y_train_array = np.array(y_train)
y_test_array = np.array(y_test)

In [None]:
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)


2023-12-11 18:08:34.500020: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /N/soft/rhel7/pcre2/10.34/lib:/N/soft/rhel7/gcc/9.3.0/lib64:/N/soft/rhel7/gcc/9.3.0/lib:/N/soft/rhel7/java/11.0.2/lib/server:/N/soft/rhel7/curl/intel/7.54.0/lib:/N/soft/rhel7/python/gnu/3.10.5/lib:/N/soft/rhel7/openmpi/gnu/4.1.4/lib:/N/soft/rhel7/libpng/1.2.57/lib:/N/soft/rhel7/intel/19.5/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64:/N/soft/rhel7/intel/19.5/compilers_and_libraries_2019.5.281/linux/ipp/lib/intel64:/N/soft/rhel7/intel/19.5/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin:/N/soft/rhel7/intel/19.5/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin:/N/soft/rhel7/intel/19.5/compilers_and_libraries_2019.5.281/linux/tbb/lib/intel64/gcc4.7:/N/soft/rhel7/intel/19.5/debugger_2019/iga/lib:/N/soft/rhel7/i

In [None]:
# Creating tensorflow datasets
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((dict(X_train_tokenized), y_train_array)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(X_test_tokenized), y_test_array)).batch(64)

In [None]:
#Compiling and training the model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
model.fit(train_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f2105cab460>

In [None]:
test_loss, test_accuracy = model.evaluate(test_dataset)

print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

Test Loss: 0.6027485132217407
Test Accuracy: 0.699999988079071


We want to see if it is possible to get higher accuracy by changing the parameters like learning rate, batch_size and epochs.

In [None]:
# Creating a function to test different parameters
def fine_tune_bert(epochs, learning_rate, batch_size=64):

    train_dataset = tf.data.Dataset.from_tensor_slices((dict(X_train_tokenized), y_train_array)).batch(batch_size)
    test_dataset = tf.data.Dataset.from_tensor_slices((dict(X_test_tokenized), y_test_array)).batch(batch_size)

    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

    model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

    model.fit(train_dataset, epochs=epochs)

    test_loss, test_accuracy = model.evaluate(test_dataset)
    print(f"Test Loss: {test_loss}")
    print(f"Test Accuracy: {test_accuracy}")


In [None]:
fine_tune_bert(epochs=3, learning_rate=2e-5, batch_size=32)


Epoch 1/3
Epoch 2/3
Epoch 3/3
Test Loss: 0.5370607376098633
Test Accuracy: 0.7749999761581421


In [None]:
fine_tune_bert(epochs=5, learning_rate=1e-5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Loss: 0.8103048205375671
Test Accuracy: 0.675000011920929


In [None]:
fine_tune_bert(epochs=5, learning_rate=1e-5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Loss: 0.49499520659446716
Test Accuracy: 0.800000011920929


## Conclusion-
The highest accuracy we got on the test dataset was 80% with 5 epochs, 1e-5 learning rate and batch size as 32.
Few ways to improve the model and hence is the accuracy is -
- Increasing the number of samples in the dataset. 100 samples is less for the model to learn the representation of both the classes.
- Using more techniques to fine tune the model like transfer learning or perhaps experimenting with different pretrained models like distilbert, or gpt3. Also experimenting with larger sized pretrained models like bert-lg
- Another way is experimenting more with the hyperparameters.
- Add more layers and use regularization and dropout in the experiments
- We could also try preprocessing the text like removing stopwords, punctuation to see if the accuracy improves.