The following code explores using fastText for sentiment analysis and how different hyperparameters affect accuracy of the trained models. <br>
The datasets in this example are from https://www.kaggle.com/datasets/bittlingmayer/amazonreviews. <br>
See https://fasttext.cc/docs/en/supervised-tutorial.html for more information on installing and using fastText. <br>

The first section can easily be run in Colab.

In [None]:
# if running in Colab uncomment next line
#! pip install fastText

In [1]:
import fasttext

In [None]:
# For Colab, download/upload datasets (and unzip if needed)
# if running locally, copy datasets to project folder

In [2]:
# train model with default hyperparameters
model = fasttext.train_supervised(input="train.ft.txt")

In [3]:
# output is: (# of reviews in test set, precision, recall)
model.test("test.ft.txt", k=1)

(400000, 0.9160625, 0.9160625)

In [4]:
# a deeper look into the accuracy of the model
model.test_label("test.ft.txt", k=1)

{'__label__2': {'precision': 0.9158732776586653,
  'recall': nan,
  'f1score': 1.8317465553173307},
 '__label__1': {'precision': 0.9162518946120485,
  'recall': nan,
  'f1score': 1.832503789224097}}

In [5]:
# test positive review (__label__2)
model.predict("This product is amazing", k=1)

(('__label__2',), array([1.00000703]))

In [6]:
# test negative review (__label__1)
model.predict("This product is terrible", k=1)

(('__label__1',), array([1.00001001]))

In [7]:
# test mixed sentiment, but should be negative (__label__1)
model.predict("I wish I could say this product was great")

(('__label__2',), array([0.99511999]))

In [8]:
# "out of the box" model doesn't do well with mixed sentiment, try with bigrams
model2 = fasttext.train_supervised(input="train.ft.txt", wordNgrams=2)
model2.test("test.ft.txt", k=1)

(400000, 0.9372, 0.9372)

In [9]:
# test mixed sentiment, but should be negative (__label__1)
model2.predict("I wish I could say this product was great")

(('__label__2',), array([0.59926981]))

In [10]:
# let's see how changing learning rate from default (.1) to 0.05 changes accuracy
model2_1 = fasttext.train_supervised(input="train.ft.txt", lr=0.05, wordNgrams=2)
model2_1.test("test.ft.txt", k=1)

(400000, 0.938095, 0.938095)

In [11]:
# test mixed sentiment, but should be negative (__label__1)
model2_1.predict("I wish I could say this product was great")

(('__label__2',), array([0.51977015]))

In [12]:
# a slight improvement, let's increase # epochs from default (5) to 10
model2_2 = fasttext.train_supervised(input="train.ft.txt", epoch=10, lr=0.05, wordNgrams=2)
model2_2.test("test.ft.txt", k=1)

(400000, 0.9338625, 0.9338625)

In [13]:
# test mixed sentiment, but should be negative (__label__1)
model2_2.predict("I wish I could say this product was great")

(('__label__2',), array([0.8499288]))

In [14]:
# that was slightly worse, let's try 4-grams
model3 = fasttext.train_supervised(input="train.ft.txt", epoch=5, lr=0.05, wordNgrams=4)
model3.test("test.ft.txt", k=1)

(400000, 0.9327925, 0.9327925)

In [15]:
# test mixed sentiment, but should be negative (__label__1)
model3.predict("I wish I could say this product was great")

(('__label__1',), array([0.99916196]))

Tuning the hyperparameters manually is quite time consuming. Thankfully, fastText comes with autotuning. <br>
See https://fasttext.cc/docs/en/autotune.html for more info on automatic hyperparameter optimization. <br>
It is recommended to run the following code in a local environment to avoid Colab timeouts and memory usage errors.<br>
Below, we run autotune for 1, 2, and 3 hour blocks and see what hyperparameters fastText finds as optimal.

In [16]:
# 1 hour optimization
model4 = fasttext.train_supervised(input='train.ft.txt', autotuneValidationFile='test.ft.txt', autotuneDuration=3600)
model4.test("test.ft.txt", k=1)

(400000, 0.94062, 0.94062)

In [19]:
# below method of obtaining hyperparameter attributes was found at https://github.com/facebookresearch/fastText/issues/887
args_obj = model4.f.getArgs()
for hparam in dir(args_obj):
    if not hparam.startswith('__'):
        print(f"{hparam} -> {getattr(args_obj, hparam)}")

autotuneDuration -> 3600
autotuneMetric -> f1
autotuneModelSize -> 
autotunePredictions -> 1
autotuneValidationFile -> test.ft.txt
bucket -> 10000000
cutoff -> 0
dim -> 52
dsub -> 2
epoch -> 30
input -> train.ft.txt
label -> __label__
loss -> loss_name.softmax
lr -> 0.04519001700813223
lrUpdateRate -> 100
maxn -> 0
minCount -> 1
minCountLabel -> 0
minn -> 0
model -> model_name.supervised
neg -> 5
output -> 
pretrainedVectors -> 
qnorm -> False
qout -> False
retrain -> False
saveOutput -> False
seed -> 0
setManual -> <bound method PyCapsule.setManual of <fasttext_pybind.args object at 0x000001751783AB70>>
t -> 0.0001
thread -> 15
verbose -> 2
wordNgrams -> 4
ws -> 5


In [17]:
# 2 hour optimization
model5 = fasttext.train_supervised(input='train.ft.txt', autotuneValidationFile='test.ft.txt', autotuneDuration=7200)
model5.test("test.ft.txt", k=1)

(400000, 0.9430025, 0.9430025)

In [20]:
args_obj = model5.f.getArgs()
for hparam in dir(args_obj):
    if not hparam.startswith('__'):
        print(f"{hparam} -> {getattr(args_obj, hparam)}")

autotuneDuration -> 7200
autotuneMetric -> f1
autotuneModelSize -> 
autotunePredictions -> 1
autotuneValidationFile -> test.ft.txt
bucket -> 10000000
cutoff -> 0
dim -> 65
dsub -> 8
epoch -> 15
input -> train.ft.txt
label -> __label__
loss -> loss_name.softmax
lr -> 0.05246053864839083
lrUpdateRate -> 100
maxn -> 6
minCount -> 1
minCountLabel -> 0
minn -> 3
model -> model_name.supervised
neg -> 5
output -> 
pretrainedVectors -> 
qnorm -> False
qout -> False
retrain -> False
saveOutput -> False
seed -> 0
setManual -> <bound method PyCapsule.setManual of <fasttext_pybind.args object at 0x0000017536C90730>>
t -> 0.0001
thread -> 15
verbose -> 2
wordNgrams -> 3
ws -> 5


In [18]:
# 3 hour optimization
model6 = fasttext.train_supervised(input='train.ft.txt', autotuneValidationFile='test.ft.txt', autotuneDuration=10800)
model6.test("test.ft.txt", k=1)

(400000, 0.9421525, 0.9421525)

In [21]:
args_obj = model6.f.getArgs()
for hparam in dir(args_obj):
    if not hparam.startswith('__'):
        print(f"{hparam} -> {getattr(args_obj, hparam)}")

autotuneDuration -> 10800
autotuneMetric -> f1
autotuneModelSize -> 
autotunePredictions -> 1
autotuneValidationFile -> test.ft.txt
bucket -> 10000000
cutoff -> 0
dim -> 31
dsub -> 2
epoch -> 4
input -> train.ft.txt
label -> __label__
loss -> loss_name.softmax
lr -> 0.03466143057439238
lrUpdateRate -> 100
maxn -> 0
minCount -> 1
minCountLabel -> 0
minn -> 0
model -> model_name.supervised
neg -> 5
output -> 
pretrainedVectors -> 
qnorm -> False
qout -> False
retrain -> False
saveOutput -> False
seed -> 0
setManual -> <bound method PyCapsule.setManual of <fasttext_pybind.args object at 0x0000017536C90A70>>
t -> 0.0001
thread -> 15
verbose -> 2
wordNgrams -> 3
ws -> 5
