### Lab 7.1: Bag of Words Model

In this lab you will use the bag of words model to learn author attribution with a [dataset of texts from Victorian authors](https://github.com/agungor2/Authorship_Attribution?tab=readme-ov-file).

In [37]:
import numpy as np
import sklearn
import pandas as pd

Here we download the CSV file containing the text snippets and author IDs.

In [38]:
!wget --no-clobber -O Gungor_2018_VictorianAuthorAttribution_data-train.csv -q https://www.dropbox.com/scl/fi/emk9db05t9u8yzgrjje7t/Gungor_2018_VictorianAuthorAttribution_data-train.csv?rlkey=kzvbl0mbpnrpjr4c3q18le6w2&dl=1

In [39]:
df = pd.read_csv('Gungor_2018_VictorianAuthorAttribution_data-train.csv', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,text,author
0,ou have time to listen i will give you the ent...,1
1,wish for solitude he was twenty years of age a...,1
2,and the skirt blew in perfect freedom about th...,1
3,of san and the rows of shops opposite impresse...,1
4,an hour s walk was as tiresome as three in a s...,1


In [40]:
text = list(df['text'])
labels = df['author'].values

In [41]:
text[0]

'ou have time to listen i will give you the entire story he said it may form the basis of a future novel and prove quite as interesting as one of your own invention i had the time to listen of course one has time for anything and everything agreeable in the best place to hear the tale was in a victoria and with my good on the box with the coachman we set out at once on a drive to the as the recital was only half through when we reached the house we postponed the remainder while we stopped there for an excellent lunch on the way back to my friend continued and finished the story it was indeed quite suitable for use and i told my friend with thanks that i should at once put it in shape for my readers i said i should make a few alterations in it for the sake of dramatic interest but in the main would follow the lines he had given me it would spoil my romance were i to answer on this page the question that must be uppermost in the reader s mind i have already revealed almost too much of th

### Exercises

1. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to produce a term frequency vector for each text.  Set `max_features=1000` to only use the top 1000 terms.

Prepare a 90/10 train-test split `random_state=42`.

Train the default `MLPCLassifier` from `sklearn.neural_network` on the data and report the train and test accuracy.  You can use the argument `verbose=True` to `MLPClassifier` to monitor training.

In [42]:
# Split the dataset

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(text, labels, test_size=0.1, random_state=42)


In [43]:
# Vectorize using CountVectorizer (Bag of Words model)
vectorizer = sklearn.feature_extraction.text.CountVectorizer(max_features=1000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train an MLP Classifier
clf = sklearn.neural_network.MLPClassifier(verbose=True, random_state=42)
clf.fit(X_train_vec, y_train)



Iteration 1, loss = 1.68576527
Iteration 2, loss = 0.46098605
Iteration 3, loss = 0.27107289
Iteration 4, loss = 0.18910646
Iteration 5, loss = 0.13841788
Iteration 6, loss = 0.10578279
Iteration 7, loss = 0.07876808
Iteration 8, loss = 0.06420576
Iteration 9, loss = 0.04930479
Iteration 10, loss = 0.03712261
Iteration 11, loss = 0.02948432
Iteration 12, loss = 0.02314298
Iteration 13, loss = 0.01946625
Iteration 14, loss = 0.01451554
Iteration 15, loss = 0.01279456
Iteration 16, loss = 0.01085259
Iteration 17, loss = 0.00891533
Iteration 18, loss = 0.00816644
Iteration 19, loss = 0.00875316
Iteration 20, loss = 0.00942821
Iteration 21, loss = 0.00914984
Iteration 22, loss = 0.01910544
Iteration 23, loss = 0.01705638
Iteration 24, loss = 0.00852662
Iteration 25, loss = 0.00459065
Iteration 26, loss = 0.00467162
Iteration 27, loss = 0.00378190
Iteration 28, loss = 0.00523108
Iteration 29, loss = 0.00383228
Iteration 30, loss = 0.00425710
Iteration 31, loss = 0.00427542
Iteration 32, los

In [44]:
# Evaluate the classifier
y_train_pred = clf.predict(X_train_vec)
y_test_pred = clf.predict(X_test_vec)

train_accuracy = sklearn.metrics.accuracy_score(y_train, y_train_pred)
test_accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)

print("CountVectorizer")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

CountVectorizer
Train Accuracy: 0.9998
Test Accuracy: 0.9428



2. Repeat the steps but using `TfidfVectorizer` to produce term frequency - inverse document frequency vectors.

Does the IDF weighting improve the results?

In [45]:
tfidf_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train an MLP Classifier
clf_tfidf = sklearn.neural_network.MLPClassifier(verbose=True, random_state=42)
clf_tfidf.fit(X_train_tfidf, y_train)


Iteration 1, loss = 3.04019973
Iteration 2, loss = 2.00485404
Iteration 3, loss = 1.38298370
Iteration 4, loss = 0.99747716
Iteration 5, loss = 0.76315327
Iteration 6, loss = 0.61275993
Iteration 7, loss = 0.51019012
Iteration 8, loss = 0.43480632
Iteration 9, loss = 0.37814895
Iteration 10, loss = 0.33316464
Iteration 11, loss = 0.29763367
Iteration 12, loss = 0.26810621
Iteration 13, loss = 0.24269852
Iteration 14, loss = 0.22143185
Iteration 15, loss = 0.20305507
Iteration 16, loss = 0.18726756
Iteration 17, loss = 0.17315653
Iteration 18, loss = 0.16047429
Iteration 19, loss = 0.14966460
Iteration 20, loss = 0.13929915
Iteration 21, loss = 0.13049441
Iteration 22, loss = 0.12224010
Iteration 23, loss = 0.11459937
Iteration 24, loss = 0.10786202
Iteration 25, loss = 0.10159498
Iteration 26, loss = 0.09596896
Iteration 27, loss = 0.09048200
Iteration 28, loss = 0.08575272
Iteration 29, loss = 0.08087992
Iteration 30, loss = 0.07706993
Iteration 31, loss = 0.07280369
Iteration 32, los

In [46]:
y_train_pred_tfidf = clf_tfidf.predict(X_train_tfidf)
y_test_pred_tfidf = clf_tfidf.predict(X_test_tfidf)


train_accuracy_tfidf = sklearn.metrics.accuracy_score(y_train, y_train_pred_tfidf)
test_accuracy_tfidf = sklearn.metrics.accuracy_score(y_test, y_test_pred_tfidf)
print("TfidfVectorizier")
print(f"Train Accuracy: {train_accuracy_tfidf:.4f}")
print(f"Test Accuracy: {test_accuracy_tfidf:.4f}")

TfidfVectorizier
Train Accuracy: 0.9999
Test Accuracy: 0.9327


The accuracy of both the CountVectorizer and TfidfVectorizer had the same training accuracy and test accuracy as the models wuold continue to train until the loss function could not be further improved after 10 consecutive epochs. The IDF weighting did not improve the results as both training and test accuracies were relatively the same. However, I would prefer using the CountVectorizer as training the TFidfVectorizer took around 40 minutes to train whilst the CountVectorizer took only 3.5 minutes.