### Lab 7.1: Bag of Words Model

In this lab you will use the bag of words model to learn author attribution with a [dataset of texts from Victorian authors](https://github.com/agungor2/Authorship_Attribution?tab=readme-ov-file).

In [2]:
import numpy as np
import sklearn
import pandas as pd
import kagglehub

  from .autonotebook import tqdm as notebook_tqdm


Here we download the CSV file containing the text snippets and author IDs.

In [3]:
!wget --no-clobber -O Gungor_2018_VictorianAuthorAttribution_data-train.csv -q https://www.dropbox.com/scl/fi/emk9db05t9u8yzgrjje7t/Gungor_2018_VictorianAuthorAttribution_data-train.csv?rlkey=kzvbl0mbpnrpjr4c3q18le6w2&dl=1

In [4]:
df = pd.read_csv('Gungor_2018_VictorianAuthorAttribution_data-train.csv', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,text,author
0,ou have time to listen i will give you the ent...,1
1,wish for solitude he was twenty years of age a...,1
2,and the skirt blew in perfect freedom about th...,1
3,of san and the rows of shops opposite impresse...,1
4,an hour s walk was as tiresome as three in a s...,1


In [5]:
df.tail()

Unnamed: 0,text,author
53673,after surrounding and searching the whole plac...,50
53674,giant who could make a young earthquake or a w...,50
53675,waters of the lake at the bottom of the hill c...,50
53676,fingers and thumb in it exactly as it came out...,50
53677,giant s sake he won t meet with for if he does...,50


In [6]:
text = list(df['text'])
labels = df['author'].values

In [7]:
text[0]

'ou have time to listen i will give you the entire story he said it may form the basis of a future novel and prove quite as interesting as one of your own invention i had the time to listen of course one has time for anything and everything agreeable in the best place to hear the tale was in a victoria and with my good on the box with the coachman we set out at once on a drive to the as the recital was only half through when we reached the house we postponed the remainder while we stopped there for an excellent lunch on the way back to my friend continued and finished the story it was indeed quite suitable for use and i told my friend with thanks that i should at once put it in shape for my readers i said i should make a few alterations in it for the sake of dramatic interest but in the main would follow the lines he had given me it would spoil my romance were i to answer on this page the question that must be uppermost in the reader s mind i have already revealed almost too much of th

### Exercises

1. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to produce a term frequency vector for each text.  Set `max_features=1000` to only use the top 1000 terms.

Prepare a 90/10 train-test split `random_state=42`.

Train the default `MLPCLassifier` from `sklearn.neural_network` on the data and report the train and test accuracy.  You can use the argument `verbose=True` to `MLPClassifier` to monitor training.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

In [9]:
test_size = 0.10
vector_text = CountVectorizer(max_features=1_000).fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(vector_text, labels, test_size=test_size)

In [10]:
model = MLPClassifier(max_iter=200)
model.fit(X_train, y_train)

print(f'Train acc: {model.score(X_train, y_train)}')
print(f'Test acc: {model.score(X_test, y_test)}')

Train acc: 0.9998137031670462
Test acc: 0.944113263785395



2. Repeat the steps but using `TfidfVectorizer` to produce term frequency - inverse document frequency vectors.

Does the IDF weighting improve the results?

In [11]:
tfidf_text = TfidfVectorizer().fit_transform(text)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(tfidf_text, labels, test_size=test_size)

In [14]:
model_tfidf = MLPClassifier(max_iter=200)
model_tfidf.fit(X_train_tfidf, y_train_tfidf)

print(f'Train acc: {model_tfidf.score(X_train_tfidf, y_train_tfidf)}')
print(f'Test acc: {model_tfidf.score(X_test_tfidf, y_test_tfidf)}')

Train acc: 1.0
Test acc: 0.9837928464977646


Woooah it totally does, perfect accuracy on the train set, and near perfect accuracy on the test set, or a large improvement from 0.84 to 0.98