Let's start by quickly going over a solution for turning a count-matrix into a tfidf matrix. Assume a 6$\times$3 count-matrix, where there are six terms across three documents. Recall that the formula tfidf is $tf_{i,j}\cdot idf_{j}$, where $tf=log(freq(i,j)+1)$, and $idf = log(\frac{d}{df_{i}})$

In [380]:
import numpy as np

toy_counts = np.array([[0,1,2], [0,1,1], [1,1,0], [0,1,0], [1,1,1], [1,3,2]])
#print(toy_counts)

def tfidfify(count_matrix):
    t, d = count_matrix.shape
    idf = []
    for term in count_matrix:
        df = 0
        for doc_count in term:
            if doc_count != 0:
                df += 1
        idf.append(df)
    idf = np.broadcast_to((np.log2(d / np.array(idf))).reshape(t, 1), (t, d))
    tfidf = np.multiply(count_matrix, idf)
    return(tfidf)

my_tfidf = tfidfify(toy_counts)

print(toy_counts)
print(my_tfidf)


[[0 1 2]
 [0 1 1]
 [1 1 0]
 [0 1 0]
 [1 1 1]
 [1 3 2]]
[[0.        0.5849625 1.169925 ]
 [0.        0.5849625 0.5849625]
 [0.5849625 0.5849625 0.       ]
 [0.        1.5849625 0.       ]
 [0.        0.        0.       ]
 [0.        0.        0.       ]]


In [381]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [382]:
data = pd.read_csv("author_train.csv", encoding="ISO-8859-1")
data.replace(to_replace={"author":6}, value=5, inplace=True)
data = data[data.author < 7].values
np.random.shuffle(data)
data = data[:1800]

In [384]:
print(data.shape)
print(data[:, 1])
print(data[:,0])

(1800, 2)
[1 1 5 ... 4 4 4]
['place of normal rest the left the boat at and was afterwards glad he did so he found one of the prettiest spots he had ever seen and a vegetation thoroughly tropical combined with a modern hotel whose were satisfactory in all respects by the time the next steamer arrived he had done the telegraph act himself and had the pleasure of sharing a room eight feet by five with a physician from who proved to be equal to several ordinary individuals in capacity for entertaining his companion the descent or possibly it is the ascent of the indian river is a journey not to be despised i learn that the means of travel have been improved since the date of which i am writing and are to be still more improved in the near future but was sure that had the steamer been even worse than it was a not easily conceivable and had he been obliged to share the easy chairs and sections of cabin floor on which most of the passengers had to sleep he would have found the trip worth tak

In [385]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

texts = data[:,0]

cv = CountVectorizer()
counts = cv.fit_transform(texts).toarray()

tf = TfidfVectorizer()
tfidf = tf.fit_transform(texts).toarray()

svd = TruncatedSVD(n_components=1000)
lsa = svd.fit_transform(tfidf)

print("Counts shape:", counts.shape)
print("tfidf shape:", tfidf.shape)
print("LSA shape:", lsa.shape)
print("SVD variance explained with 1000d:", svd.explained_variance_ratio_.sum())


Counts shape: (1800, 9966)
tfidf shape: (1800, 9966)
LSA shape: (1800, 1000)
SVD variance explained with 1000d: 0.8541220749129519


In [386]:

import pickle

pickle.dump(counts, open("author_counts.p", "wb"))
pickle.dump(tfidf, open("author_tfidf.p", "wb"))
pickle.dump(lsa, open("author_lsa.p", "wb"))


Now we have three types of models, all vector-space models of our short texts from Victorian authors. We wish to see if we can predict the authors (the values of which are known to us) just using the text.

We have different methods of doing this. We have: k-means clustering (where k=5), and agglomerative clustering. These are our unsupervised methods

In [387]:
from time import time

def timer(func):
    def wrapper(vecs):
        print("Beginning function", func.__name__)
        begin = time()
        clusters = func(vecs)
        end = time()
        print("Function took:", end-begin, "seconds\n")
        return(clusters)
    return(wrapper)

In [388]:
# Our models to test: counts, tfidf, lsa
# Clustering methods to use: KMeans, Hierarchical Clustering (Agglomerative variety) 
from sklearn.cluster import KMeans, AgglomerativeClustering

@timer
def kmeans(vectors):
    km = KMeans(n_clusters=5, random_state=0)
    km.fit(vectors)
    return(km)

@timer
def ac(vectors):
    ac = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
    ac.fit(vectors)
    return(ac)

kcounts = kmeans(counts)
ktfidf = kmeans(tfidf)
klsa = kmeans(lsa)
accounts = ac(counts)
actfidf = ac(tfidf)
aclsa = ac(lsa)

Beginning function kmeans
Function took: 14.868891477584839 seconds

Beginning function kmeans
Function took: 13.37974500656128 seconds

Beginning function kmeans
Function took: 2.0183801651000977 seconds

Beginning function ac
Function took: 9.762129545211792 seconds

Beginning function ac
Function took: 10.021447658538818 seconds

Beginning function ac
Function took: 0.9806346893310547 seconds



In [389]:
from sklearn.metrics import adjusted_rand_score as ars

print("KMeans on count vectors accuracy:", ars(kcounts.labels_, data[:,1]))
print("KMeans on tfidf vectors accuracy:", ars(ktfidf.labels_, data[:,1]))
print("KMeans on lsa vectors accuracy:", ars(klsa.labels_, data[:,1]))
print("Agglomerative Clustering on count vectors accuracy:", ars(accounts.labels_, data[:,1]))
print("Agglomerative Clustering on tfidf vectors accuracy:", ars(actfidf.labels_, data[:,1]))
print("Agglomerative Clustering on lsa vectors accuracy:", ars(aclsa.labels_, data[:,1]))




KMeans on count vectors accuracy: 0.18276678923743214
KMeans on tfidf vectors accuracy: 0.29369151006860555
KMeans on lsa vectors accuracy: 0.28076957318972373
Agglomerative Clustering on count vectors accuracy: 0.11676855293488145
Agglomerative Clustering on tfidf vectors accuracy: 0.36603069953819645
Agglomerative Clustering on lsa vectors accuracy: 0.49984915179395073


In [367]:
counts_train, counts_test = train_test_split(counts, test_size=0.15, random_state=42)
tfidf_train, tfidf_test = train_test_split(tfidf, test_size=0.15, random_state=42)
lsa_train, lsa_test = train_test_split(lsa, test_size=0.15, random_state=42)
y_train, y_test = train_test_split(data[:,1], test_size=0.15, random_state=42)
y_train -= 1
y_test -= 1

In [390]:
print(counts_train.shape, counts_test.shape)
print(tfidf_train.shape, tfidf_test.shape)
print(lsa_train.shape, lsa_test.shape)
print(y_train.shape, y_test.shape)

(1530, 9968) (270, 9968)
(1530, 9968) (270, 9968)
(1530, 1000) (270, 1000)
(1530,) (270,)


In [391]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm

In [392]:
class MyNN(nn.Module):
    def __init__(self, train_X, train_y, test_X, test_y, activation="tanh", init=None, lr=0.1):
        super(MyNN, self).__init__()
        
        self.train_X = torch.tensor(train_X, dtype=torch.float)
        self.train_y = torch.tensor([[i] for i in train_y])
        self.test_X = torch.tensor(test_X, dtype=torch.float)
        self.test_y = torch.tensor([[i] for i in test_y])
        self.activation = activation
        self.init = init
        self.lr = lr
        
        self.input_dim = train_X.shape[1]
        self.num_classes = len(np.unique(train_y))
        
        self.Linear1 = nn.Linear(self.input_dim, 300, bias=True)
        self.Linear2 = nn.Linear(300, self.num_classes, bias=True)
        if self.init:
            self.init(self.Linear1.weight)
            self.init(self.Linear2.weight)
            
        
        if self.activation == "tanh":
            self.activation = torch.tanh
        elif self.activation == "relu":
            self.activation = torch.relu
        elif self.activation == "identity":
            self.activation = lambda x: x
            
        
        self.loss = torch.nn.CrossEntropyLoss()
        self.optimizer = torch.optim.SGD(self.parameters(), lr=self.lr, momentum = 0.0)
        
    def forward(self, vector):
        a1 = self.activation(self.Linear1(vector))
        #a2 = self.activation(self.Linear2(a1))
        y_hat = self.Linear2(a1)
        return(y_hat, y_hat.max(0)[1])
        
    def train(self, epochs=5):
        for epoch in range(1, epochs+1):
            counter = 0
            correct = 0
            running_loss = 0
            
            for i in tqdm(range(len(self.train_X))):
                self.optimizer.zero_grad()
                y_hat, prediction = self.forward(self.train_X[i])
                y = self.train_y[i]
                
                if prediction == y:
                    correct += 1
                    
                loss = self.loss(y_hat.unsqueeze(0), y)
                loss.backward()
                self.optimizer.step()
                
                running_loss += loss.item()
                counter += 1
                
                    
            print("Average loss on epoch", str(epoch) + ":", running_loss / counter)
            print("Percentage correct on epoch", str(epoch) + ":", correct / counter)
            
    def test(self):
        counter = 0
        correct = 0
            
        print("Beginning Test")
        for i in tqdm(range(len(self.test_X))):
            with torch.no_grad():
                y_hat, prediction = self.forward(self.test_X[i])
                y = self.test_y[i]
                    
                if prediction == y:
                    correct += 1
                counter += 1
                    
        print("Test accuracy:", correct / counter)
            

In [393]:
NN_on_Counts = MyNN(counts_train, y_train, counts_test, y_test, activation="tanh", init=None, lr=0.001)
NN_on_Counts.train(epochs=5)
NN_on_Counts.test()

100%|██████████| 1530/1530 [00:13<00:00, 111.64it/s]
  1%|          | 10/1530 [00:00<00:15, 98.14it/s]

Average loss on epoch 1: 0.8064674368870803
Percentage correct on epoch 1: 0.6993464052287581


100%|██████████| 1530/1530 [00:13<00:00, 116.70it/s]
  1%|          | 10/1530 [00:00<00:15, 95.91it/s]

Average loss on epoch 2: 0.4524309561922659
Percentage correct on epoch 2: 0.8464052287581699


100%|██████████| 1530/1530 [00:13<00:00, 114.20it/s]
  1%|          | 8/1530 [00:00<00:20, 72.63it/s]

Average loss on epoch 3: 0.3106791121118209
Percentage correct on epoch 3: 0.9


100%|██████████| 1530/1530 [00:12<00:00, 120.53it/s]
  1%|          | 11/1530 [00:00<00:14, 108.24it/s]

Average loss on epoch 4: 0.22148335502038594
Percentage correct on epoch 4: 0.9333333333333333


100%|██████████| 1530/1530 [00:12<00:00, 118.72it/s]
100%|██████████| 270/270 [00:00<00:00, 1672.95it/s]

Average loss on epoch 5: 0.15960837532492245
Percentage correct on epoch 5: 0.9535947712418301
Beginning Test
Test accuracy: 0.9333333333333333





In [394]:
NN_on_tfidf = MyNN(tfidf_train, y_train, tfidf_test, y_test, activation="tanh", init=torch.nn.init.xavier_uniform_, lr=0.1)
NN_on_tfidf.train(epochs=5)
NN_on_tfidf.test()

100%|██████████| 1530/1530 [00:16<00:00, 95.55it/s] 
  1%|          | 10/1530 [00:00<00:16, 93.76it/s]

Average loss on epoch 1: 0.9181893484265197
Percentage correct on epoch 1: 0.6679738562091503


100%|██████████| 1530/1530 [00:16<00:00, 103.04it/s]
  1%|          | 10/1530 [00:00<00:16, 94.81it/s]

Average loss on epoch 2: 0.3112680663470349
Percentage correct on epoch 2: 0.8993464052287582


100%|██████████| 1530/1530 [00:16<00:00, 93.58it/s] 
  1%|          | 9/1530 [00:00<00:17, 87.96it/s]

Average loss on epoch 3: 0.09866771394131231
Percentage correct on epoch 3: 0.9758169934640523


100%|██████████| 1530/1530 [00:15<00:00, 97.89it/s] 
  1%|          | 10/1530 [00:00<00:15, 96.42it/s]

Average loss on epoch 4: 0.032291598569333946
Percentage correct on epoch 4: 0.9986928104575163


100%|██████████| 1530/1530 [00:15<00:00, 99.19it/s] 
100%|██████████| 270/270 [00:00<00:00, 1715.14it/s]

Average loss on epoch 5: 0.01538658453748117
Percentage correct on epoch 5: 1.0
Beginning Test
Test accuracy: 0.9592592592592593





In [395]:
NN_on_lsa = MyNN(lsa_train, y_train, lsa_test, y_test, activation="tanh", init=torch.nn.init.xavier_uniform_, lr=0.1)
NN_on_lsa.train(epochs=5)
NN_on_lsa.test()

100%|██████████| 1530/1530 [00:00<00:00, 1838.65it/s]
 12%|█▏        | 184/1530 [00:00<00:00, 1838.24it/s]

Average loss on epoch 1: 0.8857710665737102
Percentage correct on epoch 1: 0.6797385620915033


100%|██████████| 1530/1530 [00:00<00:00, 1650.05it/s]
 11%|█         | 167/1530 [00:00<00:00, 1664.51it/s]

Average loss on epoch 2: 0.29873171538309334
Percentage correct on epoch 2: 0.9058823529411765


100%|██████████| 1530/1530 [00:00<00:00, 1784.13it/s]
 11%|█         | 165/1530 [00:00<00:00, 1649.86it/s]

Average loss on epoch 3: 0.10294982416957033
Percentage correct on epoch 3: 0.9751633986928104


100%|██████████| 1530/1530 [00:00<00:00, 1818.30it/s]
 10%|█         | 153/1530 [00:00<00:00, 1529.14it/s]

Average loss on epoch 4: 0.03618009316375832
Percentage correct on epoch 4: 0.9973856209150327


100%|██████████| 1530/1530 [00:01<00:00, 1280.06it/s]
100%|██████████| 270/270 [00:00<00:00, 9499.72it/s]

Average loss on epoch 5: 0.017427550577649883
Percentage correct on epoch 5: 1.0
Beginning Test
Test accuracy: 0.9666666666666667



