# Topic Classification 

This notebook builds a Topic Classification pipeline using BEER. It is mainly thought to work from the output of a AUD model but can be easily adapted to other inputs.

In [1]:
from collections import defaultdict
import math
import os
import pickle 
import random
import sys

sys.path.insert(0, '../../')

import numpy as np
import torch
import beer

from bokeh.plotting import figure, show, output_notebook, gridplot
output_notebook()

from utils.ngramfeatures import NGramCounter, select_ngrams
import utils.plotting as plotting

%load_ext autoreload
%autoreload 2

## Settings

Global configuration of the pipeline.

In [2]:
# Directory structure
datadir = 'data'
expdir = 'exp'

# Input
dbname = 'fisher'  
model = 'aud_mfcc_8k_4g_gamma_dirichlet_process'

# N-gram order for the document representation.
ngram_order = 3

## Data preparation

The following pipeline assume the following directory structure:

```
<datadir>
└── <dbname>
    ├── test
    │   ├── docids      # list of document ids (test set)
    │   └── doclabels   # list of pairs document id <-> topic id (test set)
    ├── topics          # list of all the topic ids (optional)
    └── train
        ├── docids      # list of document ids (train set)
        └── doclabels   # list of pairs document id <-> topic id (train set)
```

The scripts `local/<dbname>/prepare_data.sh <datadir>` will prepare everything except for the documents (`docs`) which you will have to provide to run the recipe

In [3]:
%%bash -s "$datadir" "$dbname"

datadir=$1
dbname=$2
local/$dbname/prepare_data.sh $datadir

FISHER data already prepared.


Now we prepare the input to the classification. We use the transcription from a AUD system organize in the following way:

```
<expdir>/<dbname>/<modeldir>
├── test
│   └── trans
└── train
    └── trans
```

The documents are stored in the following way
```
docid1 first line of the document 
docid2 first line of the second document
docid2 second line of the second document
docid1 second line of the first document
... 
```

We use the time-aligned transcription of a AUD system converted to a "standard" transcription with the tools `recipes/aud/utils/ali2trans.py`. Replace the path to accordingly to run the following cell

In [4]:
%%bash -s "$expdir" "$dbname" "$model"

expdir=$1
dbname=$2
model=$3

mkdir -p $expdir/$dbname/$model/{test,train}

# your code here

ntrain=$(cat $expdir/$dbname/$model/train/trans | awk '{print $1}' | sort | uniq | wc -l) 
ntest=$(cat $expdir/$dbname/$model/test/trans | awk '{print $1}' | sort | uniq | wc -l)
echo "number of documents for training: $ntrain"
echo "number of documents for testing: $ntest"

number of documents for training: 1374
number of documents for testing: 1372


The data is now ready. The next cell provides a quick access to the data:
  * `topics` list of all the topics sorted alphabetically
  * `train_doc_topic` mapping document id -> topic label (train set)
  * `test_doc_topic` mapping document id -> topic label (test set)
  * `train_docs` mapping document id -> list of sentences (train set)
  * `test_docs` mapping document id -> list of sentences (test set)

In [5]:
with open(os.path.join(datadir, dbname, 'topics'), 'r') as f:
    topics = sorted([line.strip() for line in f])

def iterate_doclabels(dataset, datadir=datadir, dbname=dbname):
    with open(os.path.join(datadir, dbname, dataset, 'doclabels'), 'r') as f:
        for line in f:
            tokens = line.strip().split()
            yield tokens[0], ' '.join(tokens[1:])
train_doc_topic = {docid: topicid for docid, topicid in iterate_doclabels('train')}
test_doc_topic = {docid: topicid for docid, topicid in iterate_doclabels('test')}

    
def iterate_trans(dataset, expdir=expdir, dbname=dbname, model=model):
    with open(os.path.join(expdir, dbname, model, dataset, 'trans'), 'r') as f:
        for line in f:
            tokens = line.strip().split()
            yield tokens[0], ' '.join(tokens[1:])
train_docs = defaultdict(list)
test_docs = defaultdict(list)
for docid, sentence in iterate_trans('train'): train_docs[docid].append(sentence)
for docid, sentence in iterate_trans('test'): test_docs[docid].append(sentence)

To verify that everything is properly setup, we plot the topic distribution for the train and test set.

In [6]:
train_topic_counts = defaultdict(int)
for docid, topicid in train_doc_topic.items():
    train_topic_counts[topicid] += 1
tot = sum(train_topic_counts.values())
train_topic_prob = [train_topic_counts[topic] / tot for topic in topics]
    
test_topic_counts = defaultdict(int)
for docid, topicid in test_doc_topic.items():
    test_topic_counts[topicid] += 1
tot = sum(test_topic_counts.values())
test_topic_prob = [test_topic_counts[topic] / tot for topic in topics]

fig1 = figure(title='Topic Distribution (train set)', x_range=topics, 
              width=800, height=400)
fig1.vbar(x=topics, top=train_topic_prob, width=0.8)
fig1.xaxis.major_label_orientation = math.pi/3
fig1.xgrid.grid_line_color = None

fig2 = figure(title='Topic Distribution (test set)', x_range=topics, 
              width=800, height=400)
fig2.vbar(x=topics, top=test_topic_prob, width=0.8)
fig2.xaxis.major_label_orientation = math.pi/3
fig2.xgrid.grid_line_color = None

show(gridplot([[fig1], [fig2]]))

## Features extraction

Each document is represented by a 'bag-of-ngrams'. Because the vocabulary (i.e. the number of unique n-gram) can be quite large, we select the subset of n-grams which have the highest joint conditional probability:

$$
p(t | w ) = \frac{f_{wt} + |T|p(t)}{f_w + |T|}
$$

where:
  * $t$ is a topic label
  * $w$ is a n-gram
  * $|T|$ is the total number of topics
  * $f_wt$ is the number of times the n-gram $w$ occurs in all documents associated to the topic $t$
  * $f_w$ is the number of times the n-gram $w$ occurs in the whole corpus

In [7]:
vocab = select_ngrams(
    ngram_order, 
    nbest=100, 
    corpus=iterate_trans('train'), 
    doc_topic=train_doc_topic, 
    topic_prob={topicid: prob for topicid, prob in zip(topics, train_topic_prob)}
)

print(f'Vocabulary size: {len(vocab)}')
print('A few samples of the selected n-grams: ')
for ngram in random.sample(vocab, min(10, len(vocab))):
    print(ngram)

Vocabulary size: 3975
A few samples of the selected n-grams: 
('au66', 'au80', 'au94')
('au73', 'au51', 'au89')
('au99', 'au61', 'au20')
('au84', 'au40', 'au33')
('au6', 'au11', 'au31')
('au35', 'au77', 'au54')
('au13', 'au91', 'au78')
('au50', 'au60', 'au18')
('au94', 'au6', 'au39')
('au50', 'au72', 'au5')


Now we encode the corpus into a matrix as follows:

$$
\mathbf{X} = \begin{pmatrix}
    f_{11} & f_{12} & \dots & f_{1D} \\
    \vdots & \ddots & & \vdots \\
    f_{M1} & f_{M2} & \dots & f_{MD}
\end{pmatrix}
$$
where:
  * $D$ is the number of documents in the corpus
  * $M$ is the size the vocabulary, i.e. the number of selected n-gram
  
Similarly, we encode the document labels into a vector:

$$
\mathbf{y} = \begin{pmatrix} t_1, t_d, \dots, t_D \end{pmatrix}
$$

where $t_i$ is the (index of) topic label of the $i$th document

NOTE: our implementation is not very memory efficient as we store a dense matrix even though the data is very sparse.

In [14]:
train_X = np.zeros((len(train_docs), len(vocab)), dtype=float)
train_y = np.zeros((len(train_docs)))
for i, docid in enumerate(train_docs):
    counter = NGramCounter(ngram_order)
    for sentence in train_docs[docid]: counter.add(sentence)
    train_X[i] = counter.get_counts(vocab)
    train_y[i] = topics.index(train_doc_topic[docid])
    
test_X = np.zeros((len(test_docs), len(vocab)), dtype=float)
test_y = np.zeros((len(test_docs)))
for i, docid in enumerate(test_docs):
    counter = NGramCounter(ngram_order)
    for sentence in test_docs[docid]: counter.add(sentence)
    test_X[i] = counter.get_counts(vocab)
    test_y[i] = topics.index(test_doc_topic[docid])
    
train_X = torch.from_numpy(train_X)
train_y = torch.from_numpy(train_y).long()
test_X = torch.from_numpy(test_X)
test_y = torch.from_numpy(test_y).long()
    
print(f'train set size: {train_X.shape}')
print(f'test set size: {test_X.shape}')

train set size: torch.Size([1374, 3975])
test set size: torch.Size([1372, 3975])


## Topic Model

We use a revisited version of the Subspace Mutlinomial Model (SMM) to model the data

In [None]:
latent_dim = 50

# Data model (SGMM).
modelset = beer.NormalSet.create(
    mean=torch.zeros(latent_dim), 
    cov=torch.ones(latent_dim),
    size=len(topics),
    cov_type='full'
)
latent_prior = beer.Mixture.create(modelset).double()

# Build the Multinomial model.
mean = torch.ones(len(vocab)) / len(vocab)
model = beer.Categorical.create(mean).double()
newparams = {
    param: beer.SubspaceBayesianParameter.from_parameter(param, latent_prior)
    for param in model.bayesian_parameters()
}
model.replace_parameters(newparams)

# Create the Generalized Subspace Model
gsm = beer.GSM.create(model, latent_dim, latent_prior).double()

# Create the instance of SGMM for each dataset
models, latent_posts = gsm.new_models(len(train_docs), cov_type='diagonal')

# Accumulate the statistics.
for model in models:
    dim = train_X.shape[1]
    data = torch.eye(dim, dtype=train_X.dtype, device=train_X.device)
    data[range(dim), range(dim)] = train_X[i]
    elbo = beer.evidence_lower_bound(model, train_X)
    elbo.backward(std_params=False)

epochs = 10
batch_size = 100

# Optimizer for the conjugate parameters.
params = gsm.conjugate_bayesian_parameters(keepgroups=True)
cjg_optim = beer.VBConjugateOptimizer(params, lrate=1e-2)

# Optimizer for the standard parameters.
params = list(latent_posts.parameters()) + list(gsm.parameters())
std_optim = torch.optim.Adam(params, lr=1e-3)

# Global optimizer.
optim = beer.VBOptimizer(cjg_optim, std_optim)

elbos = []
for epoch in range(1, epochs + 1): 
    for batch in range(0, len(train_docs), batch_size):
        batch_idxs = random.sample(range(len(models)), k=batch_size)
        batch_models = [models[s] for s in batch_idxs]
        batch_latent_posts = latent_posts[batch_idxs]
        batch_labels = train_y[batch_idxs]
        
        optim.init_step()
        elbo = beer.evidence_lower_bound(
            gsm, 
            batch_models, 
            labels=batch_labels,
            latent_posts=batch_latent_posts, 
            latent_nsamples=1, 
            params_nsamples=1,
            datasize=len(models)
        )
        elbo.backward()
        optim.step()
        
        elbos.append(float(elbo))

fig = figure()
fig.line([epochs * e/len(elbos) for e in range(len(elbos))],
         elbos)
show(fig)

In [10]:
means = latent_posts.params.mean
pred = latent_prior.posteriors(means).argmax(dim=1).detach().numpy()

print(f'Accuracy: {100 * np.mean(pred == train_y):.5f} %')

NameError: name 'latent_posts' is not defined

In [11]:
latent_prior.modelset[0].mean

tensor([-0.0505,  0.0697, -0.0406,  0.0295, -0.0198,  0.0505, -0.0629, -0.0046,
        -0.0217, -0.0671,  0.0528,  0.0230,  0.0701,  0.0235,  0.0232,  0.0121,
        -0.0742, -0.0159,  0.0282, -0.0015,  0.0308,  0.0507, -0.0261, -0.0719,
         0.0316, -0.0259,  0.0643,  0.0394, -0.0676,  0.0250,  0.0566, -0.0631,
         0.0198, -0.0827,  0.0052,  0.0072, -0.0434, -0.0554, -0.0698, -0.0721,
         0.0618,  0.0156,  0.0272,  0.0269,  0.0520, -0.0181,  0.0312, -0.0019,
         0.0174, -0.0549], dtype=torch.float64)

In [12]:
fig = figure()

plotting.plot_gmm(fig, latent_prior, alpha=.5)
show(fig)

ValueError: shapes (50,2) and (50,50) not aligned: 2 (dim 1) != 50 (dim 0)

## Data preparation

In [None]:
NTOPICS = 40  # Number of topics {6, 40}
#datapath = 'data/aud_subspace_mbn_4g_gamma_dirichlet_process_ldim40.trans'
#datapath = 'data/aud_subspace_mbn_4g_gamma_dirichlet_process_ldim100.trans'
#datapath = 'data/aud_mbn_4g_gamma_dirichlet_process.trans'
datapath = 'data/aud_mfcc_8k_4g_gamma_dirichlet_process.trans'


with open(f'data/fisher_{NTOPICS}c_train.flist', 'r') as f:
    train_docs = [line.strip() for line in f]
    
with open(f'data/fisher_{NTOPICS}c_test.flist', 'r') as f:
    test_docs = [line.strip() for line in f]
    
with open('data/tID_tName.pkl', 'rb') as f:
    topic_names = pickle.load(f)
        
# Load all documents with their associated topic label.
#document2topic = {}                                                            
#with open('data/fe_03_p1_calldata.tbl', 'r') as f:                                            
#    next(f) # skip the first line.                                             
#    for line in f:                                                             
#        tokens = line.strip().split(',')                                       
#        docid, topicid = tokens[0], tokens[2]             
#        if topicid != '' and (docid in train_docs or docid in test_docs):                              
#            document2topic[docid] = topicid     
            
document2topic = {}
with open('data/fID_tID.pkl', 'rb') as f:
    labels = pickle.load(f)
for doc, topicid in labels.items():
    document2topic[doc] = topic_names[topicid]
    
        
train_documents = []
test_documents = []
for docid in document2topic:
    if docid in train_docs:
        train_documents.append(docid)
    elif docid in test_docs:
        test_documents.append(docid)
            
# Build the reverse mapping topic -> documents
topic2document = defaultdict(list)
for doc, topic in document2topic.items():
    topic2document[topic].append(doc)
topics = sorted(list(topic2document.keys()))

topic2document_train = defaultdict(list)
for doc, topic in document2topic.items():
    if doc in train_docs:
        topic2document_train[topic].append(doc)

topic2document_test = defaultdict(list)
for doc, topic in document2topic.items():
    if doc in test_docs:
        topic2document_test[topic].append(doc)

# Load the raw data
rawdata = defaultdict(list)                                                     
with open(datapath, 'r') as f:                                                  
    for line in f:                                                          
        tokens = line.strip().split()                                       
        docid = tokens[0].replace('fe_03_', '')[:5]                        
        rawdata[docid].append(' '.join(tokens[1:]))

## Features selection

The vocabulary is selected per topic by choosing the $K$ n-gram $w$ with the highest following propbability:

$$
p(t | w ) = \frac{f_{wt} + |T|p(t)}{f_w + |T|}
$$

In [None]:
p_t = np.array([float(len(topic2document_train[topic])) for topic in topics]) 
p_t /= p_t.sum()
ngram_order = 3

class NGramCounter:
    
    def __init__(self, order=3, prior_count=0.):
        self.order = order
        self.counts = defaultdict(lambda: prior_count)
        
    def add(self, doc):
        for utt in doc:
            new_utt = '<s> ' * (self.order - 1)  + utt
            tokens = new_utt.split()
            for i in range(len(tokens) - self.order):
                ngram = tuple(tokens[i:i+self.order])
                self.counts[ngram] += 1
                
    def get_counts(self, vocab=None):
        if vocab is None:
            vocab = sorted(list(self.counts.keys()))
        return [self.counts[word] for word in vocab]
        
# Evaluate the counts. 
global_counter = NGramCounter(order=ngram_order, prior_count=NTOPICS)
topic_counters = {}
for p_topic, topic in zip(p_t, topics):
    topic_counters[topic] = NGramCounter(order=ngram_order, prior_count=NTOPICS * p_topic)
    for doc in topic2document_train[topic]:
        global_counter.add(rawdata[doc])
        topic_counters[topic].add(rawdata[doc])

In [None]:
' '.join()

In [None]:
topic_ranked_ngrams = {}
for topic in topics:
    ranked_ngrams = []
    for ngram in global_counter.counts:
        score = topic_counters[topic].counts[ngram] / global_counter.counts[ngram]
        ranked_ngrams.append((ngram, score))
    topic_ranked_ngrams[topic]  = list(reversed(sorted(ranked_ngrams, key=lambda x: x[1])))

In [None]:
nbest = 100
vocab = set()
for ngrams in topic_ranked_ngrams.values():
    vocab = vocab.union([ngram for ngram, score in ngrams[:nbest]])
vocab = sorted(list(vocab))
len(vocab), vocab[:100]

## Features

Using the selected ngrams, we represent each document as a bag-of-ngrams

In [None]:
train_X = np.zeros((len(train_documents), len(vocab)), dtype=float)
train_y = np.zeros((len(train_documents)))
for i, docid in enumerate(train_documents):
    counter = NGramCounter(order=ngram_order)
    counter.add(rawdata[docid])
    train_X[i] = counter.get_counts(vocab)
    train_y[i] = topics.index(document2topic[docid])
    
test_X = np.zeros((len(test_documents), len(vocab)), dtype=float)
test_y = np.zeros((len(test_documents)))
for i, docid in enumerate(test_documents):
    counter = NGramCounter(order=ngram_order)
    counter.add(rawdata[docid])
    test_X[i] = counter.get_counts(vocab)
    test_y[i] = topics.index(document2topic[docid])

In [None]:
train_X.sum(), test_X.sum(), len(train_X), len(test_X)
train_X[:10].sum(axis=1)

## Topic classification

Using "sklearn" to build the pipeline

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


clf = Pipeline([
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=13, max_iter=20, tol=1e-1)),
])


parameters = {
}

gs_clf = GridSearchCV(clf, parameters, cv=5, iid=False)
gs_clf.fit(train_X.numpy(), train_y.numpy())


predicted = gs_clf.predict(test_X.numpy())
print(f'Accuracy: {np.mean(predicted == test_y.numpy()) * 100:.3f} %')

Accuracy: 18.586 %


accuracy for the AUD HMM without features selection: 37.755 % <br>
accuracy for the AUD subspace HMM (dim 40): 48.834 %

In [20]:
train_corpus = ['\n'.join(rawdata[doc]) for doc in train_documents]
test_corpus = ['\n'.join(rawdata[doc]) for doc in test_documents]

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(ngram_range=(3, 3))
train_X = count_vect.fit_transform(train_corpus)
test_X = count_vect.transform(test_corpus)

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

tf_transformer = TfidfTransformer(use_idf=True).fit(train_X)
train_X_tfidf = tf_transformer.transform(train_X)
test_X_tfidf = tf_transformer.transform(test_X)

clf = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=13,
                    max_iter=20, tol=1e-3).fit(train_X_tfidf, train_y)

predicted = clf.predict(train_X_tfidf)
np.mean(predicted == train_y)

predicted = clf.predict(test_X_tfidf)
print(f'Accuracy: {np.mean(predicted == test_y) * 100:.3f} %')

Accuracy: 16.108 %
