<a href="https://colab.research.google.com/github/deniskapel/topic_rec/blob/master/topic_rec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Next entity recommendation using Spacy noun chunks and LSTM

Steps:
1. Extract noun chunks using Spacy
2. Filter out most frequent
3. Cluster dialogues using chunks vector and Kmeans
4. Take the largest cluster and split it to train and test
5. Train LSTM model to generate next entity (noun chunk)

In [1]:
# skip if you are using jupyter instead of colab. GPU is required.
%%bash
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/data_tools.py
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/torch_tools.py

In [2]:
from itertools import chain

import numpy as np
import spacy
import torch

from torch.utils.data import DataLoader
from sklearn.cluster import KMeans

from data_tools import Chunker, Vectorizer, SequenceGenerator
from torch_tools import TopicDataset, GModel, train, predict

In [3]:
spacy.prefer_gpu()
# load larger model to use spacy word vectors
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [4]:
device = torch.device('cuda')
BATCH_SIZE = 32
SEQ_LEN = 2 # lengths of the sequence to use to predict next

## Data

In [5]:
# skip if data is pre-loaded
%%bash
wget http://yanran.li/files/ijcnlp_dailydialog.zip
unzip ijcnlp_dailydialog.zip
mv ijcnlp_dailydialog data

In [6]:
"""
work only with a train set only as we were supposed to filter out topics 
using key words not topic labels
"""
with open('data/dialogues_text.txt') as file:
    dialogues = file.readlines()

In [7]:
dialogues[0]

"The kitchen stinks . __eou__ I'll throw out the garbage . __eou__\n"

### Normalizing data

In [8]:
chunker = Chunker(spacy_model=nlp, stop_words=nlp.Defaults.stop_words)
chunked_dials = chunker.normalize(dialogues)

  0%|          | 0/13118 [00:00<?, ?it/s]

In [9]:
chunked_dials[90]

[Ve,
 your computer,
 Ve,
 your personal files,
 ’ m,
 all my important personal documents,
 that computer,
 no laughing matter,
 I ’ m,
 Don ’ t,
 my computer,
 t]

remove 50 most common noun chunks as they do not characterize dialogues

**TODO**: experiment with larger/smaller number and a stop list

In [10]:
most_common = [topic[0] for topic in chunker.counter.most_common(50)]
filtered = [chunker.filter_chunks(dial, most_common) for dial in chunked_dials]
set(filtered[90]) ^ set(chunked_dials[90])

{’ m, I ’ m, Ve, Don ’ t, Ve, t}

In [11]:
# filter out dialogues which have less chunks thank SEQ LEN + 1
# as no padding is implemented for now
filtered = [doc for doc in filtered if len(doc) > (SEQ_LEN+1)]

### Vectorizing data to cluster

In [12]:
vectorizer = Vectorizer(len(filtered))
vecs, doc2id = vectorizer.vectorize(filtered)

In [13]:
kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(vecs)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [14]:
unique, counts = np.unique(kmeans.labels_, return_counts=True)
topics = dict(zip(unique, counts))
topics

{0: 1275,
 1: 772,
 2: 1194,
 3: 1222,
 4: 642,
 5: 1568,
 6: 574,
 7: 574,
 8: 1840,
 9: 1418}

In [15]:
id_max = max(topics, key=topics.get)
id_max

8

Take the largest cluster to have data for training

In [16]:
largest_theme = doc2id.loc[(kmeans.labels_== id_max).nonzero()]

In [17]:
largest_theme.sample(5)

Unnamed: 0,doc
9593,"[(your, technical, title), (an, Assistant, Ele..."
11075,"[(Lindsay, Tipping), (a, reference), (an, edit..."
4183,"[(a, book), (healthy, habits), (traveling), (T..."
6803,"[(your, new, job), (my, new, job), (the, grape..."
9785,"[(Any, specific, reason), (case), (the, dentis..."


Create Vocabulary.

In [18]:
id2chunk = list(set(
        [chunk for chunk in chain(*largest_theme.values.squeeze())]))
chunk2id = {chunk: i for i, chunk in enumerate(id2chunk)}

Build X and y for training

In [19]:
sgen = SequenceGenerator(SEQ_LEN, chunk2id=chunk2id, id2chunk=id2chunk)
seqs = sgen.get_sequences(largest_theme)
seqs.head()

Unnamed: 0,seq,target
0,"[21667, 17227]","[17227, 7517]"
1,"[17227, 7517]","[7517, 26961]"
2,"[7517, 26961]","[26961, 17348]"
3,"[26961, 17348]","[17348, 30148]"
4,"[16089, 29540]","[29540, 1895]"


train test split

In [20]:
train_df=seqs.sample(frac=0.8,random_state=42)
test_df=seqs.drop(train_df.index)

In [21]:
train_df.shape, test_df.shape

((21548, 2), (5387, 2))

**data to torch DataLoader**

In [22]:
# lstm accepts hidden_state and cell_state of the same size
# to avoid initiating it at each batch, make all batches of equal size
# alternative solution is to add up a few examples from validation
trim_train = train_df.shape[0] % BATCH_SIZE
trim_test = test_df.shape[0] % BATCH_SIZE

train_x = train_df['seq'].values
train_y = train_df['target'].values
test_x = test_df['seq'].values
test_y = test_df['target'].values
if trim_train>0:
    train_x = train_x[:-trim_train]
    train_y = train_y[:-trim_train]
if trim_test>0:
    test_x = test_x[:-trim_test]
    test_y = test_y[:-trim_test]

train_dataset = TopicDataset(
    x=train_x,
    y=train_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    seq_len=sgen.seq_len)

test_dataset = TopicDataset(
    x=test_x,
    y=test_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    seq_len=sgen.seq_len)

In [23]:
# vector size               target                  prev_seq
train_dataset[0][0].shape, train_dataset[0][1],train_dataset[0][2]

((2, 300), [11104, 5190], [12313, 11104])

In [24]:
# vector size               target                  prev_seq
test_dataset[0][0].shape, test_dataset[0][1], test_dataset[0][2]

((2, 300), [2091, 26340], [14236, 2091])

Split the dataset to train, val and test

In [25]:
train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=train_dataset.collate,
    # TODO find out why shuffling requires a generator on gpu 
    # probably due to spacy vectors on gpu,
    # but shuffle=False works fine without it for some reason
    generator=torch.Generator(device='cuda'))

for batch in train_loader:
    break

batch[0].shape

torch.Size([32, 2, 300])

In [26]:
test_loader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=test_dataset.collate,
    generator=torch.Generator(device='cuda'))

In [27]:
for batch in test_loader:
    break

batch[1].shape

torch.Size([32, 2])

## Train

use lstm to generate next topic

**TODO:** experiment with number of epochs, various combinations of layers, larger/smaller sequence

**TODO2** allow switching between biderectional and one-way LSTM

In [28]:
model = GModel(vocab_size=len(sgen.id2chunk), seq_len=sgen.seq_len)

In [29]:
model.to(device)

GModel(
  (lstm): LSTM(300, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=30615, bias=True)
)

In [30]:
for p in model.named_parameters():
    print(p[0], p[1].shape)

lstm.weight_ih_l0 torch.Size([512, 300])
lstm.weight_hh_l0 torch.Size([512, 128])
lstm.bias_ih_l0 torch.Size([512])
lstm.bias_hh_l0 torch.Size([512])
linear.weight torch.Size([30615, 128])
linear.bias torch.Size([30615])


In [31]:
train(model, train_loader, epochs=15, lr=.001, clip_value=1.0)

Epoch 1:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 3:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 4:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 5:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 6:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 7:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 8:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 9:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 10:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 11:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 12:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 13:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 14:   0%|          | 0/673 [00:00<?, ?it/s]

Epoch 15:   0%|          | 0/673 [00:00<?, ?it/s]

Prediction is using the first batch only for demosntration purposes. Data is shuffled, so output is always different

In [32]:
# id2chunk is necessary to restore Noun Chunk's string
predict(
    model, test_loader, 
    old_vocab=sgen.id2chunk, 
    new_vocab=sgen.id2chunk # intentional
    )

SEQ: good care, my elderly grandfather, PRED: the period, TRUE: a teacher
SEQ: a continued claim form, that form, PRED: this form, TRUE: the questions
SEQ: a supermarket, last summer holidays, PRED: your spare time, TRUE: your spare time
SEQ: both aspects, our customer base, PRED: qc problems, TRUE: our brand
SEQ: the customer, the law, PRED: inventions, TRUE: effort
SEQ: this mess, so many projects, PRED: that ’ s, TRUE: That ’ s
SEQ: your advertisement, an experienced software engineer, PRED: my background, TRUE: my background
SEQ: the final examination, this month, PRED: good preparation, TRUE: good preparation
SEQ: unit, that part, PRED: the very end, TRUE: charge
SEQ: the high unemployment rate, politicians, PRED: more than a hundred and fifty countries, TRUE: the problems
SEQ: the book, our teacher, PRED: 42 yuan, TRUE: 42 yuan
SEQ: my judgment, commitment, PRED: projects, TRUE: projects
SEQ: the position, accountant, PRED: a student cadre, TRUE: What university
SEQ: the software

In [33]:
# skip if you do not need to save the model
!mkdir model
model_save_name = 'entity_generator.pt'
path = F"/content/model/{model_save_name}" 
torch.save(model.state_dict(), path)

## Test on another group of dialogues

OOVs are handled by spaCy but this data has a different theme. Model works on vectors and knows nothing about tokens but it is interesting to look at predictions when sequences come from another domain.

In [34]:
topics

{0: 1275,
 1: 772,
 2: 1194,
 3: 1222,
 4: 642,
 5: 1568,
 6: 574,
 7: 574,
 8: 1840,
 9: 1418}

In [35]:
with_oov = doc2id.loc[(kmeans.labels_== 1).nonzero()]
with_oov.sample(5)

Unnamed: 0,doc
6188,"[(m), (Rick, Fields), (Bob, Copeland), (Howdy,..."
3238,"[(the, name), (the, book), (Harry, Potter), (H..."
5542,"[(Death), (the, Nile), (No, ,, not, that, one)..."
5592,"[(bridge), (a, change), (a, long, time), (the,..."
719,"[(Chinese, antiques), (a, great, variety), (Ch..."


In [36]:
new_id2chunk = list(
    set([chunk for chunk in chain(*with_oov.values.squeeze())]))

new_chunk2id = {chunk: i for i, chunk in enumerate(new_id2chunk)}

In [37]:
new_sgen = SequenceGenerator(SEQ_LEN, chunk2id=new_chunk2id, id2chunk=new_id2chunk)
with_oov_seqs = new_sgen.get_sequences(with_oov)
with_oov_seqs.head(5)

Unnamed: 0,seq,target
0,"[10870, 1286]","[1286, 7582]"
1,"[1286, 7582]","[7582, 5362]"
2,"[7582, 5362]","[5362, 6158]"
3,"[5362, 6158]","[6158, 696]"
4,"[6158, 696]","[696, 11298]"


In [42]:
trim_oov = with_oov_seqs.shape[0] % BATCH_SIZE

oov_x = with_oov_seqs['seq'].values
oov_y = with_oov_seqs['target'].values
if trim_oov>0:
    oov_x = oov_x[:-trim_oov]
    oov_y = oov_y[:-trim_oov]

oov_dataset = TopicDataset(
    x=oov_x,
    y=oov_y,
    n_features=300,
    id2chunk=new_sgen.id2chunk,
    seq_len=sgen.seq_len)

oov_loader = DataLoader(
    oov_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=train_dataset.collate,
    generator=torch.Generator(device='cuda'))

for batch in train_loader:
    break

batch[1].shape

torch.Size([32, 2])

In [43]:
# id2chunk is necessary to restore Noun Chunk's string
# two vocabs are for demonstration purposes
# to explore sequence, prediction and a correct result when
# a test dataset comes from a different domain
predict(
    model, oov_loader, 
    old_vocab=sgen.id2chunk, 
    new_vocab=new_sgen.id2chunk)

SEQ: football, daddy, PRED: ’ d, TRUE: Mom
SEQ: times, a father, PRED: a child, TRUE: my baseball
SEQ: nice clothing, the yard sale, PRED: many different things, TRUE: the university flea market
SEQ: a popular game, some people, PRED: a tough approach, TRUE: your country
SEQ: future, this interview, PRED: our candidates, TRUE: My pleasure
SEQ: the singer, the subway exit, PRED: their reason, TRUE: the subway exit
SEQ: just stories, some popular songs, PRED: your visitors, TRUE: an artist or song writer
SEQ: a chance, the contest, PRED: a writing sample, TRUE: the third audition
SEQ: the folk antique handicrafts and collectibles exhibition halls, how many kinds, PRED: 3 secretarial categories, TRUE: folk collections
SEQ: hanging, chinese, PRED: americans, TRUE: my pleasure
SEQ: music, this music, PRED: your zone, TRUE: everybody
SEQ: any sympathy, show biz stars, PRED: both entry level and management positions, TRUE: some credit
SEQ: the last part, the sixteenth note, PRED: working cond

## Outcomes

Outputs using test samples from the same domain are more meaningful, which is expected, but there are some relatively good cases in the "out-of-vocabulary" dataset:

**Same theme**:

SEQ: four years, a tuition scholarship, PRED: a vocational school, TRUE: the university

SEQ: this model, extraordinary capabilities, PRED: trial, TRUE: the next generation

**Different theme**:

SEQ: future, this interview, PRED: our candidates, TRUE: My pleasure

SEQ: the owner, the integration, PRED: software products, TRUE: the Han and Mongolian cultures


**TODO**

Explore the following approach:
1. Cluster themes and save centroids
2. Train smaller models for each theme
3. When a new sample comes, based on its vector and the centroids, find the most relevant theme and use its model
4. Compare the results