<a href="https://colab.research.google.com/github/deniskapel/topic_rec/blob/master/topic_rec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Next entity recommendation using Spacy noun chunks and LSTM

Steps:
1. Extract noun chunks using Spacy
2. Filter out most frequent
3. Cluster dialogues using chunks vector and Kmeans
4. Take the largest cluster and split it to train and test
5. Train LSTM model to generate next entity (noun chunk)

In [1]:
# skip if you are using jupyter instead of colab. GPU is required.
%%bash
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/data_tools.py
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/torch_tools.py

In [2]:
import numpy as np
import spacy
import torch

from torch.utils.data import DataLoader
from sklearn.cluster import KMeans

from data_tools import Chunker, Vectorizer, SequenceGenerator
from torch_tools import TopicDataset, GModel, train, predict

In [3]:
spacy.prefer_gpu()
# load larger model to use spacy word vectors
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [32]:
device = torch.device('cuda')
BATCH_SIZE = 16
SEQ_LEN = 1 # lengths of the sequence to use to predict next

## Data

In [5]:
# skip if data is pre-loaded
%%bash
wget http://yanran.li/files/ijcnlp_dailydialog.zip
unzip ijcnlp_dailydialog.zip
mv ijcnlp_dailydialog data

In [6]:
"""
work only with a train set only as we were supposed to filter out topics 
using key words not topic labels
"""
with open('data/dialogues_text.txt') as file:
    dialogues = file.readlines()

In [7]:
dialogues[0]

"The kitchen stinks . __eou__ I'll throw out the garbage . __eou__\n"

### Normalizing data

In [8]:
chunker = Chunker(spacy_model=nlp, stop_words=nlp.Defaults.stop_words)
chunked_dials = chunker.normalize(dialogues[0:1000])

  0%|          | 0/1000 [00:00<?, ?it/s]

In [9]:
chunked_dials[90]

[Ve,
 your computer,
 Ve,
 your personal files,
 ’ m,
 all my important personal documents,
 that computer,
 no laughing matter,
 I ’ m,
 Don ’ t,
 my computer,
 t]

remove 100 most common noun chunks as they do not characterize dialogues

**TODO**: experiment with larger/smaller number and a stop list

In [33]:
most_common = [topic[0] for topic in chunker.counter.most_common(100)]
filtered = [chunker.filter_chunks(dial, most_common) for dial in chunked_dials]
set(filtered[90]) ^ set(chunked_dials[90])

{t}

In [34]:
# filter out dialogues which have less chunks thank SEQ LEN + 1
# as no padding is implemented for now
filtered = [doc for doc in filtered if len(doc) > (SEQ_LEN+1)]

### Vectorizing data to cluster

In [35]:
vectorizer = Vectorizer(len(filtered))
vecs, doc2id = vectorizer.vectorize(filtered)

In [36]:
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(vecs)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [37]:
unique, counts = np.unique(kmeans.labels_, return_counts=True)
topics = dict(zip(unique, counts))
topics

{0: 202, 1: 225, 2: 150, 3: 128, 4: 71}

In [38]:
id_max = max(topics, key=topics.get)
id_max

1

Take the largest cluster to have data for training

In [39]:
largest_theme = doc2id.loc[(kmeans.labels_== id_max).nonzero()]

In [40]:
largest_theme.sample(5)

Unnamed: 0,doc
78,"[(the, screen), (your, steering, wheel), (a, p..."
709,"[(some, dessert), (Your, bill), (my, tip)]"
736,"[(any, outdoor, interests), (My, only, recreat..."
769,"[(a, tumble), (No, big, deal), (trouble), (the..."
53,"[(a, favor), (No, sweat), (cake), (I, ’, m, a,..."


Build X and y for training

In [41]:
sgen = SequenceGenerator(SEQ_LEN)
seqs = sgen.get_sequences(largest_theme)
seqs.head()

Unnamed: 0,seq,target
0,[725],[154]
1,[154],[1328]
2,[1328],[835]
3,[835],[989]
4,[989],[60]


train test split

In [42]:
train_df=seqs.sample(frac=0.8,random_state=42)
test_df=seqs.drop(train_df.index)

In [43]:
train_df.shape, test_df.shape

((1544, 2), (386, 2))

## Data to torch DataLoader

In [44]:
# lstm accepts hidden_state and cell_state of the same size
# to avoid initiating it at each batch, make all batches of equal size
# alternative solution is to add up a few examples from validation
trim_train = train_df.shape[0] % BATCH_SIZE
trim_test = test_df.shape[0] % BATCH_SIZE

train_x = train_df['seq'].values
train_y = train_df['target'].values
test_x = test_df['seq'].values
test_y = test_df['target'].values
if trim_train>0:
    train_x = train_x[:-trim_train]
    train_y = train_y[:-trim_train]
if trim_test>0:
    test_x = test_x[:-trim_test]
    test_y = test_y[:-trim_test]

train_dataset = TopicDataset(
    x=train_x,
    y=train_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    chunk2id=sgen.chunk2id,
    seq_len=sgen.seq_len)

test_dataset = TopicDataset(
    x=test_x,
    y=test_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    chunk2id=sgen.chunk2id,
    seq_len=sgen.seq_len)

In [45]:
# vector size               target                  prev_seq
train_dataset[0][0].shape, train_dataset[0][1],train_dataset[0][2]

((1, 300), [878], [2057])

In [46]:
# vector size               target                  prev_seq
test_dataset[0][0].shape, test_dataset[0][1], test_dataset[0][2]

((1, 300), [1328], [154])

Split the dataset to train, val and test

In [47]:
train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=train_dataset.collate,
    # TODO find out why shuffling requires a generator on gpu 
    # probably due to spacy vectors on gpu,
    # but shuffle=False works fine without it for some reason
    generator=torch.Generator(device='cuda'))

for batch in train_loader:
    break

batch[0].shape

torch.Size([16, 1, 300])

In [48]:
test_loader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=test_dataset.collate,
    generator=torch.Generator(device='cuda'))

In [49]:
for batch in test_loader:
    break

batch[1]

tensor([[1179],
        [  92],
        [ 474],
        [1640],
        [1790],
        [1436],
        [  96],
        [1558],
        [1737],
        [ 281],
        [1468],
        [1307],
        [1502],
        [2117],
        [ 695],
        [2084]])

## Train

use lstm to generate next topic

**TODO:** experiment with number of epochs, various combinations of layers, larger/smaller sequence

**TODO2** allow switching between biderectional and one-way LSTM

In [50]:
model = GModel(vocab_size=len(sgen.id2chunk), seq_len=sgen.seq_len)

In [51]:
model.to(device)

GModel(
  (lstm): LSTM(300, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=2155, bias=True)
)

In [52]:
train(model, train_loader, epochs=10, lr=.001, clip_value=1.0)

Epoch 1:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 3:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 4:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 5:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 6:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 7:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 8:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 9:   0%|          | 0/96 [00:00<?, ?it/s]

Epoch 10:   0%|          | 0/96 [00:00<?, ?it/s]

Prediction is using the first batch only for demosntration purposes. Data is shuffled, so output is always different

In [53]:
predict(model, test_loader, sgen.id2chunk)

After --so much time--, I suggest discussing --my doctor--
After --news--, I suggest discussing --my washing machine--
After --a drama queen--, I suggest discussing --the monkey--
After --the back--, I suggest discussing --which kind--
After --how about this one--, I suggest discussing --no regrets--
After --a dining table--, I suggest discussing --the rooms--
After --that window--, I suggest discussing --that table--
After --writing--, I suggest discussing --word-processing--
After --the piano--, I suggest discussing --a disco--
After --a huge slide--, I suggest discussing --home--
After --this tv--, I suggest discussing --very high quality--
After --the middle part--, I suggest discussing --the back--
After --other people--, I suggest discussing --the value--
After --even one song--, I suggest discussing --the front--
After --a ten-minute wait--, I suggest discussing --noon--
After --no results--, I suggest discussing --the discipline--


In [54]:
predict(model, test_loader, sgen.id2chunk)

After --a combination--, I suggest discussing --no gain--
After --a ride--, I suggest discussing --the facilities--
After --the cover charge--, I suggest discussing --your homework--
After --the sulfur--, I suggest discussing --the whole--
After --a football match--, I suggest discussing --tv play--
After --singing--, I suggest discussing --other people--
After --a doctor--, I suggest discussing --high blood pressure--
After --the brazilian team--, I suggest discussing --no doubt--
After --bar bell--, I suggest discussing --the man--
After --what a pity--, I suggest discussing --the lounge--
After --carmen--, I suggest discussing --a dining table--
After --even a little brook--, I suggest discussing --ice cream--
After --no results--, I suggest discussing --the discipline--
After --her birthday--, I suggest discussing --some advice--
After --some coffee--, I suggest discussing --especially the elder people--
After --the show--, I suggest discussing --the circus--
