<a href="https://colab.research.google.com/github/deniskapel/topic_rec/blob/master/topic_rec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Next entity recommendation using Spacy noun chunks and LSTM

Steps:
1. Extract noun chunks using Spacy
2. Filter out most frequent
3. Cluster dialogues using chunks vector and Kmeans
4. Take the largest cluster and split it to train and test
5. Train LSTM model to generate next entity (noun chunk)

In [1]:
# skip if you are using jupyter instead of colab. GPU is required.
%%bash
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/data_tools.py
wget https://raw.githubusercontent.com/deniskapel/topic_rec/master/torch_tools.py

In [2]:
import numpy as np
import spacy
import torch

from torch.utils.data import DataLoader
from sklearn.cluster import KMeans

from data_tools import Chunker, Vectorizer, SequenceGenerator
from torch_tools import TopicDataset, GModel, train, predict

In [3]:
spacy.prefer_gpu()
# load larger model to use spacy word vectors
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [4]:
device = torch.device('cuda')
BATCH_SIZE = 32
SEQ_LEN = 2 # lengths of the sequence to use to predict next

## Data

In [5]:
# skip if data is pre-loaded
%%bash
wget http://yanran.li/files/ijcnlp_dailydialog.zip
unzip ijcnlp_dailydialog.zip
mv ijcnlp_dailydialog data

In [6]:
"""
work only with a train set only as we were supposed to filter out topics 
using key words not topic labels
"""
with open('data/dialogues_text.txt') as file:
    dialogues = file.readlines()

In [7]:
dialogues[0]

"The kitchen stinks . __eou__ I'll throw out the garbage . __eou__\n"

### Normalizing data

In [8]:
chunker = Chunker(spacy_model=nlp, stop_words=nlp.Defaults.stop_words)
chunked_dials = chunker.normalize(dialogues)

  0%|          | 0/13118 [00:00<?, ?it/s]

In [9]:
chunked_dials[90]

[Ve,
 your computer,
 Ve,
 your personal files,
 ’ m,
 all my important personal documents,
 that computer,
 no laughing matter,
 I ’ m,
 Don ’ t,
 my computer,
 t]

remove 100 most common noun chunks as they do not characterize dialogues

**TODO**: experiment with larger/smaller number and a stop list

In [10]:
most_common = [topic[0] for topic in chunker.counter.most_common(100)]
filtered = [chunker.filter_chunks(dial, most_common) for dial in chunked_dials]
set(filtered[90]) ^ set(chunked_dials[90])

{’ m, I ’ m, Ve, Don ’ t, Ve, t}

In [11]:
# filter out dialogues which have less chunks thank SEQ LEN + 1
# as no padding is implemented for now
filtered = [doc for doc in filtered if len(doc) > (SEQ_LEN+1)]

### Vectorizing data to cluster

In [12]:
vectorizer = Vectorizer(len(filtered))
vecs, doc2id = vectorizer.vectorize(filtered)

In [13]:
kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(vecs)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [14]:
unique, counts = np.unique(kmeans.labels_, return_counts=True)
topics = dict(zip(unique, counts))
topics

{0: 1422,
 1: 1076,
 2: 1606,
 3: 1662,
 4: 600,
 5: 628,
 6: 1195,
 7: 523,
 8: 1440,
 9: 740}

In [15]:
id_max = max(topics, key=topics.get)
id_max

3

Take the largest cluster to have data for training

In [16]:
largest_theme = doc2id.loc[(kmeans.labels_== id_max).nonzero()]

In [17]:
largest_theme.sample(5)

Unnamed: 0,doc
4181,"[(my, handwriting), (several, weeks), (my, han..."
8826,"[(your, strongest, trait), (Adaptability), (se..."
5507,"[(Morning), (your, certificate), (sick, -, lea..."
9978,"[(Beijing, Normal, University), (a, major), (E..."
2537,"[(Damn, it), (a, blackout), (Seinfeld), (So, w..."


Build X and y for training

In [18]:
sgen = SequenceGenerator(SEQ_LEN)
seqs = sgen.get_sequences(largest_theme)
seqs.head()

Unnamed: 0,seq,target
0,"[8225, 20887]","[20887, 12564]"
1,"[20887, 12564]","[12564, 18878]"
2,"[12564, 18878]","[18878, 20988]"
3,"[18878, 20988]","[20988, 15593]"
4,"[1318, 13025]","[13025, 7623]"


train test split

In [19]:
train_df=seqs.sample(frac=0.8,random_state=42)
test_df=seqs.drop(train_df.index)

In [20]:
train_df.shape, test_df.shape

((18533, 2), (4633, 2))

## Data to torch DataLoader

In [21]:
# lstm accepts hidden_state and cell_state of the same size
# to avoid initiating it at each batch, make all batches of equal size
# alternative solution is to add up a few examples from validation
trim_train = train_df.shape[0] % BATCH_SIZE
trim_test = test_df.shape[0] % BATCH_SIZE

train_x = train_df['seq'].values
train_y = train_df['target'].values
test_x = test_df['seq'].values
test_y = test_df['target'].values
if trim_train>0:
    train_x = train_x[:-trim_train]
    train_y = train_y[:-trim_train]
if trim_test>0:
    test_x = test_x[:-trim_test]
    test_y = test_y[:-trim_test]

train_dataset = TopicDataset(
    x=train_x,
    y=train_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    chunk2id=sgen.chunk2id,
    seq_len=sgen.seq_len)

test_dataset = TopicDataset(
    x=test_x,
    y=test_y,
    n_features=300,
    id2chunk=sgen.id2chunk,
    chunk2id=sgen.chunk2id,
    seq_len=sgen.seq_len)

In [22]:
# vector size               target                  prev_seq
train_dataset[0][0].shape, train_dataset[0][1],train_dataset[0][2]

((2, 300), [25485, 22436], [6245, 25485])

In [23]:
# vector size               target                  prev_seq
test_dataset[0][0].shape, test_dataset[0][1], test_dataset[0][2]

((2, 300), [20586, 8204], [9899, 20586])

Split the dataset to train, val and test

In [24]:
train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=train_dataset.collate,
    # TODO find out why shuffling requires a generator on gpu 
    # probably due to spacy vectors on gpu,
    # but shuffle=False works fine without it for some reason
    generator=torch.Generator(device='cuda'))

for batch in train_loader:
    break

batch[0].shape

torch.Size([32, 2, 300])

In [25]:
test_loader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, 
    shuffle=True, collate_fn=test_dataset.collate,
    generator=torch.Generator(device='cuda'))

In [26]:
for batch in test_loader:
    break

batch[1].shape

torch.Size([32, 2])

## Train

use lstm to generate next topic

**TODO:** experiment with number of epochs, various combinations of layers, larger/smaller sequence

**TODO2** allow switching between biderectional and one-way LSTM

In [27]:
model = GModel(vocab_size=len(sgen.id2chunk),
               seq_len=sgen.seq_len, bidirectional=False)

In [28]:
model.to(device)

GModel(
  (lstm): LSTM(300, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=26490, bias=True)
)

In [29]:
for p in model.named_parameters():
    print(p[0], p[1].shape)

lstm.weight_ih_l0 torch.Size([512, 300])
lstm.weight_hh_l0 torch.Size([512, 128])
lstm.bias_ih_l0 torch.Size([512])
lstm.bias_hh_l0 torch.Size([512])
linear.weight torch.Size([26490, 128])
linear.bias torch.Size([26490])


In [30]:
train(model, train_loader, epochs=15, lr=.001, clip_value=1.0)

Epoch 1:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 3:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 4:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 5:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 6:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 7:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 8:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 9:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 10:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 11:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 12:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 13:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 14:   0%|          | 0/579 [00:00<?, ?it/s]

Epoch 15:   0%|          | 0/579 [00:00<?, ?it/s]

Prediction is using the first batch only for demosntration purposes. Data is shuffled, so output is always different

In [31]:
predict(model, test_loader, sgen.id2chunk)

After --paperwork, other responsibilities--, I suggest discussing --any experience--
After --any books, genetic engineering--, I suggest discussing --books--
After --a big party, the professors--, I suggest discussing --students--
After --campus, the freshmen--, I suggest discussing --the weekends--
After --my overwork, my boy--, I suggest discussing --my sympathy--
After --a curve, percentages--, I suggest discussing --homework--
After --jane, our company's insurance policies--, I suggest discussing --attention--
After --a new trademark, the patent law--, I suggest discussing --effort--
After --working conditions, the scale--, I suggest discussing --workers--
After --how many employees, this plant--, I suggest discussing --three shifts--
After --study, the library--, I suggest discussing --a new student--
After --a strike, a dispute--, I suggest discussing --the significance--
After --projects, jobs--, I suggest discussing --the high tech industry--
After --the details, any major prob

In [32]:
predict(model, test_loader, sgen.id2chunk)

After --a three-day vacation, only three days--, I suggest discussing --the company policy--
After --a joint venture, a market development manager--, I suggest discussing --the market--
After --shanghai, my father--, I suggest discussing --the oil business--
After --the point, human beings--, I suggest discussing --your point--
After --consumer, the overhead--, I suggest discussing --an entire computer system--
After --oh , breaktime flies, mam--, I suggest discussing --your service--
After --this transaction, the near future--, I suggest discussing --a higher position--
After --a college graduate, tianjin college--, I suggest discussing --commerce--
After --an economic expert, valuable suggestions--, I suggest discussing --the effort--
After --remembers ’ day, poland--, I suggest discussing --their independence day--
After --your educational background, honors--, I suggest discussing --the scholarship--
After --not everyone, such a great boss--, I suggest discussing --my company--
Aft

In [33]:
!mkdir model
model_save_name = 'entity_generator.pt'
path = F"/content/model/{model_save_name}" 
torch.save(model.state_dict(), path)