# Language Modeling


Let's try the language modeling task by using chariot and Tensorflow.

* Download the text8 dataset by chazutsu.
* Preprocess text8 by chariot.
* Make model by TensorFlow (use tf.keras).
* Train & evaluate the model.

This tutorial needs following libraries.

* chazutsu
* tensorflow


## Prepare the packages

In [1]:
%load_ext autoreload
%autoreload 2


import os
import sys
from pathlib import Path
import numpy as np


def set_path():
    if "../" not in sys.path:
        sys.path.append("../")
    root_dir = Path.cwd()
    return root_dir

ROOT_DIR = set_path()

## Download the Language Modeling Data

In [2]:
import chazutsu
from chariot.storage import Storage

storage = Storage.setup_data_dir(ROOT_DIR)
r = chazutsu.datasets.Text8().download(storage.data_path("raw"))

Read resource from the existed resource(if you want to retry, set force=True).


In [3]:
r.train_data().head(1)

Unnamed: 0,sentence
0,anarchism originated as a term of abuse first ...


In [4]:
train_data = r.train_data()
train_data["sentence"] = train_data["sentence"].apply(lambda x: x[:100000])
train_data["sentence"].apply(lambda x: len(x))

0    100000
Name: sentence, dtype: int64

## Preprocess the review text by chariot.

### Make preprocessor

In [5]:
import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


lm_processor = Preprocessor(
                    text_transformers=[ct.text.UnicodeNormalizer()],
                    tokenizer=ct.Tokenizer(lang=None),
                    vocabulary=ct.Vocabulary(min_df=2, max_df=5000))

preprocessed = lm_processor.fit_transform(train_data)

In [6]:
print(len(lm_processor.vocabulary.get()))

1549


## Make model by TensorFlow

In [7]:
from tensorflow.python import keras as K


vocab_size = lm_processor.vocabulary.count
embedding_size = 50
hidden_size = 75

def make_model():
    model = K.Sequential()
    model.add(K.layers.Embedding(input_dim=vocab_size, output_dim=embedding_size))
    model.add(K.layers.LSTM(hidden_size))
    model.add(K.layers.Dense(vocab_size, activation="softmax"))
    return model

model = make_model()
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

## Train the Model

In [8]:
from chariot.feeder import LanguageModelFeeder


feeder = LanguageModelFeeder({"sentence": ct.formatter.ShiftGenerator()})
steps_per_epoch, generator = feeder.make_generator(preprocessed, batch_size=25, sequence_length=10,
                                                   sequencial=False)

metrics = model.fit_generator(generator(), steps_per_epoch, epochs=100, verbose=0)
print("loss={}, acc={}".format(metrics.history["loss"][-1], metrics.history["acc"][-1]))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


loss=0.5220610836986452, acc=0.9424999924376607


## Try generating the Text

In [9]:
def generate_text(seed_text, lm_processor, model, sequence_length=10, iteration=20):
    preprocessed = lm_processor.transform([seed_text])[0]

    def pad_sequence(tokens, length):
        if len(tokens) < length:
            pad_size = length - len(tokens)
            return tokens + [lm_processor.vocabulary.pad] * pad_size
        elif len(tokens) > length:
            return tokens[-length:]
        else:
            return tokens

    for _ in range(iteration):
        x = pad_sequence(preprocessed, sequence_length)
        y = model.predict([x])[0]
        w = np.random.choice(np.arange(len(y)), 1, p=y)[0]
        preprocessed.append(w)
    
    decoded = lm_processor.inverse_transform([preprocessed])
    text = " ".join(decoded[0])

    return text

In [12]:
generate_text("when you", lm_processor, model)

'when you play patients anti up mainland student something interests european communism others on continues develop society individualist due distinct arab personal'

In [13]:
generate_text("i wish to", lm_processor, model)

'i wish to called state established active impact also determining personal tend cnt many there from for programming asperger the a common toxin'