# Tutorial 2: Create a machine translator for your language of choice

## Step 1: Choose your language pair

The first step is to choose the language pair you want to translate. Go to the following website: http://www.manythings.org/anki/. This website has a large collection of bilingual sentence pairs in many different languages. Choose the language pair that you want to work with and download the corresponding data set. For example, if you want to translate from English to French, download the "French - English" data set.

## Step 2: Prepare the data

The next step is to prepare the data for training. This involves preprocessing the data, splitting it into training and testing sets, and converting the text into numerical data that can be fed into the model. You can use Python and the Keras library to do this.

In [1]:
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.layers import Input, LSTM, Dense, Embedding
from keras.models import Model
from keras.utils import pad_sequences

# Read in the dataset
data = pd.read_csv("afr.txt", delimiter='\t', header=None, names=['source', 'target', 'comments'])

# Remove comments column
data = data[['source', 'target']]

# Convert text to lowercase
data['source'] = data['source'].apply(lambda x: x.lower())
data['target'] = data['target'].apply(lambda x: x.lower())

# Tokenize the text
source_tokenizer = Tokenizer()
source_tokenizer.fit_on_texts(data['source'])
target_tokenizer = Tokenizer()
target_tokenizer.fit_on_texts(data['target'])

# Convert text to sequences of integers
source_sequences = source_tokenizer.texts_to_sequences(data['source'])
target_sequences = target_tokenizer.texts_to_sequences(data['target'])

# Pad sequences to a fixed length
max_sequence_length = 100
source_data = pad_sequences(source_sequences, maxlen=max_sequence_length, padding='post')
target_data = pad_sequences(target_sequences, maxlen=max_sequence_length, padding='post')


In this code, we first read in the dataset and remove the comments column. We then convert the text to lowercase and tokenize the text using Keras' Tokenizer class. We also pad the sequences to a fixed length of 100.

Next, we split the data into training and validation sets.

In [2]:
# Split the data into training and validation sets
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(source_data, target_data, test_size=0.2, random_state=42)


We can now define our encoder and decoder models.

In [3]:
# Define input sequence
encoder_inputs = Input(shape=(None,))
# Define output sequence
decoder_inputs = Input(shape=(None,))

# Define encoder embedding layer
encoder_embedding = Embedding(len(source_tokenizer.word_index) + 1, 256)
encoder_embedding_output = encoder_embedding(encoder_inputs)

# Define encoder LSTM layer
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding_output)
encoder_states = [state_h, state_c]

# Define decoder embedding layer
decoder_embedding = Embedding(len(target_tokenizer.word_index) + 1, 256)
decoder_embedding_output = decoder_embedding(decoder_inputs)

# Define decoder LSTM layer
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding_output, initial_state=encoder_states)

# Define output layer
decoder_dense = Dense(len(target_tokenizer.word_index) + 1, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


In this code, we define the input and output sequences, as well as the embedding and LSTM layers for the encoder and decoder. We also define the output layer and the entire model.

We can now compile and train the model.

In [4]:
from keras.utils import to_categorical


# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train model
# Define batch size and number of epochs
batch_size = 64
epochs = 10

# Define generator for training data
def generate_batch(X=X_train, y=y_train, batch_size=batch_size):
    while True:
        for i in range(0, len(X), batch_size):
            encoder_input_data = X[i:i + batch_size]
            decoder_input_data = y[i:i + batch_size, :-1]
            decoder_output_data = y[i:i + batch_size, 1:]
            encoder_input_data = np.array(encoder_input_data)
            decoder_input_data = np.array(decoder_input_data)
            decoder_output_data = np.array(decoder_output_data)
            decoder_output_data = to_categorical(decoder_output_data, num_classes=len(target_tokenizer.word_index) + 1)
            yield ([encoder_input_data, decoder_input_data], decoder_output_data)

# Define generator for validation data
def generate_validation(X=X_val, y=y_val):
    encoder_input_data = np.array(X)
    decoder_input_data = np.array(y[:, :-1])
    decoder_output_data = np.array(y[:, 1:])
    decoder_output_data = to_categorical(decoder_output_data, num_classes=len(target_tokenizer.word_index) + 1)
    return ([encoder_input_data, decoder_input_data], decoder_output_data)

# Train model
model.fit_generator(generator=generate_batch(),
                    steps_per_epoch=len(X_train)//batch_size,
                    epochs=epochs,
                    validation_data=generate_validation(),
                    validation_steps=len(X_val)//batch_size)



Epoch 1/10


  model.fit_generator(generator=generate_batch(),


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1f0421e4130>

In this code, we define the batch size and number of epochs. We also define the generators for the training and validation data. The generate_batch function generates batches of data for the training data. The generate_validation function generates data for the validation data. We then train the model using the fit_generator function.

Finally, we can make predictions using the trained model.

In [5]:
# Define encoder model
encoder_model = Model(encoder_inputs, encoder_states)

# Define decoder inputs
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_inputs = Input(shape=(None,))
decoder_embedding_output = decoder_embedding(decoder_inputs)
decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding_output, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# Define function to decode sequence
def decode_sequence(input_seq):
    # Encode the input sequence to get the encoder states
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1
    target_seq = np.zeros((1, 1))

    # Populate the first character of target sequence with the start character
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    # Generate output sequence
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = target_tokenizer.index_word[sampled_token_index]

        # Exit condition: either hit max length or find stop character
        if (sampled_token == '<end>' or len(decoded_sentence) > max_sequence_length):
            stop_condition = True
        else:
            decoded_sentence += ' ' + sampled_token
            
            # Update the target sequence
            target_seq = np.zeros((1, 1))
            target_seq[0, 0] = sampled_token_index

            # Update states
            states_value = [h, c]

        return decoded_sentence



In this code, we define a function sequence_to_text that converts a sequence to text. We then generate some translations for the validation data by selecting a random input sequence, using the decode_sequence function to generate the predicted output sequence, and then converting the sequences to text. We print the input, target, and predicted sequences for each example.

That's it! You now have a working sequence-to-sequence model that can be used for machine translation.

In [6]:
def sequence_to_text(sequence, tokenizer):
    """
    Converts a sequence of integers to its corresponding text sequence.
    
    Args:
    - sequence (np.array): A sequence of integers.
    - tokenizer (keras.preprocessing.text.Tokenizer): A tokenizer fitted on the text data.
    
    Returns:
    - A string representing the text sequence.
    """
    text = tokenizer.sequences_to_texts([sequence])[0]
    return text


In [10]:
data.head()

Unnamed: 0,source,target
0,come in.,gaan binne.
1,i'm full.,ek is vol.
2,she runs.,sy hardloop.
3,you lost.,jy verloor.
4,go inside.,gaan binne.


In [14]:
# Define a new review
new_review = "come in"

# Convert the review to a sequence of integers
new_review_seq = source_tokenizer.texts_to_sequences([new_review])[0]

# Pad the sequence to the same length as the training data
new_review_seq = pad_sequences([new_review_seq], maxlen=max_sequence_length, padding='post')

# Use the model to predict the sentiment of the review
pred_sentiment_seq = model.predict([new_review_seq, np.zeros((len(new_review_seq), 1))])

# Convert the predicted sentiment sequence to text
pred_sentiment_text = sequence_to_text(np.argmax(pred_sentiment_seq, axis=2)[0], target_tokenizer)

print("Review:", new_review)
print("Predicted sentiment:", pred_sentiment_text)

Review: come in
Predicted sentiment: het
