<h2>Normalizing informal language using Deep Learning</h2>

1 Overview:

1.1. Introduction:

The project's objective is to convert informal text into a more structured and normalized format using Natural Language Processing (NLP) techniques, which can have a variety of applications and benefits in areas such as translation, information retrieval, and accessibility. In this work, we will aim to tackle the problem by proposing an Encoder-Decoder sequence-to-sequence (seq2seq) architecture with attention mechanism.


1.2. Business Problem:

The problem we are trying to address is the difficulty of processing and understanding text
written in an informal or unstructured way. Informal language can be challenging to interpret for humans and machines, leading to misinterpretations or misunderstandings of the text's meaning. Normalizing the text allows us to produce a more structured representation that is easier to process and understand. This can help to improve the accuracy of machine translation systems, assist in information retrieval tasks, and make the text more accessible to individuals with language processing difficulties.

Informal input : U wan me to chop seat 4 u nt?

Formal input : Do you want me to reserve seat for you or not?


1.3. Dataset:

We are going to use the NUS Social Media Text Normalization and Translation Corpus dataset for the project. The corpus was created for social media text normalization and translation by randomly selecting 2,000 messages from the NUS English SMS corpus.

2 Reading data and Data Preprocessing:

In [5]:
# Import the required libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nlpaug.augmenter.word
import plotly.figure_factory as ff
import pickle


import warnings
warnings.filterwarnings("ignore")

2.1. Loading data: The dataset is organised as informal text, formal correction, and Chinese translation, each on a new line. We will create a pandas DataFrame with three columns, Informal and Formal English and Formal Chinese, by reading the data's text file. We will keep the case, punctuation, and stopwords as we collect the text from messages and general conversations.

In [6]:
# Read the file
with open("../data/raw/en2cn-2k.en2nen2cn", "r", encoding = 'utf-8') as file:
    text = file.read()
    
text = text.split('\n')
text.pop()

data = []
for idx in range(0, len(text), 3):
    row = []
    row.append(text[idx])
    row.append(text[idx+1])
    row.append(text[idx+2])
    data.append(row)

# creating the dataframe
df = pd.DataFrame(data, columns = ['Informal', 'Formal English', 'Formal Chinese'])
display(df.head())
df = df[['Informal', 'Formal English']]

Unnamed: 0,Informal,Formal English,Formal Chinese
0,"U wan me to ""chop"" seat 4 u nt?",Do you want me to reserve seat for you or not?,你要我帮你预留坐位吗？
1,Yup. U reaching. We order some durian pastry a...,Yeap. You reaching? We ordered some Durian pas...,对。你要到了吗？我们已经点了一些榴莲糕点。你快点来。
2,They become more ex oredi... Mine is like 25.....,They become more expensive already. Mine is li...,他们变得更贵了。我的是大概25。这么坏然后他们还比我以前做得少。
3,I'm thai. what do u do?,I'm Thai. What do you do?,我是泰国人。你做什么？
4,Hi! How did your week go? Haven heard from you...,Hi! How did your week go? Haven't heard from y...,嗨！你这周过的怎么样？好长时间没听到你的消息了。一切顺利吗？


2.2 Data Augmentation:

The ‘nlpaug’ library is used to augment data. Synonym augmentation and spelling augmentation are appropriate strategies for the
needs. Initially for augmenting from 2000 samples to 4000 samples synonym augmentation was performed. In addition, spelling augmentation was used to generate a final total of 8000 occurrences.

In [7]:
# Synonym Augmentation
for text in df['Formal English'].values:
    augmented = pd.DataFrame({"Informal":[nlpaug.augmenter.word.SynonymAug(aug_src = 'wordnet').augment(text)], "Formal English":[text]})
    df = df.append(augmented, ignore_index = True)

# Spelling Augmentation
for text in df['Formal English'].values:
    augmented = pd.DataFrame({"Informal":[nlpaug.augmenter.word.SpellingAug().augment(text)], "Formal English":[text]})
    df = df.append(augmented, ignore_index = True)

df.tail()

Unnamed: 0,Informal,Formal English
7995,[Hmm. in think i'd usually book os weekends. I...,Hmm. I think I usually book on weekends. It de...
7996,[Ken you ask them wheter they have for eny sms...,Can you ask them whether they have for any sms...
7997,[Wi ars near Coca already.],We are near Coca already.
7998,[Hall eleven. Got lectures. And forgete about ...,Hall eleven. Got lectures. And forget about co...
7999,[I bring for you. I cen not promise yus 100% t...,I bring for you. I can not promise you 100% to...


2.3 Adding Beginning of the Sentence token and End of the Sentence token

The encoder's input should be encoded with beginning of the sentence ('>') and end of the sentence tokens ('<'), allowing the encoder to determine the length of each sentence. In the case of the decoder, input should be prefixed with '<' and output should be prefixed with '>'.


In [8]:
# Creating encoder input, decoder input and decoder output along with adding appropriate BOS and EOS tokens
preprocessed_df = pd.DataFrame()
preprocessed_df['encoder_input'] = df.apply(lambda row: '<'+str(row['Informal'])+'>', axis=1)
preprocessed_df['decoder_input'] = df.apply(lambda row: '<'+str(row['Formal English']), axis=1)
preprocessed_df['decoder_output'] = df.apply(lambda row: str(row['Formal English']+'>'), axis=1)
preprocessed_df.head()

Unnamed: 0,encoder_input,decoder_input,decoder_output
0,"<U wan me to ""chop"" seat 4 u nt?>",<Do you want me to reserve seat for you or not?,Do you want me to reserve seat for you or not?>
1,<Yup. U reaching. We order some durian pastry ...,<Yeap. You reaching? We ordered some Durian pa...,Yeap. You reaching? We ordered some Durian pas...
2,<They become more ex oredi... Mine is like 25....,<They become more expensive already. Mine is l...,They become more expensive already. Mine is li...
3,<I'm thai. what do u do?>,<I'm Thai. What do you do?,I'm Thai. What do you do?>
4,<Hi! How did your week go? Haven heard from yo...,<Hi! How did your week go? Haven't heard from ...,Hi! How did your week go? Haven't heard from y...


2.4 Distribution of length of Encoder Input, Decoder Input and Decoder Output

As we can see, most of the sentences are of length around 50 and almost all the sentences have lengths less than 200. Hence, we can filter out the sentences which are of length more than 200.

In [9]:
# Distance plot for length of encoder input
fig = ff.create_distplot([preprocessed_df['encoder_input'].apply(len).values], ['Count'])
fig.update_layout(title= 'Length of Encoder Input (Informal text)')
fig.show()

display(stats.describe(preprocessed_df['encoder_input'].apply(len)))

DescribeResult(nobs=8000, minmax=(4, 321), mean=79.7555, variance=2163.971216152019, skewness=0.9474529511444769, kurtosis=0.4625659279018093)

In [10]:
# Distance plot for length of decoder input
fig = ff.create_distplot([preprocessed_df['decoder_input'].apply(len).values], ['Count'])
fig.update_layout(title= 'Length of Decoder Input (Normalized text)')
fig.show()

display(stats.describe(preprocessed_df['decoder_input'].apply(len)))

DescribeResult(nobs=8000, minmax=(4, 282), mean=72.438, variance=1953.5483495436931, skewness=0.9469251210577967, kurtosis=0.40664669799959263)

In [11]:
# Distance plot for length of decoder output
fig = ff.create_distplot([preprocessed_df['decoder_output'].apply(len).values], ['Count'])
fig.update_layout(title= 'Length of Decoder Output (Normalized text)')
fig.show()

display(stats.describe(preprocessed_df['decoder_output'].apply(len)))

DescribeResult(nobs=8000, minmax=(4, 282), mean=72.438, variance=1953.5483495436931, skewness=0.9469251210577967, kurtosis=0.40664669799959263)

In [12]:
# Filter samples with length less than or equal to 200
preprocessed_df = preprocessed_df[(preprocessed_df['encoder_input'].apply(len) <= 200) & (preprocessed_df['decoder_input'].apply(len) <= 200) & (preprocessed_df['decoder_output'].apply(len) <= 200)]
print('Total samples after filtering are:', preprocessed_df.shape[0])

Total samples after filtering are: 7876


2.5. Splitting the data into training, validation and test sets:

Split the data into training, validation and test sets into the ration of 90:05:05 respectively.

In [13]:
# Split the data into training, validation and test
train, validation = train_test_split(preprocessed_df, train_size=0.9, random_state = 42)
validation, test = train_test_split(validation, test_size = 0.5, random_state = 42)
train.reset_index(inplace=True, drop=True)
validation.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)
train.to_csv('../data/processed/train.csv')
validation.to_csv('../data/processed/validation.csv')
test.to_csv('../data/processed/test.csv')
print('Shape of Training set:', train.shape)
print('Shape of Validation set:', validation.shape)
print('Shape of Test set:', test.shape)

Shape of Training set: (7088, 3)
Shape of Validation set: (394, 3)
Shape of Test set: (394, 3)


2.6 Tokenizing data:

A unique id from the vocabulary is used to assign the numbers. Therefore, the specific sentence will be encoded using the unique IDs of the words that appear in it.

In [14]:
#Tokenizing the informal and normalized sentences.
tokenizer_informal = Tokenizer(filters = '"#$%&()*+-/=@[\\]^_`{|}~\t\n', lower = False, char_level = True)
tokenizer_informal.fit_on_texts(train['encoder_input'].values)

tokenizer_normalized = Tokenizer(filters = '"#$%&()*+-/=@[\\]^_`{|}~\t\n', lower = False, char_level = True)
tokenizer_normalized.fit_on_texts(train['decoder_input'].values)

tokenizer_hashmap = {'informal': tokenizer_informal, 'normalized': tokenizer_normalized}

#Save the tokenizer model
with open('../model/tokenizer.pkl', 'wb') as file:
    pickle.dump(tokenizer_hashmap, file, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
print('Vocab size of Informal text:', len(tokenizer_informal.word_index.keys()))
print('Vocab size of Normalized text:', len(tokenizer_normalized.word_index.keys()))

Vocab size of Informal text: 120
Vocab size of Normalized text: 91


2.7 Padding Data:

In order to make all the sentences of same length, the sentences with length less than 200 were padded with 0 values inorder to make all the sequence of length 200.

In [16]:
# Padding the sentences to make all the sentences of same length
padded_encoder_input_sequence = pad_sequences(tokenizer_informal.texts_to_sequences(train['encoder_input'].values), maxlen = 200, dtype='int32', padding='post')
padded_decoder_input_sequence = pad_sequences(tokenizer_normalized.texts_to_sequences(train['decoder_input'].values), maxlen = 200, dtype='int32', padding='post')
padded_decoder_output_sequence = pad_sequences(tokenizer_normalized.texts_to_sequences(train['decoder_output'].values), maxlen = 200, dtype='int32', padding='post')

print('Original sentence:')
print(train['encoder_input'][0], '\n')

print('Tokenized and padded input sentence:')
print(padded_encoder_input_sequence[0], '\n')

print('Length of tokenized and padded input sentence:', padded_encoder_input_sequence.shape[1])

Original sentence:
<["Leaving around yhat time tÃ'o. Bringing laptop homme?"]> 

Tokenized and padded input sentence:
[ 22  25  29  52   2   5  34   7   6  18   1   5   8   3  13   6  16   1
  14  10   5   4   1   4   7  15   2   1   4 106  17   3  11   1  46   8
   7   6  18   7   6  18   1  12   5  23   4   3  23   1  10   3  15  15
   2  30  29  26  21   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0] 

Length of tokenized and padded input sentence: 200
