# AI TEXT GENERATION FOR BEGINNERS: CREATING AND FINE TUNING YOUR OWN SINGLISH CHATBOT

Text generation is one of the most exciting areas in modern NLP, thanks to the work by companies like Hugging Face and OpenAI. It is also one of the harder areas for NLP beginners to navigate, due to the higher technical and resource barrier to entry (access to GPUs for training, for instance). Finding an appropriate dataset for your own experiment can be just as tough. 

For this third installment of notebooks on practical NLP tasks, I'll be sharing a series of notebooks that I've created and adapted from other sources for my own learning. The goal is to speed up your own learning progress, so that more time is spent on understanding the data and experimentation, instead of finding the basic building blocks for the project.

The final product is far from polished, and reflects the limits of my knowledge and the suitability of the dataset. But as they say, don't let the perfect get in the way of the good. Things are moving very fast in this area, and it would be a shame to sit it out while waiting for the perfect conditions.

This series contains a combination of Colab and local-machine-based notebooks. I would recommend upgrading to Colab Pro and increasing your Google account storage limits if you can afford it. You can still run a limited version of the experiments here if you don't wish to upgrade. But bear in mind you'll have to manage the storage limits quite regularly as the fine tuning process tend to generate pretty huge files. 15Gb of free storage sounds like a lot, but it actually isn't once you get started.....

In [1]:
import codecs
import json
import numpy as np
import pandas as pd
import re

from sklearn.model_selection import train_test_split

# PART 1: DATA EXTRACTION AND PREPARATION

Finding the right dataset for your use case is likely the biggest stumbling block to getting started. While online tutorials tend to come with demo datasets, they may not appeal to you or the audience you are building for.

For this project, I'm focusing on [Singlish](https://en.wikipedia.org/wiki/Singlish), or colloquial Singaporean English. It's a mish-mash of several languages and local slang, and can be bewildering for native English speakers who have never been to Singapore.

The first half of the data preparation is unique to the corpus used - a collection of [SMS messages by Singaporean students at a local university](https://scholarbank.nus.edu.sg/handle/10635/137343).

The second half of the data preparation is unique to the requirements of the chatbot finetuning, as outlined in this [Colab notebook](https://colab.research.google.com/drive/15wa925dj7jvdvrz8_z3vU7btqAFQLVlG) which I've based most of the finetuning code on. 

If you use a different dataset, you'll have to change the first half of the processing accordingly.

## 1. EXTRACTING DATA FROM THE JSON FILE

The SMSes are nested pretty deeply in the original json file. Next couple of cells are aimed at extracting the data into a dataframe format.

In [2]:
raw = [json.loads(line) for line in open('../data/singlish.json', 'r')]

In [3]:
df_raw = pd.json_normalize(raw)

df_raw.head()

Unnamed: 0,smsCorpus.@date,smsCorpus.@version,smsCorpus.message
0,2015.03.09,1.2,"[{'@id': 10120, 'text': {'$': 'Bugis oso near ..."


In [4]:
raw_messages = pd.concat(
    df_raw["smsCorpus.message"]
    .apply(pd.DataFrame)
    .tolist(),
    keys=df_raw["smsCorpus.@date"],
    sort=False,
).reset_index(level="smsCorpus.@date")


In [5]:
raw_messages['sms_text'] = [x.get('$') for x in raw_messages['text']]

In [6]:
source = pd.json_normalize(raw_messages['source'], meta='@id')

destination = pd.json_normalize(raw_messages['destination'], meta='@id')

profile = pd.json_normalize(raw_messages['messageProfile'], meta='@id')

collection = pd.json_normalize(raw_messages['collectionMethod'], meta='@id')


In [7]:
sms_raw = pd.concat([raw_messages, source, destination, profile, collection], axis=1)

In [8]:
cols = [
    "@id",
    "userProfile.userID.$",
    "sms_text",
    "userProfile.country.$",
    "userProfile.age.$",
    "userProfile.gender.$",
    "srcNumber.$",
    "phoneModel.@manufactuer",
    "phoneModel.@smartphone",
    "userProfile.frequency.$",
]

sms = sms_raw[cols].copy()


In [9]:
sms['sms_text'] = sms['sms_text'].astype('str')

# simple function to clean the text and remove non-ascii characters
def clean_text(text):    
    text = text.encode("ascii", errors="ignore").decode("ascii") #remove non-ascii, Chinese characters
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\n\n", " ", text)
    text = re.sub(r"\W", " ", text)
    text = re.sub(r"^\d+\s|\s\d+\s|\s\d+$", " ", text)
    text = text.strip(" ")
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(' +',' ', text).strip() # get rid of multiple spaces and replace with a single    
    return text

sms["clean_text"] = sms['sms_text'].map(lambda text: clean_text(text))

sms = sms.dropna(subset=['clean_text'])

In [10]:
#adding a word count col for filtering

sms['word_count'] = sms['clean_text'].str.count(' ') + 1

In [11]:
# narrowing down col selection

cols = ["@id", "userProfile.userID.$", "userProfile.country.$", "sms_text", "clean_text", "word_count"]

sms = sms[cols].copy()


In [12]:
# renaming cols for clarity

sms = sms.rename(
    columns={
        "@id": "data_id",
        "userProfile.userID.$": "user_id",
        "userProfile.country.$": "country",
        "sms_text": "sms_text",
        "clean_text": "clean_text",
        "word_count": "word_count",

    }
)


In [13]:
sms.shape

(55835, 6)

In [14]:
sms.head()

Unnamed: 0,data_id,user_id,country,sms_text,clean_text,word_count
0,10120,51,SG,Bugis oso near wat...,Bugis oso near wat,4
1,10121,51,SG,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,20
2,10122,51,SG,I dunno until when... Lets go learn pilates...,I dunno until when Lets go learn pilates,8
3,10123,51,SG,Den only weekdays got special price... Haiz......,Den only weekdays got special price Haiz Cant ...,25
4,10124,51,SG,Meet after lunch la...,Meet after lunch la,4


## 1.1 CUTTING OUT THE NOISE

A bigger dataset isn't necessarily a better one if it is merely noisy. Prior to creating the training and validation sets, I filtered out SMSes of 3 words or less (too few words) and kept only those sent by users in Singapore.

In [15]:
crit1 = sms['word_count'] > 3
crit2 = sms['country'] == 'SG'
crit3 = sms['country'] == 'Singapore'

sms = sms[crit1 & (crit2 | crit3)].copy().reset_index()

## 2. CREATE TRAIN-VALIDATION SETS

The data format for training a chatbot is different from the usual CSV files I've encountered. But it sort of makes sense in that for every response, the model will be fed x-number of previous SMSes as "context".

In this case, I'm using 7 previous responses for context. You can increase or decrease the number as you wish. Here's the [link to another Colab file](https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing) that shows you how to tweak your dataset ahead of fine-tuning the DialoGPT model. 

In [16]:
contexted = []

n = 7

for i in range(n, len(sms['clean_text'])):
    row = []
    prev = i - 1 - n # we additionally substract 1, so row will contain current response and 7 previous responses  
    for j in range(i, prev, -1):
        row.append(sms['clean_text'][j])
    contexted.append(row)  

In [17]:
columns = ['response', 'context'] 
columns = columns + ['context/'+str(i) for i in range(n-1)]

df = pd.DataFrame.from_records(contexted, columns=columns)

In [18]:
df.shape

(29353, 8)

In [19]:
df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
0,Hey pple or for nights Excellent location wif ...,nights We nt staying at port step liao Too ex,m walking in citylink now faster come down Me ...,Meet after lunch la,Den only weekdays got special price Haiz Cant ...,I dunno until when Lets go learn pilates,Go until jurong point crazy Available only in ...,Bugis oso near wat
1,Yun ah the ubi one say if wan call by tomorrow...,Hey pple or for nights Excellent location wif ...,nights We nt staying at port step liao Too ex,m walking in citylink now faster come down Me ...,Meet after lunch la,Den only weekdays got special price Haiz Cant ...,I dunno until when Lets go learn pilates,Go until jurong point crazy Available only in ...
2,Hey tmr maybe can meet you at yck,Yun ah the ubi one say if wan call by tomorrow...,Hey pple or for nights Excellent location wif ...,nights We nt staying at port step liao Too ex,m walking in citylink now faster come down Me ...,Meet after lunch la,Den only weekdays got special price Haiz Cant ...,I dunno until when Lets go learn pilates
3,Oh i asked for fun Haha take care,Hey tmr maybe can meet you at yck,Yun ah the ubi one say if wan call by tomorrow...,Hey pple or for nights Excellent location wif ...,nights We nt staying at port step liao Too ex,m walking in citylink now faster come down Me ...,Meet after lunch la,Den only weekdays got special price Haiz Cant ...
4,We are supposed to meet to discuss abt our tri...,Oh i asked for fun Haha take care,Hey tmr maybe can meet you at yck,Yun ah the ubi one say if wan call by tomorrow...,Hey pple or for nights Excellent location wif ...,nights We nt staying at port step liao Too ex,m walking in citylink now faster come down Me ...,Meet after lunch la


In [20]:
# Split the df into training andd validation set

train_df, validate_df = train_test_split(df, random_state=42, test_size=0.2)

In [21]:
train_df.shape, validate_df.shape

((23482, 8), (5871, 8))

In [22]:
# uncomment the 2 lines below to generate the CSV files for training in the next notebook

#train_df.to_csv('../data/train_df.csv', index=False)
#validate_df.to_csv('../data/validate_df.csv', index=False)

## 3. OPTIONAL: ADDITIONAL DATASET FOR AITEXTGEN

There are a number of other options out there for those who want to experiment further with text generation. One interesting library is [aitextgen](https://github.com/minimaxir/aitextgen). I didn't get very good results from the dataset, but I'm including the option here for those who want to try it out in any case.

In [23]:
#sms_text = sms['clean_text'].values.tolist()

#with open("../data/singlish_sms.txt", "w") as output:
#    output.write(str(sms_text))