# Data Preparation
I will be using text from [Kwayedza Newspaper](https://kwayedza.co.zw) which I scrapped into [content.txt](../datasets/content.txt). The code for this can be found in `1. Getting the data.ipynb` notebook.

Here a sample of what the text in `content.txt` looks like:
```bash
SANGANO reZimbabwe Indigenous Women Farmers Trust (ZIWFT) rakatanga
chirongwa chekudzidzisa varimi kugadzira fetireza pachishandiswa
zviwanikwa zvemunharaunda dzavo (organic fertiliser) sezvo mhando iyi
isingadhure mukuigadzira uye ichiwanisa chikafu chisina njodzi kuutano
hweveruzhinji.
```
I will generate a model of this text that I can then use to generate new sequences of text.

In [1]:
DATASETS_DIR = '../datasets'

## Load the text

In [2]:
# define a function to load the text document into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r', encoding='utf-8')

  # read all the text in the file
  text = file.read()

  # close the file
  file.close()
  return text

In [3]:
# load document
in_filename = f"{DATASETS_DIR}/content.txt"
doc = load_doc(in_filename)
doc[:200]

'PREMIER Soccer League (PSL) iri kudya magaka mambishi zvichitevera mhirizhonga yakaitika kuBabourfields, kuBulawayo nezuro apo vatsigiri veHighlanders vakapinda munhandare ndokutanga kukanda matombo v'

### Clean the text
The text needs to be transforemed into a sequence of tokens of words that we can use as a source to train the model. But before that there are some operations that need to be performed to clean the text:
- replace `-` with a whitespace so we can split words better
- split words based on whitespace
- remove all punctuation from the words to reduce the vocabulary size
- remove all words that are not alphabetic to remove standalone punctuation tokens
- normalize all words to lowercase toreduce the vocabulary size

In [4]:
import re
import string

# turn a doc into clean tokens
def clean_doc(doc):
  # replace '--' with a space ' '
  doc = doc.replace('--', ' ')

  # split into tokens by white space
  tokens = doc.split()

  # prepare regex for char filtering
  re_punc = re.compile('[%s]' % re.escape(string.punctuation))

  # remove punctuation from each word
  tokens = [re_punc.sub('', w) for w in tokens]

  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]

  # make lower case
  tokens = [word.lower() for word in tokens]
  return tokens

In [5]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['premier', 'soccer', 'league', 'psl', 'iri', 'kudya', 'magaka', 'mambishi', 'zvichitevera', 'mhirizhonga', 'yakaitika', 'kubabourfields', 'kubulawayo', 'nezuro', 'apo', 'vatsigiri', 'vehighlanders', 'vakapinda', 'munhandare', 'ndokutanga', 'kukanda', 'matombo', 'vachirwisana', 'nemapurisa', 'vachinyunyuta', 'kuti', 'chikwata', 'chavo', 'change', 'chanyimwa', 'pena', 'mutambo', 'wecastle', 'premier', 'soccer', 'league', 'uyu', 'waive', 'pakati', 'pedynamos', 'fc', 'nehighlanders', 'wakazomiswa', 'nekuda', 'kwemhirizhonga', 'yakatangiswa', 'nevatsigiri', 'vehighlanders', 'watambwa', 'maminitsi', 'apo', 'dembare', 'yaitungamira', 'mhirizhonga', 'iyi', 'yakaona', 'vamwe', 'vatsigiri', 'venhabvu', 'vachikuvara', 'zvakaipisisa', 'muchinyorwa', 'sachigaro', 'wepsl', 'farai', 'jere', 'anoti', 'psl', 'inoshora', 'nyaya', 'dzemhizhonga', 'dzinoitika', 'munhabvu', 'uye', 'vari', 'kuongorora', 'chiitiko', 'chekubulawayo', 'soccer', 'league', 'inoshora', 'zvikuru', 'nyaya', 'yemhirizhonga', 'yakai

### Save the clean text
Tokens can be organized into sequences of 50 input words and 1 output word i.e. sequences of 51 words. This can be done by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens. The tokens will be transformed into space-seperated strings for later storage in a file.

In [6]:
length = 10 + 1
sequences = list()
for i in range(length, len(tokens)):
  # select a sequence of tokens
  seq = tokens[i - length: i]

  # convert the sequence into a line
  line = ' '.join(seq)

  # store the line
  sequences.append(line)

print('Total Sequences: %d' % len(sequences))

Total Sequences: 5498


In [7]:
# save tokens to file, one dialog per line (checkpoint)
def save_doc(lines, filename):
  data = '\n'.join(lines)
  file = open(filename, 'w', encoding='utf-8')
  file.write(data)
  file.close()

In [8]:
# save sequences to file (checkpoint)
out_filename = f"{DATASETS_DIR}/content_sequences.txt"
save_doc(sequences, out_filename)