# Creating random text data for Pandas

## Creating random length sentences of random length words using random characters

### Background

A friend recently asked about transforming data in a Pandas data frame and provided an example of what the target column contains currently (prior to the transformation).

After I and a few others made some suggestions, there was also some discussion on performance of the different method of applying the desired transformation.

So that we could properly test, I decided to make some test data, and this in itself provided a challenge.

Briefly, the column of data contained JSON data with:

1. A single-element array
2. Containing an Object (actually, more like a set in Python)
3. Containing one or more words separated by commas

An example (similar to that provided by my friend):

`[{These, words, are, contained, by, an, Object, or, set, which, itself, is, contained, by, an, array}]`  
`[{The, next, set, of, words, is, like, this}]`

As yet, it is not clear whether the data contained the individual words wrapped by double quotes or not, and possibly the data was actually

`[{"These", "words", "are", "contained", "by", "an", "Object", "or", "set", "which", "itself", "is", "contained", "by", "an", "array"}]`  
`[{"The", "next", "set", "of", "words", "is", "like", "this"}]`

Having this example, the challenge is to create a lot of data with a random number of words in each "sentence", where each word contains a random number of letters.

Just the notion of the word "random" implies using `numpy.random` however, none of the methods available in that class can help us directly in our task. For that, we will also need to make use of `itertools` and a not so well known (well, at east to me) method of splitting an array by unequal length intervals using `numpy.split`.

In [1]:
import pandas as pd
from numpy import random as rand
import numpy as np
import itertools as itools
#import unicodedata
import re

In [2]:
NUM_RECORDS = 1000 # Make it 1000000 if you want (smaller set only to keep the csv and possibly the ipynb file smaller)
MAX_WORDS_PER_SENTENCE = 20
MAX_CHARS_PER_WORD = 10

In [3]:
# Create a "template" (of sorts) for our records.
# This template will tell us how many words randomly will be in each record.
sentence_templates = rand.randint(low=1, high=MAX_WORDS_PER_SENTENCE+1, size=NUM_RECORDS)

# Create another "template" which will be a random number of characters for each word.
# We know the count of all words across all records by using the sum() method of the numpy array sentence_templates
word_templates = rand.randint(low=1, high=MAX_CHARS_PER_WORD+1, size=sentence_templates.sum())

In [4]:
# Let's just see how many words we will have in total
len(word_templates)

10293

In [5]:
# Now use itertools to creating a running count of words
# For example, if the first 5 sentences (records) had word counts of 7, 3, 4, 8, 2
# itertools would return 7, 10, 14, 22, 24
# This running count is needed in the numpy.split method later
words_iter = list(itools.accumulate(word_templates))

In [6]:
# Now let's create an array of random characters from A-Z,
# having length = the last element of our word iterator.
# Similar to the example of 5 above where 24 would be returned as the size.
asc_codes = rand.randint(low=65, high=91, size=words_iter[-1:])

In [7]:
# Let's take a peek at the ascii codes that were generated
asc_codes[:50]

array([72, 79, 86, 83, 66, 73, 90, 71, 66, 78, 67, 82, 78, 78, 68, 89, 86,
       69, 73, 86, 86, 69, 89, 74, 81, 83, 78, 66, 76, 68, 77, 86, 73, 83,
       77, 90, 76, 84, 71, 84, 78, 78, 79, 71, 76, 72, 80, 73, 89, 80])

In [8]:
#f = lambda x: chr(x)
#chars = f(asc_codes.tolist())
# The above did not work, so we will use the following instead
# to convert the ascii codes to characters
chars = [chr(x) for x in asc_codes.tolist()]
chars[:20]

['H',
 'O',
 'V',
 'S',
 'B',
 'I',
 'Z',
 'G',
 'B',
 'N',
 'C',
 'R',
 'N',
 'N',
 'D',
 'Y',
 'V',
 'E',
 'I',
 'V']

In [9]:
# Taking the long list of characters that was generated by our conversion from ascii to char
# we use the numpy.split() method in which the 2nd parameter tells us at which index locations
# to split the array (this is why ithertools was handy)
words = [''.join(x) for x in np.split(chars, words_iter[:-1])]

# Let's see the first 20 words we created
words[:20]


['HO',
 'VSBIZ',
 'GBN',
 'CR',
 'NNDYVE',
 'IVVEYJQSNB',
 'L',
 'DM',
 'VISMZLT',
 'GT',
 'NNOGL',
 'HPIYP',
 'YNALADVSAQ',
 'K',
 'JYKEGFHK',
 'QVEUMAB',
 'UWHKT',
 'XAQWXCJLXF',
 'BS',
 'E']

In [10]:
# At this point, we have a list of words but we still need to split those words up into sentences (or records)
# Once again, we use itertools to accumulate the count so that we can pass this to the numpy.split() method
sentences_iter = list(itools.accumulate(sentence_templates))

sentences_iter[:20]

[19,
 29,
 42,
 46,
 56,
 75,
 77,
 87,
 104,
 115,
 130,
 140,
 152,
 166,
 173,
 190,
 209,
 216,
 217,
 218]

In [11]:
# Here we apply the numpy.split method, but this time to a list of words rather than a list of characters

# For the source data, if we want the individual words to be wrapped in quotes, use:
#sentences = ['[{"' + str('", "'. join(x)) + '"}]' for x in np.split(words, sentences_iter[:-1])]
# For the source data, if we want no quotes around the individual words (as originally presented by my friend), use:
sentences = ['[{' + str(', '. join(x)) + '}]' for x in np.split(words, sentences_iter[:-1])]
sentences[:10]

['[{HO, VSBIZ, GBN, CR, NNDYVE, IVVEYJQSNB, L, DM, VISMZLT, GT, NNOGL, HPIYP, YNALADVSAQ, K, JYKEGFHK, QVEUMAB, UWHKT, XAQWXCJLXF, BS}]',
 '[{E, ZLGHBQVMLP, KMWTCT, RRQVHEHGLO, Q, ISQOQFUHAF, CB, SPHJE, KIIPT, TBNYVSV}]',
 '[{VUNFJGD, OWY, X, EBCOAETC, BYZYGNAXOC, ZCOMTJVB, AT, IYXHI, MOXTG, RXVAHA, J, WFIOOCSO, RJXBYHRNP}]',
 '[{IZO, YOFZAGOCT, TXRQGW, MF}]',
 '[{DUVKRNYHT, F, GWFIIUBYW, H, WKLEOSHCK, SKWPTPS, TPIMGLCUEI, VUACPAF, GVYHJVXZX, EER}]',
 '[{NA, QPE, GO, KSPHATCKZW, FFNHHZ, SZBB, MMATRGIEUR, HLHEJWQ, XLA, L, MFTRJ, CZD, NHITJGVZN, REHVC, VTVIR, I, INXP, NEPOZZDGC, MGDP}]',
 '[{RDRQ, UJGPJYNK}]',
 '[{NKXXAV, QJG, KDHNULPZM, JFUENMMDIH, BHPIRIZF, YOQYBJ, FIHZFQ, PIF, L, FHZCVUX}]',
 '[{NBR, HTKKYEZUD, UBMKBH, GITBIPAW, DIOFWYO, PWS, TUEX, LWHC, YFMSXQQVTM, OVFYAEZB, HGNS, FFCYHROQA, TVLB, UGACUBTEVU, RTYYUHVZNZ, BSBNU, LRQKNQKEQD}]',
 '[{WXBBF, AXXJR, IFBHNSHCR, VKQGZBRT, O, Y, SDHLYCXL, AVLVA, TCGLKPT, XXWGJE, R}]']

In [12]:
# Now that we have a list of "sentences", we can push them into a Pandas data frame
df_sentences = pd.DataFrame(sentences, columns=["JSON_words"])
df_sentences.head()

Unnamed: 0,JSON_words
0,"[{HO, VSBIZ, GBN, CR, NNDYVE, IVVEYJQSNB, L, D..."
1,"[{E, ZLGHBQVMLP, KMWTCT, RRQVHEHGLO, Q, ISQOQF..."
2,"[{VUNFJGD, OWY, X, EBCOAETC, BYZYGNAXOC, ZCOMT..."
3,"[{IZO, YOFZAGOCT, TXRQGW, MF}]"
4,"[{DUVKRNYHT, F, GWFIIUBYW, H, WKLEOSHCK, SKWPT..."


In [13]:
# And now, we can save the data frame so that
# we can repeatedly test against the same data using different methods
df_sentences.to_csv("create_random_words_test_data.csv", sep="|", index=False)

In [14]:
# Read the data back from the file we saved, and re-display the head
df_map = pd.read_csv("create_random_words_test_data.csv", sep="|")
df_map.head()

Unnamed: 0,JSON_words
0,"[{HO, VSBIZ, GBN, CR, NNDYVE, IVVEYJQSNB, L, D..."
1,"[{E, ZLGHBQVMLP, KMWTCT, RRQVHEHGLO, Q, ISQOQF..."
2,"[{VUNFJGD, OWY, X, EBCOAETC, BYZYGNAXOC, ZCOMT..."
3,"[{IZO, YOFZAGOCT, TXRQGW, MF}]"
4,"[{DUVKRNYHT, F, GWFIIUBYW, H, WKLEOSHCK, SKWPT..."


In [15]:
# Apply one of the transformations as an example

# If the source data does not contain words wrapped with double quotes, use:
df_map["JSON_words"] = df_map["JSON_words"].map(lambda x: ['"' + y + '"' for y in re.split("[, ]+", x[2:-2])])
# If the source data already contains words wrapped with double quotes, use:
#df_map["JSON_words"] = [re.sub("[{}]", "", str(x)) for x in df_map["JSON_words"]]

In [16]:
df_map.head()

Unnamed: 0,JSON_words
0,"[""HO"", ""VSBIZ"", ""GBN"", ""CR"", ""NNDYVE"", ""IVVEYJ..."
1,"[""E"", ""ZLGHBQVMLP"", ""KMWTCT"", ""RRQVHEHGLO"", ""Q..."
2,"[""VUNFJGD"", ""OWY"", ""X"", ""EBCOAETC"", ""BYZYGNAXO..."
3,"[""IZO"", ""YOFZAGOCT"", ""TXRQGW"", ""MF""]"
4,"[""DUVKRNYHT"", ""F"", ""GWFIIUBYW"", ""H"", ""WKLEOSHC..."


In [17]:
df_map.tail()

Unnamed: 0,JSON_words
995,"[""JCN"", ""HLGXI"", ""BDXX"", ""A"", ""ZUFVLGWIT"", ""JV..."
996,"[""XCFLD"", ""HTBVRRUJT"", ""UFPFMMNEYC"", ""XEZPDNZYB""]"
997,"[""BCVZJH""]"
998,"[""CRM"", ""MXZKPLG"", ""AFMM"", ""KOHO"", ""KWIPO"", ""L..."
999,"[""DRHXVSLSV"", ""YOWKJBUJ"", ""DQJZNA"", ""T"", ""XDFX..."
