Open the English file

In [1]:
english_file = open("english.txt")

Using ```<file_object>.readlines()```, we can load the data in a way where each line within the text file is an individual element within a list. I show the first 10 elements of the ```readlines()``` output for your visual purposes. We can see that these elements are the same as the first 10 lines within the english.txt file.

In [2]:
english_lines = english_file.readlines()
print(english_lines[:10])

['a\n', 'aa\n', 'aaa\n', 'aachen\n', 'aardvark\n', 'aardvarks\n', 'aardwolf\n', 'aardwolves\n', 'aarhus\n', 'aaron\n']


As you can see, the data is quite noisy. At the end of each word is a new-line character represented by ```'\n'```. We can remove this through a string function called ```replace()```, replacing new-line character ```\n``` with the empty string ```''```. The following code block is just temporary code for your visual purposes of seeing the words cleansed of the ```\n```. character.

In [3]:
print([s.replace('\n', '') for s in english_lines[:10]])

['a', 'aa', 'aaa', 'aachen', 'aardvark', 'aardvarks', 'aardwolf', 'aardwolves', 'aarhus', 'aaron']


Let's use this in loading our English data into a list:

In [4]:
training_dataset = []
target_dataset = []

for word in english_lines:
    # Clean the line by removing the new-line character
    cleaned_word = word.replace('\n', '')
    
    # Check if the length of the cleaned word is equal to 7, to get words with 7 characters.
    if len(cleaned_word) == 7:
        # Make an array for converting word to ord representation
        word_to_ord = []
        
        # Iterate through the cleaned word characters, ord the character, and append it to the word_to_ord list.
        for char in cleaned_word:
            word_to_ord.append(ord(char))
            
        # Append the ord'ed word to the training dataset
        training_dataset.append(word_to_ord)
        
        # Append the correct answer to the target dataset
        target_dataset.append(0)

We can see that we get what we want, ```ord()``` list representations of words within a bigger list

In [5]:
print("First 10 training data:")
for word in training_dataset[:10]: print(word)
    
print(f"\nFirst 10 target data: \n{target_dataset[:10]}")

First 10 training data:
[97, 97, 114, 111, 110, 105, 99]
[97, 98, 97, 99, 116, 111, 114]
[97, 98, 97, 100, 100, 111, 110]
[97, 98, 97, 108, 111, 110, 101]
[97, 98, 97, 110, 100, 111, 110]
[97, 98, 97, 115, 104, 101, 100]
[97, 98, 97, 115, 104, 101, 115]
[97, 98, 97, 115, 105, 110, 103]
[97, 98, 97, 116, 105, 110, 103]
[97, 98, 97, 116, 111, 114, 115]

First 10 target data: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


The above code was very explicit to accomodate for comments. A much shorter version of doing the same thing is below (this is how I would code it).

In [6]:
training_dataset = []
target_dataset = []

for line in english_lines:
    line = line.replace('\n', '')
    
    if len(line) == 7:
        training_dataset.append([ord(char) for char in line])
        target_dataset.append(0)

We can see this code is functionally the same as the above

In [7]:
print("First 10 training data:")
for word in training_dataset[:10]: print(word)
    
print(f"\nFirst 10 target data: \n{target_dataset[:10]}")

First 10 training data:
[97, 97, 114, 111, 110, 105, 99]
[97, 98, 97, 99, 116, 111, 114]
[97, 98, 97, 100, 100, 111, 110]
[97, 98, 97, 108, 111, 110, 101]
[97, 98, 97, 110, 100, 111, 110]
[97, 98, 97, 115, 104, 101, 100]
[97, 98, 97, 115, 104, 101, 115]
[97, 98, 97, 115, 105, 110, 103]
[97, 98, 97, 116, 105, 110, 103]
[97, 98, 97, 116, 111, 114, 115]

First 10 target data: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### Hope this helps!