<h1>Task 1: Third-order letter approximation model</h1>

For Task 1, I am tasked to select five free English works from Project Gutenberg. I then have to preprocess these texts by removing any preamble and postamble, and removing all characters except for ASCII letters, full stops, and spaces. Finally, I then have to make all letters uppercase.

When I have preprocessed all the texts, I then have to create a trigram model by counting the number of times each trigram (a sequence of three characters) appears. For example, "It is what it is." would become "IT IS WHAT IT IS" when processed. This will then give a trigram model like {'IT ': 2, 'T I': 3, ' IS': 2, 'IS ': 1, 'S W': 1, ' WH': 1, 'WHA': 1, 'HAT': 1, 'AT ': 1}

In [29]:
import os
import re
import random

Initialises the directory, inistialises the dictionaries and reads each text file.

- The texts dictionary stores the texts.
- The trigram_counts dictionary stores trigrams and their counts.

In [30]:
# Directory containing the text files
directory = "texts"

texts = {}
# Initialize an empty dictionary for storing trigrams and their counts
trigram_counts = {}

# Load each text file into a dictionary
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open(os.path.join(directory, filename), "r", encoding="utf-8") as file:
            texts[filename] = file.read()

<h2>Prepare Text</h2>
The first step in preparing the texts is to remove any preamble and postamble. Project Gutenburg texts often have a preamble that ends with "*** START OF THE PROJECT GUTENBERG EBOOK (EBOOK NAME) ***", and a postamble that starts with "*** END OF THE PROJECT GUTENBERG EBOOK (EBOOK NAME) ***". This can be removed by setting markers, in this case "*** START" and "*** END". The function then searches for where these markers first occur in the text. If both markers are found, the function extracts the portion of the text between these markers.

The next step is to remove all characters except letters, full stops and spaces from the text. To do this, the function uses the re.sub() method. This is used to replace unwanted characters in the text. Any character that is not a letter, a full stop or a space is replaced with an empty string, effectively removing them. The text is now cleaned, leaving only letters, full stops and spaces.

Finally, I have to convert all the characters to uppercase. This is done by using the .upper() method, which transforms all the letters to uppercase.
The .strip() method is also used to remove any leading or trailing whitespace from the text.

In [31]:
def prepare_text(text):
    # Remove preamble and postamble from the text
    preamble = text.find("*** START")
    postamble = text.find("*** END")
    if preamble != -1 and postamble != -1:
        text = text[preamble:postamble]
    
    # Remove all characters except letters, full stops and spaces
    prepared_text = re.sub(r"[^A-Za-z. ]", "", text)
    
    return prepared_text.upper().strip()

<h2>Generate Trigrams</h2>
The next step is to generate the trigrams and count the number of times each trigram appears.

- for i in range(len(prepared_text) - 2):<br>
First, it iterates through the text to extract all possible trigrams.

- trigram = prepared_text[i:i + 3]:<br>
Then, we have to slice the text to extract a substring of 3 characters.

- if trigram in trigram_counts:
    trigram_counts[trigram] += 1
else:
    trigram_counts[trigram] = 1

This checks if the trigram is in trigram_counts. If it is, then increase the count by 1. If it's not, then initialise it's count to 1.

In [32]:
def build_trigram_model(prepared_text):
    # Iterate through the text to extract trigrams
    for i in range(len(prepared_text) - 2):
        # Extract a trigram
        trigram = prepared_text[i:i + 3]
        
        # Increment the count of the trigram in the dictionary
        if trigram in trigram_counts:
            trigram_counts[trigram] += 1
        else:
            trigram_counts[trigram] = 1
    
    return trigram_counts

Now we have to read all the text files, prepare them and get the trigram counts.

In [33]:
# Process each text file
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open(os.path.join(directory, filename), "r", encoding="utf-8") as file:
            content = file.read()
        
        # Prepare the text
        prepared_text = prepare_text(content)
        
        build_trigram_model(prepared_text)

The trigram counts are displayed in task1trigrams.txt

In [34]:
with open("task1trigrams.txt", "w", encoding="utf-8") as file:
    for trigram, count in sorted(trigram_counts.items()):
        file.write(f"{trigram}: {count}\n")

<h1>Task 2: Third-order letter approximation generation</h1>

In this task, I have to use the model from task 1 to make a string with 10,000 characters, starting with "TH." For each new character, look at the last two characters to find matching trigrams in the model. Then, pick the next character randomly based on how often it appears in those trigrams.

First, I have to select the starting string, which is "TH".

In [35]:
start = 'TH'

Now I have to make the loop that will find all the trigrams in the trigram_counts dictionary whose first two characters match the last two characters of the "start" string, randomly select the next character based on the weights, and append it to the "start" string. For referance, I used the notes provided on the emerging-technologies repo: https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb.

<h5>letters, weights = list(zip(*[(x[2], trigram_counts[x]) for x in trigram_counts.keys() if x[:2] == start[-2:]]))</h5>

- x[:2] == start[-2:]:<br>
This will filter trigrams whose first two characters (x[:2]) match the last two characters of start (start[-2:]).
- for x in trigram_counts.keys() if x[:2] == start[-2:]]:<br>
If they are equal, retrieve all trigrams stored in the trigram_counts dictionary.
- (x[2], trigram_counts[x]):<br>
(x[2]) extracts the third character of the trigram, and (trigram_counts[x]) retrieves the count of that trigram from the model, which will be used as the weight for weighted random selection.
- zip: Separates the third characters into a letters tuple and their corresponding counts into a weights tuple.

<h5>start += random.choices(letters, weights=weights, k=1)[0]</h5>

In this line,
- letters represents the list of possible next characters,
- weights represents their corresponding counts, which is used as probabilities,
- and k=1 generates one random choice.

[0] then extracts the single character that is returned, and it is then appended to start.

In [36]:
for i in range(1, 9999):
    # Select all of the keys that start with the last 2 characters in start.
    letters, weights = list(zip(*[(x[2], trigram_counts[x]) for x in trigram_counts.keys() if x[:2] == start[-2:]]))

    # Generate the next character.
    start += random.choices(letters, weights=weights, k=1)[0]

The generated text is displayed in task2randomstring.txt

In [37]:
with open("task2randomstring.txt", "w", encoding="utf-8") as file:
    file.write(start)

<h1>Task 3: Analyze the model</h1>

For this task, I have to copy the list of English words available in words.txt from the assignment repository to my own repository. I will then use it to determine the percentage of words in my 10,000 characters that are actual words in the English language.

First I have load the list of English words into a set. A set is used here rather than a list because it's faster for checking if a word exists. This is because sets are implemented as hash tables, so checking membership is a constant-time operation: O(1). In contrast, checking membership in a list requires scanning through each element, which is a linear-time operation: O(n).

I then prepare the text by using .upper() to convert the words to uppercase to match the format of the text, and use .strip() to remove any extra spaces.

In [38]:
with open("words.txt", "r", encoding="utf-8") as file:
    words = set(word.strip().upper() for word in file.readlines())

Then I split the text into words by using start.split(). This will split the text by spaces, creating a list of words. I also initialise the word count so that it starts from 0 every time it's ran.

In [39]:
split_words = start.split()
word_count = 0

Now I have to count the number of English words. To do this, I iterate through the words in split_words and check if the word exists in the words set. If it does, increase the count of word_count by 1.

In [40]:
for word in split_words:
    if word in words:
        word_count = word_count + 1

The final step is to calculate the percentage of English words. This is done by dividing the word_count by the total_words, and multiplying by 100.

In [41]:
total_words = len(start)
percentage = (word_count / total_words) * 100

Now we can print our results to see the total words, word count, and the percentage of English words.

In [42]:
print(f"Total words: {total_words}")
print(f"English words: {word_count}")
print(f"Percentage of English words: {percentage:.2f}%")

Total words: 10000
English words: 623
Percentage of English words: 6.23%
