<h1>Task 1: Third-order letter approximation model</h1>

For Task 1, I am tasked to select five free English works from Project Gutenberg. I then have to preprocess these texts by removing any preamble and postamble, and removing all characters except for ASCII letters, full stops, and spaces. Finally, I then have to make all letters uppercase.

When I have preprocessed all the texts, I then have to create a trigram model by counting the number of times each trigram (a sequence of three characters) appears. For example, "It is what it is." would become "IT IS WHAT IT IS" when processed. This will then give a trigram model like {'IT ': 2, 'T I': 3, ' IS': 2, 'IS ': 1, 'S W': 1, ' WH': 1, 'WHA': 1, 'HAT': 1, 'AT ': 1}

In [18]:
import os
import re

Initialises the directory, inistialises the dictionaries and reads each text file.

- The texts dictionary stores the texts.
- The trigram_counts dictionary stores trigrams and their counts.
- The combined_trigrams stores the combined trigrams and their counts.

In [19]:
# Directory containing the text files
directory = "texts"

texts = {}
# Initialize an empty dictionary for storing trigrams and their counts
trigram_counts = {}
# Initialize a dictionary to store combined trigram counts
combined_trigrams = {}

# Load each text file into a dictionary
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open(os.path.join(directory, filename), "r", encoding="utf-8") as file:
            texts[filename] = file.read()

<h2>Prepare Text</h2>
The first step in preparing the texts is to remove any preamble and postamble. Project Gutenburg texts often have a preamble that ends with "*** START OF THE PROJECT GUTENBERG EBOOK (EBOOK NAME) ***", and a postamble that starts with "*** END OF THE PROJECT GUTENBERG EBOOK (EBOOK NAME) ***". This can be removed by setting markers, in this case "*** START" and "*** END". The function then searches for where these markers first occur in the text. If both markers are found, the function extracts the portion of the text between these markers.

The next step is to remove all characters except letters, full stops and spaces from the text. To do this, the function uses the re.sub() method. This is used to replace unwanted characters in the text. Any character that is not a letter, a full stop or a space is replaced with an empty string, effectively removing them. The text is now cleaned, leaving only letters, full stops and spaces.

Finally, I have to convert all the characters to uppercase. This is done by using the .upper() method, which transforms all the letters to uppercase.
The .strip() method is also used to remove any leading or trailing whitespace from the text.

In [20]:
def prepare_text(text):
    # Remove preamble and postamble from the text
    preamble = text.find("*** START")
    postamble = text.find("*** END")
    if preamble != -1 and postamble != -1:
        text = text[preamble:postamble]
    
    # Remove all characters except letters, full stops and spaces
    prepared_text = re.sub(r"[^A-Za-z. ]", "", text)
    
    return prepared_text.upper().strip()

<h2>Build Trigram Model</h2>
The next step is to build a trigram model by counting the number of times each sequence of three characters trigram appears.

- for i in range(len(prepared_text) - 2):<br>
First, it iterates through the text to extract all possible trigrams.

- trigram = prepared_text[i:i + 3]:<br>
Then, we have to slice the text to extract a substring of 3 characters.

- if trigram in trigram_counts:
    trigram_counts[trigram] += 1
else:
    trigram_counts[trigram] = 1

This checks if the trigram is in trigram_counts. If it is, then increase the count by 1. If it's not, then initialise it's count to 1.

In [21]:
def build_trigram_model(prepared_text):
    # Traverse the text to extract trigrams
    for i in range(len(prepared_text) - 2):
        # Extract a trigram
        trigram = prepared_text[i:i + 3]
        
        # Increment the count of the trigram in the dictionary
        if trigram in trigram_counts:
            trigram_counts[trigram] += 1
        else:
            trigram_counts[trigram] = 1
    
    return trigram_counts

Now we have to read all the text files, prepare them and get the trigram counts. They are then merged into the combined_trigrams dictionary.

In [22]:
# Process each text file
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open(os.path.join(directory, filename), "r", encoding="utf-8") as file:
            content = file.read()
        
        # Prepare the text
        prepared_text = prepare_text(content)
        
        # Generate the trigram model
        trigram_model = build_trigram_model(prepared_text)
        
        # Merge the current file's trigram counts into the combined dictionary
        for trigram, count in trigram_model.items():
            if trigram in combined_trigrams:
                combined_trigrams[trigram] += count
            else:
                combined_trigrams[trigram] = count

The combined trigram counts are then displayed in trigram.txt

In [23]:
# Write the combined trigram counts to a single text file
output_file = "trigrams.txt"
with open(output_file, "w", encoding="utf-8") as file:
    for trigram, count in sorted(combined_trigrams.items()):
        file.write(f"{trigram}: {count}\n")