# Task 1: Third-order letter approximation model

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.

In [4]:
#import libraries
import re
from collections import defaultdict

In [7]:
''' Functions defined below '''

def preprocess_text(text):
    ''' Retain only letters, spaces, and full stops '''
    text = re.sub(r'[^a-zA-Z. ]', '', text)  # Remove unwanted characters
    return text.upper()  # Convert to uppercase

def extract_trigrams(text):
    trigram_counts = defaultdict(int)  # Default dictionary to store trigram counts
    for i in range(len(text) - 2):  # Sliding window of 3
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    return trigram_counts

def process_gutenberg_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Remove preamble and postamble
    start = text.find('*** START OF THIS PROJECT GUTENBERG EBOOK')
    end = text.find('*** END OF THIS PROJECT GUTENBERG EBOOK')
    if start != -1 and end != -1:
        text = text[start + 40:end]  # Strip preamble/postamble

    # Preprocess the text and extract trigrams
    cleaned_text = preprocess_text(text)
    return extract_trigrams(cleaned_text)

def merge_trigram_counts(all_counts, new_counts):
    for trigram, count in new_counts.items():
        all_counts[trigram] += count

In [6]:
if __name__ == "__main__":
    # Paths to the downloaded texts
    file_paths = [
        "pride_and_prejudice.txt",
        "a_tale_of_two_cities.txt",
        "moby_dick.txt",
        "sherlock_holmes.txt",
        "dracula.txt",
    ]

    all_trigram_counts = defaultdict(int)
    for file_path in file_paths:
        book_trigrams = process_gutenberg_text(file_path)
        merge_trigram_counts(all_trigram_counts, book_trigrams)

    # Sort trigrams by frequency (optional)
    sorted_trigrams = sorted(all_trigram_counts.items(), key=lambda x: -x[1])

    # Display top 10 trigrams
    print("Top 10 Trigrams:")
    for trigram, count in sorted_trigrams[:10]:
        print(f"{trigram}: {count}")


FileNotFoundError: [Errno 2] No such file or directory: 'pride_and_prejudice.txt'

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.