# Task 1: Third-order letter approximation model

###### Task description:
Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.

In [17]:
#import libraries
import re
from collections import defaultdict
import requests
import random

In [26]:
import requests
import random
import re

def task_one():

    def select_random_gutenberg_works():
        """
        Fetches the list of top 100 eBooks from Project Gutenberg,
        parses the links manually without external libraries, and selects five works randomly.

        Returns:
            list: A list of tuples with eBook titles and their download URLs in Plain Text UTF-8 format.
        """
        url = "https://www.gutenberg.org/browse/scores/top"

        try:
            # Fetch the top 100 eBooks page
            response = requests.get(url)
            response.raise_for_status()

            # Extract links to eBook pages manually using regex
            page_content = response.text
            ebook_links = re.findall(r'href="(/ebooks/\d+)"', page_content)
            ebook_links = [f"https://www.gutenberg.org{link}" for link in ebook_links]

            if len(ebook_links) < 5:
                print("Not enough eBooks found.")
                return []

            # Randomly select 5 eBook pages
            selected_pages = random.sample(ebook_links, 5)
            ebooks_with_text_links = []

            for page_url in selected_pages:
                try:
                    # Fetch each eBook page
                    page_response = requests.get(page_url)
                    page_response.raise_for_status()

                    # Extract the Plain Text UTF-8 link manually
                    text_match = re.search(r'href="(.*?\.txt\.utf-8)"', page_response.text)
                    title_match = re.search(r'<title>(.*?)\| Project Gutenberg</title>', page_response.text)

                    if text_match and title_match:
                        full_text_url = f"https://www.gutenberg.org{text_match.group(1)}"
                        title = title_match.group(1).strip()
                        ebooks_with_text_links.append((title, full_text_url))

                except requests.RequestException as e:
                    print(f"Error fetching eBook page {page_url}: {e}")

            if len(ebooks_with_text_links) < 5:
                print("Not enough eBooks with Plain Text UTF-8 format found.")
                return []

            return ebooks_with_text_links

        except requests.RequestException as e:
            print(f"Error fetching data from Project Gutenberg: {e}")
            return []

    # run all tasks
    selected_works = select_random_gutenberg_works()
    print(selected_works)
    for idx, (title, url) in enumerate(selected_works, 1):
        print(f"{idx}. {title}: {url}")


task_one()        

[('Second Treatise of Government by John Locke', 'https://www.gutenberg.org/ebooks/7370.txt.utf-8'), ('The Reign of Greed by José Rizal', 'https://www.gutenberg.org/ebooks/10676.txt.utf-8'), ('Great Expectations by Charles Dickens', 'https://www.gutenberg.org/ebooks/1400.txt.utf-8'), ('A Room with a View by E. M. Forster', 'https://www.gutenberg.org/ebooks/2641.txt.utf-8'), ('Dracula by Bram Stoker', 'https://www.gutenberg.org/ebooks/345.txt.utf-8')]
1. Second Treatise of Government by John Locke: https://www.gutenberg.org/ebooks/7370.txt.utf-8
2. The Reign of Greed by José Rizal: https://www.gutenberg.org/ebooks/10676.txt.utf-8
3. Great Expectations by Charles Dickens: https://www.gutenberg.org/ebooks/1400.txt.utf-8
4. A Room with a View by E. M. Forster: https://www.gutenberg.org/ebooks/2641.txt.utf-8
5. Dracula by Bram Stoker: https://www.gutenberg.org/ebooks/345.txt.utf-8


In [7]:

def select_texts():
    print('texts selected : ')


''' Functions defined below '''

def preprocess_text(text):
    ''' Retain only letters, spaces, and full stops '''
    text = re.sub(r'[^a-zA-Z. ]', '', text)  # Remove unwanted characters
    return text.upper()  # Convert to uppercase

def extract_trigrams(text):
    trigram_counts = defaultdict(int)  # Default dictionary to store trigram counts
    for i in range(len(text) - 2):  # Sliding window of 3
        trigram = text[i:i+3]
        trigram_counts[trigram] += 1
    return trigram_counts

def process_gutenberg_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Remove preamble and postamble
    start = text.find('*** START OF THIS PROJECT GUTENBERG EBOOK')
    end = text.find('*** END OF THIS PROJECT GUTENBERG EBOOK')
    if start != -1 and end != -1:
        text = text[start + 40:end]  # Strip preamble/postamble

    # Preprocess the text and extract trigrams
    cleaned_text = preprocess_text(text)
    return extract_trigrams(cleaned_text)

def merge_trigram_counts(all_counts, new_counts):
    for trigram, count in new_counts.items():
        all_counts[trigram] += count

In [20]:
if __name__ == "__main__":
    # Paths to the downloaded texts
    file_paths = [
        "pride_and_prejudice.txt",
        "a_tale_of_two_cities.txt",
        "moby_dick.txt",
        "sherlock_holmes.txt",
        "dracula.txt",
    ]

    all_trigram_counts = defaultdict(int)
    for file_path in file_paths:
        book_trigrams = process_gutenberg_text(file_path)
        merge_trigram_counts(all_trigram_counts, book_trigrams)

    # Sort trigrams by frequency (optional)
    sorted_trigrams = sorted(all_trigram_counts.items(), key=lambda x: -x[1])

    # Display top 10 trigrams
    print("Top 10 Trigrams:")
    for trigram, count in sorted_trigrams[:10]:
        print(f"{trigram}: {count}")


NameError: name 'process_gutenberg_text' is not defined


Task 1: Third-order letter approximation model
Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.

For example, the sentence: It is what it is. would become IT IS WHAT IT IS. This will give a model like {'IT ': 2, 'T I': 3, ' IS': 2, 'IS ': 1, ...}.

Task 2: Third-order letter approximation generation
Use your model from Task 1 to generate a string of 10,000 characters. Start with the string TH. Generate each next character by looking at the previous two characters. Find the trigrams in your model that start with those two characters. Randomly select one of the third letters of those trigrams, using the counts as weights.

For example, suppose your model has five trigrams starting with TH: THE appeared 150 times, THA appeared 70 times, THI 60 times, TH  50 times, and TH. appeared 10 times. The total of the counts is 340. Select the next character as E with probability 150/340, A with probability 70/340, and so on.

Task 3. Analyze your model
Copy the list of English words available in words.txt in this repository to your own repository. Use it to determine the percentage of words in your 10,000 characters that are actual words in the English language.

Task 4: Export your model as JSON
Export your model as JavaScript Object Notation (JSON), saving it in your repository as trigrams.json.

Select five free English works in Plain Text UTF8 format from Project Gutenberg. Use them to create a model of the English language as follows. Remove any preamble and postamble. Remove all characters except for (ASCII) letters (uppercase and lowercase), full stops, and spaces. Make all letters uppercase.

Next create a trigram model by counting the number of times each sequence of three characters (that is, each trigram) appears. You can design your own data structure for storing the results but explain your design and its rationale in your answer.