<a href="https://colab.research.google.com/github/azoqi/Natural-Language-Processing/blob/main/ClassCoding2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem: we would like to know what traits define great authors by comparing the works of 3 historic and well-regarded authors: William Shakespeare, Jane Austen, and Charles Dickens. Specifically, we need to determine the total size of each authors vocabulary and the average length of a sentence in their works.



Requirements: write a Python program using the Natural Language ToolkitLinks to an external site. to tokenize the works (or a subset) of each author retrieved from Project GutenbergLinks to an external site.. You should count the total number of unique words (types) that appear in the works of each author, as well as the average length of a sentence in the books. You do not have to use every book by the three authors in your investigation, but should use at least the 3 most popular complete novels. Exclude poetry as it will interfere with you investigation of sentence length.

Your program should do a good job of cleaning and normalizing the tokens.


In [9]:
import requests
import nltk
import re
import time
import string

In [10]:
nltk.download('punkt_tab')
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sentences = tokenizer.tokenize(text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
def clean_gutenberg_text(text):
    start_match = re.search(r'\*\*\*\s*START OF (THIS|THE) PROJECT GUTENBERG EBOOK.*\*\*\*', text)
    end_match = re.search(r'\*\*\*\s*END OF (THIS|THE) PROJECT GUTENBERG EBOOK.*\*\*\*', text)
    if start_match and end_match:
        text = text[start_match.end():end_match.start()]
    return text

In [12]:
works = {
    'Jane Austen': [
        ("Pride and Prejudice", "http://www.gutenberg.org/files/1342/1342-0.txt"),
        ("Sense and Sensibility", "http://www.gutenberg.org/files/161/161-0.txt"),
        ("Emma", "http://www.gutenberg.org/files/158/158-0.txt")
    ],
    'Charles Dickens': [
        ("A Tale of Two Cities", "http://www.gutenberg.org/files/98/98-0.txt"),
        ("Oliver Twist", "http://www.gutenberg.org/files/730/730-0.txt"),
        ("Great Expectations", "http://www.gutenberg.org/files/1400/1400-0.txt")
    ],
    'William Shakespeare': [
        # Note: Shakespeare’s works are plays. We use three popular ones.
        ("Hamlet", "https://www.gutenberg.org/files/2265/2265-0.txt"),
        ("Macbeth", "https://www.gutenberg.org/files/2264/2264-0.txt"),
        ("Romeo and Juliet", "https://www.gutenberg.org/files/1112/1112-0.txt")
    ]
}

In [13]:
#import nltk

#nltk.download('punkt')

results = {}

for author, texts in works.items():
    vocab = set()         # unique words for this author.
    total_words = 0       # Total count of (cleaned) word tokens.
    total_sentences = 0   # Total count of sentences.
    print(f"Processing texts for {author}...")

    for title, url in texts:
        print(f"  Downloading '{title}'...")
        try:
            response = requests.get(url)
            response.encoding = 'utf-8'
            text = response.text
        except Exception as e:
            print(f"Error downloading '{title}': {e}")
            continue

        text = clean_gutenberg_text(text)

        # Tokenize the text into sentences.
        sentences = nltk.sent_tokenize(text)

        for sentence in sentences:
            # Tokenize the sentence into words.
            words = nltk.word_tokenize(sentence)
            cleaned_words = []
            for token in words:
                token = token.lower().strip(string.punctuation)
                # Include the token only if it is alphabetic.
                if token.isalpha():
                    cleaned_words.append(token)
                    vocab.add(token)
            # Count words in this sentence (if any remain after cleaning).
            if cleaned_words:
                total_words += len(cleaned_words)
                total_sentences += 1

        # stop overloading Project Gutenberg's servers.
        time.sleep(1)

    # Calculate average sentence length.
    avg_sentence_length = total_words / total_sentences if total_sentences > 0 else 0
    results[author] = {
        'vocab_size': len(vocab),
        'avg_sentence_length': avg_sentence_length
    }
# Print out the results.
print("\n=== Analysis Results ===")
for author, data in results.items():
    print(f"\nAuthor: {author}")
    print(f"  Vocabulary Size: {data['vocab_size']}")
    print(f"  Average Sentence Length: {data['avg_sentence_length']:.2f} words")

Processing texts for Jane Austen...
  Downloading 'Pride and Prejudice'...
  Downloading 'Sense and Sensibility'...
  Downloading 'Emma'...
Processing texts for Charles Dickens...
  Downloading 'A Tale of Two Cities'...
  Downloading 'Oliver Twist'...
  Downloading 'Great Expectations'...
Processing texts for William Shakespeare...
  Downloading 'Hamlet'...
  Downloading 'Macbeth'...
  Downloading 'Romeo and Juliet'...

=== Analysis Results ===

Author: Jane Austen
  Vocabulary Size: 10455
  Average Sentence Length: 27.49 words

Author: Charles Dickens
  Vocabulary Size: 16661
  Average Sentence Length: 24.84 words

Author: William Shakespeare
  Vocabulary Size: 3924
  Average Sentence Length: 15.53 words
