# Book Analysis

## 1. Necessary Information

**Book #1: Crime and Punishment (1866) by Fyodor Dostoevsky**

- URL: https://www.gutenberg.org/cache/epub/2554/pg2554-images.html
- License: Public Domain
- Encoding: UTF-8

**Book #2: The Great Gatsby (1925) by F. Scott Fitzgerald**

- URL: https://www.gutenberg.org/cache/epub/64317/pg64317-images.html
- License: Public Domain
- Encoding: UTF-8

## 2. Questions To Answer

**- Who was the most frequently mentioned character in The Great Gatsby?**

**- How many chapters is there in Crime and Punishment?**

**- What are the top 20 most frequently mentioned words in both books?**

## 3. Analysis

In [2]:
# Import necessary libraries
import os
import string
from collections import Counter
import re

In [3]:
# Define file paths for the books
book1_path = os.path.join('data', 'crimeandpunishment.txt')
book2_path = os.path.join('data', 'thegreatgatsby.txt')

In [4]:
# Open and read the books
with open(book1_path, 'r', encoding='utf-8') as file1:
    book1_text = file1.read()

with open(book2_path, 'r', encoding='utf-8') as file2:
    book2_text = file2.read()

To answer the first question, I will begin with preprocessing the text by removing any punctuation, converting to lowercase and splitting into words for easy and accurate implementation/analysis

In [5]:
def preprocess_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    words = text.split()
    return words
gatsby_words = preprocess_text(book1_text)

Next, I will define a list of character names to count mentions. The list of characters have been taken from the following website: https://www.sparknotes.com/lit/gatsby/characters/

In [6]:
character_names = ["gatsby", "daisy", "tom", "nick", "myrtle", "jordan", "george", "meyer", "owl", "klipspringer"]

In [7]:
# Count the mentions of each character in the text.
def character_frequency(words, character_names):
    character_counts = Counter()
    
    for word in words:
        for character in character_names:
            if character in word:
                character_counts[character] += 1
    
    return character_counts

gatsby_character_counts = character_frequency(gatsby_words, character_names)
print(gatsby_character_counts)

Counter({'tom': 81, 'owl': 60, 'nick': 4})


In [8]:
# Create a dictionary to store character mentions
character_mentions = dict(gatsby_character_counts)
for character, count in character_mentions.items():
    print(f"{character.capitalize()}: {count} mentions")

Owl: 60 mentions
Tom: 81 mentions
Nick: 4 mentions


In [9]:
most_frequent_character, frequency = gatsby_character_counts.most_common(1)[0]

In [10]:
print(f"The most frequently mentioned character in 'The Great Gatsby' is {most_frequent_character.capitalize()} with {frequency} mentions.")

The most frequently mentioned character in 'The Great Gatsby' is Tom with 81 mentions.


To answer the second question, I am going to define a pattern that indicates the parts and chapters of the book. In the code below, I used the 're' (regular expressions) module to help match and extract the necessary information since the book displays the chapter and part numbers in a specific pattern: "PART X" or "CHAPTER Y" in which X and Y are greek numerical letters, respectively. 

In [11]:
def extract_chapters(book_text):
    # Use regular expressions to find chapter headings with the pattern "CHAPTER X".
    chapter_pattern = r'CHAPTER [IVXLCDM]+'
    part_pattern = r'PART [IVXLCDM]+'
    chapter_headings = re.findall(chapter_pattern, book_text)
    part_headings = re.findall(part_pattern, book_text)
    
    # Count the total number of chapters.
    total_chapters = len(chapter_headings)
    total_parts = len(part_headings)
    
    # Print the chapter headings and parts as well as their total counts
    for part in part_headings:
        print(part)
    
    print(f"\nTotal Number of Parts: {total_parts}")   
    
    print("\n-----------------------------------------------------------\n")
    
    for chapter in chapter_headings:
        print(chapter)
 
    print(f"\nTotal Number of Chapters: {total_chapters}")


In [12]:
print('Table of Contents for Crime and Punishment\n')
extract_chapters(book1_text)

Table of Contents for Crime and Punishment

PART I
PART II
PART III
PART IV
PART V
PART VI

Total Number of Parts: 6

-----------------------------------------------------------

CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER VI
CHAPTER VII
CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER VI
CHAPTER VII
CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER VI
CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER VI
CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER I
CHAPTER II
CHAPTER III
CHAPTER IV
CHAPTER V
CHAPTER VI
CHAPTER VII
CHAPTER VIII

Total Number of Chapters: 39


For the third question, I am going to define a function to find the most frequent words in a text. However, I will firstly begin by identifying common stop words like "a" and "of" to make my analysis more accurate. By doing so, I can get more understanding on the actual diction of the text. 

In [13]:
# Define a list of common stop words
stop_words = ["a", "an", "the", "and", "of", "she", "he", "it", "in", "to",
              "in", "was", "had", "that", "with", "her", "him", "his", 
             "i", "a", "s", "at", "but", "you", "me", "as", "they", "them", 
             "my", "from", "this", "that", "is", "not", "there", "t", 
             "be", "so", "has", "have", "we", "on", "for", "like", 
             "by", "one", "or", "were", "your","will", "if", "do", "been", 
             "am", "as", "said"]

def tokenize_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize by splitting on spaces and removing punctuation
    words = re.findall(r'\b\w+\b', text)
    # Filter out stop words
    words = [word for word in words if word not in stop_words]
    
    return words

In [14]:
# Function to find the most frequent words in text
def find_most_frequent_words(text, top_n=20):
    # Tokenize the text
    words = tokenize_text(text)
    
    # Count the frequency of each word
    word_counts = Counter(words)
    
    # Get the top N most frequent words
    top_words = word_counts.most_common(top_n)
    
    return top_words

In [15]:
gatsby_top_words = find_most_frequent_words(book2_text)
crime_top_words = find_most_frequent_words(book1_text)

print("Top 20 Most Frequent Words in The Great Gatsby:\n")
for word, count in gatsby_top_words:
    print(f"{word}: {count}")
print("\n-----------------------------------------------------------")
print("\nTop 20 Most Frequent Words in Crime and Punishment:\n")
for word, count in crime_top_words:
    print(f"{word}: {count}")

Top 20 Most Frequent Words in The Great Gatsby:

gatsby: 268
all: 239
out: 214
up: 194
tom: 191
daisy: 186
into: 168
about: 159
when: 147
what: 147
then: 144
over: 143
down: 118
who: 117
man: 114
no: 113
back: 109
came: 108
any: 105
some: 104

-----------------------------------------------------------

Top 20 Most Frequent Words in Crime and Punishment:

all: 1318
what: 1231
are: 870
raskolnikov: 785
no: 703
out: 685
up: 658
would: 573
now: 564
about: 538
how: 536
know: 530
too: 503
did: 497
could: 496
come: 480
man: 479
then: 471
very: 466
don: 464
