#### Task 4:

1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?

## 1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt

In [41]:
!wget https://www.gutenberg.org/files/11/11-0.txt

--2023-04-11 03:24:24--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174313 (170K) [text/plain]
Saving to: ‘11-0.txt.6’


2023-04-11 03:24:25 (241 KB/s) - ‘11-0.txt.6’ saved [174313/174313]



## 2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.

In [42]:
!pip3.11 install nltk



In [43]:
import nltk

with open('11-0.txt', 'r') as f:
    text = f.read()

In [44]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Preprocessing
text = ''.join(char for char in text if char.isalpha() or char.isspace())
words = nltk.word_tokenize(text)

# Removing stop words and lemmatizing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

# Joining the words back to form the preprocessed text
preprocessed_text = ' '.join(words)

# preprocessed_text

In [45]:
preprocessed_text[1047:]



## 3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?

In [46]:
!pip3.11 install sklearn



In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Splitting the text into chapters
chapter_list = preprocessed_text[1047:].split('CHAPTER ')[1:]
chapters = []
for chapter in chapter_list:
    chapter_parts = chapter.split('THE END')
    if len(chapter_parts) > 1:
        chapters.append(chapter_parts[0] + 'THE END')
    else:
        chapters.append(chapter_parts[0])

# Vectorizing the preprocessed text
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(chapters)
feature_array = vectorizer.get_feature_names_out()

# Finding the top 10 most important words from each chapter
for i, chapter_tfidf in enumerate(tfidf):
    sorted_indices = np.argsort(chapter_tfidf.toarray()).flatten()[::-1]
    top_10 = [feature_array[index] for index in sorted_indices[:11] if feature_array[index] != 'alice'][:10]
    chapter_name = ', '.join(top_10)
    print(f'Chapter {i+1}: {chapter_name}')


Chapter 1: little, bat, door, key, eat, think, like, way, either, see
Chapter 2: mouse, pool, little, im, swam, cat, dear, said, foot, mabel
Chapter 3: mouse, said, dodo, lory, dry, prize, thimble, know, bird, soon
Chapter 4: little, window, puppy, rabbit, bill, bottle, glove, fan, one, said
Chapter 5: caterpillar, said, serpent, pigeon, im, egg, youth, size, father, little
Chapter 6: said, cat, footman, baby, duchess, mad, pig, wow, like, cook
Chapter 7: hatter, dormouse, said, march, hare, twinkle, time, tea, draw, know
Chapter 8: queen, said, hedgehog, gardener, the, king, soldier, five, executioner, procession
Chapter 9: said, mock, turtle, gryphon, duchess, moral, queen, went, say, day
Chapter 10: turtle, mock, gryphon, said, lobster, dance, beautiful, soup, join, whiting
Chapter 11: king, hatter, said, court, dormouse, witness, queen, officer, juror, breadandbutter
Chapter 12: said, king, jury, sister, dream, queen, would, slate, rabbit, fit


## 3b. How would you name each chapter according to the identified tokens?

** Based on the identified tokens, you could name each chapter as follows:**

Chapter 1: Little Door and the Key
Chapter 2: Mouse, Pool, and the Cat
Chapter 3: Dodo, Lory, and the Thimble Prize
Chapter 4: Window, Puppy, and Rabbit
Chapter 5: Caterpillar, Serpent, and the Pigeon
Chapter 6: Footman, Baby, and the Duchess
Chapter 7: Hatter, Dormouse, and the Tea Party
Chapter 8: Queen, Hedgehog, and the Gardeners
Chapter 9: Turtle, Gryphon, and the Duchess' Morals
Chapter 10: Lobster Dance and Beautiful Soup
Chapter 11: King, Hatter, and the Court Witness
Chapter 12: King's family and the jury

These nouns/names provide a general idea of the characters or themes in each chapter, based on the top tokens identified.

## 4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?

In [8]:
!pip3.11 install spacy
!python3.11 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Tokenize the text into sentences
doc = nlp(text)
sentences = [sent for sent in doc.sents]

# Find sentences containing 'Alice'
alice_sentences = [sent for sent in sentences if 'Alice' in sent.text]

# Extract verbs from these sentences
verb_counts = {}

for sent in alice_sentences:
    for token in sent:
        if token.text == 'Alice' and token.dep_ == 'nsubj':
            verb = token.head
            if verb.pos_ == 'VERB':
                verb_counts[verb.lemma_] = verb_counts.get(verb.lemma_, 0) + 1

# Get the top 10 most used verbs
top_10_verbs = sorted(verb_counts.items(), key=lambda x: x[1], reverse=True)[:10]

# Print the top 10 verbs
for verb, count in top_10_verbs:
    print(f'{verb}: {count}')

say: 30
think: 17
reply: 12
begin: 10
look: 9
feel: 7
go: 6
hear: 6
get: 5
have: 3


## What does Alice do most often?

From the verb count, she says, thinks, replies, begins, looks, goes, feels, hear, has, and venture.

ps: By the way, I LOVE ALICE; The kingfisher bird tatooed on my left arm, her name is Alice ;-)