# TASK 2

<a name='000'></a>

<h2>Content</h2>

<ul>
    <ol type='1'>
    <li><a href='#001'>Libraries and Open file</a></li>
    <li><a href='#002'>Preprocessing</a></li>
    <li><a href='#003'>Top 10 most important words from each chapter</a></li>
    <li><a href='#004'>Top 10 most used verbs in sentences with Alice</a></li> 
    <li><a href='#005'>Conclusions</a></li>
    </ol>
</ul>

**"Alice's Adventures in Wonderland"** is a novel written by **Lewis Carroll** in **1865**. The story follows **Alice**, a young girl who falls down a rabbit hole into a **surreal world** filled with **peculiar characters**. Throughout her **adventure**, she interacts with the **White Rabbit**, **the Mad Hatter**, **the Queen of Hearts**, and **the Cheshire Cat**. The narrative is **full of wordplay** and **illogical situations**. The work has been adapted into movies and plays, becoming a classic in children's and fantasy literature.

Link: https://www.gutenberg.org/files/11/11-0.txt

**Natural Language Processing (NLP)** is a **field** of **artificial intelligence**that focuses on **enabling machines to understand and generate human language**. It is **used** in **chatbots**, **machine translation**, **sentiment analysis**, **text summarization**, **information extraction** and **text classification**, **among other applications**. NLP is **based on text processing techniques** and **machine learning algorithms** and has experienced significant advances in recent years, which has made it relevant in various sectors.

In this task we will make **use of the nlp topics** we were **taught during the class** with the book **"Alice in Wonderland"**.

<a name='001'></a>

<h2>Libraries and Open file</h2>

We open the necessary libraries and open the txt file.

In [1]:
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from collections import Counter

In [2]:
with open('C://Users//Daan_//Downloads//Alice.txt', 'r', encoding='utf8') as file:
    text = file.read()

In [3]:
new_text = text.split('CHAPTER XII.   Alice’s Evidence')[1]
new_text = new_text.split('THE END')[0]
new_text_1 = new_text.split('CHAPTER ')[1:]

We separate the text into chapters.

In [4]:
df = pd.DataFrame()

for i in range(12):
    df.loc[i, 'Chapter'] = 'CHAPTER {}'.format(i+1)
    df.loc[i, 'Text'] = new_text_1[i]

df

Unnamed: 0,Chapter,Text
0,CHAPTER 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginn...
1,CHAPTER 2,II.\nThe Pool of Tears\n\n\n“Curiouser and cur...
2,CHAPTER 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey ...
3,CHAPTER 4,IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt...
4,CHAPTER 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpi...
5,CHAPTER 6,VI.\nPig and Pepper\n\n\nFor a minute or two s...
6,CHAPTER 7,VII.\nA Mad Tea-Party\n\n\nThere was a table s...
7,CHAPTER 8,VIII.\nThe Queen’s Croquet-Ground\n\n\nA large...
8,CHAPTER 9,IX.\nThe Mock Turtle’s Story\n\n\n“You can’t t...
9,CHAPTER 10,X.\nThe Lobster Quadrille\n\n\nThe Mock Turtle...


<a name='002'></a>

<h2>Preprocessing</h2>

We perform preprocessing of the text, including conversion to lowercase, removal of empty words, numbers / non-alphabetic characters and lemmatization.

In [5]:
def preprocessing(text):
    
    stoplist = set(nltk.corpus.stopwords.words('english'))
    text = text.lower()
    text = re.sub('[^a-zA-Z ]+', ' ', text)
    text = re.sub(' +', ' ', text)
    tokens = nltk.word_tokenize(text)   
    palabras = [token for token in tokens if token not in stoplist and token != 'alice']
    wnlemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    lemmatize= [wnlemmatizer.lemmatize(palabra) for palabra in palabras]
    
    return ' '.join(lemmatize)

In [6]:
for i in range(len(df)):
    df.loc[i, 'Preprocessed Text'] = preprocessing(df.loc[i, 'Text'])
    
df

Unnamed: 0,Chapter,Text,Preprocessed Text
0,CHAPTER 1,I.\nDown the Rabbit-Hole\n\n\nAlice was beginn...,rabbit hole beginning get tired sitting sister...
1,CHAPTER 2,II.\nThe Pool of Tears\n\n\n“Curiouser and cur...,ii pool tear curiouser curiouser cried much su...
2,CHAPTER 3,III.\nA Caucus-Race and a Long Tale\n\n\nThey ...,iii caucus race long tale indeed queer looking...
3,CHAPTER 4,IV.\nThe Rabbit Sends in a Little Bill\n\n\nIt...,iv rabbit sends little bill white rabbit trott...
4,CHAPTER 5,V.\nAdvice from a Caterpillar\n\n\nThe Caterpi...,v advice caterpillar caterpillar looked time s...
5,CHAPTER 6,VI.\nPig and Pepper\n\n\nFor a minute or two s...,vi pig pepper minute two stood looking house w...
6,CHAPTER 7,VII.\nA Mad Tea-Party\n\n\nThere was a table s...,vii mad tea party table set tree front house m...
7,CHAPTER 8,VIII.\nThe Queen’s Croquet-Ground\n\n\nA large...,viii queen croquet ground large rose tree stoo...
8,CHAPTER 9,IX.\nThe Mock Turtle’s Story\n\n\n“You can’t t...,ix mock turtle story think glad see dear old t...
9,CHAPTER 10,X.\nThe Lobster Quadrille\n\n\nThe Mock Turtle...,x lobster quadrille mock turtle sighed deeply ...


<a name='003'></a>

<h2>Top 10 most important words from each chapter</h2>

We use **TfidfVectorizer** to **extract** text **features**.

In [7]:
def extraction(text):
    
    vector = TfidfVectorizer().fit(text)
    df2 = pd.DataFrame(vector.transform(text).toarray(), columns=vector.get_feature_names_out())
    
    return df2

In [8]:
tk3=extraction(df['Preprocessed Text'])

Classification of the features to find the 10 most important words of each chapter.

In [9]:
for Chapter in range(len(tk3)):
    top = tk3.loc[Chapter, :].sort_values(ascending=False).head(10)
    print()
    print('10 Most Important Words of "Chapter {}"'.format(Chapter+1))
    for i, datos in enumerate(top.index):
        print('{}. {}'.format(i+1, datos))


10 Most Important Words of "Chapter 1"
1. little
2. bat
3. rabbit
4. door
5. key
6. way
7. eat
8. like
9. think
10. either

10 Most Important Words of "Chapter 2"
1. mouse
2. pool
3. little
4. oh
5. swam
6. cat
7. dear
8. said
9. foot
10. mabel

10 Most Important Words of "Chapter 3"
1. said
2. mouse
3. dodo
4. race
5. prize
6. lory
7. dry
8. thimble
9. know
10. bird

10 Most Important Words of "Chapter 4"
1. bill
2. rabbit
3. little
4. window
5. puppy
6. glove
7. one
8. chimney
9. bottle
10. fan

10 Most Important Words of "Chapter 5"
1. caterpillar
2. said
3. pigeon
4. serpent
5. egg
6. youth
7. size
8. father
9. little
10. well

10 Most Important Words of "Chapter 6"
1. said
2. footman
3. cat
4. baby
5. mad
6. duchess
7. wow
8. like
9. pig
10. cook

10 Most Important Words of "Chapter 7"
1. hatter
2. dormouse
3. said
4. march
5. hare
6. tea
7. twinkle
8. time
9. well
10. treacle

10 Most Important Words of "Chapter 8"
1. queen
2. said
3. hedgehog
4. king
5. gardener
6. soldier
7. c

I would name each chapter as follows:

**Chapter 1**: ***"A Little Rabbit's Key to Wonderland"***

**Chapter 2:** ***"Mouse Swam in the Little Pool with a Cat"***

**Chapter 3:** ***"The Dodo Said, 'Let's Race for a Prize'"***

**Chapter 4:** ***"Bill the Rabbit's Little Window Adventure"***

**Chapter 5:** ***"Caterpillar Said, 'Pigeon, Serpent, Egg!'"***

**Chapter 6:** ***"The Footman's Mad Adventure with a Pig"***

**Chapter 7:** ***"Hatter, Dormouse, and the Twinkling Tea Time"***

**Chapter 8:** ***"The Queen's Hedgehog, King, and Rose Garden"***

**Chapter 9:** ***"The Mock Turtle's Moral School"***

**Chapter 10:** ***"The Mock Turtle and the Lobster's Dance"***

**Chapter 11:** ***"King and Hatter's Courtroom Drama"***

**Chapter 12:** ***"King, Queen, and the Dream of Wonderland"***


<a name='004'></a>

<h2>Top 10 most used verbs in sentences with Alice</h2>

We divide the text into sentences containing the word **"Alice"**.

In [10]:
oraciones = re.split('[.?!]', new_text)
oraciones_alice = [' '.join(oracion for oracion in oraciones if 'Alice' in oracion)]

We do again a preprocessing of the text adding some things to solve the problem.

In [11]:
def preprocessing_2(text):
    
    stoplist = set(nltk.corpus.stopwords.words('english'))
    text = text.lower()
    text = re.sub('[^a-zA-Z ]+', ' ', text)
    text = re.sub(' +', ' ', text)
    tokens = nltk.word_tokenize(text)   
    palabras = [token for token in tokens if token not in stoplist]
    wnlemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    lemmatize= [wnlemmatizer.lemmatize(palabra) for palabra in palabras]
    tags = nltk.pos_tag(lemmatize)
    verbos = [verbo for verbo, tag in tags if tag.startswith('VB')]
    lemmatize = [wnlemmatizer.lemmatize(verbo,'v') for verbo in verbos]
    
    return lemmatize

In [12]:
s=Counter(preprocessing_2(oraciones_alice[0])).most_common(10)

In [13]:
print("10 Most Important Verbs In Sentences with Alice")
print()

for i in range(len(s)):
    print('{}. {} '.format(i+1, s[i][0]))

10 Most Important Verbs In Sentences with Alice

1. say 
2. go 
3. think 
4. get 
5. look 
6. begin 
7. see 
8. come 
9. know 
10. find 


Above we can see the top 10 things Alice does most in the whole book.

<a name='005'></a>

<h2>Conclusions</h2>

**Natural language processing (NLP)** is of **great importance today** because of its **ability to understand and generate text automatically**. It is **applied** in a **wide range of fields**, from virtual assistants to sentiment analysis and machine translation, making it a **crucial technology** in the **digital age**.

On the other hand, **"Alice in Wonderland"** is a **classic literary** work written by Lewis Carroll. Its **importance** lies in its ability to **captivate readers** of all ages **with** its **boundless imagination and social satire**. Alice's story  remains relevant and timeless, exploring themes of curiosity, logic and absurdity, making it a significant literary work in the history of literature.