<a href="https://colab.research.google.com/github/dmassoo/MLT_labs/blob/main/Dmitrii_Vorotnikov_J41321c__MLT_2022_Task_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 4:
1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?


#P.1

In [33]:
import pandas as pd
import re

In [34]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# P.2

Alice_by_chapters is a tsv in the following form: chapter number  -  chapter text

It is needed for p.3 to find most important words in each chapter

In [35]:
filename = "/content/drive/MyDrive/MLT/Alice_by_chapters.tsv"
data = pd.read_csv(filename, delimiter="\t", encoding='cp1251')
data = data[0:12]
data

Unnamed: 0,chapter,text
0,1.0,Alice was beginning to get very tired of sitti...
1,2.0,“Curiouser and curiouser!” cried Alice (she wa...
2,3.0,They were indeed a queer-looking party that as...
3,4.0,"It was the White Rabbit, trotting slowly back ..."
4,5.0,The Caterpillar and Alice looked at each other...
5,6.0,For a minute or two she stood looking at the h...
6,7.0,There was a table set out under a tree in fron...
7,8.0,A large rose-tree stood near the entrance of t...
8,9.0,“You can’t think how glad I am to see you agai...
9,10.0,"The Mock Turtle sighed deeply, and drew the ba..."


In [36]:
text = data.at[0, "text"][:1000]
text

'Alice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into\nthe book her sister was reading, but it had no pictures or\nconversations in it, “and what is the use of a book,” thought Alice\n“without pictures or conversations?”\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure of\nmaking a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so _very_ remarkable in that; nor did Alice think it\nso _very_ much out of the way to hear the Rabbit say to itself, “Oh\ndear! Oh dear! I shall be late!” (when she thought it over afterwards,\nit occurred to her that she ought to have wondered at this, but at the\ntime it all seemed quite natural); but when the Rabbit actually _took a\nwatch out of its waistcoat-pocket_

In [37]:
text = re.sub(r"[^\w\s]|\_", "", text)
text = re.sub(r'\n', " ", text)
text

'Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversations  So she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid whether the pleasure of making a daisychain would be worth the trouble of getting up and picking the daisies when suddenly a White Rabbit with pink eyes ran close by her  There was nothing so very remarkable in that nor did Alice think it so very much out of the way to hear the Rabbit say to itself Oh dear Oh dear I shall be late when she thought it over afterwards it occurred to her that she ought to have wondered at this but at the time it all seemed quite natural but when the Rabbit actually took a watch out of its waistcoatpocket and looked at it'

In [38]:
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer

In [39]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [40]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [41]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In [31]:
def clean_text(text):
    text = re.sub(r"[^\w\s]|_", "", text)
    text = re.sub(r"\n", " ", text)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [word for word in text if not word in stop_words]
    text = " ".join(text)
    return text

In [67]:
data["processed_text"] = data["text"].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,chapter,text,processed_text
0,1.0,Alice was beginning to get very tired of sitti...,alice wa beginning get tired sitting sister ba...
1,2.0,“Curiouser and curiouser!” cried Alice (she wa...,curiouser curiouser cried alice wa much surpri...
2,3.0,They were indeed a queer-looking party that as...,indeed queerlooking party assembled bankthe bi...
3,4.0,"It was the White Rabbit, trotting slowly back ...",wa white rabbit trotting slowly back looking a...
4,5.0,The Caterpillar and Alice looked at each other...,caterpillar alice looked time silence last cat...


# P.3

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [86]:
vectorizer_tfidf = TfidfVectorizer(max_features=3000)

In [87]:
X = data["processed_text"]
vectorizer_tfidf.fit(X)

TfidfVectorizer(max_features=3000)

In [88]:
X_tfidf = vectorizer_tfidf.transform(X)

In [89]:
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())

In [90]:
tfidf_df.shape

(12, 2424)

In [107]:
# transposing dataframe for convenience (columns are pd.Series)
dft = tfidf_df.T
dft.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
abide,0.0,0.0,0.0,0.0,0.0,0.020543,0.0,0.0,0.0,0.0,0.0,0.0
able,0.0,0.028168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
absence,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019324,0.0,0.0,0.0
absurd,0.0,0.0,0.025185,0.0,0.0,0.017643,0.0,0.0,0.0,0.0,0.0,0.0
acceptance,0.0,0.0,0.029326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [108]:
# iterating over columns and getting 10 largest values
for i in dft:
    ser = dft.loc[:, i]
    print(ser.nlargest(10))
    print()

wa        0.503459
alice     0.256479
little    0.142488
bat       0.140570
door      0.127001
key       0.124173
eat       0.117908
like      0.104492
think     0.104492
way       0.104492
Name: 0, dtype: float64

wa        0.343301
mouse     0.278280
alice     0.235406
little    0.166746
pool      0.149587
im        0.148813
swam      0.140842
cat       0.139140
dear      0.136202
said      0.117703
Name: 1, dtype: float64

mouse      0.380251
said       0.347196
dodo       0.302224
alice      0.234868
wa         0.204233
prize      0.175955
lory       0.151112
dry        0.133486
thimble    0.117303
know       0.112328
Name: 2, dtype: float64

wa        0.287660
alice     0.253818
window    0.194377
little    0.186133
bill      0.181993
puppy     0.170080
rabbit    0.163388
bottle    0.125200
fan       0.125200
glove     0.125200
Name: 3, dtype: float64

caterpillar    0.419145
said           0.400165
alice          0.269342
pigeon         0.265199
serpent        0.265199
wa        

# P.4

Alice_text is a text file with all the chapters

In [25]:
from pathlib import Path
txt = Path('/content/drive/MyDrive/MLT/Alice_text.txt').read_text()

In [26]:
txt[:100]

'"Alice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing t'

In [27]:
txt = txt.replace('\n', ' ')

In [28]:
txt[:100]

'"Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing t'

In [29]:
# splitting text by sentences
sentences = re.split(r"[.?!]", txt)

In [30]:
sentences[:5]

['"Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations',
 '”  So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her',
 '  There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, “Oh dear',
 ' Oh dear',
 ' I shall be late']

In [49]:
# cleaning each sentence
for i in range(len(sentences)):
    sentences[i] = clean_text(sentences[i])

In [51]:
# removing sentences without 'alice' token
sents_with_alice = [s for s in sentences if 'alice' in s]

In [54]:
sents_with_alice[:5]

['alice wa beginning get tired sitting sister bank nothing twice peeped book sister wa reading picture conversation use book thought alice without picture conversation',
 '  wa nothing remarkable alice think much way hear rabbit say oh dear',
 ' thought afterwards occurred ought wondered time seemed quite natural rabbit actually took watch waistcoatpocket looked hurried alice started foot flashed across mind never seen rabbit either waistcoatpocket watch take burning curiosity ran across field fortunately wa time see pop large rabbithole hedge',
 '  another moment went alice never considering world wa get',
 '  rabbithole went straight like tunnel way dipped suddenly suddenly alice moment think stopping found falling deep well']

In [55]:
prepared_text = " ".join(sents_with_alice)

In [62]:
import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(prepared_text)

# verbs tokens that arent stop words or punctuations
verbs = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "VERB")]

# five most common verb tokens
verb_freq = Counter(verbs)
common_verbs = verb_freq.most_common(10)

In [60]:
common_verbs

[('said', 174),
 ('thought', 40),
 ('went', 30),
 ('looked', 24),
 ('know', 23),
 ('think', 21),
 ('began', 21),
 ('got', 18),
 ('m', 17),
 ('ve', 14)]

'm' and 've' are probably tokens from to be and to have. These verbs don't give us any interesting information though.
Alice says, thinks, goes, looks, thinks, knows, begins and gets something the most often of all actions.
