# Preprocessing

## Question 1

Please use the `grutenberg` corpus provided in `nltk` and extract the text written by Lewis Caroll, i.e., `carroll-alice.txt`, as your corpus data.

With this corpus data, please perform text preprocessing on the **sentences** of the corpus.

In particular, please:

- pos-tag all the sentences to get the parts-of-speech of each word
- lemmatize all words using `WordNetLemmatizer` on a sentential basis

Please provide your output as shown below:

- it is a data frame
- the column `alice_sents` includes the original sentence texts
- the column `alice_sents_pos` includes the annotations of the word/postag for each sentence
- the column `sents_lem` includes the lemmatized version of the sentences


```{note}
Please note that the lemmatized form of the BE verbs (e.g., *was*) should be *be*.
```


In [4]:
import nltk
from nltk.corpus import gutenberg
from nltk.stem import WordNetLemmatizer
import pandas as pd

# download guterberg if you don't have it
# nltk.download('gutenberg')

# print(gutenberg.fileids())

alice = gutenberg.sents('carroll-alice.txt')

# Step1: pos_tag each sentence in Alice
alice_tag = [nltk.pos_tag(s) for s in alice]
#alice_tag[:5]

# Step2: merge word tokens into a long-string sentence
alice_sents = [' '.join(s) for s in alice]
alice_sents_pos = [
    " ".join([(w + "/" + t) for (w, t) in s]) for s in alice_tag
]

# print(len(alice_sents))
# print(len(alice_sents_pos))

# Step3:Lemmatize each word of each sentence
wnl = WordNetLemmatizer()

alice_tag_lem = list()
for s in alice_tag:
    sl = list()
    for (w, t) in s:
        if t[0].lower() not in ['v', 'n', 'a']:
            sl.append(wnl.lemmatize(w))
        else:  # if n, v, a, specify the pos
            sl.append(wnl.lemmatize(w, t[0].lower()))
    alice_tag_lem.append(sl)

alice_sents_lem = [' '.join(s) for s in alice_tag_lem]

alice_sents_df = pd.DataFrame({
    "sents": alice_sents,
    "sents_pos": alice_sents_pos,
    "sents_lem": alice_sents_lem
})

In [5]:
alice_sents_df

Unnamed: 0,sents,sents_pos,sents_lem
0,[ Alice ' s Adventures in Wonderland by Lewis ...,[/JJ Alice/NNP '/POS s/NN Adventures/NNS in/IN...,[ Alice ' s Adventures in Wonderland by Lewis ...
1,CHAPTER I .,CHAPTER/NN I/PRP ./.,CHAPTER I .
2,Down the Rabbit - Hole,Down/IN the/DT Rabbit/NNP -/: Hole/NN,Down the Rabbit - Hole
3,Alice was beginning to get very tired of sitti...,Alice/NNP was/VBD beginning/VBG to/TO get/VB v...,Alice be begin to get very tired of sit by her...
4,So she was considering in her own mind ( as we...,So/IN she/PRP was/VBD considering/VBG in/IN he...,So she be consider in her own mind ( a well a ...
...,...,...,...
1698,But her sister sat still just as she left her ...,But/CC her/PRP$ sister/NN sat/VBD still/RB jus...,But her sister sit still just a she leave her ...
1699,"First , she dreamed of little Alice herself , ...","First/RB ,/, she/PRP dreamed/VBD of/IN little/...","First , she dream of little Alice herself , an..."
1700,The long grass rustled at her feet as the Whit...,The/DT long/JJ grass/NN rustled/VBD at/IN her/...,The long grass rustle at her foot a the White ...
1701,"So she sat on , with closed eyes , and half be...","So/IN she/PRP sat/VBD on/IN ,/, with/IN closed...","So she sit on , with closed eye , and half bel..."


## Question 2

Based on the output of the previous question, please create a lemma frequnecy list of `carroll-alice.txt` using the lemmatized forms by including only lemmas which are:
- consisting of only alphabets or hyphens
- at least 5-character long

The casing is irrelevant.

The expected output is provided as follows.


In [None]:
import re
alice_words = sum(alice_tag_lem, [])
alice_wf = nltk.FreqDist([w.lower() for w in alice_words if len(w)>=5 and re.match(r'^[a-zA-Z-]+$', w)])

In [12]:
# Top 20 frequent lemmas
alice_wf.most_common(20)

[('alice', 398),
 ('little', 128),
 ('think', 118),
 ('there', 99),
 ('about', 94),
 ('begin', 92),
 ('would', 83),
 ('again', 83),
 ('herself', 83),
 ('thing', 80),
 ('could', 77),
 ('queen', 75),
 ('turtle', 61),
 ('hatter', 57),
 ('quite', 55),
 ('gryphon', 55),
 ('rabbit', 52),
 ('their', 52),
 ('first', 51),
 ('voice', 51)]

## Question 3

Please identify top verbs that co-occcur with the name *Alice* in the text, with the name being the **subject** of the verb. 

Please use the `en_core_web-sm` model in `spacy` for English dependency parsing.

To simply the matter, please identify all the verbs that have a dependency relation of `nsubj` with the noun `Alice` (where `Alice` is the **dependent**, and the verb is the **head**).

In [35]:
import spacy
from spacy import displacy
# load language model
nlp = spacy.load('en_core_web_sm') 
alice_docs = nlp.pipe(alice_sents)

np_chunks = []
for d in alice_docs:
    for chunk in d.noun_chunks:
        np_chunks.append((chunk.text, chunk.root.text, chunk.root.dep_,
                chunk.root.head.text))

np_chunks_targets = [head for (text, root, dep, head) in np_chunks if root.lower()=="alice" and dep == "nsubj" ]

In [36]:
nltk.FreqDist(np_chunks_targets).most_common(20)

[('said', 127),
 ('thought', 32),
 ('replied', 13),
 ('was', 10),
 ('began', 8),
 ('went', 7),
 ('looked', 7),
 ('felt', 5),
 ('like', 5),
 ('think', 4),
 ('had', 4),
 ('ventured', 4),
 ('beginning', 3),
 ('been', 3),
 ('heard', 3),
 ('hear', 3),
 ('waited', 3),
 ('remarked', 3),
 ('asked', 3),
 ('see', 3)]