<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Natural language processing
© ExploreAI Academy

In this exercise, we will perform text preprocessing tasks such as converting to lowercase, removing punctuation, creating a bag-of-words, and applying stemming and lemmatisation techniques in order to analyse text data to gain some insights.

## Learning objectives

By the end of this exercise, you should be able to:
* Implement text preprocessing techniques such as converting to lowercase and removing punctuation.
* Apply stemming and lemmatisation techniques to extract the root forms of words.
* Create a bag-of-words representation to quantify the occurrence of words in text.
* Calculate statistics such as the number of stopwords, unique words, and word frequencies in text data.

## Import libraries and read in the data

In [1]:
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import urllib

#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('omw-1.4')

The data used in this notebook is text from the book "Alice's Adventures in Wonderland" by Lewis Carroll. 

In [5]:
# read in the data
def print_some_url():
    with urllib.request.urlopen('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt') as f:
        return f.read().decode('ISO-8859-1')

data = print_some_url()
print(data[:863])

Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

 


## Data preprocessing
We will first start by providing you with the functions required to remove puntuation, create a bag-of-words and define a stemmer, tokeniser and lemmatiser. Once you apply the functions to pre-process the data, you will be asked to perform some calculations and analysis in the exercise questions below.


**Convert to lowercase and remove punctuation** 

In [6]:
#Function to remove puntauation

def remove_punctuation(words):
    words = words.lower()
    return ''.join([x for x in words if x not in string.punctuation])

In [7]:
#Apply the remove_punctuation function to the data
data = remove_punctuation(data)

**Create a bag-of-words and assign our stemmer and lemmatiser.**

In [8]:
# define stemmer function
stemmer = SnowballStemmer('english')

# tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(data)

# define lemmatiser
lemmatizer = WordNetLemmatizer()

# bag-of-words
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary 
        with each word as a key, and the value represents the number of 
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

# remove stopwords
tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')]

# create bag of words
bag_of_words = bag_of_words_count(tokens_less_stopwords,{})

Pay special attention to what these functions return and how the subsequent texts and lists look.

## Exercises

### Exercise 1

Use the stemmer and lemmatiser functions (defined in the cells above) from the relevant library to write a function that finds the stem and lemma of the nth word in the token list.

_**Function Specifications:**_
* Should take a `list` as input and return a  `dict` type as output.
* The dictionary should have the keys **'original',  'stem' and 'lemma'** with the corresponding values being the nth word transformed in that way.

**Example result:**

`{'original': 'daisies', 
'stem': 'daisi', 
'lemma': 'daisy'}`

Use your function to find the 120th word in `tokens`.

In [14]:
def get_stem_and_lemma(tokens:list[str],nth_index:int):
    token = tokens[nth_index - 1] #Zero based array indexing
    result = {'original':token}
    result["stem"] = stemmer.stem(token)
    result["lemma"] = lemmatizer.lemmatize(token)
    return result

### Exercise 2

Create a function that calculates the number of stopwords that are in the text in total, including repetitions.   

_Hint_ : you can use the nltk stopwords dictionary 

_**Function Specifications:**_
* Function should take a `list` as input 
* The number of stopwords should be returned as an `int` 

Use your function to calculate the total number of stopwords in `tokens`.

In [24]:
def count_stopwords(text:list[str])->int:
    count = 0
    stop_word_chars =  [line for line in text if line in stopwords.words("english")]
    count += len(stop_word_chars)

    return count

In [25]:
count_stopwords(tokens)

13774

### Exercise 3

Write a function that calculates the number of **unique** words in the text.

_**Function Specifications:**_
* Function should take a `list` as input and return an `int` 


Use your function to calculate the number of **unique** words in `tokens`.

In [30]:
def count_unique(text:list[str])->int:
    return len(set(text))

In [31]:
count_unique(tokens)

2749

### Exercise 4

Write a function that calculates the kth most frequently occuring word in the bag of words.

_**Function Specifications:**_
* Function should take a `dict` and an `int` k as input
* Function should return the kth most common word as a `str`

_Hint : bag_of_words already does not include stopwords_

**Example input:**
```python
most_common_word(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 2)

>>> 'apple'
```


Use the function to calculate the 3rd most frequently occuring word in the bag-of-words.

In [53]:
def most_common_word(bag, kth_freq):
    return sorted(bag.items(), key=lambda items: items[1], reverse=True)[kth_freq - 1][0]

In [54]:
most_common_word({'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 2)

'apple'

In [57]:
most_common_word(bag_of_words, 3)

'little'

### Exercise 5

Write a function that calculates the number of words that appear n times in the text.

_**Function Specifications:**_
* Input is taken as a `dict` and an `int` n, where n is the number of times the word appears in the text
* Count the number of words that appear n times in the text
* Output should be the count as an `int`

**Example input:** 
```python
word_frequency_count(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 12)

>>> 2
```

Use the function to calculate the number of words that appear 8 times in the bag-of-words.


In [69]:
def word_frequency_count(bag, knth_freq)->int:
    word_freq = {}
    # sum(1 for value in bag.values() if value == knth_freq)
    for _, value in bag.items():
        if value not in word_freq:
            word_freq[value]=1
        else:
            word_freq[value]+=1

    return word_freq[knth_freq]

In [65]:
word_frequency_count({'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 12)

2

In [68]:
word_frequency_count(bag_of_words, 8)

49

## Solutions

### Exercise 1

In [None]:
def find_roots(token_list, n):
    root_dict = {}
    word = token_list[n-1]
    root_dict['original'] = word
    root_dict["stem"] = stemmer.stem(word)
    root_dict["lemma"] = lemmatizer.lemmatize(word)
    
    return root_dict

In [None]:
find_roots(tokens, 120) 

### Exercise 2

In [22]:
def count_stopwords(token_list):
    STOPwords = [word for word in token_list if word in stopwords.words("english")]
    return len(STOPwords)

In [23]:
count_stopwords(tokens)

13774

### Exercise 3

In [32]:
def unique_words(token_list):
    return len(set(token_list))

In [33]:
unique_words(tokens)

2749

Note: The same result can be achieved by using the `len()` function on the `bag_of_words_count()` function to calculate the number of unique words in `tokens`.

In [None]:
len(bag_of_words_count(tokens,{}))

### Exercise 4

In [55]:
def most_common_word(bag, k):
    switch = [(value, key) for key, value in bag.items()]
    switch = sorted(switch)
    return switch[-k][1]

In [56]:
most_common_word(bag_of_words, 3)

'little'

### Exercise 5

In [66]:
def word_frequency_count(bag, n):
    total = sum(1 for value in bag.values() if value == n)
    return total

In [67]:
word_frequency_count(bag_of_words, 8)

49

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>