<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Natural language processing
© ExploreAI Academy

In this exercise, we will perform text preprocessing tasks such as converting to lowercase, removing punctuation, creating a bag-of-words, and applying stemming and lemmatization techniques in order to analyse text data to gain some insights.

## Learning objectives

By the end of this exercise, you should be able to:
* Implement text preprocessing techniques such as converting to lowercase and removing punctuation.
* Apply stemming and lemmatization techniques to extract the root forms of words.
* Create a bag-of-words representation to quantify the occurrence of words in text.
* Calculate statistics such as the number of stop words, unique words, and word frequencies in text data.

## Import libraries and read in the data

In [103]:
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import urllib
from text_processor_utils import read_online_text_data, read_local_text_data, remove_punctuations
from text_processor_utils import bag_of_words_count
import pandas as pd
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('omw-1.4')

The data used in this notebook is text from the book "Alice's Adventures in Wonderland" by Lewis Carroll. 

In [104]:
#I will only execute the code block below once. 
# read in the data--- I have done this into one whole function to handle repetition 
"""
def print_some_url():
    with urllib.request.urlopen('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt') as f:
        return f.read().decode('ISO-8859-1')

data = print_some_url()
print(data[:863])

"""

"\ndef print_some_url():\n    with urllib.request.urlopen('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt') as f:\n        return f.read().decode('ISO-8859-1')\n\ndata = print_some_url()\nprint(data[:863])\n\n"

In [105]:
#Download the data
"""
url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt'
dataset = read_online_text_data(url)

"""

"\nurl = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt'\ndataset = read_online_text_data(url)\n\n"

In [106]:
file_path = file_path = r"C:\Users\Allan\Desktop\Natural Language Processing\txt_data.txt"
data = read_local_text_data(file_path= file_path)

In [107]:
print(data[:863])

Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

 


In [108]:
len(data)

148574

## Data preprocessing
We will first start by providing you with the functions required to remove punctuation, create a bag-of-words, and define a stemmer, tokeniser, and lemmatizer. Once you apply the functions to preprocess the data, you will be asked to perform some calculations and analysis in the exercise questions below.


**Convert to lowercase and remove punctuation** 

In [109]:
#Function to remove punctuation
# I have my own custom function in the text_processor fle

def remove_punctuation(words):
    words = words.lower()
    return ''.join([x for x in words if x not in string.punctuation])


_data_ =  remove_punctuation(data)
print(_data_[:863])

alices adventures in wonderland

                alices adventures in wonderland

                          lewis carroll

               the millennium fulcrum edition 30




                            chapter i

                      down the rabbithole


  alice was beginning to get very tired of sitting by her sister
on the bank and of having nothing to do  once or twice she had
peeped into the book her sister was reading but it had no
pictures or conversations in it and what is the use of a book
thought alice without pictures or conversation

  so she was considering in her own mind as well as she could
for the hot day made her feel very sleepy and stupid whether
the pleasure of making a daisychain would be worth the trouble
of getting up and picking the daisies when suddenly a white
rabbit with pink eyes ran close by her

  there was nothing so


In [110]:
#Apply the remove_punctuation function to the data
data = remove_punctuations(data)

In [111]:
print(data[:863])

alices adventures in wonderland

                alices adventures in wonderland

                          lewis carroll

               the millennium fulcrum edition 30




                            chapter i

                      down the rabbithole


  alice was beginning to get very tired of sitting by her sister
on the bank and of having nothing to do  once or twice she had
peeped into the book her sister was reading but it had no
pictures or conversations in it and what is the use of a book
thought alice without pictures or conversation

  so she was considering in her own mind as well as she could
for the hot day made her feel very sleepy and stupid whether
the pleasure of making a daisychain would be worth the trouble
of getting up and picking the daisies when suddenly a white
rabbit with pink eyes ran close by her

  there was nothing so


**Create a bag-of-words and assign our stemmer and lemmatizer**

In [112]:
#Tokenize
tokenizer = TreebankWordTokenizer()
data_tokens = tokenizer.tokenize(text= data)
#Stemmer
stemmer = SnowballStemmer('english')
#Lemmatize
lemmatizer = WordNetLemmatizer()

In [113]:
print(data_tokens[:10])

['alices', 'adventures', 'in', 'wonderland', 'alices', 'adventures', 'in', 'wonderland', 'lewis', 'carroll']


In [114]:
#Create a bag of words--- 
# I also have included thie funtion in the text processor utils
"""
def bag_of_words_count(words:list, word_dict= {}):
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word]= 1
    return word_dict

"""

'\ndef bag_of_words_count(words:list, word_dict= {}):\n    for word in words:\n        if word in word_dict.keys():\n            word_dict[word] += 1\n        else:\n            word_dict[word]= 1\n    return word_dict\n\n'

In [115]:
#Remove stop words
tokens_less_stopwords = [word for word in data_tokens if word not in stopwords.words('english')]
#Then create a bag of words
bag_of_words = bag_of_words_count(words = tokens, word_dict= {})
#bag_of_words

Pay special attention to what these functions return and how the subsequent texts and lists look.

## Exercises

### Exercise 1

Use the stemmer and lemmatizer functions (defined in the cells above) from the relevant library to write a function that finds the stem and lemma of the nth word in the token list.

_**Function specifications:**_
* Should take a `list` as input and return a  `dict` type as output.
* The dictionary should have the keys **'original',  'stem', and 'lemma'** with the corresponding values being the nth word transformed in that way.

**Example result:**

`{'original': 'daisies', 
'stem': 'daisi', 
'lemma': 'daisy'}`

Use your function to find the 120th word in `tokens`.

In [116]:
tokens_less_stopwords[120]

'another'

In [140]:
#Your code here
def find_stemmer_and_lemma(words:list, n:int):
    out_put_dict = { }
    target_word = words[n-1]
    stemmed = stemmer.stem(target_word)
    lemmatized = lemmatizer.lemmatize(target_word)
    out_put_dict['original'] = target_word
    out_put_dict['stem'] = stemmed
    out_put_dict['lemma'] = lemmatized
    return out_put_dict


find_stemmer_and_lemma(data_tokens, 120)

{'original': 'daisies', 'stem': 'daisi', 'lemma': 'daisy'}

### Exercise 2

Create a function that calculates the number of stop words that are in the text in total, including repetitions.   

_Hint:_ You can use the nltk stopwords dictionary. 

_**Function specifications:**_
* Function should take a `list` as input. 
* The number of stop words should be returned as an `int`. 

Use your function to calculate the total number of stop words in `tokens`.

In [118]:
#Your code here
stopwords_list = stopwords.words('english')
def calculate_number_of_stopwords(word_list: list):
    total_stopwords = len([n for n in word_list if n in stopwords_list])
    return total_stopwords


calculate_number_of_stopwords(tokens_less_stopwords)

0

### Exercise 3

Write a function that calculates the number of **unique** words in the text.

_**Function specifications:**_
* Function should take a `list` as input and return an `int`. 


Use your function to calculate the number of **unique** words in `tokens`.

In [138]:
#Your code here
def get_number_of_unique_words(word_list: list):
    unique_words = set(word_list)
    total_number_of_unique_words = len(unique_words)
    return total_number_of_unique_words

get_number_of_unique_words(data_tokens)

2749

### Exercise 4

Write a function that calculates the kth most frequently occurring word in the bag-of-words.

_**Function specifications:**_
* Function should take a `dict` and an `int` k as input.
* Function should return the kth most common word as a `str`.

_Hint:_ bag_of_words already does not include stop words.

**Example input:**
```python
most_common_word(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 2)

>>> 'apple'
```


Use the function to calculate the 3rd most frequently occurring word in the bag-of-words.

In [120]:
switched =[(value, key) for key, value in bag_of_words.items()]
switched_list = sorted(switched, reverse= True)
switched_list[2][1]

'little'

In [121]:
#Your code here
def kth_frequency(bag:list, k):
    switched =[(value, key) for key, value in bag.items()]
    switched_list = sorted(switched, reverse= True)
    return(switched_list[k-1][1])

kth_frequency(bag_of_words, 3)

'little'

### Exercise 5

Write a function that calculates the number of words that appear n times in the text.

_**Function specifications:**_
* Input is taken as a `dict` and an `int` n, where n is the number of times the word appears in the text.
* Count the number of words that appear n times in the text.
* Output should be the count as an `int`.

**Example input:** 
```python
word_frequency_count(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 12)

>>> 2
```

Use the function to calculate the number of words that appear eight times in the bag-of-words.


In [122]:
n = 8
_words_ = []
_words_ +=([word for word in bag_of_words.keys()if bag_of_words[word] == n])

len(_words_)

49

In [123]:
#Your code here
def word_frequency_count(bag:dict, n):
    _words_ = []
    _words_ +=(word for word in bag.keys() if bag[word]==n)
    return len(_words_)


word_frequency_count(bag_of_words, 8)

49

## Solutions

### Exercise 1

In [126]:
tokens = data_tokens

In [127]:
def find_roots(token_list, n):
    
    root_dict = {}
    word = token_list[n-1]
    root_dict['original'] = word
    root_dict["stem"] = stemmer.stem(word)
    root_dict["lemma"] = lemmatizer.lemmatize(word)
    
    return root_dict

In [128]:
find_roots(tokens, 120) 

{'original': 'daisies', 'stem': 'daisi', 'lemma': 'daisy'}

### Exercise 2

In [129]:
def count_stopwords(token_list):
    STOPwords = [word for word in token_list if word in stopwords.words("english")]
    return len(STOPwords)

In [130]:
count_stopwords(tokens)

13774

### Exercise 3

In [131]:
def unique_words(token_list):
    return len(set(token_list))

In [132]:
unique_words(tokens)

2749

Note: The same result can be achieved by using the `len()` function on the `bag_of_words_count()` function to calculate the number of unique words in `tokens`.

In [133]:
len(bag_of_words_count(tokens,{}))

2749

### Exercise 4

In [134]:
def most_common_word(bag, k):
    switch = [(value, key) for key, value in bag.items()]
    switch = sorted(switch)
    return switch[-k][1]

In [135]:
most_common_word(bag_of_words, 3)

'little'

### Exercise 5

In [136]:
def word_frequency_count(bag, n):
    total = sum(1 for value in bag.values() if value == n)
    return total

In [137]:
word_frequency_count(bag_of_words, 8)

49

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>