Instructions: Read the questions carefully and answer the questions in the new code blocks provided. Remember to run all the codes in the code blocks before proceeding to writing your code

Note that your answer might deviated slightly from the given answer. As long as the different between your answer and the sample answer is not too much, it can be accepted. For more clarification, you may ask your instructor in class or post your questions in the forum

In [4]:
# Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
import warnings
import gdown
warnings.filterwarnings("ignore")

## Question 1: Processing list of words

Text Preprocessing

In [6]:
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]

In [7]:
# Import nltk
import nltk

# Download the Punkt Tokenizer from the nltk library
nltk.download('punkt')

# Import snowball from nltk
from nltk.stem import snowball

# Declare snowballStemmer for English words
snowballStemmer = snowball.SnowballStemmer("english")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\benlo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [8]:
# Declare new list called stemmed_words
stemmed_words = []

# Stem each word in words and store in stemmed_words
for word in words:
  stemmed_words.append(snowballStemmer.stem(word.lower()))

# Print out stemmed_words
print(stemmed_words)

['program', 'program', 'programm', 'program', 'programm']



```
Expected Answer: ['program', 'program', 'programm', 'program', 'programm']
```

Feature Extraction

In [21]:
# choose some words to be stemmed
word_list = ["hello", "world", "programmer", "student", "programmer"]

In [22]:
# import countvectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# declare CountVectorizer as vectorizer
vectorizer = CountVectorizer()

# fit vectorizer with words
vectorizer.fit(word_list)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [24]:
# list out the vocabularies recognized by vectorizer
vectorizer.vocabulary_

{'hello': 0, 'world': 3, 'programmer': 1, 'student': 2}

```
Expected Answer: ['hello', 'programmer', 'student', 'world']
```

In [26]:
# transform data into bag of words and save it as words_bow
vectorizer.transform(word_list)

# print out words_bow in array format (use words_bow.toarray())
words_bow = vectorizer.transform(word_list)
print(words_bow.toarray())

[[1 0 0 0]
 [0 0 0 1]
 [0 1 0 0]
 [0 0 1 0]
 [0 1 0 0]]


```
Expected Answer:
[[1 0 0 0]
 [0 0 0 1]
 [0 1 0 0]
 [0 0 1 0]
 [0 1 0 0]]
```

## Question 2: Processing sentence

Text Preprocessing

In [50]:
# choose a sentence
sentence = "Programmers program with programming languages"

In [51]:
# Import nltk
import nltk

# Download the Punkt Tokenizer from the nltk library
nltk.download('punkt')
nltk.download('punkt_tab')

# Import snowball from nltk
from nltk.stem import snowball

# Declare snowballStemmer for English words
snowballStemmer = snowball.SnowballStemmer("english")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\benlo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\benlo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [52]:
# Tokenize sentence and store it as tokens
tokens = nltk.word_tokenize(sentence.lower())

# Declare new list called stemmed_tokens
stemmed_tokens = []

# Stem each token in tokens and store in stemmed_tokens
for token in tokens:
    stemmed_tokens.append(snowballStemmer.stem(token))

# Join stemmed words in stemmed_tokens and save as new_sentence
new_sentence = ' '.join(stemmed_tokens)

# Print out new_sentence
print(new_sentence)

programm program with program languag


```
Expected Answer: programm program with program languag
```

Feature Extraction

In [53]:
# import countvectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# declare CountVectorizer as vectorizer
vectorizer = CountVectorizer()

# fit vectorizer with new_sentence
vectorizer.fit([new_sentence])


0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [54]:
# list out the vocabularies recognized by vectorizer
vectorizer.vocabulary_

{'programm': 2, 'program': 1, 'with': 3, 'languag': 0}

```
Expected Answer: ['languag', 'program', 'programm', 'with']
```

In [56]:
# transform new_sentence into bag of words and save it as sentence_bow
sentence_bow = vectorizer.transform([new_sentence])

# print out sentence_bow in array format (use sentence_bow.toarray())
print(sentence_bow.toarray())

[[1 2 1 1]]


```
Expected Answer: [[1 2 1 1]]
```

## Question 3: Process sentence with punctuations

Text Processing

In [57]:
# choose a sentence
sentence = "Hello student, You have to build a very good site and I love visiting your site."

In [58]:
# Tokenize sentence and store it as tokens
tokens = nltk.word_tokenize(sentence.lower())

# Declare new list called stem_tokens
stem_tokens = []

# Stem each token in tokens and store in stem_tokens
for token in tokens:
    stem_tokens.append(snowballStemmer.stem(token))

# Remove punctuations and special characters from the list stem_tokens
stem_tokens = [token for token in stem_tokens if token.isalnum()]

# Join stem words in stem_tokens and save as new_sentence
new_sentence = ' '.join(stem_tokens)

# Print out new_sentence
print(new_sentence)

hello student you have to build a veri good site and i love visit your site


```
Expected Answer: hello student you have to build a veri good site and i love visit your site
```

Feature Extraction

In [59]:
# import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# declare CountVectorizer as vectorizer
vectorizer = CountVectorizer()

# fit vectorizer with new_sentence
vectorizer.fit([new_sentence])


0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [60]:
# list out the vocabularies recognized by vectorizer
vectorizer.vocabulary_

{'hello': 4,
 'student': 7,
 'you': 11,
 'have': 3,
 'to': 8,
 'build': 1,
 'veri': 9,
 'good': 2,
 'site': 6,
 'and': 0,
 'love': 5,
 'visit': 10,
 'your': 12}

```
Expected Answer:
['and',
 'build',
 'good',
 'have',
 'hello',
 'love',
 'site',
 'student',
 'to',
 'veri',
 'visit',
 'you',
 'your']
```

In [61]:
# transform new_sentence into bag of words and save it as sentence_bow
sentence_bow = vectorizer.transform([new_sentence])

# print out sentence_bow in array format (use sentence_bow.toarray())
print(sentence_bow.toarray())


[[1 1 1 1 1 1 2 1 1 1 1 1 1]]


```
Expected Answer: [[1 1 1 1 1 1 2 1 1 1 1 1 1]]
```

# Advanced

## Challenge 1 - Processing list of sentences

Text Preprocessing

In [62]:
list_of_sentences = ['Hello everyone.', 'How are you?', 'I am fine.']

In [63]:
# Declare new list to store preprocessed_sentences
preprocessed_sentences = np.array([])

# Use a for loop to loop through list_of_sentence with each element as sentence
for sentence in list_of_sentences:
  # Tokenize sentence and store it as tokens
  tokens = nltk.word_tokenize(sentence.lower())
  # Declare new list called stem_tokens
  stem_tokens = []
  # Stem each token in tokens and store in stem_tokens
  for token in tokens:
      stem_tokens.append(snowballStemmer.stem(token))


  # Remove punctuations and special characters from the list stem_tokens
  stem_tokens = [token for token in stem_tokens if token.isalnum()]
    # Use conditional statements to identify and remove non-alphanumeric (alphabet and numbers) characters.


  # Join stem words in stem_tokens and save as new_sentence
  new_sentence = ' '.join(stem_tokens)
  # Use np.append to append new_sentence into preprocessed_sentences
  preprocessed_sentences = np.append(preprocessed_sentences, new_sentence)
# End of for loop

In [64]:
# Print out preprocessed_sentences
print(preprocessed_sentences)

['hello everyon' 'how are you' 'i am fine']


```
Expected Answer: ['hello everyon', 'how are you', 'i am fine']
```

Feature Extraction

In [65]:
# import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
# declare CountVectorizer as vectorizer
vectorizer = CountVectorizer()
# fit vectorizer with preprocessed_sentences
vectorizer.fit(preprocessed_sentences)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [66]:
# list out the vocabularies recognized by vectorizer
vectorizer.vocabulary_

{'hello': 4, 'everyon': 2, 'how': 5, 'are': 1, 'you': 6, 'am': 0, 'fine': 3}

```
Expected Answer: ['am', 'are', 'everyon', 'fine', 'hello', 'how', 'you']
```

In [67]:
# transform preprocessed_sentences into bag of words and save it as sentences_bow
sentences_bow = vectorizer.transform(preprocessed_sentences)
# print out sentences_bow in array format (use sentences_bow.toarray())
print(sentences_bow.toarray())

[[0 0 1 0 1 0 0]
 [0 1 0 0 0 1 1]
 [1 0 0 1 0 0 0]]


```
Expected Answer:
[[0 0 1 0 1 0 0]
 [0 1 0 0 0 1 1]
 [1 0 0 1 0 0 0]]
```

## Challenge 2 - Processing dataset

Read more about the dataset: [Kaggle Link](https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp)

Text Preprocessing

In [68]:
# Download and read data
gdown.download('https://drive.google.com/uc?id=1ube5ON5i1m2Y-FtmLkpVcesTm0iqr3Zq', 'emotion_data.csv', quiet=False)
text_data = pd.read_csv('emotion_data.csv')

Downloading...
From: https://drive.google.com/uc?id=1ube5ON5i1m2Y-FtmLkpVcesTm0iqr3Zq
To: c:\Users\benlo\telebort\Al2\L9-10 NLP\emotion_data.csv
100%|██████████| 3.70M/3.70M [00:12<00:00, 303kB/s]


In [69]:
# Create a function to preprocess text
def text_preprocessing(text):
  # Tokenize text and store it as tokens
  tokens = nltk.word_tokenize(text.lower())
  # Declare new list called stem_tokens
  stem_tokens = []
  # Stem each token in tokens and store in stem_tokens
  for token in tokens:
      stem_tokens.append(snowballStemmer.stem(token))

  # Remove punctuations and special characters from the list stem_tokens
  stem_tokens = [token for token in stem_tokens if token.isalnum()]
  # Join stem words in stem_tokens and save as preprocessed_text
  preprocessed_text = ' '.join(stem_tokens)
  # Return preprocessed_text as the function's output
  return preprocessed_text
# End of function

In [70]:
# Declare preprocessed_text_data as an empty array
preprocessed_text_data = np.array([])
# Use for loop to loop through every single sentence in text_data['Document'] and save each sentence as sentence
for sentence in text_data['Document']:
  # Run text_preprocessing with sentence as its parameters then store it as x
  x = text_preprocessing(sentence)
  # append x onto preprocessed_text_data
  preprocessed_text_data = np.append(preprocessed_text_data, x)

In [71]:
# Print out the first 5 sentences in text_data['Document']
text_data['Document'][:5]

0                              i didnt feel humiliated
1    i can go from feeling so hopeless to so damned...
2     im grabbing a minute to post i feel greedy wrong
3    i am ever feeling nostalgic about the fireplac...
4                                 i am feeling grouchy
Name: Document, dtype: object

In [72]:
# Print out the first 5 sentences in preprocessed_text_data
preprocessed_text_data[:5]

array(['i didnt feel humili',
       'i can go from feel so hopeless to so damn hope just from be around someon who care and is awak',
       'im grab a minut to post i feel greedi wrong',
       'i am ever feel nostalg about the fireplac i will know that it is still on the properti',
       'i am feel grouchi'], dtype='<U286')

Feature Extraction

In [74]:
# import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
# declare CountVectorizer as vectorizer
vectorizer = CountVectorizer()
# fit vectorizer with preprocessed_text_data
vectorizer.fit(preprocessed_text_data)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [75]:
# list out the vocabularies recognized by vectorizer
vectorizer.vocabulary_

{'didnt': 2534,
 'feel': 3420,
 'humili': 4562,
 'can': 1356,
 'go': 3973,
 'from': 3744,
 'so': 8793,
 'hopeless': 4497,
 'to': 9730,
 'damn': 2256,
 'hope': 4495,
 'just': 5173,
 'be': 787,
 'around': 496,
 'someon': 8848,
 'who': 10654,
 'care': 1402,
 'and': 343,
 'is': 4960,
 'awak': 644,
 'im': 4652,
 'grab': 4018,
 'minut': 6090,
 'post': 7338,
 'greedi': 4064,
 'wrong': 10820,
 'am': 293,
 'ever': 3210,
 'nostalg': 6569,
 'about': 27,
 'the': 9582,
 'fireplac': 3509,
 'will': 10690,
 'know': 5305,
 'that': 9577,
 'it': 4982,
 'still': 9115,
 'on': 6710,
 'properti': 7500,
 'grouchi': 4095,
 'ive': 4999,
 'been': 821,
 'littl': 5588,
 'burden': 1270,
 'late': 5387,
 'wasnt': 10530,
 'sure': 9313,
 'whi': 10630,
 'was': 10525,
 'take': 9418,
 'or': 6753,
 'milligram': 6062,
 'time': 9699,
 'recommend': 7767,
 'amount': 321,
 'fallen': 3348,
 'asleep': 538,
 'lot': 5662,
 'faster': 3379,
 'but': 1293,
 'also': 281,
 'like': 5544,
 'funni': 3787,
 'as': 521,
 'confus': 1930,
 'life

In [76]:
# transform preprocessed_text_data into bag of words and save it as text_data_bow
text_data_bow = vectorizer.transform(preprocessed_text_data)
# print out text_data_bow[0]
print(text_data_bow[0])

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3 stored elements and shape (1, 10955)>
  Coords	Values
  (0, 2534)	1
  (0, 3420)	1
  (0, 4562)	1


```
Expected Answer:
  (0, 2534)	1
  (0, 3420)	1
  (0, 4562)	1
```

In [77]:
# convert text_data_bow[0] to array using .toarray() and print it out
text_data_bow[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

```
Expected Answer: [[0 0 0 ... 0 0 0]]
```