<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Module-5-Assignment/blob/main/scripts/Part%20I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Part I:  Vector Semantics and Motivation for Word Embeddings

It is important to understand the words meaning (recall semantics) AND their context. Words that are seen in the similar context also often have similar meaning. The distributional hypothesis expresses this phenomenon saying that there is a link in similarity in how words are distributed and their likeness.  Vector semantics is the concept of learning representations of meanings of words – called embeddings—from their distributions in a corpus or corpora. Fundamentally, we are asking the question with NLP: how might we represent the meaning of a word and interpret it?

A word embedding is simply a to represent words in a numerical context -- a vector.  This is important because Neural Networks and Machine Learning models don't learn on the text itself, but the numerical representation of the text. In fact, there is typically an "embedding layer" as part of the simplest NLP-based neural networks as you will find.

The simplest way to show this is called a one-hot vector, other forms include term frequencies of words (as we have seen with Bayesian models), Term Frequency-Inverse Document Frequency, which normalizes terms across documents, and distributional representations, which are context-based encodings that help derive similarity-- i.e., "queen is to female as king is to male".

We will start simple and discuss some of the challenges, then move to more complex transformations.

## Setup
As part of completing the assignment, you will see that there are areas in the note book for you to complete your own coding input.

It will be look like following:
```
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
'Some coding activity for you to complete'
### END CODE HERE ###

```
Please be sure to fill these code snippets out as you turn in your assignment.


##1.1 One-hot vector

A one-hot vector helps to translate categorical or sequential data to something that is machine readable and also does not have an impact on your model. Each word in the sequence is given a binary encoding and is mapped to a vector of the length of the the input. This is a common pre-processing step for the input layer in a neural network.

One hot encoding assigns a unique code for each unique word. As an example, we can take the following sentence and convert it to a one-hot vector.

"Live as if you were to die tomorrow. Learn as if you were to live forever"

We will use NLTK to tokenize the sentence, then Sci-Kit Learn to apply the one-hot encoder. Note, that SK-Learn will apply a single value for a unique word in a vector which is great for categorical representations. This one-hot encoding has traditionally been used for feeding categorical data to many scikit-learn estimators in shallow learning models such as notably linear models and SVMs with the standard kernels.

Note: This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero.


In [1]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from numpy import argmax
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
nltk.download('punkt')

#%matplotlib inline
tokenizer = RegexpTokenizer(r'\w+')

# define input string
data = 'Live as if you were to die tomorrow. Learn as if you were to live forever'
#tokenize that string
wordlist = nltk.word_tokenize(data.lower())
#create a vector representation of the wordlist
wordlist_clean = []

for i in wordlist: # Go through every word in your tokens list
    if (i not in string.punctuation):  # remove punctuation
        wordlist_clean.append(i)
# define universe of possible input values
wordlist_clean_df = pd.DataFrame(data=wordlist_clean, columns=['words'])

#encode using scki-kit learn
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoder.fit(wordlist_clean_df)
wordlist_clean_df_encoded = one_hot_encoder.transform(wordlist_clean_df)
wordlist_clean_df_encoded = pd.DataFrame(data=wordlist_clean_df_encoded, columns=one_hot_encoder.categories_)
print('\n\n One-Hot Encoded Vector using SKLearn')
display(wordlist_clean_df_encoded)



 One-Hot Encoded Vector using SKLearn


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,as,die,forever,if,learn,live,to,tomorrow,were,you
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##1.2 Encoding as a dense  - Singular Value Decomposition
A second approach you might try is to encode each word using a unique number. This helps with reducing dimensionality and attempts to address the problem of very large sparse matrices. Continuing the example above, you could assign 1 to "live", 2 to "the", and so on. You could then encode the sentence "The cat sat on the mat" as a dense vector. Now, instead of a sparse vector, you now have a dense one. A dense vector is a vector where all elements are populated with a non-zero value.

There are several challenges:

1.   The integer-encoding is arbitrary (it does not capture any relationship between words)
2.   An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.
3.  Word order is ignored.
4.  Raw absolute frequency counts of words do not necessarily represent the meaning of the text properly




In [2]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from numpy import argmax
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')
"""## Default Style Settings
matplotlib.rcParams['figure.dpi'] = 150
pd.options.display.max_colwidth = 200
#%matplotlib inline"""
tokenizer = RegexpTokenizer(r'\w+')

# define input string
data = 'Live as if you were to die tomorrow. Learn as if you were to live forever'
#tokenize that string
wordlist = nltk.word_tokenize(data.lower())
#create a vector representation of the wordlist
wordlist_clean = []

for i in wordlist: # Go through every word in your tokens list
    if (i not in string.punctuation):  # remove punctuation
        wordlist_clean.append(i)
# define universe of possible input values
wordlist_clean_df = pd.DataFrame(data=wordlist_clean, columns=['words'])
dense_vector = np.unique(wordlist_clean_df, return_counts=True)
dense_vector_df = pd.DataFrame(data=dense_vector, columns = np.unique(wordlist_clean_df))
display(dense_vector_df)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,as,die,forever,if,learn,live,to,tomorrow,were,you
0,as,die,forever,if,learn,live,to,tomorrow,were,you
1,2,1,1,2,1,2,2,1,2,2


##1.3 Text Vectorization


*   Overview
*   N-Gram Bag of Words
*   Term Frequency-Inverse Document Frequency (TF-IDF)
*   Document Similarity: Cosine Similarity, Jaccard Similarity, Euclidian Similarity
*   Topic Modeling Exercise

### Why do we do it?
These subsquent categories of text vectorizations are ways to derive similarity from text documents. This is useful for NLP tasks such as topic modeling -- where we aim to show the relationship between documents via a category or topic. You will see how TF-IDF can be used to support topic modeling.

Here are some text vectorization approaches in summary:
![Text Vectorization Approaches](https://drive.google.com/uc?export=view&id=12GYWDaK5_offSn3Gy-hv_KpuTc4A_mGA)





### 1.3.1 N-Gram Bag-Of Words Model
You've already learned the bag-of words model above with one-hot encoding and dense vectorization! We are counting the frequencies of words in the matrix in a dense representation of the word vector. What happens if we took some steps to improve the Bag-of-Words model by incorporating the n-gram approach we have learned earlier in the class.

What does this do?
If our goal is to identify words in texts that represent meaning of that text, then recall that taking the bi-gram, tri-gram, or n-gram of a corpus allows us to bring in context via the word order. With a simple BOW approach, no word order is considers. Moreover, we can filter words based on distributional counts -- that is, term frequencies. Imagine that the counts of a word fall into say a Gaussian (normal) Distribution across a number of corpora. We can use the distribution to filter out salient word-phrases or sequences in which we can infer the meaning of the text. Finally, we can apply weights to the frequency counts -- similar to weight vector in a NN-- in which those weights have an impact on word relationships or salience.


![Bag of Words](https://drive.google.com/uc?export=view&id=1btCVz_8JWYTvE73qGLCRZb7nXg-kCiDU)




#### 1.3.1.2 Example: N-Gram Bag-Of Words Model

In [2]:
#example from: https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/exercise/5-text-vectorization.html#

import pandas as pd                        # Python library for pandas - data maniplation
import numpy as np                         # Python library for numpy -- matrix algebra library
import matplotlib                          # Python library for matplotlib -- visual display of data
import matplotlib.pyplot as plt            # Python library for matplotlib -- visual display of data
import nltk                                # Python library for NLP
import re                                  # library for regular expression operations
import string                              # for string operations
nltk.download('stopwords')                 # package for stop words
from nltk.corpus import stopwords          # module for stop words that come with NLTK

from nltk.stem import PorterStemmer        # module for stemming
from sklearn.feature_extraction.text import CountVectorizer

## Default Style Settings
matplotlib.rcParams['figure.dpi'] = 150
pd.options.display.max_colwidth = 200
#%matplotlib inline

corpus = [
    'The sky is blue and beautiful.', 'Love this blue and beautiful sky!',
    'The quick brown fox jumps over the lazy dog.',
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    'I love green eggs, ham, sausages and bacon!',
    'The brown fox is quick and the blue dog is lazy!',
    'The sky is very blue and the sky is very beautiful today',
    'The dog is lazy but the brown fox is quick!'
]
labels = [
    'weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather',
    'animals'
]

corpus = np.array(corpus) # np.array better than list
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df

wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
print(corpus)
print("="*50)
print(norm_corpus)

# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
bv = CountVectorizer(ngram_range=(2, 2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

['The sky is blue and beautiful.' 'Love this blue and beautiful sky!'
 'The quick brown fox jumps over the lazy dog.'
 "A king's breakfast has sausages, ham, bacon, eggs, toast and beans"
 'I love green eggs, ham, sausages and bacon!'
 'The brown fox is quick and the blue dog is lazy!'
 'The sky is very blue and the sky is very beautiful today'
 'The dog is lazy but the brown fox is quick!']
['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog'
 'kings breakfast sausages ham bacon eggs toast beans'
 'love green eggs ham sausages bacon' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,bacon eggs,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,breakfast sausages,brown fox,dog lazy,eggs ham,...,lazy dog,love blue,love green,quick blue,quick brown,sausages bacon,sausages ham,sky beautiful,sky blue,toast beans
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


### 1.3.2 Term Frequency-Inverse Document Frequency (TF-IDF)
As an extension of the BOW model, we can weight the frequency (counts) of the terms in a document by considering its *dispersion*. Fundamentally, we are taken the total frequency of a word and dividing it by the number of documents with that term -- this gives us term frequency.


Then we take the inverse The formula for TF-IDF will look something like:


*   Term Frequency(TF): the number of times a word appears in a document. These are the raw absolute frequency counts of the words in the BOW model.
*   Inverse Document Frequency(IDF): total documents in corpus over number of documents with term.

> $\textit{TF-IDF} = {tf \times idf}$

Here, the general idea is that we can extropolate the meaningful words from a corpus by inversing their frequency. For example, "The" in the corpus may be frequently observed, but does not garner meaning. We can use this for keyword extraction, and information retrieval tasks.

Let's normalize this function to account for divide-by-zero erros and to also smooth the weighting scheme.

Addressing divide-by-zero errors. Similar to Laplace Smoothing techniques, we will typically add one to the IDF formula:

> $\textit{IDF} = 1 + log\frac{N}{1+df}$

We also might normalize the final IF-IDF function using an L2 Norm (see more in Jurafsky, Chapter 6).

> $\textit{TF-IDF}_{normalized} = \frac{tf \times idf}{\sqrt{(tf\times idf)^2}}$


#### 1.3.2.1 Example: TF-IDF Usage
In our example, we use the TfidfTransformer function to apply L2 norms and smoothing techniques.

```
tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True)
```


Let's use the same corpus from above in our example for TD-IDF

In [5]:
norm_corpus = ['sky blue beautiful', 'love blue beautiful sky',
 'quick brown fox jumps lazy dog',
 'kings breakfast sausages ham bacon eggs toast beans',
 'love green eggs ham sausages bacon', 'brown fox quick blue dog lazy',
 'sky blue sky beautiful today' ,'dog lazy brown fox quick']


from sklearn.feature_extraction.text import CountVectorizer
# get bag of words features in sparse format
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix

from sklearn.feature_extraction.text import TfidfTransformer


"""Note: With Tfidftransformer you will systematically compute word counts using CountVectorizer
and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores."""
tt = TfidfTransformer(norm='l2',
                      use_idf=True,
                      smooth_idf=True)
tt_matrix = tt.fit_transform(cv_matrix)
tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
tt_df = pd.DataFrame(np.round(tt_matrix, 2), columns=vocab)
display(tt_df)



"""Note: WWith Tfidfvectorizer on the contrary, you will do all three steps at once.
Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset."""

from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0.,
                     max_df=1.,
                     norm='l2',
                     use_idf=True,
                     smooth_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
tv_df  = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
display(tv_df)


Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


### 1.3.4 Document Similarity and Word Semantics

Lexical semantics is a branch of linguistics focused on meaning and word relationships. Moreover, the idea behind word sense is the interpretation of the word (often requiring context to understand). Where multiple meanings can occur for a word – take the example of a mouse that can mean both the cursor controller and the rodent—we must discern using context. Relationships between word senses can be referred to as synonyms (e.g., couch/sofa).

**Word Similarity** is not the same as a synonym, rather it is the idea that words have relationships. The example of cat and dog is used to show that while they are not synonymous, they are both animals, often they are domesticated – their semantics are similar.

**Word relatedness** is slightly different than word similarity where there is more of a psychological association—for example, that coffee and cup are related.
Recall that vectors for representing words are called embeddings implying that a point in space can be mapped to another point in space.

This is important because word similarity (measured through a vector representing distance between two words in space) can be powerful for tasks we have previously done, such as sentiment analysis. Moreover, we can derive the meaning of the word using the nearby counts of similar words.

We look at three similarity metrics to score word and/or document similarity:

*   Manhattan Distance: is the sum of absolute differences between points across all the dimensions. Called "Manhattan" because we can think of getting from point (a,b) to point (c,d) on a Cartesian plane by only travelining vertically or horizontally, not diagnally.
*   Euclidian Distance: is the shortest distance between two points in mathmatics. Not as useful in the field of NLP. The "as the crow flies" distance.
*   Cosine Similarity: measure similarity based on the content overlap between documents.
*   Jaccard Similarity: Used to identify documents we measure it as proportion of number of common words to number of unique words in both documents.

Note: Generally speaking the difference betweem *distance* and *similarity* is basically that distance is just equal to 1 - similarity.


Let's take a look at Cosine Similarity metrics since this is most commonly used with NLP and also with word2vec.

> $similarity(doc_1, doc_2) = cos(\theta) = \frac{doc_1  doc_2}{\lvert doc_1\rvert \lvert doc_2\rvert}$

By cosine distance/dissimilarity we assume following:
> $distance(doc_1, doc_2) = 1 - similarity(doc_1, doc_2)$

The similarity-based metics look like the following🇰
> cos(\vec{x},\vec{y}) = \frac{\sum_{i=1}^{n}{x_i\times y_i}}{\sqrt{\sum_{i=1}^{n}x_i^2}\times \sqrt{\sum_{i=1}^{n}y_i^2}}



```
cosine_similarity(xyz)
array([[1.        , 0.97780241, 0.30320366],
       [0.97780241, 1.        , 0.49613894],
       [0.30320366, 0.49613894, 1.        ]])
```

![](https://drive.google.com/uc?export=view&id=1c9D33toCdC1W3_SRiUawykcspho8IVfI)



### **1.3.5: Exercise: BOW with n-gram**
Use the *brown* corpus to create a n-gram BOW model. First, you must clean and organize the data. Then enter your code to complete the exercise.

The Brown Corpus is an collection of text samples of American English categorized by various genres such as science-fiction, adventure, etc.

Create a tri-gram bag of words matrix using the brown corpus as its inputs.



In [19]:
import pandas as pd                        # Python library for pandas - data maniplation
import numpy as np                         # Python library for numpy -- matrix algebra library
import matplotlib                          # Python library for matplotlib -- visual display of data
import matplotlib.pyplot as plt            # Python library for matplotlib -- visual display of data
import nltk                                # Python library for NLP
import re                                  # library for regular expression operations
import string                              # for string operations

nltk.download('stopwords')                 # package for stop words
nltk.download('brown')                 # package for stop words
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.corpus import brown              # this is the corpus you use for this exercise.
from nltk.stem import PorterStemmer        # module for stemming
from sklearn.feature_extraction.text import CountVectorizer

#The seed() method is used to initialize the random number generator
np.random.seed(100)

brown_cat= brown.categories() # Creates a list of categories

docs=[]
for cat in brown_cat: # We append tuples of each document and categories in a list
    t1=brown.sents(categories=cat) # At each iteration we retrieve all documents of a given category
    for doc in t1:
        docs.append((' '.join(doc), cat)) # These documents are appended as a tuple (document, category) in the list

brown_df=pd.DataFrame(docs, columns=['sentence', 'category']) #The data frame is created using the generated tuple.

brown_df.head()


#Step 1. Pre-Processing the Brown Corpus Text
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

#create some normalized corpus from the pre-processing functiong above
normalize_corpus = normalize_corpus(brown_df['sentence'].values)

#Using the nromalized corpus.
#Because the brown corpus is very large,select 10,000 random records from the corpus. Set seed so you can return the same results.
np.random.seed(100)
norm_corpus = np.random.choice(normalize_corpus, size=10000, replace=False)

#Step 2. Create a tri-gram data frame and count its frequencies
vectorizer = CountVectorizer(ngram_range=(3, 3))
tri_gram_matrix = vectorizer.fit_transform(norm_corpus)

#print the dataframe to show the tri-gram BOW
bv_matrix = tri_gram_matrix.toarray()
vocab = vectorizer.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)
### END CODE HERE ###

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Unnamed: 0,aab follows vowel,aah go said,aaron cohn san,ab plane end,ab plasma control,abandon atom bomb,abandon ceremonies winter,abandon efforts would,abandon real celebration,abandon whole life,...,zion stayed get,zodiacal sign watercolorist,zoning board county,zoo grimace lions,zoo reservations extremely,zoooop snag around,zu longing simple,zur khaneh latter,zurich prince boun,zworykin novel technique
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 1.3.6 Exercise: TD-IDF

Now, using anyone of the following datasets, create you're own TF-IDF implementation. Provide your output in the form of a matrix.

For more on the intuition behind TF-IDF, read the article [here](https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/nlp/text-vec-traditional.html).

Refer to [this article](https://sci2lab.github.io/ml_tutorial/tfidf/) related to TF-IDF and Elastisearch. Note how the TF-IDF approach can be used for information retrieval.

Datasets that you may choose from:
*   [Reviews Dataset](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). this dataset uses classified data from Yelp!, Amazon, and IMBD. You can use this to determine TF-IDF across the datasets.


*   Presidential speeches in NLTK. You can use this dataset to determine the TFIDF vector of words across presidential speeches.

```
nltk.corpus.inaugural
```

Please provide your code in the cell below.





In [27]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import inaugural

# Define the WordPunctTokenizer and stopwords
wpt = nltk.WordPunctTokenizer()
stop_words = set(stopwords.words('english'))

# Define the normalize_document function
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

# Load the dataset and normalize the speeches
speeches = [normalize_document(inaugural.raw(fileid)) for fileid in inaugural.fileids()]

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the speeches
tfidf_matrix = tfidf_vectorizer.fit_transform(speeches)

# Convert the TF-IDF matrix to a pandas DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
tfidf_df


Unnamed: 0,000,15,15th,1801,1817,1850,1886,1890,1893,1897,...,younger,youngest,yourselfxand,youth,youthful,youâ,zeal,zealous,zealously,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025876,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033079,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.08147,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047632,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.030448,0.0,0.020043,0.022848,0.0,0.030448
8,0.043552,0.023986,0.0,0.023986,0.023986,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017999,0.017158,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.022267,0.0,0.0,0.0


In [26]:
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Define the WordPunctTokenizer and stopwords
wpt = nltk.WordPunctTokenizer()
stop_words = set(stopwords.words('english'))

# Define the normalize_document function
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

# Load and preprocess the Amazon dataset
with open("amazon_cells_labelled.txt", "r") as file:
    lines = file.readlines()
    amazon_corpus = [normalize_document(line.split("\t")[0]) for line in lines]


# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the texts
tfidf_matrix = tfidf_vectorizer.fit_transform(amazon_corpus)

# Convert the TF-IDF matrix to a pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Add labels to the DataFrame
tfidf_df["Label"] = labels

# Display the TF-IDF matrix with labels
tfidf_df

Unnamed: 0,abhor,ability,able,abound,abovepretty,absolutel,absolutely,ac,accept,acceptable,...,years,yearsgreat,yell,yes,yet,youd,youll,za,zero,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


##A. References

1.   Chapter 6 – Vector Semantics and Word Embeddings Speech and Language rocessing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021.
2.   [Word2vec from Scratch with NumPy](https://towardsdatascience.com/word2vec-from-scratch-with-numpy-8786ddd49e72)
3.   [A hands=on intutive approach to Deep Learning Methods for Text Data - Word2Vec,GloVe and FastText](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)
4.    [Traditional Methods for Text Data](https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41)
5.    [Word Embeddings](https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/word_embeddings.ipynb#scrollTo=Q6mJg1g3apaz)
6. [CS 224D: Deep Learning for NLP](https://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf)
7. [Text Vectorization] (https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/nlp/text-vec-traditional.html)
8. [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus)
9. [TF-IDF](https://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)
10. [Applying TF-IDF algorithm in practice](https://plumbr.io/blog/programming/applying-tf-idf-algorithm-in-practice)
11. [text2vec](http://text2vec.org/similarity.html)

