# Coding Practice 2 


## Corpus 
For the following exercies, use this paragraph as the corups:

```The use of natural language processing (NLP) is rapidly becoming an integral part of many businesses. NLP is the process of deriving meaning and understanding from text, such as analyzing the sentiment of a customer review or determining the intent of an email response. For instance, a company may use NLP to extract meaningful insights from customer feedback and influence product design decisions. By leveraging the power of NLP, organizations can gain valuable insights into customer thoughts, behaviors, patterns, and preferences. Consequently, NLP is being used in many industries to uncover opportunities for better customer experiences, increased efficiency in operations, improved decision-making, and advanced analytics.```

In [None]:
corpus = """The use of natural language processing (NLP) is rapidly becoming an integral part of many businesses.
NLP is the process of deriving meaning and understanding from text, such as analyzing the sentiment of a customer 
review or determining the intent of an email response. For instance, a company may use NLP to extract meaningful 
insights from customer feedback and influence product design decisions. By leveraging the power of NLP, 
organizations can gain valuable insights into customer thoughts, behaviors, patterns, and preferences. 
Consequently, NLP is being used in many industries to uncover opportunities for better customer experiences, 
increased efficiency in operations, improved decision-making, and advanced analytics."""

Use the corpus above to generate embeddings using the following techniques:
1. `TFiDF`
2.  `Bag of Words`
3. `CBOW`
4. `SkipGram`

Before we generate embeddings, we'll leverage the function from the last exerciess notebook to clean up the corpus by tokenizing, removing stop words, and lemmatizing.

In [None]:
import nltk
from nltk.corpus import stopwords
import pprint
from nltk.stem import PorterStemmer
from functools import reduce

# This function will remove stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup

# This function will use the `PorterStemmer` to reduce words to their stem
def stemmer_(s):
    tokens = nltk.word_tokenize(s)
    port_stemmer = PorterStemmer()
    cleanup = reduce(lambda x, y: x + " " + port_stemmer.stem(y), tokens, "")
    return cleanup

In [None]:
# Apply the functions defined above to clean data 
cleaned_text = cleanupDoc(corpus)
cleaner_text = stemmer_(cleaned_text)

## Using Inverse Term Frequency to Generate Embeddings

Now that the data has been cleaned, fill in the code below to apply the `TF-iDF` vecotrizer from the `sklearn` library

In [None]:
# Fill in code in the appropriate spots
from sklearn.feature_extraction.text import FILL_IN


# Instantiate an instace of the TfidfVectorizer

vectorizer = FILL_IN(use_idf=TRUE)

# Fit the vectorizer to corputs 
fitted_vectorizer = vectorizer.fit([cleaner_text])

# Transform the corpuse using the fit vectorizer 
X = FILL_IN

We can use the results of the fitted vectorizer transformed into a `DataFrame` to take a quick look at which words have the highest term frequency

In [None]:
# Fill in code in the appropriate spots
# Convert the sparse matrix into a Pandas DataFrame for quick analysis
import pandas as pd

df = pd.DataFrame(X[0].T.todense(), 
                   index = FILL_IN, 
                  columns=["TF-IDF"]
                 )
print(df.sort_values("TF-IDF", ascending=False).head(10))

## Using Bag of Words to Generate Embeddings

As an alternative to `TF-iDF` the `Continuous Bag of Words` algorithm can be applied to generate counting based embeddings. In the cells below, please fill in the missing code to generate `CBOW` embeddings.

In [None]:
# Fill in code in the appropriate spots
from sklearn.feature_extraction.text import FILL_IN

# Instantiate Count Vectorizer 
vectorizer = FILL_IN

# Fit to the corpus 
fitted_vectorizer = FILL_IN

# Transform using fitted vectorizer 
X = fitted_vectorizer.transform([cleaner_text])

We can use the results of the fitted vectorizer transformed into a `DataFrame` to take a quick look at which words have the highest `Bag of Words` score

In [None]:
# Convert the sparse matrix into a Pandas DataFrame for later modeling 

import pandas as pd

df = pd.DataFrame(X[0].T.todense(), 
                   index = fitted_vectorizer.get_feature_names(),
                  columns=["Bag of Words"]
                 )
print(df.sort_values("Bag of Words", ascending=False).head(10))

## Using Word2Vec to Generate Embeddings

In [None]:
# Fill in code in the appropriate spots
# Train Word2Vec using CBOW

import gensim
from gensim.models import FILL_IN
from nltk.tokenize import sent_tokenize, word_tokenize

# Sentence and word tokenize cleaned text 
data = []

for i in sent_tokenize(cleaner_text):
    temp = []
    
    for j in word_tokenize(i):
        temp.append(j.lower())
        
    data.append(temp)

In [None]:
# Fill in code in the appropriate spots
cbow = FILL_IN(data, min_count=1, vector_size=1000, window=5, sg=0)

In [None]:
# Fill in code in the appropriate spots
# Train Word2Vec using Skip Gram
skip_gram = FILL_IN(data, min_count=1, vector_size=1000, window=5, sg=1)

Now that we've built the two different models, let's compare the two by looking at similarity scores for a set of words across the two different models

In [None]:
# Fill in code in the appropriate spots
# Calculate similarities 
cbow_similarity = cbow.wv.FILL_IN("nlp", "custom")
skip_gram_similarity = skip_gram.wv.FILL_IN("nlp", "custom")

# Print results
print(f"Cosine similarity between `nlp` and `custom` using CBOW Model: {cbow_similarity}")
print(f"Cosine similarity between `nlp` and `custom` using Skip Gram Model: {skip_gram_similarity}")
